changed .ipnb to make it align with SQL DW but not SQL Server, and added .py Python script file

hangzh-msft · hangzh-msft · commit 51dff48138eb · 2015-12-21T15:08:58.000-08:00
diff --git a/Misc/SQLDW/FilestoDownload_SQLDW_Walkthrough.txt b/Misc/SQLDW/FilestoDownload_SQLDW_Walkthrough.txt
@@ -2,4 +2,5 @@ SQLDW_Data_Import.ps1
 LoadDataToSQLDW.sql
 DeleteResourcesOnSQLDW.sql
 SQLDW_Explorations.sql
-SQLDW_Explorations.ipynb
+SQLDW_Explorations.ipynb
+SQLDW_Explorations_Scripts.py
diff --git a/Misc/SQLDW/SQLDW_Explorations.ipynb b/Misc/SQLDW/SQLDW_Explorations.ipynb
@@ -18,14 +18,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# NYC Data wrangling using IPython Notebook and SQL Server"
+    "# NYC Data wrangling using IPython Notebook and SQL Data Warehouse"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This notebook demonstrates data exploration and feature generation using Python and SQL queries for data stored in SQL Server in Azure. We start with reading a sample of the data into a Pandas data frame and visualizing and exploring the data. We show how to use Python to execute SQL queries against the data and manipulate data directly within the SQL Server on Azure.\n",
+    "This notebook demonstrates data exploration and feature generation using Python and SQL queries for data stored in Azure SQL Data Warehouse. We start with reading a sample of the data into a Pandas data frame and visualizing and exploring the data. We show how to use Python to execute SQL queries against the data and manipulate data directly within the Azure SQL Data Warehouse.\n",
     "\n",
     "This IPNB is accompanying material to the data Azure Data Science in Action walkthrough document (https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-process-sql-walkthrough/) and uses the New York City Taxi dataset (http://www.andresmh.com/nyctaxitrips/)."
    ]
@@ -68,7 +68,6 @@
     "from time import time\n",
     "import pyodbc\n",
     "import os\n",
-    "from azure.storage.blob import BlobService\n",
     "import tables\n",
     "import time"
    ]
@@ -174,7 +173,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Report number of rows and columns in table <nyctaxi_fare>"
+    "#### Report number of rows and columns in table \\<nyctaxi_fare>"
    ]
   },
   {
@@ -196,7 +195,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Read-in data from SQL Server"
+    "#### Read-in data from SQL Data Warehouse"
    ]
   },
   {
@@ -232,7 +231,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now we can explore the 0.05 percent sample data. We start with looking at descriptive statistics for trip distance:"
+    "Now we can explore the sample data. We start with looking at descriptive statistics for trip distance:"
    ]
   },
   {
@@ -498,7 +497,7 @@
    "source": [
     "In this section we used a sampled table we pregenerated by joining Trip and Fare data and taking a sub-sample of the full dataset. \n",
     "\n",
-    "Please note that we have created a sampled table named '<nyctaxi_sample>'. If a table with this name already exists please rename the table in the code below. Also, if you have run this code once and created the table already, re-executing the code to create the table will fail and inserting into an existing table will result in potentially duplicate data being inserted. You can drop the table using the code below if required."
+    "The sample data table named '<nyctaxi_sample>' has been created and the data is loaded when you run the PowerShell script. "
    ]
   },
   {
diff --git a/Misc/SQLDW/SQLDW_Explorations_Scripts.py b/Misc/SQLDW/SQLDW_Explorations_Scripts.py
@@ -0,0 +1,202 @@
+# NYC Data wrangling using Python and Azure SQL Data Warehouse
+
+#-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-#
+#                   License Information                     #
+#-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-#
+# This sample IPython Notebook is shared by Microsoft under the MIT license.
+# Please check the LICENSE.txt file in the directory where this Python script file is stored
+# for license information and additional details.
+
+#-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-#
+#                      Prerequisites                        #
+#-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-#
+# Anaconda Python 2.7
+# Or Python 2.7 and modules including pandas, numpy, matplotlib, time, pyodbc, tables
+# Azure SQL Data Warehouse provisioned
+
+#-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-#
+#                      Background                           #
+#-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-#
+# This notebook demonstrates data exploration and feature generation
+# using Python and SQL queries for data stored in Azure SQL Data Warehouse.
+# We start with reading a sample of the data into a Pandas data frame and
+# visualizing and exploring the data.
+# We show how to use Python to execute SQL queries against the data
+# and manipulate data directly within the Azure SQL Data Warehouse.
+
+# This IPNB is accompanying material to the Azure Data Science in Action walkthrough document
+# (https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-process-sqldw-walkthrough/)
+# and uses the New York City Taxi dataset (http://www.andresmh.com/nyctaxitrips/).
+
+#-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-#
+#   Step 1: Read data in Pandas frame for visualizations    #
+#-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-#
+# We start with loading a sample of the data in a Pandas data frame and performing some explorations on the sample. 
+# We join the Trip and Fare data and select the top 10000 rows of the dataset in a Pandas dataframe.
+# We assume that the Trip and Fare tables have been created and loaded to tables in SQL Data Warehouse.
+# If you haven't done this already please refer to the 'Load the data to SQL Data Warehouse' section of this walkthrough.
+
+# Step 1.1. Import required packages in this experiment (no output)
+import pandas as pd
+from pandas import Series, DataFrame
+import numpy as np
+import matplotlib.pyplot as plt
+from time import time
+import pyodbc
+import os
+import tables
+import time
+
+# Step 1.2. Initialize Database Credentials (no output)
+SERVER_NAME = '<server name>'
+DATABASE_NAME = '<database name>'
+USERID = '<user name>'
+PASSWORD = '<password>'
+DB_DRIVER = '<database driver>'
+
+# Step 1.3. Create Data Warehouse Connection (no output)
+CONNECTION_STRING = ';'.join([driver,server,database,uid,pwd, ';TDS_VERSION=7.3;Port=1433'])
+print CONNECTION_STRING
+conn = pyodbc.connect(CONNECTION_STRING)
+
+# Step 1.4. Report number of rows and columns in table <nyctaxi_trip> (outputs numbers of records and columns in trip table)
+nrows = pd.read_sql('''SELECT SUM(rows) FROM sys.partitions WHERE object_id = OBJECT_ID('<nyctaxi_trip>')''', conn)
+print 'Total number of rows = %d' % nrows.iloc[0,0]
+
+ncols = pd.read_sql('''SELECT count(*) FROM information_schema.columns WHERE table_name = ('<nyctaxi_trip>')''', conn)
+print 'Total number of columns = %d' % ncols.iloc[0,0]
+
+# Step 1.5. Report number of rows and columns in table <nyctaxi_fare> (outputs numbers of records and columns in fare table)
+nrows = pd.read_sql('''SELECT SUM(rows) FROM sys.partitions WHERE object_id = OBJECT_ID('<nyctaxi_fare>')''', conn)
+print 'Total number of rows = %d' % nrows.iloc[0,0]
+
+ncols = pd.read_sql('''SELECT count(*) FROM information_schema.columns WHERE table_name = ('<nyctaxi_fare>')''', conn)
+print 'Total number of columns = %d' % ncols.iloc[0,0]
+
+# Step 1.6 Read-in data from SQL Data Warehouse (outputs reading time and shape of data read in)
+t0 = time.time()
+
+#load only a small percentage of the joined data for some quick visuals
+df1 = pd.read_sql('''select top 10000 t.*, f.payment_type, f.fare_amount, f.surcharge, f.mta_tax, 
+      f.tolls_amount, f.total_amount, f.tip_amount 
+      from <nyctaxi_trip> t, <nyctaxi_fare> f where datepart("mi",t.pickup_datetime)=0 and t.medallion = f.medallion 
+      and t.hack_license = f.hack_license and t.pickup_datetime = f.pickup_datetime''', conn)
+
+t1 = time.time()
+print 'Time to read the sample table is %f seconds' % (t1-t0)
+
+print 'Number of rows and columns retrieved = (%d, %d)' % (df1.shape[0], df1.shape[1])
+
+# Step 1.7. Descriptive statistics of the data (outputs statistics of data)
+# Now we can explore the sample data. We start with looking at descriptive statistics for trip distance:
+df1['trip_distance'].describe()
+
+# Step 1.8. Plot the box plot of trip_distance (outputs figures)
+# Next we look at the box plot for trip distance to visualize quantiles
+df1.boxplot(column='trip_distance',return_type='dict')
+
+# Step 1.9. Plot the distribution of trip_distance (outputs figures)
+fig = plt.figure()
+ax1 = fig.add_subplot(1,2,1)
+ax2 = fig.add_subplot(1,2,2)
+df1['trip_distance'].plot(ax=ax1,kind='kde', style='b-')
+df1['trip_distance'].hist(ax=ax2, bins=100, color='k')
+
+# Step 1.10. Put the trip_distance to bins
+trip_dist_bins = [0, 1, 2, 4, 10, 1000]
+df1['trip_distance']
+trip_dist_bin_id = pd.cut(df1['trip_distance'], trip_dist_bins)
+trip_dist_bin_id
+
+# Step 1.11. Plot the bar and line charts of the trip_distance in bins (outputs figures)
+# The distribution of the trip distance values after binning looks like the following:
+pd.Series(trip_dist_bin_id).value_counts()
+# We can plot the above bin distribution in a bar or line plot as below
+pd.Series(trip_dist_bin_id).value_counts().plot(kind='bar')
+pd.Series(trip_dist_bin_id).value_counts().plot(kind='line')
+# We can also use bar plots for visualizing the sum of passengers for each vendor as follows
+vendor_passenger_sum = df1.groupby('vendor_id').passenger_count.sum()
+print vendor_passenger_sum
+vendor_passenger_sum.plot(kind='bar')
+
+# Step 1.12. Plot the Scatter plot between trip_time_in_secs and trip_distance (output figures)
+# to see whether there is any correlation between them
+plt.scatter(df1['trip_time_in_secs'], df1['trip_distance'])
+# To further drill down on the relationship we can plot distribution side by side
+# with the scatter plot (while flipping independentand dependent variables) as follows
+df1_2col = df1[['trip_time_in_secs','trip_distance']]
+pd.scatter_matrix(df1_2col, diagonal='hist', color='b', alpha=0.7, hist_kwds={'bins':100})
+# Similarly we can check the relationship between rate_code and trip_distance using a scatter plot
+plt.scatter(df1['passenger_count'], df1['trip_distance'])
+
+# Step 1.13. Calculate the correlation between trip_time_in_secs and trip_distance (outputs correlations between two columns)
+# Pandas 'corr' function can be used to compute the correlation between trip_time_in_secs and trip_distance as follows:
+df1[['trip_time_in_secs', 'trip_distance']].corr()
+
+#-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-#
+# Step 2: Exploring the Sampled Data in SQL Data Warehouse  #
+#-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-#
+# In this section we used a sampled table we pregenerated by joining Trip and Fare data and taking a sub-sample of the full dataset. 
+# The sample data table named '<nyctaxi_sample>' has been created and the data is loaded when you run the PowerShell script.
+# Step 2.1. Report number of rows and columns in the sampled table (outputs numbers of rows and columns in the sampled data table
+nrows = pd.read_sql('''SELECT SUM(rows) FROM sys.partitions WHERE object_id = OBJECT_ID('<nyctaxi_sample>')''', conn)
+print 'Number of rows in sample = %d' % nrows.iloc[0,0]
+
+ncols = pd.read_sql('''SELECT count(*) FROM information_schema.columns WHERE table_name = ('<nyctaxi_sample>')''', conn)
+print 'Number of columns in sample = %d' % ncols.iloc[0,0]
+
+# Step 2.2. Check the tipped/not tipped distribution (outputs counts of trips in tipped/not tipped classes)
+query = '''
+        SELECT tipped, count(*) AS tip_freq
+        FROM <nyctaxi_sample>
+        GROUP BY tipped
+        '''
+
+pd.read_sql(query, conn)
+
+# Step 2.3. Check the tip class (tip_amount) distribution (outputs counts of trips in tip classes)
+query = '''
+        SELECT tip_class, count(*) AS tip_freq
+        FROM <nyctaxi_sample>
+        GROUP BY tip_class
+'''
+
+tip_class_dist = pd.read_sql(query, conn)
+tip_class_dist
+
+# Step 2.4. Plot the tip distribution by class (outputs figures)
+tip_class_dist['tip_freq'].plot(kind='bar')
+
+# Step 2.5. Count the number of trips each day (outputs a data frame with count of trips in each day)
+query = '''
+        SELECT CONVERT(date, dropoff_datetime) as date, count(*) as c 
+        from <nyctaxi_sample> 
+        group by CONVERT(date, dropoff_datetime)
+        '''
+pd.read_sql(query,conn)
+
+# Step 2.6. Count the number of trips per each medallion (outputs a data frame with count of trips by each medallion ID)
+query = '''select medallion,count(*) as c from <nyctaxi_sample> group by medallion'''
+pd.read_sql(query,conn)
+
+# Step 2.7. Count the number of trips per each medallion and license (outputs a data frame)
+query = '''select medallion, hack_license,count(*) from <nyctaxi_sample> group by medallion, hack_license'''
+pd.read_sql(query,conn)
+
+# Step 2.8. Count the number of trips by trip_time_in_secs (outputs a data frame)
+query = '''select trip_time_in_secs, count(*) from <nyctaxi_sample> group by trip_time_in_secs order by count(*) desc'''
+pd.read_sql(query,conn)
+
+# Step 2.9. Count the number of trips by trip_distance (outputs a data frame)
+query = '''select floor(trip_distance/5)*5 as tripbin, count(*) from <nyctaxi_sample> group by floor(trip_distance/5)*5 order by count(*) desc'''
+pd.read_sql(query,conn)
+
+# Step 2.10. Count the number of trips by payment type (outputs a data frame)
+query = '''select payment_type,count(*) from <nyctaxi_sample> group by payment_type'''
+pd.read_sql(query,conn)
+
+# Step 2.11. Read the top 10 observations from the sample table (outputs a data frame)
+query = '''select TOP 10 * from <nyctaxi_sample>'''
+pd.read_sql(query,conn)
+
+

Original file line number	Diff line number	Diff line change
`@@ -18,14 +18,14 @@`
`18`	`18`	`"cell_type": "markdown",`
`19`	`19`	`"metadata": {},`
`20`	`20`	`"source": [`
`21`		`- "# NYC Data wrangling using IPython Notebook and SQL Server"`
	`21`	`+ "# NYC Data wrangling using IPython Notebook and SQL Data Warehouse"`
`22`	`22`	`]`
`23`	`23`	`},`
`24`	`24`	`{`
`25`	`25`	`"cell_type": "markdown",`
`26`	`26`	`"metadata": {},`
`27`	`27`	`"source": [`
`28`		`- "This notebook demonstrates data exploration and feature generation using Python and SQL queries for data stored in SQL Server in Azure. We start with reading a sample of the data into a Pandas data frame and visualizing and exploring the data. We show how to use Python to execute SQL queries against the data and manipulate data directly within the SQL Server on Azure.\n",`
	`28`	`+ "This notebook demonstrates data exploration and feature generation using Python and SQL queries for data stored in Azure SQL Data Warehouse. We start with reading a sample of the data into a Pandas data frame and visualizing and exploring the data. We show how to use Python to execute SQL queries against the data and manipulate data directly within the Azure SQL Data Warehouse.\n",`
`29`	`29`	`"\n",`
`30`	`30`	`"This IPNB is accompanying material to the data Azure Data Science in Action walkthrough document (https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-process-sql-walkthrough/) and uses the New York City Taxi dataset (http://www.andresmh.com/nyctaxitrips/)."`
`31`	`31`	`]`
`@@ -68,7 +68,6 @@`
`68`	`68`	`"from time import time\n",`
`69`	`69`	`"import pyodbc\n",`
`70`	`70`	`"import os\n",`
`71`		`- "from azure.storage.blob import BlobService\n",`
`72`	`71`	`"import tables\n",`
`73`	`72`	`"import time"`
`74`	`73`	`]`
`@@ -174,7 +173,7 @@`
`174`	`173`	`"cell_type": "markdown",`
`175`	`174`	`"metadata": {},`
`176`	`175`	`"source": [`
`177`		`- "#### Report number of rows and columns in table <nyctaxi_fare>"`
	`176`	`+ "#### Report number of rows and columns in table \\<nyctaxi_fare>"`
`178`	`177`	`]`
`179`	`178`	`},`
`180`	`179`	`{`
`@@ -196,7 +195,7 @@`
`196`	`195`	`"cell_type": "markdown",`
`197`	`196`	`"metadata": {},`
`198`	`197`	`"source": [`
`199`		`- "#### Read-in data from SQL Server"`
	`198`	`+ "#### Read-in data from SQL Data Warehouse"`
`200`	`199`	`]`
`201`	`200`	`},`
`202`	`201`	`{`
`@@ -232,7 +231,7 @@`
`232`	`231`	`"cell_type": "markdown",`
`233`	`232`	`"metadata": {},`
`234`	`233`	`"source": [`
`235`		`- "Now we can explore the 0.05 percent sample data. We start with looking at descriptive statistics for trip distance:"`
	`234`	`+ "Now we can explore the sample data. We start with looking at descriptive statistics for trip distance:"`
`236`	`235`	`]`
`237`	`236`	`},`
`238`	`237`	`{`
`@@ -498,7 +497,7 @@`
`498`	`497`	`"source": [`
`499`	`498`	`"In this section we used a sampled table we pregenerated by joining Trip and Fare data and taking a sub-sample of the full dataset. \n",`
`500`	`499`	`"\n",`
`501`		`- "Please note that we have created a sampled table named '<nyctaxi_sample>'. If a table with this name already exists please rename the table in the code below. Also, if you have run this code once and created the table already, re-executing the code to create the table will fail and inserting into an existing table will result in potentially duplicate data being inserted. You can drop the table using the code below if required."`
	`500`	`+ "The sample data table named '<nyctaxi_sample>' has been created and the data is loaded when you run the PowerShell script. "`
`502`	`501`	`]`
`503`	`502`	`},`
`504`	`503`	`{`