Getting NYC Taxi Data into Azure Data Lake

I wanted to get a meaningful dataset into Azure Data Lake so that I could test it out. I came across this article, that walks through using the NYC Taxi Dataset with Azure Data Lake:

https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-process-data-lake-walkthrough

The article kind of skips over the whole part of getting the dataset into Azure. Here is how I did it:

  • Spin up a VM on Azure
  • On Server Manager, click on Local Server, next to IE Enhanced Security Configuration click the On link, and at least set Admin to Off (or else you will have to click ok a dozen times a web page)
  • Download the files from the NYC Taxi Trip website to your VM http://www.andresmh.com/nyctaxitrips/
  • Install 7-Zip so that you can unzip the 7z files.
    • Once you install it from http://www.7-zip.org/download.html, go to the install folder (probably C:Program Files7-Zip) and right click the 7z.exe file. Select the 7zip > open archive option and then click the + sign and browse to your downloads folder
  • Because the files in the trip_data.7z file are larger than 2GB, you cannot upload them using the portal, and you need to use Powershell.
  • You need to install the Azure PowerShell Commandlets – look for the Windows Install link a bit down this page https://azure.microsoft.com/en-us/downloads/
  • You will probably need to restart the VM for the Azure commands to be available in PowerShell
  • Go wild on Azure Data Lake Store using this doc https://github.com/Microsoft/azure-docs/blob/master/articles/data-lake-store/data-lake-store-get-started-powershell.md – here are the key steps:

 # Log in to your Azure account
Login-AzureRmAccount

# List all the subscriptions associated to your account
Get-AzureRmSubscription

# Select a subscription
Set-AzureRmContext -SubscriptionId “xxx-xxx-xxx”

# Register for Azure Data Lake Store
Register-AzureRmResourceProvider -ProviderNamespace “Microsoft.DataLakeStore”

#Verify your ADL account name
Get-AzureRmDataLakeStoreAccount

#Figure out what folder to put the files
Get-AzureRmDataLakeStoreChildItem -AccountName mlspike -Path “/”

NOTE: if you do not want to copy the files one-by-one, you can just copy the whole folder using this format: Import-AzureRmDataLakeStoreItem -AccountName mlspike -Path “C:UsersTaxiDesktopfiles2trip_data” -Destination $myrootdirTaxiDataFiles

Once you have the files uploaded to Azure Data Lake, you can delete the VM.

If you know of a faster way of getting them there (without downloading them to your local machine), I would love to hear it!

Thanks.

Matt

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s