Setting up an Azure Data Lake and Azure Data Factory using Powershell

Login-AzureRmAccount
#first ensure that you have an Azure Data Lake that you want to use for ODX
#$resourceGroups = Get-AzureRmResourceGroup
#$azureDataLakeNames = "";
# foreach ($resourceGroup in $resourceGroups) {
# $azureDataLake = Get-AzureRmDataLakeStoreAccount -ResourceGroupName $resourceGroup.ResourceGroupName
#$azureDataLake
# $azureDataLakeName = $azureDataLake.Name
# $azureDataLakeNameLength = $azureDataLakeName.Length
# $azureDataLakeNameLength -gt 0
# if ($azureDataLakeNameLength -gt 0) {
# $azureDataLakeNames += " " + $azureDataLake.Name + " (resource group: " + $resourceGroup.ResourceGroupName + " & location: " + $resourceGroup.Location + ")"
# }
# }
# "-----------"
#"DataLakeNames: " + $azureDataLakeNames
#-------------------------------------------------------------------------------
#-------------------------------------------------------------------------------
#REQUIRED: you must enter a unique appname which will be used as the security principal
$appname = "inventose"
#OPTIONAL: change the password for the security principal password
$password = "Xyzpdq"
#run the above script, and replace DATALAKESTORENAME with the appropriate name/rg/location from your existing data lake store; or enter a new name to have a data lake created
$dataLakeStoreName = $appname
$odxResourceGroup = $appname
$dataLakeLocation = "Central US" #Central US, East US 2, North Europe
#recommended to use the same resource group as the data factory for simplicity, but you can use any resource group or enter a new name to create
$dataFactoryResourceGroup = $dataLakeStoreResourceGroup
#specify where you want your data factory - current options are East US, North Europe, West Central US, and West US
$dataFactoryLocation = "West US"
#-------------------------------------------------------------------------------
#-------------------------------------------------------------------------------
#create odxResourceGroup, if it does not exist
Get-AzureRmResourceGroup -Name $odxResourceGroup -ErrorVariable notPresent1 -ErrorAction 0
if ($notPresent1)
{
New-AzureRmResourceGroup -Location $dataLakeLocation -Name $odxResourceGroup
}
#create data lake, if it does not exist
Get-AzureRmDataLakeStoreAccount -Name $dataLakeStoreName -ErrorVariable notPresent2 -ErrorAction 0
if ($notPresent2)
{
New-AzureRmDataLakeStoreAccount -Location $dataLakeLocation -Name $dataLakeStoreName -ResourceGroupName $odxResourceGroup
}
$homepage = "https://ODXPS.com/" + $appname
#create security principal, if it does not exist
$app = New-AzureRmADApplication -DisplayName $appname -HomePage $homepage -IdentifierUris $homepage -Password $password
$app = Get-AzureRmADApplication -DisplayName $appname
$servicePrincipal = New-AzureRmADServicePrincipal -ApplicationId $app.ApplicationId
Start-Sleep 10
New-AzureRmRoleAssignment -RoleDefinitionName "Contributor" -Id $servicePrincipal.Id -ResourceGroupName $odxResourceGroup
New-AzureRmRoleAssignment -RoleDefinitionName "Data Factory Contributor" -Id $servicePrincipal.Id -ResourceGroupName $odxResourceGroup
New-AzureRmRoleAssignment -RoleDefinitionName "Reader" -Id $servicePrincipal.Id -ResourceGroupName $odxResourceGroup
#Set-AzureRmDataLakeStoreItemAclEntry -AccountName $dataLakeStoreName -Path / -AceType User -Id $app.ApplicationId -Permissions All
Set-AzureRmDataLakeStoreItemAclEntry -AccountName $dataLakeStoreName -Path / -AceType User -Id $servicePrincipal.Id -Permissions All
Get-AzureRmDataLakeStoreItem -Account $dataLakeStoreName -Path /ODX -ErrorVariable notPresent3 -ErrorAction 0
if ($notPresent3)
{
New-AzureRmDataLakeStoreItem -Folder -AccountName $dataLakeStoreName -Path /ODX
}
Set-AzureRmDataLakeStoreItemAclEntry -AccountName $dataLakeStoreName -Path /ODX -AceType User -Id $servicePrincipal.Id -Permissions All
#Start-Sleep 60 #there seems to be a lag between when these permissions are added and when they are applied...trying 1 minutes to start
$subscription = Get-AzureRmSubscription
$subscriptionId= ($subscription).Id
$tenantId = ($subscription).TenantId
#ensure there are permissions
#Get-AzureRmDataLakeStoreItemAclEntry -Account $dataLakeStoreName -Path /
#get information on datalake
$dataLake = Get-AzureRmDataLakeStoreAccount -Name $dataLakeStoreName
#here is a printout
"---------------------------------------------------------------"
"---------------------------------------------------------------"
$text1= "Azure Data Lake Name: " + $dataLakeStoreName + "`r`n" +
"Tenant ID: " + $tenantId + "`r`n" +
"Client ID: " + $app.ApplicationId + "`r`n" +
"Client Secret: " + $password + "`r`n" +
"Subscription ID: " + $subscriptionId + "`r`n" +
"Resource Group Name: " + $odxResourceGroup + "`r`n" +
"Data Lake URL: adl://" + $dataLake.Endpoint + "`r`n" +
"Location: " + $dataFactoryLocation
"---------------------------------------------------------------"
"---------------------------------------------------------------"
Out-File C:UsersMattDyorDesktopDataLake.ps1 -InputObject $text1

Advertisements

Getting NYC Taxi Data into Azure Data Lake

I wanted to get a meaningful dataset into Azure Data Lake so that I could test it out. I came across this article, that walks through using the NYC Taxi Dataset with Azure Data Lake:

https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-process-data-lake-walkthrough

The article kind of skips over the whole part of getting the dataset into Azure. Here is how I did it:

  • Spin up a VM on Azure
  • On Server Manager, click on Local Server, next to IE Enhanced Security Configuration click the On link, and at least set Admin to Off (or else you will have to click ok a dozen times a web page)
  • Download the files from the NYC Taxi Trip website to your VM http://www.andresmh.com/nyctaxitrips/
  • Install 7-Zip so that you can unzip the 7z files.
    • Once you install it from http://www.7-zip.org/download.html, go to the install folder (probably C:Program Files7-Zip) and right click the 7z.exe file. Select the 7zip > open archive option and then click the + sign and browse to your downloads folder
  • Because the files in the trip_data.7z file are larger than 2GB, you cannot upload them using the portal, and you need to use Powershell.
  • You need to install the Azure PowerShell Commandlets – look for the Windows Install link a bit down this page https://azure.microsoft.com/en-us/downloads/
  • You will probably need to restart the VM for the Azure commands to be available in PowerShell
  • Go wild on Azure Data Lake Store using this doc https://github.com/Microsoft/azure-docs/blob/master/articles/data-lake-store/data-lake-store-get-started-powershell.md – here are the key steps:

 # Log in to your Azure account
Login-AzureRmAccount

# List all the subscriptions associated to your account
Get-AzureRmSubscription

# Select a subscription
Set-AzureRmContext -SubscriptionId “xxx-xxx-xxx”

# Register for Azure Data Lake Store
Register-AzureRmResourceProvider -ProviderNamespace “Microsoft.DataLakeStore”

#Verify your ADL account name
Get-AzureRmDataLakeStoreAccount

#Figure out what folder to put the files
Get-AzureRmDataLakeStoreChildItem -AccountName mlspike -Path “/”

NOTE: if you do not want to copy the files one-by-one, you can just copy the whole folder using this format: Import-AzureRmDataLakeStoreItem -AccountName mlspike -Path “C:UsersTaxiDesktopfiles2trip_data” -Destination $myrootdirTaxiDataFiles

Once you have the files uploaded to Azure Data Lake, you can delete the VM.

If you know of a faster way of getting them there (without downloading them to your local machine), I would love to hear it!

Thanks.

Matt

Quick Hit: Common Ways to Interact with Hadoop

MapReduce: geniuses only. If you are on this page, read the next option!

Pig: Short for Pig Latin. Allows you to query Hadoop like SQL. Developed by Yahoo. Easy to learn.

input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
 
 -- Extract words from each line and put them into a pig bag
 -- datatype, then flatten the bag to get one word on each row
 words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
 
 -- filter out any words that are just white spaces
 filtered_words = FILTER words BY word MATCHES '\w+';
 
 -- create a group for each word
 word_groups = GROUP filtered_words BY word;
 
 -- count the entries in each group
 word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;
 
 -- order the records by count
 ordered_word_count = ORDER word_count BY count DESC;
 STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

Hive: originally built by Facebook, a social networking site (you knew you would learn something). It has a SQL-like language called HiveQL. The queries are translated into MapReduce, Tez, or Spark jobs.

DROP TABLE IF EXISTS docs;
2 CREATE TABLE docs (line STRING);
3 LOAD DATA INPATH 'input_file' OVERWRITE INTO TABLE docs;
4 CREATE TABLE word_counts AS
5 SELECT word, count(1) AS count FROM
6  (SELECT explode(split(line, 's')) AS word FROM docs) temp
7 GROUP BY word
8 ORDER BY word;

Oozie: an orchestration framework that allows you to string together different MapReduce, Pig, and Hive jobs.

 

 

Setting Up Goals in Google Analytics and Google Tag Manager

Set Up a Destination Goal

If you want to start improving the business performance of your website, you have to have a goal, sometimes called a conversion event. If your goal is a destination (e.g., a “thank you” or confirmation page), Google Analytics makes it easy. Just click on the Admin tab along the top, scroll over to View, and select Goals. Select Create New Goal, give your goal a name (e.g., Purchase), select destination, and then enter the part of the url following your domain name (e.g., if your page is http://www.mysite.com/success, just enter /success).

To make sure everything is set up properly, you can click the Verify link. If you get this message:

This Goal would have a 0% conversion rate based on your data from the past 7 days. Try a different set up.

You may have misconfigured your goal, the goal may not have been reached yet (e.g., nobody has visited that page), or the data may be working its way through the interenets (I have 12 hours between when I started firing an event to when the first completion was recorded).

Set Up an Event Goal using Google Tag Manager and Google Analytics

Occasionally, you may want to fire a goal on an event, such as a button click or a form submission that posts outside of the host domain (hello Paypal). This is a bit trickier. The best way to do it is to use a combination of Google Tag Manager and Google Analytics.

Create the Trigger

In Google Tag Manager, you first need to create a trigger for the event that you want to capture.

Triggers > New > Click > Just Links.

This is where I play a little dangerous. I do encourage waiting for tags for up to 2 seconds (2000 in the milliseconds box). This means that you give Google up to 2 seconds to record the click before following the link…a bit of delay for the user, but you will not be swallowing (as many) clicks by fast transitions. You will want to validate that the form submission works, because you may interfere with validations or other scripts that were built without expecting a delay.

The rest of the selections are basic, so we can now jump to the 2nd part of the process…

Create the Tag

Still in Google Tag Manager, create a New tag.

Tags > New > Google Analytics > Universal Analytics > Add your Google Analytics Tracking ID > Track Type = Event > Action = “BuyLinkClick” (or whatever) > Fire on Click > Select the Trigger you just created. Phew!

Now you are pushing the events to your GA account. If you head over to Google Analytics > Real-Time > Events, you should be able to see it as you click these links.

Create the Goal

Back in Google Analytics, go to Admin, scroll over to View, and click Goals. New Goal > Event > Action = “BuyLinkClick”. Odds are, if you click on Verify, you will get this message:

This Goal would have a 0% conversion rate based on your data from the past 7 days. Try a different set up.

You may have misconfigured your goal, the goal may not have been reached yet (e.g., nobody has visited that page), or the data may be working its way through the interenets (I have 12 hours between when I started firing an event to when the first completion was recorded).

Come back tomorrow to see if this goal is being tracked properly.

From Text File to Map in 30 Minutes

I wanted to see where United States patent attorneys reside. Using data from the United States Patent and Trademark Office and a free account from Microsoft’s Power BI, I was able to create the two visualizations below in about 30 minutes. Pretty amazing. Let me know if you want any details on how to do this – I sketched out the process below.

https://app.powerbi.com/view?r=eyJrIjoiMmI4N2Y0NWUtY2MyMi00NjQwLWE5MTMtNGRmMzdlMDVlMjBhIiwidCI6ImRmM2ZmMTE5LWU4N2MtNDlmNy1hM2IxLTY2NzdkMzAzOGZlMSIsImMiOjZ9

Steps to Create Power BI Map Visualization

  • Grab data: text file of all active US Patent Attorneys can be found here: https://oedci.uspto.gov/OEDCI/practitionerRoster.jsp
  • Visit Power BI Getting Started Page https://powerbi.microsoft.com/en-us/get-started/ so you can:
    • Download the Software (this allows you to author your visualizations); AND
    • Sign Up for a Power BI account (this allows you to publish your visualizations to the web)
  • Once you have downloaded the Desktop Power BI software, create a new report (File > New).
  • Click on Get Data > File > Text, and point to the file you downloaded (you need to extract it from the zip file if you have not already done that).
  • There are two types of maps in Power BI: maps and filled maps; I used maps that represents each node as a bubble.
  • For the top left map, just drag the City on to the Location box.
  • For the bottom left map, drag State on to the Location box, and drag State on to Size; click on the State under Size, and make sure Count is selected (not Count Distinct).
  • Finally, add a couple of tables to the right. For the top map, drag over City and Firm Name under values, and then click on the down arrow for Firm Name and select Count (this will aggregate for the city instead of showing all of the Firm Names along with their city).
    • I should have used registration number, instead of Firm Name, because this is actually a count of FIRMS by city, not a count of practitioners. Alas,  if that is the worst mistake I make today I am doing all right.
  • On the desktop, you can now click on the Publish icon in the top right.
  • Now navigate to the web version of Power BI and navigate to your app.
  • Under File select “Publish to Web”
  • Grab the “HTML you can add to your website” and paste this into a text view of your blog, and you are done. Super easy.

If you have published or stumbled upon some nice Power BI visualizations, drop them in a comment. I am a bit surprised this has not become more common. My prediction: data visualizations will become the norm in 2017, because all of these visualization tools are racing to become the standard, and are breaking down the barriers to adoption that have historically prevented people from jumping in (primarily complexity and cost). Exciting times.

Thanks.

Matt

Visualizing Sports League Standings

I have been thinking about a new visualization for sports league standings. Something that you could use to see teams on the ascent, and teams on their way down. Ideally, this would include all of the teams in a league, but since it takes a bit of time for me to prepare the data, I figured I would start by sharing the a micro-set of data, seeing whether this visualization already exists somewhere, and, if not, whether you find the visualization intriguing.
MLS Standinds

If you click on the image, you can download the Excel spreadsheet that I used to create the chart.

What you are seeing is the 2016 season of wins and losses for the Seattle Sounders (red) and the Portland Timbers (orange). A win is worth +1, a loss is worth -1, and a tie is worth 0. For the first game, you see that the Sounders lose and the Timbers win.

Some interesting things jump out from this visualization: the teams are in a dead heat from games 8 to 13, the Timbers are on top for most of the rest of the season, and around game 25 the Sounders are in the toilet. But, you have to get hot at the right time, and that is exactly what the Sounders did: advancing from -4 to +1 over the course of the last 9 games.

Let me know what you think.

Thanks.

Matt