Skip to content
This repository was archived by the owner on Jun 29, 2019. It is now read-only.

Commit 634eb45

Browse files
committed
2 parents b5a6ca6 + 226ecf7 commit 634eb45

39 files changed

Lines changed: 14771 additions & 6255 deletions
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
### Welcome to the Linux Data Science Virtual Machine
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
Anaconda is a popular Python distribution containing several pre-built and tested scientific and analytic Python packages that include NumPy, Pandas, SciPy, Matplotlib, and IPython. It also provides a package management to easily download many more Python libraries just by entering "conda install <package name>".
2+
3+
This installation contains both Python 2.7 and Python 3.5. Python 3.5 is the the instance in the default PATH. You can change this in the profile file (/etc/profile.d/dsvm.sh).
4+
5+
To activate Python 2.7 run the following from the shell:
6+
7+
source /anaconda/bin/activate root
8+
9+
Python 2.7 is installed at /anaconda/bin.
10+
11+
You can look at packages installed by running:
12+
pip list
13+
14+
To activate Python 3.5 run the following from the shell:
15+
16+
source /anaconda/bin/activate py35
17+
18+
Python 3.5 is installed at /anaconda/envs/py35/bin
19+
20+
You can look at packages installed in Python 3.5 by running:
21+
pip list
22+
23+
24+
To install any package try in the following order:
25+
26+
conda install <package name>
27+
28+
pip install <package name>
29+
30+
You will be installing the package in the currently activated environment.
31+
32+
To run python interactively just type "python" in shell to run the currently activated version. If you are on a graphical (X2go) client, you can also use the Spyder IDE by typing "spyder". In addition you can use text editors like VIM, Emacs or gedit.
33+
34+
More info : https://docs.continuum.io/
35+
36+
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
The Data Science VM comes with tools and libraries to access various services in Azure. The following are the key tools and libraries:
2+
3+
* Azure Command Line Interface: Azure command line interface (CLI) allows you to create and manage Azure resources through the shell. To invoke the Azure tools just type "azure help". For more information, please refer to the Azure CLI documentation page found at: https://azure.microsoft.com/documentation/articles/virtual-machines-command-line-tools/.
4+
5+
* Microsoft Azure Storage Explorer : The Microsoft Azure Storage Explorer is a graphical tool used to browse through the objects that you have stored within your Azure Storage Account, upload/download data in Azure blobs. You can access the Storage Explorer from desktop shortcut icon. You can invoke it from a shell prompt by typing "StorageExplorer". You need to be logged in from X2go client or have X11 forwarding setup.
6+
7+
* Azure Libraries: We have installed several libraries. The following are the some of the librariues available for you:
8+
- Python: The Azure related libraries in Python that are installed are "azure", "azureml", "pydocumentdb", "pyodbc". These libraries allow you to access Azure storage services, Azure Machine Learning, Azure DocumentDB (a NoSQL database on Azure). pyodbc along with Microsoft ODBC driver for SQL Server enables access to Microsoft SQL Server, Azure SQL Database and Azure SQL Datawarehouse from Python using ODBC interface. Please enter "pip list" to see all the listed library. Be sure to run this command in both Python 2.7 and 3.5 environment.
9+
10+
- R: The Azure related libraries in R that are installed are "AzureML" The "RODBC" library provides access ODBC data sources like Azure SQL Datawarehouse, Azure SQL Databases.
11+
12+
- Java: The list of Azure Java libraries can be found in the directory "/dsvm/sdk/AzureSDKJava" on the VM. The key libraries are Azure storage and management APIs, DocumentDB and JDBC drivers for SQL Server.
13+
14+
You can access the Azure portal (https://portal.azure.com) from the pre-installed Firefox browser. You can install other browsers. On the Azure portal you can create, manage and monitor Azure resources.
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
Azure Machine Learning (ML) is a fully managed cloud service that enables you to easily build, deploy, and share predictive analytics solutions. You build your experiments and models from the Azure Machine Learning Studio. It can be accessed from a web browser on the data science virtual machine by visiting https://studio.azureml.net.
2+
3+
Once you login to the Azure Machine Learning Studio, you will have access to an experimentation canvas where you can build your ML workflows. You also have access to a Jupyter Notebook hosted on Azure ML and can work seamlessly with the Experimentation. Azure ML lets you build your ML models and wrap them in a web service interface for operationalization. This enables clients written in any language to invoke predictions from the ML models. You can find more information about Azure ML on the documentation page:
4+
https://azure.microsoft.com/documentation/services/machine-learning/
5+
6+
You can also build your models in R or Python on the VM and deploy it in production on Azure ML. We have installed libraries in R and Python to enable this functionality.
7+
8+
The library in R is called "AzureML". In Python it is called "azureml".
9+
10+
For information on how to deploy models in R and Python into Azure ML please refer to the following article (written for the Windows version of the Data Science VM. This item is applicable to the Linux VM too):
11+
12+
https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-vm-do-ten-things/#3-build-models-using-r-or-python-and-operationalize-them-using-azure-machine-learning
13+
14+
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
The open source database Postgres is available on the VM with the services running and initdb already completed. You still need to create databases and users. Please refer to Postgres documentation found at http://www.postgresql.org/docs/9.2/static/index.html.
2+
3+
In order access SQL Server databases on-prem or on the cloud OR Azure SQL Data Warehouse, we have provided ODBC, JDBC drivers.
4+
5+
SQuirreL SQL Client:
6+
====================
7+
8+
A graphical SQL client - SQuirrel SQL has been provided to connect to different databases (Microsoft SQL Server, Postgres, MySQL etc) and run SQL queries. You can run this from a graphical desktop session (using X2Go client) or a SSH session with X11 forwarding. To invoke SQuirrel SQL you can either launch it from the icon on the desktop OR run the following command on the shell.
9+
10+
/usr/local/squirrel-sql-3.7/squirrel-sql.sh
11+
12+
13+
The first time you need to setup your drivers and database aliases. The JDBC drivers are located at:
14+
15+
/usr/share/java/jdbcdrivers
16+
17+
More information on SQuirrel SQL can be found at: http://squirrel-sql.sourceforge.net/index.php?page=screenshots
18+
19+
Command Line tools for accessing Microsoft SQL Server:
20+
======================================================
21+
22+
The ODBC driver package for Microsoft SQL Server also comes with two command line tools:
23+
24+
bcp - The bcp utility bulk copies data between an instance of Microsoft SQL Server and a data file in a user-specified format. The bcp utility can be used to import large numbers of new rows into SQL Server tables or to export data out of tables into data files. To import data into a table, you must either use a format file created for that table or understand the structure of the table and the types of data that are valid for its columns.
25+
26+
More Info: https://msdn.microsoft.com/en-us/library/hh568446(v=sql.110).aspx
27+
28+
sqlcmd - The sqlcmd utility lets you enter Transact-SQL statements, system procedures, and script files at the command prompt. This utility uses ODBC to execute Transact-SQL batches.
29+
30+
More info: https://msdn.microsoft.com/en-us/library/hh568447(v=sql.110).aspx
31+
32+
Note: There are some differences in this utility between Linux and Windows platform. Please see the documentation page above for details.
33+
34+
35+
Database Access Libraries:
36+
==========================
37+
38+
There are libraries available in Python and R to access databases. In R, the RODBC package or dplyr package allows you to query or execute SQL statements on the database server. In Python, the pyodbc library provides database access with ODBC as the underlying layer. To access Postgres from Python you can also use the "psycopg2" library. For accessing Postgres from R you can also use the package "RPostgreSQL".
39+
40+
41+
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
Software Development tools on the Data Science Virtual Machine
2+
==============================================================
3+
4+
You have choice of several code editors. This includes vi/VIM, Emacs, gEdit and Eclipse. gEdit and Eclipse are graphical editors and need you to be logged in to a graphical desktop. These editors have desktop and application menu shortcuts to launch them.
5+
6+
VIM and Emacs are text based editors. Emacs has an add-on package called ESS (Emacs Speaks Statistics) that makes working with R easier within Emacs. More info can be found at: http://ess.r-project.org/.
7+
8+
Eclipse is an open source, extensible IDE supporting multiple languages. This instance is the Java developers edition. There are plugins available for several popular languages that you can install to extend the Eclipse environment. We also have a plugin installed in Eclipse called Azure Toolkit for Eclipse which allows you to easily create, develop, test, and deploy Azure applications using the Eclipse development environment supporting languages like Java. There is also Azure SDK for Java that allows access to different Azure services from a Java environment. More information on Azure toolkit for Eclipse can be found at: https://azure.microsoft.com/documentation/articles/azure-toolkit-for-eclipse/.
9+
10+
LaTex is installed through the texlive package along with an Emacs add-on package "auctex" (https://www.gnu.org/software/auctex/manual/auctex/auctex.html) which simplifies authoring your LaTex documents within Emacs.
11+
12+
In addition to Python, R, and Java you can also use node.js, Perl, PHP, Ruby to build your applications on the VM. Azure SDK for node.js, PHP and Ruby are installed. You can find more information about the Azure SDK for all supported platforms and languages at: https://azure.microsoft.com/en-us/downloads/.
13+
14+
For other platforms or languages you can explore the REST APIs provided by Azure. You can find pointers to REST API of various Azure services at: https://msdn.microsoft.com/en-us/library/azure/mt420159.aspx
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
#!/bin/bash
2+
3+
4+
calculate_dimensions() {
5+
WT_HEIGHT=20
6+
WT_WIDTH=$(tput cols)
7+
8+
if [ -z "$WT_WIDTH" ] || [ "$WT_WIDTH" -lt 60 ]; then
9+
WT_WIDTH=80
10+
fi
11+
if [ "$WT_WIDTH" -gt 178 ]; then
12+
WT_WIDTH=160
13+
else
14+
WT_WIDTH=$(($WT_WIDTH-4))
15+
fi
16+
17+
WT_H=$(tput lines)
18+
19+
if [ "$WT_H" -gt 35 ]; then
20+
WT_HEIGHT=35
21+
fi
22+
23+
WT_MENU_HEIGHT=10
24+
}
25+
26+
show_info() {
27+
whiptail --title "$1" --scrolltext --textbox "$2" $WT_HEIGHT $WT_WIDTH
28+
}
29+
30+
while true; do
31+
calculate_dimensions
32+
CHOICE=$(whiptail --title "Linux Data Science Virtual Machine" \
33+
--menu "Get more info on data science tools pre-installed on this machine." "$WT_HEIGHT" "$WT_WIDTH" "$WT_MENU_HEIGHT" \
34+
--ok-button "More Info" \
35+
--cancel-button "Close" \
36+
"Setup graphical desktop" "Connect to the VM using X2Go graphical desktop client" \
37+
"Anaconda Python Distribution" "Enterprise ready Python distribution with several data analytics libraries" \
38+
"Azure Machine Learning Libraries" "Libraries to work with Azure Machine Learning in the cloud" \
39+
"Azure Tools" "Tools and libraries to work with Azure services" \
40+
"Database Tools" "Manage and query relational databases" \
41+
"Development Tools" "IDEs and code editors" \
42+
"Jupyter Notebook Server(R, Python)" "Browser based interactive data science and scientific computing " \
43+
"Machine Learning Tools" "Tools to run ML algorithms locally(CNTK, R, Scikit-learn, Vowpal Wabbit, xgboost and more)" \
44+
"Microsoft R Open" "Open Source, Enhanced distribution of R with Math Kernel Library" \
45+
3>&1 1>&2 2>&3)
46+
RET=$?
47+
if [ $RET -eq 1 ]; then
48+
exit 1
49+
elif [ $RET -eq 0 ]; then
50+
case "$CHOICE" in
51+
Setup*)
52+
show_info "$CHOICE" "/usr/share/dsvm-more-info/x2gosetup"
53+
;;
54+
Anaconda*)
55+
show_info "$CHOICE" "/usr/share/dsvm-more-info/anaconda"
56+
;;
57+
Azure\ Machine\ Learning*)
58+
show_info "$CHOICE" "/usr/share/dsvm-more-info/azureml"
59+
;;
60+
Azure\ Tools*)
61+
show_info "$CHOICE" "/usr/share/dsvm-more-info/azure"
62+
;;
63+
Database*)
64+
show_info "$CHOICE" "/usr/share/dsvm-more-info/dbtools"
65+
;;
66+
Development*)
67+
show_info "$CHOICE" "/usr/share/dsvm-more-info/devtools"
68+
;;
69+
Jupyter*)
70+
show_info "$CHOICE" "/usr/share/dsvm-more-info/jupyter"
71+
;;
72+
Machine\ Learning*)
73+
show_info "$CHOICE" "/usr/share/dsvm-more-info/mltools"
74+
;;
75+
Microsoft\ R*)
76+
show_info "$CHOICE" "/usr/share/dsvm-more-info/microsoftR"
77+
;;
78+
More*)
79+
show_info "$CHOICE" "/usr/share/dsvm-more-info/more"
80+
;;
81+
esac
82+
fi
83+
done
84+
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
Jupyter Notebook Server
2+
======================
3+
4+
Anaconda Python distribution comes with a Jupyter notebook, an browser based environment to explore data, share code and analysis.
5+
6+
Setting up password for Notebook Server
7+
=======================================
8+
9+
The notebook server comes with a default password that you need to update as a one time task. Here are the steps to set a strong password for the Jupyter notebook server.
10+
11+
Run the following command from the shell on the Data Science Virtual Machine to create your own strong password for the Jupyter notebook server installed on the machine.
12+
13+
python.exe -c "import IPython;print IPython.lib.passwd()"
14+
15+
Choose a strong password when prompted.
16+
17+
You will see the password hash in the format "sha1:xxxxxx" in the output. Copy this password hash and replace the existing hash that is in your notebook config file located at: "/usr/local/etc/jupyter/jupyter_notebook_config.py" with a parameter name "c.NotebookApp.password". You will need to edit this file as root.
18+
19+
You should only replace the existing hash value that is within the quotes. The quotes and the sha1: prefix for the parameter value need to be retained.
20+
21+
Finally, you need to stop and restart the Jupyter service that is installed in /etc/init.d/jupyter. If your new password is not accepted after restarting jupyter or you have issues stopping jupyter, try restarting the virtual machine.
22+
23+
Accessing the notebook:
24+
25+
A Jupyter notebook server has been pre-configured with Python 2, Python 3 and R kernels. If you are on the VM through X2go client, you can click on a desktop icon named "Jupyter" to launch the browser to access the Notebook server. You can also visit the URL - https://localhost:9999/ from a web browser to access the Jupyter notebook server (Note: Continue if you get any certificate warnings.). You can access the Jupyter notebook server from any remote host by just entering the URL - https://<VM DNS name or IP Address>:9999/. We have packaged a few sample notebooks. You can see the link to the samples on the notebook home page after you authenticate to the Jupyter notebook using the password you created in earlier step. You can create a new notebook by selecting "New" and then the language kernel. If you dont see the "New" button, click on the Jupyter icon on the top left to go to the home page of the notebook server.
26+
27+
The Jupyter notebook server listening on port 9999 is setup to run from /etc/init.d/jupyter and will be started automatically when you boot up the VM. If you want to reconfigure the notebook server you can edit the file "/usr/local/etc/jupyter/jupyter_notebook_config.py" as root/sudo.
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
Microsoft R Open
2+
================
3+
4+
R is one of the most popular language for data analysis and machine learning. If you wish to use R for your analytics, the VM has Microsoft R Open (MRO) with the Math Kernel Library (MKL) which optimizes the math operations common in analytical algorithms. MRO is 100% compatible with CRAN-R and you can install any of the R libraries published in CRAN on the MRO installation. You can edit your R programs in one of the default editors like vi, emacs or gedit. You can download and use other IDEs as well such as RStudio (http://www.rstudio.com). For your convenience, a simple script (installRStudio.sh) is provided in the "/dsvm/tools" directory to allow you to install RStudio. In addition we have also installed the Emacs package ESS (Emacs Speaks Statistics) that simplifies working with R files within Emacs editor.
5+
6+
To launch R, you just type R in the shell. You will be taken to an interactive environment. To develop you R program you will typical use an editor like Emacs or vi or gedit and then run the scripts within R. If you install RStudio you will have a full graphical IDE environment to develop your R program.
7+
8+
There is also a R script to install the top 20 R packages (from http://www.kdnuggets.com/2015/06/top-20-r-packages.html). This script can be run once you are in the R interactive interface which can be entered by typing R in the shell.
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
The VM comes with a few ML tools/algorithms pre-compiled and pre-installed. This includes:
2+
3+
* CNTK (Computational Network Toolkit from Microsoft Research) - A deep learning toolkit
4+
* Vowpal Wabbit - A fast online learning algorithm
5+
* xgboost - A tool which provides optimized boosted tree algorithms
6+
* Python - Anaconda Python comes bundled with ML algorithms with libraries like Scikit-learn. You can install other libraries running pip install
7+
* R - R comes with a rich library of ML functions. Some of the libraries that are pre-installed are lm, glm, randomForest, rpart.
8+
Other libraries can be installed by running install.packages(<lib name>)
9+
10+
Here is more description on CNTK, Vowpal Wabbit and xgboost.
11+
12+
CNTK:
13+
This is an open source deep learning toolkit. It is a command line tool (cntk) and is already in the PATH.
14+
15+
To run a basic sample do the following in shell:
16+
17+
# Copy samples to your home directory and execute cntk
18+
cp -r /dsvm/tools/CNTK-2016-02-08-Linux-64bit-CPU-Only/Examples/Other/Simple2d cntkdemo
19+
cd cntkdemo/Data
20+
cntk configFile=../Config/Simple.cntk
21+
22+
You will find the model output in ~/cntkdemo/Output/Models
23+
24+
More Info on CNTK: https://github.com/Microsoft/CNTK
25+
https://github.com/Microsoft/CNTK/wiki
26+
27+
28+
Vowpal Wabbit(vw):
29+
30+
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
31+
32+
To run the tool on a very basic example do the following:
33+
34+
cp -r /dsvm/tools/VowpalWabbit/demo vwdemo
35+
cd vwdemo
36+
vw house_dataset
37+
38+
There are other larger demos in that directory. Please refer to VW documentation referred below for more info.
39+
40+
More info on vw: https://github.com/JohnLangford/vowpal_wabbit
41+
42+
43+
xgboost:
44+
This is a library that is designed, and optimized for boosted (tree) algorithms. The goal of this library is to push the extreme of the computation limits of machines to provide a scalable, portable and accurate for large scale tree boosting.
45+
46+
It is provided as a command line as well as a R library.
47+
48+
To use this library in R, you can start interactive R session ( just by typing R in the shell) and loading the library.
49+
50+
Here is a simple example you can run in R prompt:
51+
52+
library(xgboost)
53+
54+
data(agaricus.train, package='xgboost')
55+
data(agaricus.test, package='xgboost')
56+
train <- agaricus.train
57+
test <- agaricus.test
58+
bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
59+
eta = 1, nthread = 2, nround = 2, objective = "binary:logistic")
60+
pred <- predict(bst, test$data)
61+
62+
63+
To run the xgboost command line, here are the steps to execute in shell:
64+
65+
cp -r /dsvm/tools/xgboost/demo/binary_classification/ xgboostdemo
66+
cd xgboostdemo
67+
xgboost mushroom.conf
68+
69+
A .model file is written to that directory. Info about this demo example can be found at: https://github.com/dmlc/xgboost/tree/master/demo/binary_classification
70+
71+
More Info on xgboost: https://xgboost.readthedocs.org/en/latest/
72+
https://github.com/dmlc/xgboost
73+

0 commit comments

Comments
 (0)