Geeky is Awesome: tutorial

Showing posts with label tutorial. Show all posts

Sunday, May 31, 2020

Quick Git command line tutorial

Git is a version control system which facilitates the creation and maintenance of versions in a project. Usually it's used for software code but it can also be used for things like documents. Its most useful for anything that is text based as you will be able to see which lines have changed across versions.

Given that you've installed git, create folder that will store your project, open your terminal (cmd in Windows), navigate to the folder, and turn it into a repository by entering:

git init .

This will create a folder called ".git" in your current directory which lets you do repository stuff to it. Now whenever you want to do repository stuff, just open your terminal and navigate back to the folder. To see this we can ask for a status report of the repo by entering:

git status

This will output the following:

On branch master

Initial commit

nothing to commit (create/copy files and use "git add" to track)

This is basically telling us that the repo is empty. Now we start putting stuff there. Let's create a readme file using markdown with the file name readme.md:

My first git repo
=================

This is my readme file.

After saving, if we go back to the terminal and enter:

git status

we will now see:

On branch master

Initial commit

Untracked files:
  (use "git add ..." to include in what will be committed)

        readme.md

nothing added to commit but untracked files present (use "git add" to track)

This is saying that readme.md is a file in the repo folder that is not being kept track of by git. We can add this file to the git index by entering:

git add readme.md

After asking for the status again, we will now see:

On branch master

Initial commit

Changes to be committed:
  (use "git rm --cached ..." to unstage)

        new file:   readme.md

Now the file is in the index and is 'staged'. If we update the file and save it again:

My first git repo
=================

This is my updated readme file.

the status will now say:

On branch master

Initial commit

Changes to be committed:
  (use "git rm --cached ..." to unstage)

        new file:   readme.md

Changes not staged for commit:
  (use "git add ..." to update what will be committed)
  (use "git checkout -- ..." to discard changes in working directory)

        modified:   readme.md

This is saying that we have a staged file that has also been modified. We can check what has been changed in the file since we last staged it by entering:

git diff

diff --git a/readme.md b/readme.md
index 75db517..5bdd78c 100644
--- a/readme.md
+++ b/readme.md
@@ -1,4 +1,4 @@
 My first git repo
 =================

-This is my readme file.
+This is my updated readme file.

Diff shows you all the lines that were changed together with some unchanged lines next to them for context. The '-' line was replaced by the '+' line. We're also told in "@@ -1,4 +1,4 @@" that the line number that was changed was 4 (1 line was removed at line 4, and another line was added at line 4).

Now we stage this modification so that it is also kept track of:

git add readme.md

On branch master

Initial commit

Changes to be committed:
  (use "git rm --cached ..." to unstage)

        new file:   readme.md

Staging changes is not the point of repositories. The point is committing the staged changes. A commit is a backup of the currently indexed files. When you take a backup, you make a copy of your project folder and give it a name so that you can go back to it. This is what a commit does. Enter:

git commit

This will open up a text editor so you can enter a description of what you're committing.

If the text editor is vim, which you will know because it is not helpful at all, you need to first press 'i' for 'insert' before you type anything. To save, press ESC, followed by ':', followed by 'w', followed by enter. To exit, press ESC, followed by ':', followed by 'q', followed by enter.
If it's nano then just follow the on screen commands, noting that '^' means CTRL.

The commit message you write can later be searched in order to find a particular commit. Note that commits are generally thought of as being unchangable after they are made, so make sure you write everything you need to. The first line of the commit message has special importance and is considered to be a summary of what has changed. You should keep it under 50 characters long so that it can be easily displayed in a table. It should be direct and concise (no full stops at the end, for example), with the rest of the lines under it being a more detailed description to make up for any missing information in the first line. A blank line needs to separate the first line from the rest of the lines. Note that it's fine to only have the first line is it's enough.

added the word 'updated' to readme

readme.md was updated so that it says that it is my updated readme file.
# Please enter the commit message for your changes. Lines starting
# with '#' will be ignored, and an empty message aborts the commit.
# On branch master
#
# Initial commit
#
# Changes to be committed:
#       new file:  readme.md

The lines with the # in front were written by git and should be left there. If we check the status again we will now see:

On branch master
nothing to commit, working tree clean

This is saying that there were no changes since the last commit. We can now see all of our commits by entering:

git log

Author: mtanti 
Date:   Sat May 30 10:46:17 2020 +0200

    added the word 'updated' to readme

    readme.md was updated so that it says that it is my updated readme file.

We can see who made the commit, when, and what message was written. It's important that commits are done frequently but on complete changes. A commit is not a save (you do not commit each time you save), it is a backup, and the state of your project in each backup should make sense. Think of it as if you're crossing out items in a todo list and with each item crossed out you're taking a backup. You should not take a backup in between items. On the other hand, your items should be broken down into many simple tasks in order to be able to finish each one quickly.

Now, let's add a folder called 'src' to our repo and then check the status.

On branch master
nothing to commit, working tree clean

In git's eyes, nothing has changed because git does not have a concept of a folder, only of the directory of files. We need to put a file in the folder in order to be able to add it to git's index. Let's add 2 text files to src: 'a.txt' and 'b.txt' with the following content each:

line 1
line 2
line 3

The status now shows:

On branch master
Untracked files:
  (use "git add ..." to include in what will be committed)

        src/

nothing added to commit but untracked files present (use "git add" to track)

This is saying that we have a new folder called 'src' with some files in it. We can add the folder by using the add command. If you want you can avoid having to include the file names of the files you're adding by just using a '.', which means "all unstaged modified or new files":

git add .

On branch master
Changes to be committed:
  (use "git reset HEAD ..." to unstage)

        new file:   src/a.txt
        new file:   src/b.txt

Let's commit these two files.

git commit

added source files

# Please enter the commit message for your changes. Lines starting
# with '#' will be ignored, and an empty message aborts the commit.
# On branch master
# Changes to be committed:
#       new file:   src/a.txt
#       new file:   src/b.txt

Note that I didn't add anything else to the message other than the first line. There is no need to specify what files were added as they can be seen in the commit. Now let's look at the log of commits:

git log

commit 59cfc3f057bf1f19038ab15c4357d97bc84ac30e (HEAD -> master)
Author: mtanti 
Date:   Sat May 30 11:17:14 2020 +0200

    added source files

commit f71f17b63c6b3ddb7506000cbc422e8f1b173958
Author: mtanti 
Date:   Sat May 30 10:46:17 2020 +0200

    added the word 'updated' to readme

    readme.md was updated so that it says that it is my updated readme file.

We can see how all the commits are shown in descending order of when they were made. You might be wondering what 'HEAD' is referring to. HEAD is the commit we are working on. We can now move in between commits and move in time. This is very useful if you start working on something and realise that there was a better way to do it and need to undo all your work up to a particular point in the commit history. When we do this, we would be moving the HEAD to a different commit. The HEAD can be moved by using the checkout command:

git checkout HEAD~1

This is saying "go back one commit behind the HEAD". The command gives the following output:

Note: checking out 'HEAD~1'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b 

HEAD is now at f71f17b... added the word 'updated' to readme

This is saying that we are now in the commit with the message "added the word 'updated' to readme". It is also possible to use the hash as a commit identifier in order to just to it directly without being relative to the HEAD. The hash is the 40 digit hexadecimal next to the commits in the log. For example, the first commit had a hash of 'f71f17b63c6b3ddb7506000cbc422e8f1b173958' so we could have entered "git checkout f71f17b63c6b3ddb7506000cbc422e8f1b173958". We can also just use the first 7 digits to avoid typing everything but it's more likely that there will be a collision with another hash.

If look at the log now, we'll see:

commit f71f17b63c6b3ddb7506000cbc422e8f1b173958 (HEAD)
Author: mtanti 
Date:   Sat May 30 10:46:17 2020 +0200

    added the word 'updated' to readme

    readme.md was updated so that it says that it is my updated readme file.

which shows that the HEAD has been moved one commit behind in the timeline.

Now if we look at the project folder (and refresh), we'll see that the folder 'src' has been physically removed. We can restore it by moving forward in time and go to the latest commit which includes the 'src' folder.

Unfortunately, there is no direct notation for moving forward in time as what we just did is not normal usage of git. Note that at the moment the HEAD is said to be 'detached', which means that it is not in a proper place (the end of the timeline). We can get back to the proper place we should be in by checking out to 'master'.

git checkout master

Checking the log, we now see:

commit 59cfc3f057bf1f19038ab15c4357d97bc84ac30e (HEAD, master)
Author: mtanti 
Date:   Sat May 30 11:17:14 2020 +0200

    added source files

commit f71f17b63c6b3ddb7506000cbc422e8f1b173958
Author: mtanti 
Date:   Sat May 30 10:46:17 2020 +0200

    added the word 'updated' to readme

    readme.md was updated so that it says that it is my updated readme file.

So what is this 'master' business? What we were calling timelines are actually called 'branches' in git, and branches are one of the most important things in git. Imagine you've started working on a new feature in your program. Suddenly you are told to let go of everything you're doing, work on fixing a bug, and quickly release an update right away. The feature you're working on is half way done and you can't release an updated version of the program with a half finished function; but there's no way you'll finish the feature quickly enough. Do you undo all the work you did on the feature so that you're at a stable version of the program and able to fix the bug? Of course not. That's what branches are for.

With branches you can keep several versions of your code available and switch from one to the other with checkout. The master branch is the one you start with. Ideally the master's last commit should always be in a publishable state (no 'work in progress'). Of course if you're just working on the master branch then this would not be possible without taking committing very rarely, which is bad practice. The solution is to have several development branches on which you modify the project bit by bit. Every time you start working on a new publishable version of your project, you start a new branch and work on modifying your project to create the new version. Once you finish what you're doing and are happy with the result, you then merge the development branch into the master branch. If you're in the middle of something and need to fix a bug, you switch to the master branch, create a new branch for fixing the bug, fix it, and merge it to the master. Then you merge the new master code with your earlier development branch so that you can continue working as if nothing happened.

Let's make a development branch. Enter the following:

git branch mydevbranch

This will create a branch called 'mydevbranch' that sticks out from the current branch at the current HEAD, that is, it will create a branch that sticks out from master's last commit. By 'sticks out' I mean that when you switch to mydevbranch, the project will be changed to look like it was at the commit from where the branch is starting from. Alternatively you can include a commit hash after the name of the branch in order to make it stick out from an earlier point in the current branch. For example "git branch mydevbranch f71f17b63c6b3ddb7506000cbc422e8f1b173958" will create a branch from the first commit.

We can see a list of branches by entering:

git branch --list

* master
  mydevbranch

The asterisk shows which branch is currently active (has the HEAD).

Now switch to the new branch using checkout:

git checkout mydevbranch

and check the status:

On branch mydevbranch
nothing to commit, working tree clean

It now says that we're on mydevbranch instead of on master. Note that whilst we're on the new branch, any modifications we make to the project will be stored on the branch whilst master will remain as it is. Let's modify src/a.txt to look like this:

line 1
line 2 changed
line 3

And now add and commit this file and then check the log:

commit 9bc4488ac847bceccb746eeafb1a8c239de350f2 (HEAD -> mydevbranch)                                                                            Author: mtanti                                                                                                              Date:   Sun May 31 10:24:23 2020 +0200                                                                                                                                                                                                                                                                changed line in a.txt                                                                                                                                                                                                                                                                         commit 59cfc3f057bf1f19038ab15c4357d97bc84ac30e (master)                                                                                         Author: mtanti                                                                                                              Date:   Sat May 30 11:17:14 2020 +0200                                                                                                                                                                                                                                                                added source files                                                                                                                                                                                                                                                                            commit f71f17b63c6b3ddb7506000cbc422e8f1b173958                                                                                                  Author: mtanti                                                                                                              Date:   Sat May 30 10:46:17 2020 +0200                                                                                                                                                                                                                                                                added the word 'updated' to readme                                                                                                                                                                                                                                                                readme.md was updated so that it says that it is my updated readme file.

Note how the log shows us where each branch is and where the HEAD is. In this case the HEAD is at mydevbranch's last commit and master is one commit behind it. Let's add another commit after changing src/b.txt:

line 1
line 2
line 3
line 4

Now that we're done with the changes in this new version, we can merge the development branch to master by first switching to master and then merging to mydevbranch:

git checkout master
git merge mydevbranch

Updating 59cfc3f..f962efb
Fast-forward
 src/a.txt | 2 +-
 src/b.txt | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

Note that this was a fast forward merge, which is the best kind of merge you can have. It happens when all the changes were made in a neat sequence which can be simply reapplied on the master, as opposed to having multiple branches changing stuff at the same time.

Before seeing what a merge conflict looks like, let's look at the log again now that we've made the merge, but this time as a graph:

git log --graph

* commit f962efb409e4f08f94d717dec866519bc2848e8f (HEAD -> master, mydevbranch)
| Author: mtanti 
| Date:   Sun May 31 10:30:42 2020 +0200
|
|     added a new line to b.txt
|
* commit 9bc4488ac847bceccb746eeafb1a8c239de350f2
| Author: mtanti 
| Date:   Sun May 31 10:24:23 2020 +0200
|
|     changed line in a.txt
|
* commit 59cfc3f057bf1f19038ab15c4357d97bc84ac30e
| Author: mtanti 
| Date:   Sat May 30 11:17:14 2020 +0200
|
|     added source files
|
* commit f71f17b63c6b3ddb7506000cbc422e8f1b173958
  Author: mtanti 
  Date:   Sat May 30 10:46:17 2020 +0200

      added the word 'updated' to readme

      readme.md was updated so that it says that it is my updated readme file.

Here we can see a neat straight line moving through a timeline of commits, from the first to the last with no splits along the way. The master and development branch are together on the last commit in the timeline. Now let's see how this can be different.

Switch back to the development branch so that you can start working on the next version of the project:

git checkout mydevbranch

and change src/a.txt to have a new line:

line 1
line 2 changed
line 3
line 4

As you're working on the file, you get a call from your boss who tells you to immediately change all the lines to start with a capital letter in both src/a.txt and src/b.txt. Now what? You need to stop working on mydevbranch, go back to master, and create a new branch for working on the new request.

Before switching branches it's important that you do not have any uncommitted changes in your project, otherwise they will be carried over to the master branch and that would make things confusing. If you're not at a commitable point in your change then you can save the current state of the branch files by stashing them instead:

git stash push

This will add a stash object to the current commit of the current branch which can then be retrieved and removed. Next checkout to master:

git checkout master

Create and switch to a new branch.

git branch myurgentbranch
git checkout myurgentbranch

and apply the urgent modifications.

src/a.txt

Line 1
Line 2 changed
Line 3

src/b.txt

Line 1
Line 2
Line 3
Line 4

Commit these changes and merge to master:

git add .
git commit
git checkout master
git merge myurgentbranch

The merge should be a fast forward since we started working on the last commit of master with no interruptions. The log will show us the current situation:

git log --graph

* commit 5dcd492113d3942550a58efdc7b90e15bd36d537 (HEAD -> master, myurgentbranch)                                                               | Author: mtanti                                                                                                            | Date:   Sun May 31 11:06:39 2020 +0200                                                                                                         |                                                                                                                                                |     capitalised each line in source files                                                                                                      |                                                                                                                                                * commit f962efb409e4f08f94d717dec866519bc2848e8f (mydevbranch)                                                                                  | Author: mtanti                                                                                                            | Date:   Sun May 31 10:30:42 2020 +0200                                                                                                         |                                                                                                                                                |     added a new line to b.txt
|
* commit 9bc4488ac847bceccb746eeafb1a8c239de350f2
| Author: mtanti 
| Date:   Sun May 31 10:24:23 2020 +0200
|
|     changed line in a.txt
|
* commit 59cfc3f057bf1f19038ab15c4357d97bc84ac30e
| Author: mtanti 
| Date:   Sat May 30 11:17:14 2020 +0200
|
|     added source files
|
* commit f71f17b63c6b3ddb7506000cbc422e8f1b173958
  Author: mtanti 
  Date:   Sat May 30 10:46:17 2020 +0200

      added the word 'updated' to readme

      readme.md was updated so that it says that it is my updated readme file.

You can see how when we commit the changes in mydevbranch, we'll have a fork in the timeline from the straight line that we currently have. Let's see what happens then.

Now that we're ready from the urgent request we can go back to the normal development branch:

git checkout mydevbranch

and pop back the changes we stashed:

git stash pop

Note that "git stash list" shows what is in the stash. We were working on src/a.txt where we were adding a new line to the file:

line 1
line 2 changed
line 3
line 4

Now let's imagine that we finished the changes we were making and can commit them:

git add .
git commit

Now a.txt is supposed to have two changes: the new line and each line starting with a capital letter. Each of these changes are on a different branch. Let's start by making the development branch complete by merging the changes we applied to the master to the development branch (note that we're reversing the direction of the merge now because we want the development branch to be up to date).

git merge master

Auto-merging src/a.txt
CONFLICT (content): Merge conflict in src/a.txt
Automatic merge failed; fix conflicts and then commit the result.

This is when things start getting hairy as you'll need to manually fix your conflicting changes. The status will tell us which files need to be fixed:

On branch mydevbranch
You have unmerged paths.
  (fix conflicts and run "git commit")
  (use "git merge --abort" to abort the merge)

Changes to be committed:

        modified:   src/b.txt

Unmerged paths:
  (use "git add ..." to mark resolution)

        both modified:   src/a.txt

This saying that during merging, src/b.txt was updated with no conflicts but src/a.txt requires manual intervention. If we open src/a.txt we'll see the following:

<<<<<<< HEAD
line 1
line 2 changed
line 3
line 4
=======
Line 1
Line 2 changed
Line 3
>>>>>>> master

The file has been modified by git to show the conflicting changes. 7 arrows and equals signs are used to highlight sections of changes which need to be resolved. Note that all the lines have been changed here to the section is the whole file. Now we can either fix the file directly or use git's "git mergetool" to help us by showing all the changes. In this case we can modify the file directly:

Line 1
Line 2 changed
Line 3
Line 4

Make sure to remove all the arrows and equals signs. Status will now output:

All conflicts fixed but you are still merging.
  (use "git commit" to conclude merge)

Changes to be committed:

        modified:   src/a.txt
        modified:   src/b.txt

Changes not staged for commit:
  (use "git add ..." to update what will be committed)
  (use "git checkout -- ..." to discard changes in working directory)

        modified:   src/a.txt

It's now saying that conflicts were fixed. All we need to do is add the fixed file and continue the merge.

git add src/a.txt
git merge --continue

At the end of the merge, you will be asked to enter a commit message in order to automatically commit the merge change. Git automatically puts the message "Merge branch 'master' into mydevbranch", which is good enough.

The log now shows the new timeline:

*   commit 49abf32812ec7dbeeab792729098a61fd3446a45 (HEAD -> mydevbranch)
|\  Merge: b6cc3c8 5dcd492
| | Author: mtanti 
| | Date:   Sun May 31 11:36:50 2020 +0200
| |
| |     Merge branch 'master' into mydevbranch
| |
| * commit 5dcd492113d3942550a58efdc7b90e15bd36d537 (myurgentbranch, master)
| | Author: mtanti 
| | Date:   Sun May 31 11:06:39 2020 +0200
| |
| |     capitalised each line in source files
| |
* | commit b6cc3c887533a995e749589b6cdbfaaad530b03e
|/  Author: mtanti 
|   Date:   Sun May 31 11:17:55 2020 +0200
|
|       added new line to a.txt
|
* commit f962efb409e4f08f94d717dec866519bc2848e8f
| Author: mtanti 
| Date:   Sun May 31 10:30:42 2020 +0200
|
|     added a new line to b.txt
|
* commit 9bc4488ac847bceccb746eeafb1a8c239de350f2
| Author: mtanti 
| Date:   Sun May 31 10:24:23 2020 +0200
|
|     changed line in a.txt
|
* commit 59cfc3f057bf1f19038ab15c4357d97bc84ac30e
| Author: mtanti 
| Date:   Sat May 30 11:17:14 2020 +0200
|
|     added source files
|
* commit f71f17b63c6b3ddb7506000cbc422e8f1b173958
  Author: mtanti 
  Date:   Sat May 30 10:46:17 2020 +0200

      added the word 'updated' to readme

      readme.md was updated so that it says that it is my updated readme file.

We can now continue modifying our development branch with anything left to add in the new version and then switch to master and merge, which will be a fast forward since we have resolved all conflicts already. Note that if you see the log before merging, you will not see the development branch since it is not in the master's timeline. You can see all timelines by entering "git log --graph --all".

git checkout master
git merge mydevbranch

If you want to delete the urgent branch, just enter "git branch --delete myurgentbranch".

This concludes our quick tutorial. I didn't mention anything about remote repositories and pushing and pulling to and from the repositories but basically if you setup a github account, you can keep a backup online for multiple developers to work together, pushing the local repository to the online one and pulling the online repository when it was changed by someone else. The online repository is referred to as 'origin' in git.

Tuesday, January 30, 2018

A quick guide to running deep learning applications on Amazon AWS virtual server supercomputer

Deep learning is serious business and if you want to work on sizeable problems you're going to need more hardware than you probably have at home. Here is how to use Amazon Web Services in order to be able to upload your code and datasets to an online server which will run it for you and then let you download the results. It's basically a virtual computer that you use over the Internet by sending commands.

First of all, when I got started I was initially following this video on how to use AWS:
https://youtu.be/vTAMSm06baA?t=1476

AWS Educate

If you're a student then you can take advantage of AWS Educate where if you are accepted you will receive 100 free hours of runtime to get started. 100 hours in deep learning will not last long but it will allow you to experiment and get your bearings with AWS without worrying about the bill too much. It will also be a few days before you get a reply. Here's the link:
https://aws.amazon.com/education/awseducate/

Get the applications you need

If you're on Windows, then before creating your server you should install WinSCP and PuTTY in order to be able to upload your data and run your programs. If you're not on Windows then you can use the terminal with ssh to do this.

Create an account

Start by visiting this link and making an account:
https://aws.amazon.com/

You need to provide your credit card details before starting. You will be charged automatically every month so make sure to keep an eye on your usage as you pay for the amount of time you let the server run per hour.

Enter the AWS console

Next go into the AWS console which is where you get to manage everything that has to do with virtual servers:
https://console.aws.amazon.com/

Note that there is the name of a city at the top such as "Ohio". This is to say where you want your servers to be in. Amazon has servers all around the world and you might prefer one region over another, mostly to reduce latency. Different regions also have different prices, so that might take priority in your decision. Ohio and W. Virginia seem to be the cheapest. See here for more information:
https://www.concurrencylabs.com/blog/choose-your-aws-region-wisely/

Keep in mind that each region has its own interface so that if you reserve a server in one region, you can only configure your server when that region is selected. You will not see a list of all your servers in any region. So make sure you remember which regions you use.

After having chosen a region, next go to Services and click on EC2:
https://console.aws.amazon.com/ec2/

Create a virtual server

Click on the big blue button called "Launch Instance" in order to create your virtual server. You can now see a list of AMIs which are preconfigured virtual servers that are specialized for some kind of task, such as deep learning. You're going to copy an instance of one of these and then upload your files to it. Look for the AMI called "Deep Learning AMI (Ubuntu)" which contains a bunch of deep learning libraries together with CUDA drivers. Click on "Select".

This is where you choose the computing power you want. The better it is, the more expensive. If you just want to see how it works then choose a free one which says "Free tier eligible". If you want to get down to business then choose one of the "GPU Compute" instances such as "p2.xlarge" which has 12GB of GPU memory. The "p2.xlarge" costs about 90 cents per hour (which starts from the moment you create the instance, not when you start running your programs so it also includes the time it takes to upload your data).

If this is your first time creating a powerful instance then you will first need to ask Amazon to let you use it (this is something they do to avoid people hogging all the resources). Under "Regarding" choose "Service Limit Increase", under "Limit Type" choose "EC2 Instances", and under "Region" choose the region you selected previously. You also have to say something about how you'll use it. After being granted access you can then continue from where we left off.

After ticking the check box next to your selected computer power, click on "Next: Configure Instance Details".

Leave this next step with default settings. Click on "Next: Add Storage".

This is where you choose your storage space. You will need at least 50GB of space in order to store the deep learning AMI but you will need additional space for your datasets and results. Keep in mind that you have to pay per GB per month for storage. If you need frequent access to the hard drive (such as loading minibatches from disk) then you'll need to use an SSD drive which costs about 10 cents per GB per month. Otherwise you can use a magnetic drive which costs about 5 cents per GB per month.

Next click on "Next: Add Tags". This is where you make up some informative tags for your virtual server in order to differentiate it from other virtual servers. You can do something like "Name=Deep learning". If you only have one server then this is not important. Click on "Next: Configure Security Group".

This is where you create some firewall rules to make sure that only your computer has access to the virtual server, even if someone else knows the password. It might be the case that this doesn't work for you and you won't be able to connect at all even from your IP address in which case choose "Anywhere" as a source which will not make any restricts based on IP. Click "Review and Launch".

As soon as you click on "Launch" at the bottom the clock starts ticking and you will start being billed. If you've got the AWS Educate package then you will have 100 free hours but they will start being used up. You can stop an instance any time you want but you will be billed for one hour as soon as it starts, even if you stop it immediately.

If this is your first time using a server in the selected region (the place you want your server to be in) then you will be asked to create a cryptographic private key which is a file that will be used instead of a password to connect to the virtual server. If you're on Windows then you need to use a program that comes with PuTTY called PuTTYgen which converts the .pem file that Amazon sends you to a .ppk file that PuTTY can use. Follow the section called "To convert your private key" in the following link to do this:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html?icmpid=docs_ec2_console

Connecting to your virtual server

Now that we have our server ready we need to connect to it, upload our stuff, and get it to run. We're also on the clock so we need to hurry. Start by visiting your list of server instances, which can be found in the side bar when you're in the EC2 dashboard:
https://console.aws.amazon.com/ec2/v2/home#Instances:sort=instanceId

Your server should be running. You can stop it by right clicking it and under "Instance state" choosing "Stop". This will stop it from charging you every hour but you will be charged for at least one hour as soon as you start.

Clicking on a server and then clicking on "Connect" will give you information to access the server using PuTTY or ssh.

If you're using Windows, you connect to your server using PuTTY, which you can find out how using this guide:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html?icmpid=docs_ec2_console
After connecting using PuTTY you can transfer the configuration to WinSCP by opening WinSCP, going on Tools, and choosing "Import sites". Now you can connect using WinSCP in order to upload and download files as well.

If you're not using Windows then you should use the terminal and follow this guide instead:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html
You can upload stuff from Linux by using FileZilla or by using the normal directory explorer and clicking CTRL+L to enter an FTP location.

Upload all the files you need including datasets and scripts. You can zip the files before uploading them and then unzip them on the server using the "unzip" Linux command.

DO NOT INSTALL LIBRARIES WITH PIP YET! See the next section first.

Using your virtual server

Finally we are close to start running our scripts. But we still need to do two more things first.

First of all, look at the top of the PuTTY or terminal which tells you how to activate different Anaconda environments. These are Python environments with different libraries available which are connected to CUDA so that you will be able to run on the GPU. In PuTTY or terminal enter "source activate <environment name>". Remember this line to use later.

Now you can install anything you want using pip.

Secondly as soon as you close PuTTY or terminal all running processes will stop (but the instance will still be running so you'll still be paying). What you need to do is to use the application called "screen" which will keep everything running on its own. See this link:
https://www.rackaid.com/blog/linux-screen-tutorial-and-how-to/

Now you'll need to activate the environment again because screen creates a new session which is disconnected from the previous one.

Hoorray!

You're done! You can now start using your server. When you're ready, make sure to stop it and you can even terminate it but that will completely delete the server with all data so only do it when you're really ready, otherwise you'll be wasting money reuploading and rerunning everything.

You can see your current bill by clicking on Services and going on Bill. This will show you your current month's bill as well as previous bills as well. You can ever get a daily break down and forecasts.
https://console.aws.amazon.com/billing/home

Saturday, October 21, 2017

Hyperparameter tuning using Hyperopt

One of the most tedious but important things there is in machine learning is tuning the hyperparameters of your machine learning algorithm, such as the learning rate and initial parameters in gradient descent. For example you might want to check how your gradient descent algorithm performs when the learning rate is 1.0, 0.1, or 0.01 and the initial parameters being randomly chosen between -10.0 and 10.0 or between -1.0 and 1.0.

The problem with doing this is that unless you have several supercomputers at your disposal, testing the learning algorithm is a slow process and there are so many different ways to configure your hyperparameters. In the above example you'd have to try six different configurations: (1.0,(-10.0,10.0)), (1.0,(-1.0,1.0)), (0.1,(-10.0,10.0)), etc.. An exhaustive search (grid search) would take too long if each test takes very long and in practice you'd have a search space which is much bigger than six. We can test a random sample of the configurations but ideally the sample would be chosen intelligently rather than randomly. We could start out by trying some randomly chosen configurations and then start homing in on some of the more promising hyperparameter choices.

This is where the Python library Hyperopt comes in. It's a simple library that searches a space of hyperparameters in a rational way so that you'll end up with a better result after trying 100 configurations than if you just randomly sampled 100 different configurations and picked the best performing one. It does this by using a tree of Parzen Estimators.

Let's start with a simple gradient descent example that finds the minimum of "y = x^2". We'll write a function that takes in a learning rate, and a range within which to initialize "x" (we'll only ask for the maximum of this range and assume that the negative of this number is the minimum). We'll then apply gradient descent on the initial "x" value for 10 epochs. The function will finally return the value of "x" that was found near the minimum.

import random

def loss(x):
    return x**2

def loss_grad(x):
    return 2*x

def graddesc(learning_rate, max_init_x):
    x = random.uniform(-max_init_x, max_init_x)
    for _ in range(10):
        x = x - learning_rate*loss_grad(x)
    return x

Great, now we want to find the best hyperparameters to use. In practice we'd have a validation set for our machine learning model to learn to perform well on. Once the hyperparameters that result in the best performance on the validation set are found, we'd apply them to learn on a separate test set and it is this performance that is used to judge the quality of the learning algorithm. However, for this blog post we'll instead focus on the simpler mathematical optimization problem of finding the minimum of "y = x^2".

This is how you use Hyperopt to find the best hyperparameter combination. First you create a function called an objective function that takes in a tuple with the chosen hyperparameters and returns a dictionary that is read by hyperopt to assess the fitness of the chosen hyperparameters. Then you take this function and the possible hyperparameter choices you want to allow and pass them to the hyperopt function called "fmin" which finds the hyperparameters that give the smallest loss.

import hyperopt

def hyperopt_objective(hyperparams):
    (learning_rate, max_init_x) = hyperparams
    l = loss(graddesc(learning_rate, max_init_x))
    return {
            'loss':   l,
            'status': hyperopt.STATUS_OK,
        }

best = hyperopt.fmin(
        hyperopt_objective,
        space     = [
                hyperopt.hp.choice('learning_rate', [ 1.0, 0.1, 0.01 ]),
                hyperopt.hp.choice('max_init_x',    [ 10.0, 1.0 ]),
            ],
        algo      = hyperopt.tpe.suggest,
        max_evals = 10
    )
print(best)

The output of this program is this:

{'learning_rate': 1, 'max_init_x': 1}

This is saying that the best loss is obtained when the learning rate is the item at index 1 in the given list (0.1) and the maximum initial value is the item at index 1 in the given list (1.0). The "space" parameter in fmin is there to say how to construct a combination of hyperparameters. We specified that we want a list of two things: a learning rate that can be either 1.0, or 0.1, or 0.01, and a maximum initial value that can be either 10.0 or 1.0. We use "hp.choice" to let fmin choose among a list. We could instead use "hp.uniform" in order to allow any number within a range. I prefer to use a list of human friendly numbers instead of allowing any number so I use the choice function instead. We have also said that we want to allow exactly 10 evaluations of the objective function in order to find the best hyperparameter combination.

Although this is how we are expected to use this library, it is not very user friendly in general. For example there is no feedback given throughout the search process so if each evaluation of a hyperparameter combination takes hours to complete then we could end up waiting for several days without seeing anything, just waiting for the function to return a value. The return value also requires further processing as it's just indexes. We can make this better by adding some extra stuff to the objective function:

eval_num = 0
best_loss = None
best_hyperparams = None
def hyperopt_objective(hyperparams):
    global eval_num
    global best_loss
    global best_hyperparams

    eval_num += 1
    (learning_rate, max_init_x) = hyperparams
    l = loss(graddesc(learning_rate, max_init_x))
    print(eval_num, l, hyperparams)

    if best_loss is None or l < best_loss:
        best_loss = l
        best_hyperparams = hyperparams

    return {
            'loss':   l,
            'status': hyperopt.STATUS_OK,
        }

We can now see each hyperparameter combination being evaluated. This is the output we'll see:

1 1.6868761146697238 (1.0, 10.0)
2 0.34976768426779775 (0.01, 1.0)
3 0.006508209785146999 (0.1, 1.0)
4 1.5999357079405185 (0.01, 10.0)
5 0.2646974732349057 (0.01, 1.0)
6 0.5182259594937579 (1.0, 10.0)
7 53.61565213613977 (1.0, 10.0)
8 1.8239879002601682 (1.0, 10.0)
9 0.15820396975495435 (0.01, 1.0)
10 0.784445725853568 (1.0, 1.0)

Also, the variable best_hyperparams will contain the tuple with the best hyperparameters found. Printing best_hyperparams will show "(0.1, 1.0)". We can even save the best hyperparameters found till now in a file so that we can stop the search early if we run out of patience.

Here is the full code in one place:

import random
import hyperopt

def loss(x):
    return x**2

def loss_grad(x):
    return 2*x

def graddesc(learning_rate, max_init_x):
    x = random.uniform(-max_init_x, max_init_x)
    for _ in range(10):
        x = x - learning_rate*loss_grad(x)
    return x

eval_num = 0
best_loss = None
best_hyperparams = None
def hyperopt_objective(hyperparams):
    global eval_num
    global best_loss
    global best_hyperparams

    eval_num += 1
    (learning_rate, max_init_x) = hyperparams
    l = loss(graddesc(learning_rate, max_init_x))
    print(eval_num, l, hyperparams)

    if best_loss is None or l < best_loss:
        best_loss = l
        best_hyperparams = hyperparams

    return {
            'loss':   l,
            'status': hyperopt.STATUS_OK,
        }

hyperopt.fmin(
        hyperopt_objective,
        space     = [
                hyperopt.hp.choice('learning_rate', [ 1.0, 0.1, 0.01 ]),
                hyperopt.hp.choice('max_init_x',    [ 10.0, 1.0 ]),
            ],
        algo      = hyperopt.tpe.suggest,
        max_evals = 10
    )

print(best_hyperparams)

To find out more about Hyperopt see this documentation page and the Github repository.

Sunday, May 22, 2016

Making your own deep learning workstation: From Windows to Ubuntu dual boot with CUDA and Theano

I recently managed to turn my Windows 7 gaming PC into a dual boot GPU enabled deep learning workstation which I use for deep learning experiments. Here is how I did it.

First of all, you'll want to use Theano as a deep learning framework which automatically optimises your code to use the GPU, the fast parallel processor on your graphics card, which makes deep learning faster (twice as fast on my computer). Unfortunately this seems to be difficult to do on Windows, but easy to do on Linux; so I created a partition for Ubuntu Linux (splitting a hard disk into two separate spaces which can contain two different operating systems) and made my computer dual boot (so that I can choose whether to use Windows or Ubuntu when my computer is switched on). It seems that the most compatible Ubuntu version which most tutorials focus on is Ubuntu 14.04, which is what I installed. I then installed CUDA, which is the NVIDIA graphic card driver that allows you to write programs that use the GPU. After installing this, I installed Theano together with the even easier specialisation framework Keras which lets you do deep learning with only a few lines of Python code.

Be warned that I had to reinstall Ubuntu multiple times because I followed different tutorials which didn't work. Here I will link to the tutorials which worked for me.

Create a partition for Ubuntu

Tutorial
The first step is to create a partition in your C: drive in which to install Ubuntu. I made a partition of 100GB using the Windows Disk Management tool. Follow the instructions in the tutorial.

Install Ubuntu 14.04 into the partition (yes, it is recommended to use 14.04), together with making the computer dual boot

Tutorial
Follow the instructions in the tutorial to use BCDEdit, the Windows boot manager tool, to make it possible to choose whether to use Windows or Ubuntu. Stop at step 5.

To continue past step 5, download the Ubuntu 14.04 ISO file from http://releases.ubuntu.com/trusty/. I choose the 64-bit PC (AMD64) desktop image but you might need the 32-bit one instead. After downloading it, burn it into an empty DVD, restart your computer, and pop in the DVD before it starts booting. Assuming your boot priority list has the DVD drive before the hard disk (changing this if it's not the case depends on your BIOS and mother card but it usually requires pressing F12 when the computer is starting up), you should be asked whether to install or try Ubuntu. Choose to try Ubuntu and once the desktop is loaded start the installation application.

At this point you're on step 6 of the tutorial. Be sure to choose "something else" when asked how to install Ubuntu (whether it's along side your other operating system or remove whatever you already have in your computer). Create a 1GB partition for swap space and another partition with the remainder for the available unformatted space for Ubuntu. You can find instructions on this at http://askubuntu.com/questions/343268/how-to-use-manual-partitioning-during-installation/521195#521195. It's very important that the device for boot loader is set to be the same partition where Ubuntu will be installed, otherwise Windows will not show when you turn on your computer. Do not restart, but choose to continue testing the Live CD instead.

You are now in step 8 in the tutorial. This is where you make it possible to select Ubuntu when you switch on your computer. Here is a list of steps you need:

First, you need to mount the Windows and Ubuntu partitions to make them accessible. In Ubuntu there is no "My Computer" like in Windows where you can see all the partitions. Instead you have the drives and partitions shown on the side. Click on the ones whose size matches those of the Windows C: drive and the new Ubuntu partition (after clicking on them they will be opened so you can check their contents). This will to mount the partitions.
Next you need to know the partition name of the Ubuntu partition. Click on the Ubuntu icon on the side and search for a program called GParted. Open the program and use the drop down list in the top corner to look for the partition you created for Ubuntu (selecting different choices in the drop down list will show you partitions in different physical hard disks you have). Take note of the name of the Ubuntu partition such as "/dev/sda2". Call this name [Ubuntu partition].
Next you need to know the directory path to the Windows partition. In an open folder window (or go on the Files icon in the side), click on "Computer" on the side, go in "media", enter into the "ubuntu" folder and there you will find all the mounted partitions. Find the Windows partition. Call the directory path to this partition icon [Windows partition] (/media/ubuntu/abcdef-123456).
Open the terminal (ctrl+alt+T) and then just copy the following line into the terminal. Where it says '[Windows parition]' just drag and drop the partition icon into the terminal and the whole directory will be written for you.
```
sudo dd if=[Ubuntu partition] of='[Windows partition]/ubuntu.bin' bs=512 count=1
```

Now you can restart, remove the DVD, and log into your newly installed Ubuntu operating system, update it, and do another restart.

Installing CUDA

Tutorial
This is the part that took the most tries since only one tutorial actually worked well and gave no errors. Follow the instructions in the tutorial. When downloading the CUDA driver be sure to download the .deb file.

Installing Theano with dependencies

Tutorial
Now comes the straight forward part. Unfortunately it's easier to just use Python 2 than to try to make things work with Python 3. The examples in Keras are for Python 2 anyway. Just open the terminal and paste:

sudo apt-get install python-numpy python-scipy python-dev python-pip python-nose g++ libopenblas-dev git

However also install an extra package which will let you test that Theano was installed correctly:

sudo pip install nose-parameterized

Now you can install theano and keras.

sudo pip install Theano
sudo pip install keras

Now whenever you want to run a theano or keras python script and use the GPU all you have to do is write the following in the terminal:

THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python [python script]

You may also want to use Python IDLE in order to experiment with theano, in which case install it like this:

sudo apt-get install idle

and then run it to use the GPU like this:

THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 idle

or do this to debug an error in your program:

THEANO_FLAGS=mode=FAST_COMPILE,device=cpu,floatX=float32 idle

I also recommend this good theano tutorial to get you started.

Has everything worked? Leave a comment if you think I should add something else to this tutorial.

Monday, September 19, 2011

A lousy tutorial to C# drag and drop

Drag and drop is a neat way to allow the user the "transfer" data into a winform control. Here's how to enable drag and drop in a windows form in C#:

1. Create your source and target controls. In this case we're using 2 ListViews where the one on the right (called listView1) is the source of the drag and the one on the left (called listView2) is target of the drop. However note that the source of the drag can even be from the windows explorer by drag dropping a file into the control.

2. Set the AllowDrop property of the control which will receive the drop to true.

3. Set up an event in the control which will be dragged from, that senses that a drag has been initiated, such as the ItemDrag event or the DragLeave event and call the DoDragDrop method. The DoDragDrop method is to be passed the data to be transferred (the object you want the receiving control to get).

4. Set up the DragOver event in the control which will be dropped on, to check if the data being dragged over can be accepted and change the cursor to an invalid cursor picture if not. You can check the type of the data being dragged by using the GetDataPresent method.

5. Set up the DragDrop event in the control which will be dropped on, to actually do something with the data once it has been dropped. This event will only fire if the DragOver event did not set Effect to None.

And there you have it. A lousy tutorial. However, here's something worth mentioning:

If you are transferring an object which could be one of several types and you want to view the object as its base type, then this will not work, as the GetData method will just return null if you pass it a base class for a type. The only was I found to get around this was by created a proxy object which will contain the object of interest casted as its base class, and then what you receive the proxy object in the receiving control you just get the object of interest from it. Like so: