Thursday, April 24, 2014

Upload backups to Amazon S3 bucket using boto, Python

Here is a script to upload backups older than 1 day to S3 bucket. We would like to exclude transaction log backups so I used "if 'LOG' not in name" - you may change it accordingly.

import os, sys, time
from boto.s3.connection import S3Connection
from boto.s3.key import Key
conn = S3Connection('YOURKEY','SECRETKEY')
bucket = conn.get_bucket('MyBeautifulBucket')
age = 86400 #1 day = 24*60*60
backupdir = 'O:\Backups\'

for root, dirs, files in os.walk(backupdir, topdown=False):
    for name in files:
        if 'LOG' not in name: #exclude copying transaction log backups
            f = os.path.join(root, name)
            now = time.time()
            modified = os.stat(f).st_mtime
            if modified > now - age:
                k.key= 'mySubBucket/' + str(name)

Saturday, April 12, 2014

Run a query on multiple schemas postgres

If you have one huge database with multiple schemata in it, one for each customer (single database, multi tenant model), then from time to time you might have to run a query on all schemata, for example - getting a build number or user count etc. You could either write a shell script to bring all tenants from a tenant list and build sql queries to execute dynamically or you could use a function in postgres pl sql. Below is the function to get a count of users from all schemata, querying the table tUser:

-- create a record for the returned data 
DROP FUNCTION IF EXISTS count_em_all ();
DROP TYPE IF EXISTS table_count;
CREATE TYPE table_count AS (table_name TEXT, num_rows INTEGER); 

   CREATE OR REPLACE FUNCTION count_em_all () RETURNS SETOF table_count  AS 
       the_count RECORD; 
       t_name RECORD; 
       r table_count%ROWTYPE; 

       -- list the schemas
       FOR t_name IN 
         select distinct schemaname
         from pg_catalog.pg_tables
         where tablename = 'tUser' --and schemaname not ilike 'something|_you|_liketofilter%' escape '|'      
               -- Run your query on each schema
               FOR the_count IN EXECUTE 'SELECT COUNT(*) AS "count" FROM "'||t_name.schemaname||'".tUser'
               END LOOP; 

               r.table_name := t_name.schemaname; 
               r.num_rows := the_count.count; 
               RETURN NEXT r; 
           END LOOP; 
   $$ LANGUAGE plpgsql;

select count_em_all();  

SQL Server: Setup Mirroring Step by Step using TSQL

It is 2014 and people should be using AlwaysOn instead of mirroring. But some clients are still on SQL Server 2008R2 and I am about to setup mirroring for one such customer in a couple of days and posting these steps here won't hurt. There is a handy GUI wizard that lets you create mirroring pretty easily but if you have to mirror around 30 databases, using scripts is lot easier.

Our setup is simple. SQL01 and SQL02 are our two servers - SQL01 is the principal and SQL02 is the mirror. We are not using a witness for this setup.

Steps on Mirror Server (SQL02)

1. On SQL01, take a full backup and also a transaction log backup of the database you would like to mirror. (So your database would have to be in FULL recovery mode, otherwise no mirroring for you!)

2. Restore database on SQL02
Use master;
RESTORE DATABASE [MirrorMe] FROM DISK = N'K:\Backups\MirrorMe_041214.bak' WITH FILE = 1,  
RESTORE LOG [MirrorMe] FROM DISK = N'K:\Backups\MirrorMe_log.trn' WITH FILE = 1,  

3. Create an endpoint for mirror instance, SQL02. An endpoint is necessary for database mirroring/alwayson to recieve connections from other servers.

4. Create a login for using in mirroring setup and grant connect on the endpoint.

USE [master]

5. You might want to create and maintain same set of logins/users on both Principal and Mirror server. When you do a failover, you can simply fix orphaned users and your application can keep talking to the database.

  'create login [' + + '] ' + 
  case when p.type in('U','G') then 'from windows ' else '' end +
  'with ' +
  case when p.type = 'S' then 'password = ' + master.sys.fn_varbintohexstr(l.password_hash) + ' hashed, ' +
  'sid = ' + master.sys.fn_varbintohexstr(l.sid) + 
  ', check_expiration = ' + case when l.is_expiration_checked > 0 then 'ON, ' else 'OFF, ' end + 
  'check_policy = ' + case when l.is_policy_checked > 0 then 'ON, ' else 'OFF, ' end +
  case when l.credential_id > 0 then 'credential = ' + + ', ' else '' end 
  else '' end +
  'default_database = ' + p.default_database_name +
  case when len(p.default_language_name) > 0 then ', default_language = ' + p.default_language_name else '' end
FROM sys.server_principals p 
LEFT JOIN sys.sql_logins l ON p.principal_id = l.principal_id 
LEFT JOIN sys.credentials c ON  l.credential_id = c.credential_id
WHERE p.type in('S','U','G')
  AND <> 'sa'
  --AND in ('myLogin1','myLogin2' );

6. Add the databases to Mirroring. You'll notice that the endpoint is set to 7023 and that is yet to be created on our Primary server, SQL01.

USE master;
--assuming that you have already restored these databases and at least one transaction log backup with NORECOVERY 

Steps on Principal Server (SQL01)

1. Now switch to Primary server, in this case SQL01 and create an endpoint

--you can check existing endpoints, if any using below query
SELECT name, role_desc, state_desc FROM sys.database_mirroring_endpoints;
SELECT name, port FROM sys.tcp_endpoints;
--create an endpoint


2.Mirror login
USE [master]

3. Add databases to Mirroring

USE master;

Done, you should now see the databases in mirrored state.

Saturday, April 5, 2014

Shell script: Analyze schema (update statistics) - PostgreSQL

Analyze in PostgreSQL is like sp_updatestats of SQL Server. Analyze gathers statistics about number of rows in a table and stores them in pg_statistics system catalog, which are used by query planner to create optimal query plans. Below is a shell script that is useful to update statistics on a schema. If you are on a single database and multi-tenant model, this is pretty useful to update the target tenant schema, without updating the entire database.

Before executing this, you might want to add postgres user, password and host to your pgpass file, so that you can execute this shell script without being prompted for password.

vi /root/.pgpass 

#Add details to file hostname:port:database:username:password

#set file permissions
chmod 600 /root/.pgpass

Now the shell script: You can download script from here

# Title    :
# Description:  
# Shell script 
# Options:
#  Schema name that requires a statistic update
# Requirements:
#  Installed psql and added path to binary in PATH system variable. 
# Examples:
#  $HOME/ "MySchemaName"


test=$(psql -A -1 -t -U $DBUSER -d $DBNAME -p $DBPORT -c "select * from pg_namespace where nspname='$SCHEMANAME'")
if [[ "$test" == "" ]]; then
  echo "Schema \"$SCHEMANAME\" in db \"$DBNAME\" for dbuser \"$DBUSER\" and port \"$DBPORT\" does not exists"
  exit 1

select 'ANALYZE VERBOSE \"'||current_database()||'\".\"'||t.schemaname||'\".\"'||t.tablename||'\"; ' as analyze_statement
from pg_tables t
where schemaname = '$SCHEMANAME'
order by t.tablename;

echo "Started generating analyze script for Schema \"$SCHEMANAME\" started at $(date +'%Y-%m-%d %T')"
psql -A -q -1 -t -p 7432 -U dbadmin -d MyDatabase -c "$GENANALYZESQL" > "$ANALYZESQL" 2>"$ANALYZELOG"
if ! [[ "$(cat $ANALYZELOG)" == "" ]]; then
  echo "Generation of analyze script for schema \"$SCHEMANAME\" in db \"$DBNAME\" for dbuser \"$DBUSER\" and port \"$DBPORT\" had error(s):"
  exit 1
echo "Analyzing schema \"$SCHEMANAME\" started at $(date +'%Y-%m-%d %T')"
psql -A -q -1 -t -p 7432 -U dbadmin -d MyDatabase -f "$ANALYZESQL" 2>&1 | /usr/sbin/ >> "$ANALYZELOG"
echo "Analyzing schema \"$SCHEMANAME\" ended at $(date +'%Y-%m-%d %T')"
echo "Analyzing schema sql script is in file $ANALYZESQL. Analyze log is below:"

Tuesday, April 1, 2014

Delete files older than x days - Python, bash etc

Our postgres logs directory had logs from almost 6 months and we only needed 5 days worth of data. Its pretty straightforward to remove stuff using bash:

find /db/logs/archive_data/* -type f -mtime +5 | xargs rm

But find sometimes fails with the error "too many arguments" when it has to deal with too many files.

Here is another way in Python:

Example usage: Save this as and run - python /db/data 5
from sys import argv
import os, time

script, dir, age = argv
print "Searching directory %s for file older than %s day(s)" % (str(argv[1]), str(argv[2]))

#convert age to sec. 1 day = 24*60*60
age = int(age)*86400

for file in os.listdir(dir):
    now = time.time()
    filepath = os.path.join(dir, file)
    modified = os.stat(filepath).st_mtime
    if modified < now - age:
        if os.path.isfile(filepath):
            print 'Deleted: %s (%s)' % (file, modified)

Sunday, March 30, 2014

SQL Server Database Mirroring: automating mirror failovers


You have a bunch of databases on a production server and you decided to go with SQL Server mirroring as a DR solution. Your IT dept, in its infinite wisdom, thought that it is a waste of resources to have a witness server and you were asked to somehow adjust with a Principal and a Mirror server. Your databases are critical and though the option of automated failovers is ruled out due to lack of witness, you would like to put process in place such that you are always battle ready to deal with a possible disaster and recover quickly. The solution should be designed in such a way that even the most junior of your DBAs should be able to handle it.


For the sake of this explanation, lets us assume my production server is YIN and Mirror server is YANG. So what happens when a disaster occurs? Assuming that you already have mirroring setup between YIN and YANG and it is functional, you might have to do the following after a disaster.

  • When YIN goes down, databases on YANG server will be in restoring/disconnected state. You'll bring all those databases online  by performing a forced failover with allow data loss.
  • When the failover is done, you might want to fix orphaned users
  • You now make necessary changes to whichever mechanism your application uses to connect to the databases to point to the new server. In our case, we store database names and connection strings in a SQL that sits on a well maintained cluster (lets call this MOTHERSHIP), so we update those connection strings with IP address of YANG. 
  • We have lot of business specific SQL Agent jobs running on our production servers apart from regular maint jobs. These jobs exist both on YIN and YANG. The jobs are kept in disabled state on mirror server. Now that we had a disaster and we failover to the evil YANG, we might have to enable these jobs. 
  • Send a notification that failover of some sort occurred. 
What Next?

Ok, you did all of that and we were back in business with minimal downtime. Our former Principal server was resurrected and is back online. YANG is currently the Principal and you might want to add YIN back to mirroring setup as the new mirror. After a couple of days, you are sure that YIN is now stable and it should be primary because it probably has better resources compared to YANG. You might have to do a soft (planned) failover. So following are various scenarios that are possible in a setup like this:
  • Planned Failover from Yin to Yang. Stop jobs on Yin, enable jobs on Yang, fix users, update connection strings
  • Planned Failover from  Yang to Yin etc.,
  • Forced Failover on Yin
  • Forced Failover on Yang
  • Resume Mirroring on Yin
  • Resume Mirroring on Yang
Enough of theory. Lets get our hands dirty. Creating batch scripts for each of these scenarios appeared like the easiest simplest (if not elegant) solution to me. The actual logic that performs a failover can be a stored procedures and the batch scripts can be stored on a server which is highly available. 

Create these stored procedures on both Principal and Mirror Servers.
 Example procedure below. Click links above for other sps. 

USE [master]
CREATE PROCEDURE [dbo].[usp_ForcedFailover]

/* stored procedure for doing a forced failover of all database to YANG
example usage:  EXEC master.dbo.usp_MirrorFailover

 ,@maxid INT
 ,@i INT
 ,@dbname SYSNAME
 [dbid] INT identity(1, 1) NOT NULL PRIMARY KEY


INSERT INTO @MirrorDBs (dbname)
SELECT DB_NAME(database_id)
FROM sys.database_mirroring
WHERE mirroring_role = 2 -- Mirror
 AND mirroring_state IN (0,1) -- Disconnected or Suspended
 AND mirroring_safety_level = 2

--select * from @MirrorDBs
SELECT @maxid = max([dbid])
 ,@i = 1
FROM @MirrorDBs;

WHILE @i <= @maxid
 SELECT @dbname = dbname
 FROM @MirrorDBs
 WHERE [dbid] = @i


 PRINT @sql

 EXEC (@sql);

 SET @i = @i + 1


Next, create scripts for enabling/disabling jobs:
declare @sql nvarchar(max) = '';
@sql += N'exec msdb.dbo.sp_update_job @job_name = ''' + name + N''', @enabled = 1;
' from msdb.dbo.sysjobs
where enabled = 0
order by name;

print @sql;
exec (@sql);

Create scripts for fixing users

 ,@maxid INT
 ,@i INT
 ,@dbname SYSNAME
 [dbid] INT identity(1, 1) NOT NULL PRIMARY KEY


INSERT INTO @mirrors (dbname)
SELECT DB_NAME(database_id)
FROM sys.database_mirroring
WHERE mirroring_role = 1 -- Principal 
 AND mirroring_state = 4 -- Synchronized
 AND mirroring_safety_level = 2

--select * from @mirrors
SELECT @maxid = max([dbid])
 ,@i = 1
FROM @mirrors;

WHILE @i <= @maxid
 SELECT @dbname = dbname
 FROM @mirrors
 WHERE [dbid] = @i

 SET @sql = 'USE ' + QUOTENAME(@dbname) + 
 ' EXEC sp_change_users_login ''Auto_Fix'', ''myUser1'';
EXEC sp_change_users_login ''Auto_Fix'', ''myUser2'';

 PRINT @sql

 EXEC (@sql);

 SET @i = @i + 1
Update connection strings on MOTHERSHIP server
USE ConnectionDB
UPDATE COMPANY SET ConnectionString='Data Source=;User ID=fancyuser;Password=fancypass;Max Pool Size=200' 
WHERE ConnectionString='Data Source=;User ID=fancyuser;Password=fancypass;Max Pool Size=200';
Finally, the batch file that brings all these together:
@echo off
goto :startbatch

This batch file is used for 
1. Performing a soft failover of YIN to YANG. YANG will become the primary
2. Fix Database users
3. Update connection strings in connectiondb database post failover
4. Disable jobs on YIN
5. Enable jobs on YANG
6. Send a notification email 


set source_server=
set mothership_server=
set mirror_server=
set serverlogin=fancyuser
set serverpass=fancypass
set connectiondb=connectiondb

set home_dir=d:\mirrorme\Failover_YIN_TO_YANG
set failover_script=%home_dir%\failover.sql
set failover_out=%home_dir%\failover_out.txt
set fixusers_script=%home_dir%\fixusers.sql
set fixusers_out=%home_dir%\fixusers_out.txt
set update_connectiondb=%home_dir%\update_connectiondb.sql
set update_connectiondb_out=%home_dir%\update_connectiondb_out.txt
set disablejobs_script=%home_dir%\disablejobs.sql
set disablejobs_out=%home_dir%\disablejobs_out.txt
set enablejobs_script=%home_dir%\enablejobs.sql
set enablejobs_out=%home_dir%\enablejobs_out.txt

REM Failover from YIN to YANG
sqlcmd -S tcp:%source_server%,1433 -d master -U %serverlogin% -P %serverpass% -s "," -W -h -1 -i %failover_script% -o %failover_out%

REM Run fix users script on YANG to deal with orphan users
sqlcmd -S tcp:%mirror_server%,1433 -d master -U %serverlogin% -P %serverpass% -s "," -W -h -1 -i %fixusers_script% -o %fixusers_out%

REM update connectiondb database with new connection string
sqlcmd -S tcp:%connectiondb_server%,1433 -d %connectiondb% -U %serverlogin% -P %serverpass% -s "," -W -h -1 -i %update_connectiondb% -o %update_connectiondb_out%

REM disable jobs on YIN because we failedover to YANG
sqlcmd -S tcp:%source_server%,1433 -d master -U %serverlogin% -P %serverpass% -s "," -W -h -1 -i %disablejobs_script% -o %disablejobs_out%

REM Enable jobs on YANG since it is the new primary
sqlcmd -S tcp:%mirror_server%,1433 -d master -U %serverlogin% -P %serverpass% -s "," -W -h -1 -i %enablejobs_script% -o %enablejobs_out%

REM send emails on completion
C:\work\tool\sendemail -t -cc -u "Mirroring failover of YIN to YANG" -m "Database Mirroring Failover of YIN to YANG has been triggered and completed. SQL Agent jobs on YIN were disabled and jobs on YANG were enabled. Check logs attached. " -s -f "" -a %failover_out% -a %fixusers_out% -a %update_connectiondb_out% -a %disablejobs_out% -a %enablejobs_out%

To do:

Update this post with scripts for all other possible failover scenarios:

Thursday, March 20, 2014

Batch Script: SQL Server Database Backup, Refresh, sanitize, run script and upload csv to sftp

One of the customers requested the following:

Backup a production database and restored it on a dev server. Sanitize the refreshed database and run a script and upload results as a csv file to sftp site. I know this can be done with SSIS, but I thought a batch job would be simpler to manage and troubleshoot.

First, create a directory for this job and add the following files to it. I will call mine D:\NirvanaInc. We are going to use the following tools to get this task done.

  • SQLCMD - To run backup/restore commands, sanitize scripts etc
  • PSFTP - To upload files to an SFTP site
  • sendemail - To send notification emails

Now that we are ready, lets start building the scripts we will use. The idea is to create one batch file which will call and run various scripts that do backup, restore, cleanup etc. this approach will make it incredibly easier to maintain and make changes to the job in future, if needed.


This is simple backup script:

USE master;

DECLARE @startdate VARCHAR(100);

SET @startdate = 'backup start: ' + convert(VARCHAR(32), getdate(), 121);

PRINT @startdate;

BACKUP DATABASE [NirvanaProd] TO DISK = N'R:\Backups\NirvanaProd.bak'
 ,NAME = N'NirvanaProd-Full Database Backup'
 ,STATS = 10


SELECT @backupSetId = position
FROM msdb.dbo.backupset
WHERE database_name = N'NirvanaProd.bak'
 AND backup_set_id = (
  SELECT max(backup_set_id)
  FROM msdb..backupset
  WHERE database_name = N'NirvanaProd.bak'

IF @backupSetId IS NULL
   N'Verify failed. Backup information for database ''NirvanaProd'' not found.'

FROM DISK = N'R:\Backups\NirvanaProd.bak'
WITH FILE = @backupSetId

DECLARE @enddate VARCHAR(100);

SET @enddate = 'backup end: ' + convert(VARCHAR(32), getdate(), 121);

PRINT @enddate;


-- restore backup on target server, SQL:
USE master;

-- wait for start time
 ,@targettime DATETIME;

-- targettime is set to 02:00:00 AM, If the backup is done sooner, it will wait till 2 AM to kick off restore
SET @targettime = substring(CONVERT(VARCHAR(39), GETDATE(), 121), 1, 10) + ' 02:00:00';
SET @now = GETDATE();

IF (@targettime > @now)
 WAITFOR TIME @targettime;

DECLARE @startdate VARCHAR(100);
SET @startdate = 'db restore start: ' + convert(VARCHAR(32), getdate(), 121);
PRINT @startdate;

  FROM sys.databases
  WHERE NAME = 'NirvanaDev'

FROM DISK = N'R:\Remotebackups\Nirvanaprod.bak'
 ,MOVE N'FinancialSoftwareSystems_Data' TO N'D:\SQLData\NirvanaDev.mdf'
 ,MOVE N'FinancialSoftwareSystems_log' TO N'D:\SQLData\nirvanaDev.ldf'
 ,STATS = 10


DECLARE @enddate VARCHAR(100);
SET @enddate = 'db users restore end: ' + convert(VARCHAR(32), getdate(), 121);
PRINT @enddate;

I will leave out sanitize and grant permissions script because they are specific to your environment and you should build your scripts accordingly. Next up is running the a report script on target database and uploading the output as CSV to SFTP site. I chose PSFTP because of its ease of use.


echo lcd D:\NirvanaInc > psqftp.bat
echu lpwd >> psftp.bat
echo put D:\NirvanaInc\sql_results.csv >> psftp.bat
echo quit >> psftp.bat

c:\sysutil\psftp -pw myFancyPwd -b psftp.bat  -v

By now, you should have the following files on your D:\NirvanaInc folder:

Finally, the main batch file:

@echo off
goto :startbatch

This batch file is used for 
1. Taking a backup of NirvanaProd database
2. Copying backup file to a Dev db server
3. Restore database on Dev Environment
4. Sanitizing Database
5. Creating user and granting permission
6. Run a script
7. Upload results to sftp


REM set variablesa

set source_server=
set target_server=
set source_db=NirvanaProd
set target_db=NirvanaDev

set home_dir=D:\NirvanaInc\NirvanaProd
set backup_script=%home_dir%\backup.sql
set backup_out=%home_dir%\backup_out.txt
set copy_out=%home_dir%\copy_out.txt
set restore_script=%home_dir%\restore.sql
set restore_out=%home_dir%\restore_out.txt
set sanitize_script=%home_dir%\sanitize.sql
set sanitize_out=%home_dir%\sanitize_out.txt
set grant_script=%home_dir%\grant_script.sql
set grant_out=%home_dir%\grant_out.txt
set run_script=%home_dir%\run.sql
set run_script_out=%home_dir%\sql_results.csv

set serverlogin=mylogin
set serverpass=fancypassword123

REM backup your production database
sqlcmd -S tcp:%source_server%,1433 -d master -U %serverlogin% -P %serverpass% -s "," -W -h -1 -i %backup_script% -o %backup_out%

REM copy backup from source server to target server, log is in %copy_out%
echo copy of backup file started at: %date% %time% > %copy_out%
copy /y \\%source_server%\R$\Backups\%source_db%.bak \\%target_server%\R$\RemoteBackups 1>>%copy_out% 2>&1
echo copying of the backup file completed at: %date% %time% >> %copy_out%

REM Restore backup on the target server
sqlcmd -S tcp:%target_server%,1433 -d master -U %serverlogin% -P %serverpass% -s "," -W -h -1 -i %restore_script% -o %restore_out%

REM Sanitize database and create user, grant permissions
sqlcmd -S tcp:%target_server%,1433 -d %target_db% -U %serverlogin% -P %serverpass% -s "," -W -h -1 -i %sanitize_script% -o %sanitize_out%
sqlcmd -S tcp:%target_server%,1433 -d %target_db% -U %serverlogin% -P %serverpass% -s "," -W -h -1 -i %grant_script% -o %grant_out%

REM Run sql script
sqlcmd -S tcp:%target_server%,1433 -d %target_db% -U %serverlogin% -P %serverpass% -s "," -W -h -1 -i %run_script% -o %run_script_out%

REM Upload run_sql.csv results to SFTP
call %home_dir%/run_psftp.bat 

REM send emails on completion
C:\sysutil\sendemail -t -cc -u "%target_db% refresh" -m "%source_db% backup from server %source_server% is restored on %target_server% as db %target_db% . The script ran successfully and the output was uploaded to SFTP" -s -f "" -a %backup_out% -a %restore_out% -a %sanitize_out% -a %grant_out% 

have fun! :)

Saturday, March 8, 2014

Postgres en_US - UTF 8 Collation on Windows

I was restoring a schema today on a stage server and ran into a collation error. The source of schema backup was Postgres 9.1, CentOS and the destination was Postgresq 9.1, windows server 2008 R2.

pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 2125939; 1259 50334466 INDEX ix_rd_vendorlist_facts_defaultsort_en_us
pg_restore: [archiver (db)] could not execute query: ERROR: collation "public.en_US" for encoding "UTF8" does not exist

After reading a related stackoverflow thread I realized en_US locale is Linux specific and the windows equivalent is 'English_United States.1252'. I created these collations on template database and the user database I was trying to restore this schema on and the restore finally worked.

CREATE COLLATION pg_catalog."en_US" (
 LC_COLLATE = 'English_United States.1252'
 ,LC_CTYPE = 'English_United States.1252'

CREATE COLLATION pg_catalog."fr_CA" (
 LC_COLLATE = 'French_Canada.1252'
 ,LC_CTYPE = 'French_Canada.1252'

CREATE COLLATION pg_catalog."pt_BR" (
 LC_COLLATE = 'Portuguese_Brazil.1252'
 ,LC_CTYPE = 'Portuguese_Brazil.1252'

-- add public collations
 LC_COLLATE = 'English_United States.1252'
 ,LC_CTYPE = 'English_United States.1252'

 LC_COLLATE = 'French_Canada.1252'
 ,LC_CTYPE = 'French_Canada.1252'

 LC_COLLATE = 'Portuguese_Brazil.1252'
 ,LC_CTYPE = 'Portuguese_Brazil.1252'

Tuesday, October 1, 2013

Setup: Python/Flask dev environment on windows using virtualenv and pip

Chacolateyis a sudo apt-get for the windows crowd and its awesome! Chacolatey makes it incredibly easy to setup multiple dev environments for Python/flask/django which is otherwise a multi-step process (installing setuptools, pip, virtualenv etc). Here are the steps.

Install Chacolatey

Launch a powershell prompt using "Run As Admin" and set execution policy to unrestricted.

Set-ExecutionPolicy Unrestricted

And run the following command in the same window to install chacolatey

iex ((new-object net.webclient).DownloadString(''))

Once its installed, you'll have csint added to your path and it is the equivalent of apt-get.

Install Python

If you don't already have Python installed, run the following in a powershell/cmd

cinst Python

This will install Python 2.7 at C:\Python27. We need a few more things to setup a proper dev environment. Instead of installing flask package directly to the main Python installation, it can be separated using virtualenv. If you are developing an app on django and another using flask and would like to keep these environments separate, virtualenv is the way to go.

Install Pip

cinst pip

pip install virtualenvwrapper
pip install virtualenvwrapper-powershell

This will add virtualenv.exe to your system path and you can create multiple Python dev environments using this. Lets call our flask app FlaskApp. Lets create a directory for this app on D drive. Create a virtual env in that folder.

virtualenv FlaskApp

This will create a Python instance in FlaskApp folder. Now activate the virtual environment we just created. cd to FlaskApp/Scripts and run activate

cd FlaskApp\Scripts\
#your command line should change to (FlaskApp) D:\FlaskApp\Scripts. 
pip install flask
# To come out of (FlaskApp), simply type 'deactivate'

Now you are all set. We created a virtual environment called FlaskApp and installed flask package. Lets see if it works. Create a sample app. Create a with below

-- the above line is to let the computer run python.exe from our VirtualEnv not the system wide Python
from flask import Flask
app = Flask(__name__)

def hello_world():
    return 'Hello World!'

if __name__ == '__main__':



It should run the web server on

Thursday, July 25, 2013

Virtual Lab: Multi node Hadoop cluster on windows (HortonWorks, HDInsight)

Microsoft recently announced HDInsight, built on top of HortonWorks DataPlatform (a distribution of Hadoop similar to Cloudera). There is a developer preview of HDInsight available as a free download from Microsoft, which is very easy to setup. However,  the developer preview is a single box solution and it is not possible to simulate a real world deployment with multi node clusters. You can create a 4 node hadoop cluster on Azure with a few clicks but it is prohibitively costly (and the cluster will be shutdown after free tier usage, if your account is a trial one). This is a step by step guide to setup a multi node cluster for free on your laptop using HortonWorks Data Platform. HDInsight is actually built on top of Hortonworks solution, so there won’t be many differences apart from some usability features and nice looking clients.  So let’s get started with the setup:

What will we build?

We will build a Virtualbox Base VM with  all the necessary software installed, sysprep it and spin linked clones from the base for better use of available resources, We will build a domain controller and add 4 VMs to the domain. You can achieve similar setup using HyperV too, using differential disks. We should end up with something like this:


Name of the VM

IP Address
HDDC Domain Controller and Database host for Metastore 1024
HDMaster Name node, Secondary Name node, Hive, Oozie, Templeton 2048
HDSlave01 Slave/Data Node 2048
HDSlave02 Slave/Data Node 2048
HDSlave03 Slave/Data Node 1024

In prod environments, HIVE, OOZIE and TEMPLETON usually have separate servers running the services, but for the sake of simplicity, I chose to run all of them on a single VM.  If you don’t have enough resources to run 3 data /slave nodes, then you can go with 2. Usually 2 slave nodes and 1 master is necessary to get a feel of real Hadoop Cluster.

System Requirements:

You need a decent laptop/desktop with more than 12 GB of RAM and a decent processor like core i5. If you have less RAM, you might want to run a bare minimum of 3 node cluster (1 master and 2 slaves) and adjust the available RAM accordingly.  For this demo, check the way I assigned RAM in the table above. You can get away with assigning lesser memory to data nodes.

Software Requirements:

Windows Server 2008 R2 or Windows Server 2012 Evaluation edition, 64 bit
Microsoft .NET Framework 4.0
Microsoft Visual C++  2010 redistributable, 64 bit
Java JDK 6u31 or higher
Python 2.7
Microsoft SQL Server 2008 R2 (if you would like to use it for Metastore)
Microsoft SQL Server JDBC Driver
VirtualBox for virtualization (Or HyperV, if you prefer)
HortonWorks DataPlatform for Windows

Building a Base VM:

This guide assumes that you are using VirtualBox for virtualization. The steps are similar even if you are using HyperV with minor differences like creating differencing disks before creating VM clones from it. Hadoop cluster setup involves installing a bunch of software and setting environment variables on each server. Creating a base VM and spinning linked clones from it makes it remarkably easier and saves lot of time and it helps creating more VMs within the available limits of your disk space compared to stand alone VMs.
1. Download and Install VirtualBox
2. Create a new VM for a base VM. I called mine MOTHERSHIP. Select 25 GB of initial space, allowing dynamic expansion. We will do a lot of installations on this machine and to make them work quick, assign maximum amount of RAM temporarily. At the end of the setup, we can adjust the RAM to a lesser value.
3. Enable two network adapters on your VM. Adapter1 connected to NAT (if you care about internet access on VMs). Adapter2 for Internal only Domain Network.

image image

4. From Setting > Storage > Controller IDE , mount Windows Server 2008 R2 installation media iso.
5. Start the VM and boot from the iso. Perform Windows Installation. This step is straight forward.
6. Once the installation is done, set password to P@$$w0rd! and login.
7. Install guest additions for better graphics, shared folders etc.
8. You will move a lot of file from your Host machine to the VM, so configuring a shared folder will help.


Install necessary software on BaseVM:

1. Login to the base VM you built above and install Microsoft .NET Framework 4.0
2. Install Microsoft 2010 VC ++ redistributable package.
3. Install Java JDK 6u31 or higher. Make sure that you select a destination directory that doesn’t have any spaces in the address during installation. For example, I installed mine on C:\Java\jdk1.6.0_31
4. Add an environment variable for Java. If you installed Java on C:\Java\jdk1.6.0_31, you should add that path as a new system variable.  Open a Powershell windows as an admin and execute the below command:

5. Install Python. Add Python to Path. Use the command below in a powershell shell launched as Administrator

$env:Path += ";C:\Python27"

6. Confirm that the environment variables have been set properly by running %JAVA_HOME% from Run, it should open the directory where JDK was installed. Also, run Python from Run and it should open Python shell.

7. We are almost done with the Base VM. We can now generalize this image using sysprep. Go to Run, type sysprep and enter. It should open /system32/sysprep directory. Run sysprep.exe as an Admin. Select 'System Out of Box Experience’ , check Generalize and select Shutdown.


Create Linked Clones:

1. Go to VirtualBox, Right click on the Base VM you created above and select clone.


2. Give the VM an appropriate name. In my case, its HDDC for domain controller. Make sure you select Re-Initialize MAC Addresses of all network cards
3. In the Clone Type Dialog, select Linked Clone, click Clone.


4. Repeat steps # 1 – 3 for creating other VMs that will be part of your cluster. You should end up with HDDC, HDMaster and slave nodes - HDSlave01, HDSlave02, HDSlave03 (depending on your available resources)
5. Since we created these clones from the same base, it will inherit all settings, including memory. You might want to go to the properties of each VM and assign memory as you see fit.

Configuring Domain on HDDC

We will not go into details of setting up active directory. Google is your friend. But below are some steps specific to our setup.

1. Start HDDC VM and login.
2. VirtualBox creates funny computers names, even though your VM name is HDDC. Change the computer name to HDDC and reboot.
3. Go to the network settings and you should see two network adapters enabled. Usually ‘Network Adapter2’ is your domain only internal network. To confirm, you can go to VM settings while the VM is running, go to Network tab and temporarily simulate cable disconnection by unchecking cable connected. The Adapter attached to NAT is for internet and the one created as Internal is for Domain Network.


4. Go to properties of Domain Network Adapter inside the VM and set below ip configuration


5. Enable Active Directory services role and reboot. Use dcpromo.exe to create a new domain. I chose my domain name as TESLA.COM, named after the greatest scientist ever lived. If you need help configuring AD Domain, check this page. Post configuration and reboot, logon to the Domain Controller as TESLA\Administrator and create an active directory account and make it a Domain Admin, this will make things easier for you for the rest of setup.
6. Optionally, if you are planning on using this VM as Metastore for Hadoop cluster, you can install SQL Server 2008 R2. We will not go into the details of that setup. It is just a straight forward installation. Make sure you create a SQL Login which has sysadmin privileges. My login is hadoop. I have also created a couple of blank databases on it, namely hivedb, ooziedb.

Network Configuration and Joining other VMs to Domain

1. Log on to HDMaster as an Administrator
2. Find the Domain network Adapter (refer to step #3 in Configuring Domain section). Make sure your HDDC (Domain controller) VM is running.
3. Change the adapter settings: IP -, subnet mast –, Default Gateway – Leave it blank, Preferred DNS –
4. Right click on My Computer > Setting > Computer name, domain, workgroup settings > Settings and Change the computer name to HDMaster and enter the Domain name TESLA.COM. Enter Domain Admin login and password when prompted and you’ll be greeted with a welcome message. Reboot.
5. Logon to each of the remaining VMs and repeat steps #1 to 4 – the only difference is the IP address. Refer to the list of IPs under ‘What will we build’ section of this article. Preferred DNS remains same for all the VMs i.e,

Preparing The environment for Hadoop Cluster

We now have a domain and all the VMs added to it. We also installed all the necessary software to the base VM so all the VMs created from it will inherit those. It is now time to prepare the environment for hadoop installation. Before we begin, make sure ALL VMs have same date, time settings.

Disable IPV6 on ALL VMs:

Log on to each VM, open each network adapter and uncheck IPV6 protocol from adapter settings.

Adding firewall exceptions:

Hadoop uses multiple ports for communicating between clients and other services so we need to create firewall exceptions on all the servers involved in the cluster. Execute the code below on each VM of the cluster on a powershell prompt launched as Administrator. If you want to make it simpler, you can just turn off Firewall at the domain level altogether on all VMs. But we don’t normally do that on production.

netsh advfirewall firewall add rule name=AllowRPCCommunication dir=in action=allow protocol=TCP localport=50470
netsh advfirewall firewall add rule name=AllowRPCCommunication dir=in action=allow protocol=TCP localport=50070
netsh advfirewall firewall add rule name=AllowRPCCommunication dir=in action=allow protocol=TCP localport=8020-9000
netsh advfirewall firewall add rule name=AllowRPCCommunication dir=in action=allow protocol=TCP localport=50075
netsh advfirewall firewall add rule name=AllowRPCCommunication dir=in action=allow protocol=TCP localport=50475
netsh advfirewall firewall add rule name=AllowRPCCommunication dir=in action=allow protocol=TCP localport=50010
netsh advfirewall firewall add rule name=AllowRPCCommunication dir=in action=allow protocol=TCP localport=50020
netsh advfirewall firewall add rule name=AllowRPCCommunication dir=in action=allow protocol=TCP localport=50030
netsh advfirewall firewall add rule name=AllowRPCCommunication dir=in action=allow protocol=TCP localport=8021
netsh advfirewall firewall add rule name=AllowRPCCommunication dir=in action=allow protocol=TCP localport=50060
netsh advfirewall firewall add rule name=AllowRPCCommunication dir=in action=allow protocol=TCP localport=51111
netsh advfirewall firewall add rule name=AllowRPCCommunication dir=in action=allow protocol=TCP localport=10000
netsh advfirewall firewall add rule name=AllowRPCCommunication dir=in action=allow protocol=TCP localport=9083
netsh advfirewall firewall add rule name=AllowRPCCommunication dir=in action=allow protocol=TCP localport=50111

Networking Configurations and Edit Group Policies:

1. Logon to your Domain Controller VM, HDDC with a Domain Administrator account (ex: TESLA\admin)
2.Go to start and type Group Policy Management in the search bar. Open Group Policy Management.
3. If you don’t see your Forest TESLA.COM already, add a Forest and provide your domain name
4. Expand Forest –> Domain –> Name of your Domain –> right click on default domain policy –> edit


5. Group Policy Management Editor (lets call it GPM Editor) will open up and note that ALL the steps following below are done through this editor.

6. In GPM Editor, navigate to  Computer Configuration -> Policies -> Windows Settings -> Security Settings ->System Services -> Windows Remote Management (WS-Management). Change the startup mode to Automatic

7. Go to Computer Configuration -> Policies -> Windows Settings -> Security Settings -> Windows Firewall with Advanced Security. Right click on it and create a new  inbound firewall rule. For Rule Type, choose pre-defined, and select Windows Remote Management from the dropdown. It will automatically create 2 rules. Let them be.  click Next, Check Allow The Connection.

8. This step is for enabling Remote Execution Policy for Powershell. Go to Computer Configuration -> Policies -> Administrative Templates -> Windows Components -> Windows PowerShell. Enable   Turn on Script Execution. You’ll see Execution Policy dropdown under options, select Allow All Scripts

9. Go to Computer Configuration -> Policies -> Administrative Templates -> Windows Components -> Windows Windows Remote Management (WinRM) -> WinRM Service. Enable “Allow Automatic Configuration of Listeners” . Under Options, enter * (asterisk symbol)   for IPV4 filter.  Also change “Allow CredSSP Authentication” to Enabled.

10. We configured WinRM Service so far, now it’s WinRM Client’s turn.  Go to Computer Configuration -> Policies -> Administrative Templates -> Windows Components -> Windows Windows Remote Management (WinRM) -> WinRM Client. Set Trusted Hosts to Enabled and under Options, set TrustedHostsList to * (asterisk symbol) .  In the same dialog, look for Allow CredSSP Authentication and set it to Enabled.

11. Go to  Computer Configuration -> Policies -> Administrative Templates -> System -> Credentials Delegation. Set Allow Delegating Fresh Credentials  to Enabled. Under options, click on Show next to Add Servers To The List  and set WSMAN to *  as shown below: (make a new entry like so: WSMAN/*.TESLA.COM, replace it with your domain here)


12. Repeat instructions in step#11 above, this time for property NTLM-only server authentication

13. Already tired? We are almost done. Just a couple more steps before we launch the installation. Go to Run  and type ADSIEdit.msc and enter. Expand OU=Domain Controllers menu item and select CN=HDDC (controller hostname). Go to Properties -> Security -> Advanced –> Add. Enter NETWORK SERVICE, click Check Names, then Ok. In the Permission Entry select Validated write to service principal name. Click Allow and OK to save your changes.

14.On the Domain controller VM, launch a powershell windows an an administrator and run the following: Restart-Service WinRM

15. Force update gpupdate on the other VMs in the environment. Run this command: gpupdate /force

Define Cluster Configuration File

When you download HortonWorks Dataplatform for windows, the zip contains a sample clusterproperties.txt file which can be used as a template during installation. The installation msi  depends on clusterproperties file for cluster layout definition. Below is my Cluster Configuration file. Make changes to yours accordingly.

#Log directory HDP_LOG_DIR=c:\hadoop\logs #Data directory HDP_DATA_DIR=c:\hdp\data #Hosts, #Database host DB_FLAVOR=mssql #Hive properties HIVE_DB_NAME=hivedb HIVE_DB_USERNAME=sa HIVE_DB_PASSWORD=P@$$word! #Oozie properties OOZIE_DB_NAME=ooziedb OOZIE_DB_USERNAME=sa OOZIE_DB_PASSWORD=P@$$word!


1. Keep all of your VMs running and logon to HDMaster, with domain admin account.
2. Create a folder C:\setup and copy the contents of HortonWorks Data Platform installation media you downloaded at the beginning of this guide.
3. Replace the clusterproperties.txt with the one you created under Define Cluster Config File section. This file is same for ALL the participant servers in the cluster.
4. open a command prompt as administrator and cd to C:\setup.
5. Type the below command to kick off installation. Beware of unnecessary spaces in the command. It took me a lot of time to figure out the space was the issue. Edit the folder path according to your setup. There is no space between HDP_LAYOUT and ‘=’.

msiexec /i "c:\setup\hdp-1.1.0-GA.winpkg.msi" /lv "hdp.log" HDP_LAYOUT="C:\setup\clusterProperties.txt" HDP_DIR="C:\hdp\Hadoop" DESTROY_DATA="no"

6. If everything goes well, you should see a success message.


7. Log on to each of the other VMs, and repeat steps from #2 to 6.

Congratulations, you have now successfully created your first Hadoop Cluster. You’ll see shortcuts like below on your desktop on each VM. Before starting Hadoop exploration, we need to start services on both Local and remote machines. Log on to the HDMaster server, go to hadoop installation directory and look for \hadoop folder. You’ll see a couple of batch files to start/stop services.


In the next blog post, we will write a sample MapReduce job using Pig and one using Python. Have fun!