Importing MongoDB Data Using SSIS 2012

I have embarked on a little quest to learn other database platforms (especially NoSQL) as more and more of our clients at Pragmatic Works have them in their enterprise, and want to be able to import data from them into their SQL Server data warehouses using SQL Server Integration Services (SSIS). While I found several articles that showed how to do so, these were outdated due to changes in the MongDB C# driver. After quite a bit of effort figuring out how to get this working, I thought I’d pass along my hard fought knowledge.

First, I assume you are familiar with MongoDB (http://www.mongodb.org/) and SQL Server (https://www.microsoft.com/en-us/sqlserver/default.aspx). In my examples I am using SSIS 2012 and MongoDB 2.4.8, along with the C# driver version 1.7 for MongoDB available at http://docs.mongodb.org/ecosystem/drivers/csharp/ .

First, download and install the C# driver. This next step is important, as there was a change that occurred with version 1.5 of the driver: the DLLs are no longer installed in the GAC (Global Assembly Cache) automatically. They must be there, however, for SSIS to be able to use them.

By default, my drivers were installed to C:\Program Files (x86)\MongoDB\CSharpDriver 1.7. You’ll want to open a CMD window in Administrator mode, and navigate to this folder. Next you’ll need GACUTIL, on my computer I found the most recent version at:

C:\Program Files (x86)\Microsoft SDKs\Windows\v8.1A\bin\NETFX 4.5.1 Tools\x64\

A simple trick to find yours: Since you are already in the CMD window, just move to the C:\Program Files (x86) folder, and do a “dir /s gacutil.exe”. It will list all occurrences of the program, just use the one with the most recent date. Register the dlls by entering these commands:

“C:\Program Files (x86)\Microsoft SDKs\Windows\v8.1A\bin\NETFX 4.5.1 Tools\x64\gacutil” /i MongDB.Bson.dll

“C:\Program Files (x86)\Microsoft SDKs\Windows\v8.1A\bin\NETFX 4.5.1 Tools\x64\gacutil” /i MongDB.Driver.dll

Note the “ quote marks around the path are important for the CMD window to correctly separate the gacutil program from the parameters.

Once that is done, create a new SQL Server Integration Services project in SQL Server Data Tools (SSDT), what used to be called BIDS in SQL Server 2008R2 (and previous). Put a Data Flow Task on the Control Flow design surface. Then open the Data Flow Task for editing.

Next, drag and drop a Script Component transformation onto the Data Flow design surface. When prompted, change the component type to Source.

image

Now edit the script transform by double clicking on it. Move to the Inputs and Outputs page. For my test, I am using the dbo.DimCurrency collection I created using the technique I documented in the previous post, Exporting Data from SQL Server to CSV Files for Import to MongoDB Using PowerShell ( https://arcanecode.com/2014/01/13/exporting-data-from-sql-server-to-csv-files-for-import-to-mongodb-using-powershell/ )

I renamed the output from “output” to “MongoDB_DimCurrency”. I then added four columns, CurrencyName, CurrencyAlternateKey, CurrencyKey, and ID.

image

Make sure to set CurrencyName, CurrencyAlternateKey, and ID to “Unicode string [DT_WSTR]” Data Type. Then change CurrencyKey to “four byte signed integer [DT_I4]”.

Now return to the Script page and click Edit Script. In the Solution Explorer pane, expand References, right click and pick Add Reference. Go to Browse, and navigate to the folder where the MongoDB C# drivers are installed. On my system it was in C:\Program Files (x86)\MongoDB\CSharpDriver 1.7\. Add both MongoDB.Driver.dll and MongoDB.Bson.dll.

image

Click OK when done, your Solution Explorer should now look something like:

image

Now in the script, expand the Namespaces region and add these lines:

using MongoDB.Bson;
using MongoDB.Driver;
using MongoDB.Bson.Serialization;

Now scroll down to the CreateNewOutputRows() procedure. Here is a sample of the code I used:

public override void CreateNewOutputRows()
{
  string connectionString = "mongodb://localhost";
  string databaseName = "AdventureWorksDW2014";

  var client = new MongoClient(connectionString);
  var server = client.GetServer();
  var database = server.GetDatabase(databaseName);
  string CurrencyKey = "";

  foreach (BsonDocument document in database.GetCollection<BsonDocument>("dbo.DimCurrency").FindAll())
  {
    MongoDBDimCurrencyBuffer.AddRow();
    MongoDBDimCurrencyBuffer.CurrencyName = document["CurrencyName"] == null ? "" : document["CurrencyName"].ToString();
    MongoDBDimCurrencyBuffer.CurrencyAlternateKey = document["CurrencyAlternateKey"] == null ? "" : document["CurrencyAlternateKey"].ToString();
    CurrencyKey = document["CurrencyKey"] == null ? "" : document["CurrencyKey"].ToString();

    MongoDBDimCurrencyBuffer.CurrencyKey = Convert.ToInt32(CurrencyKey);
    MongoDBDimCurrencyBuffer.ID = document["_id"] == null ? "" : document["_id"].ToString();
  }

}

I start by defining a connection string to the MongoDB server, followed by the database name. I then create a MongoClient object. Note the MongoClient is the new way of connecting to the MongoDB server. In earlier versions of the C# driver, you used MongoServer objects.

I then cycle through each document in the collection “dbo.DimCurrency”, using the FindAll() method. For each item I use the AddRow() method to add a row to the buffer. In order to find the proper name for the buffer I went to the Solution Explorer and expanded the BufferWrapper.cs file. This is a class created by the script transform with the name of the output buffer.

image

For each column in my outputs, I map a column from the document. Note the use of the ternary operator ? : to strip out nulls and replace them with empty strings. String columns you can map directly from the document object to the output buffers columns.

The CurrencyKey column, being an integer, had to be converted from a string to an integer. To make it simple I created a string variable to hold the return value from the document, then used the Convert class to convert it to an INT 32.

Once you’ve done all the above, validate the code by building the code. If that all checks out save your work, close the code window, then close the Script Transformation Editor by clicking OK.

Now place a destination of some kind on the Data Flow. Since I have my company’s Task Factory tools I used a TF Terminator Destination, but you could also use a Row Count destination. On the precedence constraint between the two, right click and Enable Data Viewer. Execute the package, if all goes well you should see:

image

A few final notes. This test was done using a MongoDB document schema that was flat, i.e. it didn’t have any documents embedded in the documents I was testing with. (Hopefully I’ll be able to test that in the future, but it will be the subject of a future post.) Second, the key was the registering of the DLLs in the GAC. Until I did that, I couldn’t get the package to execute. Finally, by using the newer API for the MongoDB objects I’ve ensured compatibility for the future.

Advertisements

Exporting Data from SQL Server to CSV Files for Import to MongoDB Using PowerShell

I’ve been exploring other database systems, in order to determine how to import data from them using SQL Server Integration Services (SSIS). My first step though was to create some test data. I wanted something familiar, so I decided to export the Adventure Works Data Warehouse sample database and import into MongoDB. While I had many options I decided the simplest way was to first export the data to CSV files, then use the MongoDB utility mongimport. Naturally I turned to PowerShell to create an automated, reusable process.

First, if you need the Adventure Works DW database, you’ll find it at http://msftdbprodsamples.codeplex.com/. Second, I did my export from a special version of Adventure Works DW I created called AdventureWorksDW2014. This is optional, but if you want to have a version of Adventure Works DW updated with current dates, see my post at https://arcanecode.com/2013/12/08/updating-adventureworksdw2012-for-2014/. Third, I assume you are familiar with MongoDB, but if you want to learn more go to http://www.mongodb.org/.

Below is the PowerShell 3 script I created. The script is broken into four regions. The first, User Settings, contains the variables that you the user might need to change to get the script to run. It has things like the name of the SQL Server database, the path to MongoDB, etc.

The second region, Common, establishes variables that are used by the remaining two regions. You shouldn’t need to change or alter these. The third region accesses SQL Server and exports each table in the selected database to a CSV format file.

The final region, “Generate MongoDB import commands”, creates a batch (.BAT) file which has all the commands needed to run mongoimport for each CSV file. I decided not to have the PowerShell script execute the .BAT file so it could be reviewed before it is run. There might be some tables you don’t want to import, etc.

It is also quite easy to adapt this script to just create CSV files from SQL Server without using the MongoDB piece. Simply remove the fourth and final region, then in the Common and User Settings regions remove any variables what begin with the prefix “mongo”.

As the comments do a good job of explaining what happens I’ll let you review the included documentation for step by step instructions.

#==================================================================================================
# SQLtoCSVtoMongoDb.ps1
# Robert C. Cain | @ArcaneCode |
http://arcanecode.com
#
# If you need a simple way to export data from SQL Server to MongoDb, here is one way to do it.
# The script starts by setting up some variables to the server environment (see the User Settings
# region)
#
# Next, it exports data from each table in the selected database to individual CSV files.
# Finally, it generates a batch file which executes mongoimport for each csv file to import
# into MongoDb.
#
# I broke this into four regions so if all that is desired is a simple export of data to CSVs,
# you can simply omit the final region along with any variables that begin with "mongo".
#
# While I could have gone ahead and run the batch file at the end, I chose not to in order to
# give you time to review the output prior to running the batch file.
#==================================================================================================

Clear-Host

#region User Settings

  # In this section, set the variables so they are appropriate for your project / environment
 
  # This is the spot where you want to store the generated CSVs.
  # Make sure it does NOT end in a \
  $csvPath = "C:\mongodb"

  # If you are running this on a computer other than the server, set the name of the server below
  $sqlServer = $env:COMPUTERNAME

  # If you have a named instance be sure replace "default" with the name of the instance
  $sqlInstance = "\default"

  # Enter the name of the database to export below
  $sqlDatabaseName = "AdventureWorksDW2014"

  # The settings below only apply to the MongoDB code generation
  # Assemble path to mongodb. This assumes utlities are stored in the default bin folder
  $mongoPath = "C:\mongodb"
  $mongoImport = "$mongoPath\bin\mongoimport"

  # Set the server name and port
  $mongoHost = "localhost"   # Leave blank to default to localhost
  $mongoPort = ""            # Leave blank to default to 27107
 
  # Set the user name and password, leave blank if it isn’t needed
  $mongoUser = ""
  $mongoPW = ""

  # Enter the name of the database to import to.
  $mongoDatabaseName = "AdventureWorksDW2014"

  # Upserts are REALLY slow, especially on large datasets. Setting this to $true will turn off
  # the upsert option. If set to true, you are responsible for either deleting all documents
  # in the collection before hand, or allowing the risk of duplicates.
  #
  # Setting to false will enable the upsert option for mongoimport, and attempt to determine the
  # keys and (if found) add them to the final mongoimport command.
  $mongoNoUpsert = $true

#endregion

#region Common ————————————————————————————
 
  # This section sets variables used by both regions below. There is no need to alter anything
  # in this region.

  # Import the SQLPS provider (if it’s not already loaded)
  if (-not (Get-PSProvider SqlServer))
    { Import-Module SQLPS -DisableNameChecking }

  # Assemble the full servername \ instance
  $sqlServerInstance = "$sqlServer\$sqlInstance"

  # Assemble the full path for the SQL Provider to get to the database
  $sqlDatabaseLocation = "SQLSERVER:\sql\$sqlServerInstance\databases\$sqlDatabaseName"

  # Now tack on the Tables ‘folder’ to the SQL Provider path, the move there
  $sqlTablesLocation = $sqlDatabaseLocation + "\Tables"
  Set-Location $sqlTablesLocation

  # Get a list of tables in this database
  $sqlTables = Get-ChildItem

#endregion

#region Export SQL Data —————————————————————————
  # In this section we will export data from each table in the database to a CSV file.
  # WARNING: If the CSV file exists, it will be overwritten.

  # These are just used to display informational messages during processing
  $sqlTableIterator = 0
  $sqlTableCount = $sqlTables.Count

  # Iterate over each table in the database
  foreach($sqlTable in $sqlTables)
  {
    $sqlTableName = $sqlTable.Schema + "." + $sqlTable.Name   

    # I’ll grant you the next little bit of formatting for the progress messages is a bit
    # OCD on my part, but I like my output formatted and easy to read.
    $sqlTableIterator++
    $padCount = " " * (1 + $sqlTableCount.ToString().Length – $sqlTableIterator.ToString().Length)
    $sqlTableIteratorFormatted = $padCount + $sqlTableIterator

    if( $sqlTableName.Length -gt 50 )
      { $padTable = " " }
    else
      { $padTable = " " * (50 – $sqlTableName.Length) }

    Write-Host -ForegroundColor White -NoNewline "Processing Table $sqlTableIteratorFormatted of $sqlTableCount : $sqlTableName $padTable"
   
    # If the instance is "default", we have to exclude it when we use Invoke-SqlCmd
    if($sqlInstance.ToLower() -eq "\default")
      { $sqlSI = $sqlServer }
    else
      { $sqlSI = $sqlServerInstance }

    # Load an object with all the data in the table
    # Note if you have especially large tables you may need to modify this
    # section to break things into smaller chunks.
    $sqlCmd = "SELECT * FROM " + $sqlTableName
    $sqlData = Invoke-Sqlcmd -Query $sqlCmd `
                             -ServerInstance $sqlSI `
                             -SuppressProviderContextWarning `

    # Now write the data out.
    # Note utf8 encoding is important, as it is all mongoimport understands
    # Also need to omit the Type Info header PowerShell wants to write out
    Write-Host -ForegroundColor Yellow "    Writing to table $sqlTableName.csv"
    $sqlData | Export-Csv -NoTypeInformation -Encoding "utf8" -Path "$csvPath\$sqlTableName.csv"

  }

  # Just add a blank line after the processing ends
  Write-Host

#endregion

#region Generate MongoDB import commands ———————————————————-

  # In this region we will generage the commands to import our newly exported data
  # into an existing database in MongoDB. This is an example of our desired output (wrapped
  # onto multiple lines for readability, in the output it will be a single line):

  #  C:\mongodb>bin\mongoimport –host localhost -port 27107
  #                             –db AdventureWorksDW2014 –collection DimSalesReason
  #                             –username Me –password mySuperSecureP@ssW0rd!
  #                             –type csv –headerline –file DimSalesReason.csv
  #                             –upsert –upsertFields SalesReasonKey

  # Note several of these parameters are optional, and could use defaults, or be potentially
  # omitted from the final output, based on the choices at the very beginning of this script

  # Feel free to alter the $mongoCommand as needed for other circumstances

  # Final warning, the database must already exist in MongoDb in order to import the data. This
  # script will not generate the database for you.

  # Create the name for the batch file we will generate
  $mongoBat = $csvPath + "\Import_SQL_" + $sqlDatabaseName + "_to_MongoDb_" + $mongoDatabaseName + ".bat"

  # See if file exists, if so delete it
  if (Test-Path $mongoBat)
    { Remove-Item $mongoBat }

  # These are just used to display informational messages during processing
  $sqlTableIterator = 0
  $sqlTableCount = $sqlTables.Count

  # mongoimport allows us to do upserts, helping to eliminate duplicate rows on import.
  #
  # To make an upsert work there has to be a key column to match up on. Fortunately,
  # most tables in the SQL Server world have Primary Keys, so we can find out what
  # columns those are and add it to the command. Note if there is no PK in SQL Server,
  # no upsert will be attempted.
  #
  # Note though that upserts are REALLY slow, so the option to skip them is
  # built into the script and set at the top (mongoNoUpsert). The generated batch file
  # assumes that either a) you have deleted all data from the collection ahead of time,
  # or b) you are OK with the risk of duplicate data.

  # Iterate over each table in the database to build the mongoimport command
  foreach($sqlTable in $sqlTables)
  {
    $sqlTableName = $sqlTable.Schema + "." + $sqlTable.Name

    # A bit more OCD progress messages
    $sqlTableIterator++
    $padCount = " " * (1 + $sqlTableCount.ToString().Length – $sqlTableIterator.ToString().Length)
    $sqlTableIteratorFormatted = $padCount + $sqlTableIterator
    Write-Host -ForegroundColor Green "Building mongoimport command for table $sqlTableIteratorFormatted of $sqlTableCount : $sqlTableName"

    # Begin building the command
    $mongoCommand = "$mongoImport "
   
    if ($mongoHost.Length -ne 0)
      { $mongoCommand += "–host $mongoHost " }

    if ($mongoPort.Length -ne 0)
      { $mongoCommand += "–port $mongoPort " }

    $mongoCommand += "–db $mongoDatabaseName –collection $sqlTableName "

    if ($mongoUser.Length -ne 0)
      { $mongoCommand += " –username $mongoUser –password $mongoPW " }

    $mongoCommand += " –type csv –headerline –file $csvPath\$sqlTableName.csv "
       
    # Build the upsert clause, if the user has elected to use it.
    if ($mongoNoUpsert -eq $false)
    {
      $mongoPKs = ""
      foreach($sqlIndex in $sqlTable.Indexes)
      {
        if($sqlIndex.IndexKeyType -eq ‘DriPrimaryKey’)
        {
          foreach($sqlCol in $sqlIndex.IndexedColumns) #$sqlPKColumns)
          {
            if ($mongoPKs.Length -ne 0)
              { $mongoPKs += "," }
            # Note column names are returned with [ ] around them, and must be removed
            # Have to use -replace instead of .Replace() because $sqlCol is an column not a string
            $mongoPKs += ($sqlCol -replace "\[", "") -replace "\]", ""
          }
               
          $mongoCommand += " –upsert –upsertFields $mongoPKs"
        }           
      }
    }

    # Append the command to the batch file
    $mongoCommand | Out-File -FilePath $mongoBat -Encoding utf8 -Append

  }

  # Just add a blank line after the processing ends
  Write-Host

#endregion