Archive for the ‘Uncategorized’ Category
BCP (bulk copy program) is a great way to transfer huge data (millions of rows) to an Azure database. I found that it is very efficient in terms of ease of use and speedy with which the data could be transferred. The bcp utility is a command line utility that ships with Microsoft SQL Server. It bulk copies data between SQL Azure (or SQL Server) and a data file in a user-specified format. The bcp utility that ships with SQL Server 2008 R2 is fully supported by SQL Azure.
Following are the steps to use this utility to export the data to a flat file and then exporting the data to the Azure database:
- Copy data from customer table into a temp table
- select * into [temp_employee] from employee
- Copy data to local file from a table in a local database.
- BCP [server_name].[database_name].[schema_name].[temp_employee] OUT c:\temp\employee.txt -T –c
- Create the temp table in the destination database in the cloud
- Copy data into the azure database.
- BCP [azure_database_name].[schema_name].[employee] IN c:\temp\employee.txt -S servername.database.windows.net; -U <username>@<servername> -P <password>
Following is a sample of the Bulk copy output form the command window for reference. The most interesting aspect to observer is that data transfer rate stood at 5206 rows per second for this particular attempt. I believe that higher transfers could be a possibility depending on the available network bandwidth.
Exporting Data out of SQL Azure
Data can be retrieved in a a huge set from the in SQL Azure by running the following statement at the Windows command prompt:
bcp [database_name].[schema_name].[table_name] out C:\temp\[file_name].txt -c -U username@servername -S tcp:servername.database.windows.net -P password
In this blog I will briefly write about the files and the filegroups of a SQL Server database, followed by describing the most popular data storage technology called RAID – Redundant Array of Independent Disks, along with covering some internals of this technology and its applicability. Having a understanding of the various database files, RAID, RAID levels will be crucial in building a SQL Server that can scale bigger, restore faster and query quicker.
SQL Server Files and Filegroups
SQL Server database stores data and transactions in separate files. Data associated with tables, indexes, and large object data is stored in data files. The transactions are sequentially stored in a log file. The Data and transactions of a databases can be stored in three types of files.
- Primary data files – Every database has one primary data file. Recommended file name extension is .mdf. This is merely a best practice and your company can have a policy for a unique extension.
- Secondary data files – Secondary data files make up all the data files, other than the primary data file. This file will exist only when it has been defined in the database setup. Recommended file name extension is .ndf.
- Log files – Log files contain the information that is required to recover all transactions in the database. Log files are sequential. There must be at least one log file for each database, although there can be more than one. Recommended file name extension is “.ldf”. The log files are truncated automatically when using the SIMPLE recovery model, but not when using BULK LOGGED or FULL recovery
The most likely part to fail in a computer or a storage device, because it probably is the only moving part. A hard drive typically is spinning anywhere from 5400 to 15000 rotations per minute (RPM) and stands of chance for failure. RAID is the technology that is the solution to the above problem. RAID stands for Redundant Array of Independent Disks, is a redundant system with multiple hard drive that is used to store your data. Basically it allow you to turn multiple physical hard drives into one logical hard drive. Depending on the type of configuration employed in designing the RAID, failure of one or more drives will not impact your data. This will act as a critical system that can save your data from loss and should be considered for data storage. There are many RAID levels , however three basic types of RAID levels are widely used by companies and they are RAID 0, RAID 1 and RAID 5. RAID 0 is called as disk striping, RAID 1 is called disk mirroring and RAID 5 is called disk striping with parity.
RAID 0 – Disk Striping
Disk striping in RAID 0 combines physical drives into one large logical drive with the maximum capacity being the sum of individual hard disk sizes. For example if you have 2 disks with individual capacity of 1TB in your RAID, you would effectively have 2 TB of usable disk space to store data. Data is distributed across all the drives and where data stripes are distributes evenly across all the physical disks. The size of the data stripe can vary by the design determined by your DBA or the one demanded by your application. Let us assume that you have a stripe size of 16 bits, when a data block of size 64 bits is written on a RAID 0 with 4 disks, bits 1 – 16 are written on disk 1, bits 16 – 32 are written on disk 2, bits 17 – 48 are written on disk 3 and bits 49 – 64 are written on disk 4. Basically this raid level is striping your data across all the disks evenly, thus making writing and reading you data faster. However one big concern with this design is that if a hard disk were to fail, you will not be able to recover the data because you do not have a redundancy to reconstruct all the data.
RAID 1 – Disk Mirroring
Disk mirroring creates identical copies of your data. There would only be two drives involved in this type of RAID. Disk mirroring would make an identical image of the hard drive. In other words the data on one drive is automatically written on to the other hard drive. The primary advantage of this type configuration is in a scenario when one of the hard drive fails, your server is still up and running because you have a complete backup all the data on another hard disk in the RAID. At this point while the sever is still running, you can swap the failed disk with a another hard disk and let the RAID rebuild the mirrored disk. RAID 1 is mainly for liability by introducing redundancy of data. Keep in mind that only 50 % of the total disk space can be used for data storage. For example if you have 2 disks with individual capacity of 1TB in your RAID, you would effectively have 1 TB of usable disk space to store data.
RAID 5 – Disk Striping with Parity
Disk striping with parity can be achieved with atleast 3 hard drives for its simplest configuration. In this case, the stripe 1 will be written to dive 1, stripe 2 will be written to drive 2 and the parity for the data written on drive 1 would be written to the drive 3, followed by the parity for the data written on drive 2 written on drive 1 followed by stripe 4 on drive 2 and stripe 5 on drive 3. It is hard to visualize the last sentence, but you have to understand that this allows for a failure of one hard drive. The parity introduced in this type of RAID will be used to rebuild the failed drive, because the other two drive contain all the data. Another advantage it offers is that you have more disk space when compared with RAID 1 and not as much as RAID 0 though. For example if you have 3 disks with individual capacity of 1TB in your RAID, you would effectively have 2 TB of usable disk space to store data and if you were to have 4 disks on your RAID with 1 TB capacity on each of then, the total usable disk space would be 3 TB . Basically you will be using the the size equivalent of one hard disk for redundancy.
RAID Level 2, 3, 4, 6, 7, 8, 9, 10, ……
All the other RAID levels are a combination of a basic levels 0, 1, 5. For example RAID 10 is a combination of RAID 1 and RAID 0, where your data is mirrored in addition to being striped across the disks. RAID 6 is a combination of RAID 5 and RAID 1, where your data is striped with parity along with it being mirrored for additional redundancy.
Redundant array of independent disks is a system that allows for configuring hard disks to store data efficiently and also provide fault tolerance. Raid level 0 can ideally be used for fast read and write capability with the ability to combine several smaller disks into one large logical drive. Caution should be exercised in maintaining a backup scheme for your data as there is no redundancy provided by this level. RAID level 1 is purely a play on providing redundancy for your data. RAID level 5 provides improved performance with writing and reading data in addition to providing some redundancy. RAID 1 and 5 allows for hot swappable disks, meaning to say, the failed drives can be replaced and rebuild while the server is still running. RAID is the right tool for business because it delivers flexibility and scalability so that you can create and manage enterprise data storage systems in the shortest possible time.
How can you design an archiving solution on a large table without deleting any data and also achieve improved performance on CRUD operations (on the same table)? The assumption in this scenario is that the most recent data would be accessed more often that the older data.
Archiving solutions can be of two types; 1) passive and 2) active. A passive solution is one where the historic data is archived in another table on another database, making the data unavailable. An active solution is one where the historic data is archived and will still be available for access without much of an impact on the application performance. A large table typically contains millions of rows and probable has a size that runs into several gigabytes. Just the size of the table makes it very expensive to perform CRUD operations and difficult to maintain indexes.
Table Partitioning in SQL Server 2005 and up lets a table data and indexes to be stored in several smaller partitions. This feature is very useful in providing a way to easily maintain and perform database operations on that table. Each partition will be stored in a different file which can be part of a filegroup. Data is distributed evenly between the files in a filegroup. Allowed columns can be used as a partition key which is the criteria for partitioning. You will be able to define the boundaries of a partition by defining the limits of the partition key. This division is based on how you access the rows of the table. In addition to identifying the partition key, we also will want to include that key as part of the index and also partition the index. Partitioning the index is called as index alignment. This was when the index portion will be stored along with the data rows stored in the partition.
Dividing a table into several files gives us the flexibility of storing those files on separate drives. We can store the files that contain data from the recent years on faster drives as opposed to storing the older data on slower drives. Going with assumption in the problem statement that the most recent data is accessed more often that the older data, we will have improved the performance on this table because we will have faster response times thanks to the faster drives.
Note: All data types that are valid for use as index columns can be used as a partitioning column, except timestamp. The ntext, text, image, xml, varchar(max), nvarchar(max), or varbinary(max) data types cannot be specified. 
Restricting the size of a file upload is an important validation that needs to be performed by an online application so avoid the risk of filling up the server disk space by a malicious intent. ASP.Net provides an upload control that only provides a server side validation of file size. By the time the control validates the size of the uploaded file, the physical file would have been already been copied to the server, which is tool late in avoiding the issue.
Client side technologies comes to the rescue in this scenario, where the validation of an attachment size can be implemented using a browser run-time like Flash, Silverlight, ActiveX, HTML5 etc. This way, if attempts were made to upload files with unsupported sizes, the run-time plug-in can thwart the attempt without any impact on your web server. Following are two free tools that can be employed for this purpose:
- SWFUpload is a flash based tool.
- PLUpload is a versatile plugin that can support multiple run-times. This plugin slices a large file in small chunks and will send them out one by one to the server. You can then safely collect them on the server and combine into the original file. The size of the chunks and the acceptable file formats can be defined in the plugin UI definition.
We have implemented PLUpload with good success. This plugin also support multiple file uploads. Visit the plugin homepage to see the other rich features that are supported. The online forum is a treasure trove, where you can find the various implementations, code snippets and will be able to participate in contributing to the community.
Use DBCC DROPCLEANBUFFERS to test queries with a cold buffer cache without shutting down and restarting the server. To drop clean buffers from the buffer pool, first use CHECKPOINT to produce a cold buffer cache. This forces all dirty pages for the current database to be written to disk and cleans the buffers. After you do this, you can issue DBCC DROPCLEANBUFFERS command to remove all buffers from the buffer pool. 
CHECKPOINT; GO DBCC DROPCLEANBUFFERS; GO
It is recommended these commands should not be executed in a production environment where the SQL Server caching helps in gaining performance. Running these commands could adversely impact the server performance.
1. DBCC DROPCLEANBUFFERS (Transact-SQL)
Here is a useful tidbit. We can use one of the following two methods to perform a case sensitive string/field comparison when a SQL Server database does not support case sensitivity.
declare @a as nvarchar(20) = 'Test' declare @b as nvarchar(20) = 'TEST' --Method1: if(convert(varbinary,@a) = convert(varbinary,@b)) select 'identical' else select 'non-identical' --Method 2: if(@a = @b COLLATE Latin1_General_CS_AS) select 'identical' else select 'non-identical'