You are here

Data Stage

DataStage Interview Questions and Answers,Solution and Explanation - Part 3

Does the selection of 'Clear the table and Insert rows' in the ODBC stage send a Truncate statement to the DB or does it do some kind of Delete logic.
Ans:
There is no TRUNCATE on ODBC stages. It is Clear table blah blah and that is a delete from statement. On an OCI stage such as Oracle, you do have both Clear and Truncate options. They are radically different in permissions (Truncate requires you to have alter table permissions where Delete doesn't).

Tell me one situation from your last project, where you had faced problem and How did you solve it?
Ans:
The jobs in which data is read directly from OCI stages are running extremely slow. I had to stage the data before sending to the transformer to make the jobs run faster.
B. The job aborts in the middle of loading some 500,000 rows. Have an option either cleaning/deleting the loaded data and then run the fixed job or run the job again from the row the job has aborted. To make sure the load is proper we opted the former.

Why do we have to load the dimensional tables first, then fact tables:
Ans:
As we load the dimensional tables the keys (primary) are generated and these keys (primary) are Foreign keys in Fact tables.

How will you determine the sequence of jobs to load into data warehouse?
Ans:
First we execute the jobs that load the data into Dimension tables, then Fact tables, then load the Aggregator tables (if any).

What are the command line functions that import and export the DS jobs?
Ans:
A. dsimport.exe- imports the DataStage components.
B. dsexport.exe- exports the DataStage components.

What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director?
Ans:
Use crontab utility along with dsexecute() function along with proper parameters passed.

How would call an external Java function which are not supported by DataStage?
Ans:
Starting from DS 6.0 we have the ability to call external Java functions using a Java package from Ascential. In this case we can even use the command line to invoke the Java function and write the return values from the Java program (if any) and use that files as a source in DataStage job.

What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job.
Ans:
A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you can schedule the sequencer around the time the file is expected to arrive.
B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file.

Read the String functions in DS
Ans:
Functions like [] -> sub-string function and ':' -> concatenation operator
Syntax: string [ [ start, ] length ]
string [ delimiter, instance, repeats ]

How did you connect with DB2 in your last project?
Ans:
Most of the times the data was sent to us in the form of flat files. The data is dumped and sent to us. In some cases were we need to connect to DB2 for look-ups as an instance then we used ODBC drivers to connect to DB2 (or) DB2-UDB depending the situation and availability. Certainly DB2-UDB is better in terms of performance as you know the native drivers are always better than ODBC drivers. 'iSeries Access ODBC Driver 9.00.02.02' - ODBC drivers to connect to AS400/DB2.

What are Sequencers?
Ans:
Sequencers are job control programs that execute other jobs with preset Job parameters.

Differentiate Primary Key and Partition Key?
Ans:
Primary Key is a combination of unique and not null. It can be a collection of key values called as composite primary key. Partition Key is a just a part of Primary Key

How did you handle an 'Aborted' sequencer?
Ans:
In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again.
 

DataStage Interview Questions and Answers,Solution and Explanation - Part 2

What are conformed dimensions?
Ans:
A conformed dimension is a single, coherent view of the same piece of data throughout the organization. The same dimension is used in all subsequent star schemas defined. This enables reporting across the complete data warehouse in a simple format.

Why fact table is in normal form?
Ans:
Basically the fact table consists of the Index keys of the dimension/ook up tables and the measures.
so when ever we have the keys in a table .that itself implies that the table is in the normal form.

What is a linked cube?
Ans:
A cube can be stored on a single analysis server and then defined as a linked cube on other Analysis servers. End users connected to any of these analysis servers can then access the cube. This arrangement avoids the more costly alternative of storing and maintaining copies of a cube on multiple analysis servers. linked cubes can be connected using TCP/IP or HTTP. To end users a linked cube looks like a regular cube.

What is degenerate dimension table?
Ans:
The values of dimension which is stored in fact table is called degenerate dimensions. these dimensions doesn,t have its own dimensions.

What is ODS?
Ans:
ODS stands for Online Data Storage.

What is a general purpose scheduling tool?
Ans:
The basic purpose of the scheduling tool in a DW Application is to stream line the flow of data from Source To Target at specific time or based on some condition.

What is the need of surrogate key;why primary key not used as surrogate key?
Ans:
Surrogate Key is an artificial identifier for an entity. In surrogate key values are generated by the system sequentially(Like Identity property in SQL Server and Sequence in Oracle). They do not describe anything. Primary Key is a natural identifier for an entity. In Primary keys all the values are entered manually by the user which are uniquely identified. There will be no repetition of data

What is Hash file stage and what is it used for?
Ans:
Used for Look-ups. It is like a reference table. It is also used in-place of ODBC, OCI tables for better performance.

What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director?
Ans:
Use crontab utility along with d***ecute() function along with proper parameters passed.

What is the OCI? and how to use the ETL Tools?
Ans:
OCI means orabulk data which used client having bulk data its retrive time is much more ie., your used to orabulk data the divided and retrived

What are OConv () and Iconv () functions and where are they used?
Ans:
IConv() - Converts a string to an internal storage formatOConv() - Converts an expression to an output format.

What is Fact table?
Ans:
Fact Table contains the measurements or metrics or facts of business process. If your business process is "Sales" , then a measurement of this business process such as "monthly sales number" is captured in the Fact table. Fact table also contains the foriegn keys for the dimension tables.

What are the steps In Building the Data Model
Ans:
While ER model lists and defines the constructs required to build a data model, there is no standard process for doing so. Some methodologies, such as IDEFIX, specify a bottom-up

What is Dimensional Modelling?
Ans:
Dimensional Modelling is a design concept used by many data warehouse desginers to build thier datawarehouse. In this design model all the data is stored in two types of tables - Facts table and Dimension table. Fact table contains the facts/measurements of the business and the dimension table contains the context of measuremnets ie, the dimensions on which the facts are calculated.

What type of Indexing mechanism do we need to use for a typical datawarehouse?
Ans:
On the fact table it is best to use bitmap indexes. Dimension tables can use bitmap and/or the other types of clustered/non-clustered, unique/non-unique indexes.

What is Normalization, First Normal Form, Second Normal Form , Third Normal Form?
Ans:
Normalization can be defined as segregating of table into two different tables, so as to avoid duplication of values.

Is it correct/feasible develop a Data Mart using an ODS?
Ans:
Yes it is correct to develop a Data Mart using an ODS.becoz ODS which is used to?store transaction data and few Days (less historical data) this is what datamart is required so it is coct to develop datamart using ODS.

What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs?
Ans:
1. Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts.
2. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.
3. Tuned the 'Project Tunables' in Administrator for better performance.
4. Used sorted data for Aggregator.
5. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs
6. Removed the data not used from the source as early as possible in the job.
7. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries
8. Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.
9. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.
10. Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.
Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or
external macros but if it is inline code then the overhead will be minimal.

Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins
are made.
12. Tuning should occur on a job-by-job basis.
13. Use the power of DBMS.
14. Try not to use a sort stage when you can use an ORDER BY clause in the database.
15. Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….
16. Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.

DataStage Interview Questions and Answers,Solution and Explanation

How did you handle reject data?
Ans:
Typically a Reject-link is defined and the rejected data is loaded back into data warehouse. So Reject link has to be defined every Output link you wish to collect rejected data. Rejected data is typically bad data like duplicates of Primary keys or null-rows where data is expected.

If worked with DS6.0 and latest versions what are Link-Partitioner and Link-Collector used for?
Ans: Link Partitioner - Used for partitioning the data.
Link Collector - Used for collecting the partitioned data.

What are Routines and where/how are they written and have you written any routines before?
Ans:
Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or edit. The following are different types of routines:
1) Transform functions
2) Before-after job subroutines
3) Job Control routines

What are OConv () and Iconv () functions and where are they used?
Ans:
IConv() - Converts a string to an internal storage format
OConv() - Converts an expression to an output format.

How did you connect to DB2 in your last project?
Ans:
Using DB2 ODBC drivers.

Explain METASTAGE?
Ans:
MetaStage is used to handle the Metadata which will be very useful for data lineage and data analysis later on. Meta Data defines the type of data we are handling. This Data Definitions are stored in repository and can be accessed with the use of MetaStage.

Do you know about INTEGRITY/QUALITY stage?
Ans:
Qulaity Stage can be integrated with DataStage, In Quality Stage we have many stages like investigate, match, survivorship like that so that we can do the Quality related works and we can integrate with datastage we need Quality stage plugin to achieve the task.

Explain the differences between Oracle8i/9i?
Ans:
Oracle 8i does not support pseudo column sysdate but 9i supports
Oracle 8i we can create 256 columns in a table but in 9i we can upto 1000 columns(fields)

How do you merge two files in DS?
Ans:
Either use Copy command as a Before-job subroutine if the metadata of the 2 files are same or create a job to concatenate the 2 files into one if the metadata is different.

What is DS Designer used for?
Ans:
You use the Designer to build jobs by creating a visual design that models the flow and transformation of data from the data source through to the target warehouse. The Designer graphical interface lets you select stage icons, drop them onto the Designer work area, and add links.

What is DS Administrator used for?
Ans:
The Administrator enables you to set up DataStage users, control the purging of the Repository, and, if National Language Support (NLS) is enabled, install and manage maps and locales.

What is DS Director used for?
Ans
: datastage director is used to run the jobs and validate the jobs.
we can go to datastage director from datastage designer it self.

What is DS Manager used for?
Ans
: The Manager is a graphical tool that enables you to view and manage the contents of the DataStage Repository

What are Static Hash files and Dynamic Hash files?
Ans:
As the names itself suggest what they mean. In general we use Type-30 dynamic Hash files. The Data file has a default size of 2Gb and the overflow file is used if the data exceeds the 2GB size.

What is Hash file stage and what is it used for?
Ans:
Used for Look-ups. It is like a reference table. It is also used in-place of ODBC, OCI tables for better performance.

How are the Dimension tables designed?
Ans:
Find where data for this dimension are located.
Figure out how to extract this data.
Determine how to maintain changes to this dimension.
Change fact table and DW population routines.

Capturing Unmatched Records from a Join in Data Stage

The Join stage does not provide reject handling for unmatched records (such as in an InnerJoin scenario). If un-matched rows must be captured or logged, an OUTER join operation must be performed. In an OUTER join scenario, all rows on an outer link (eg. Left Outer, Right Outer, or both links in the case of Full Outer) are output regardless of match on key values. 

During an Outer Join, when a match does not occur, the Join stage inserts NULL values into the unmatched columns. Care must be taken to change the column properties to allow NULL values before the Join. This is most easily done by inserting a Copy stage and mapping a column from NON-NULLABLE to NULLABLE.

 A Filter stage can be used to test for NULL values in unmatched columns.

 In some cases, it is simpler to use a Column Generator to add an ‘indicator’ column, with a constant value, to each of the outer links and test that column for the constant after you have performed the join. This is also handy with Lookups that have multiple reference links.

Data Stage - Lookup vs. Join Stages

The Lookup stage is most appropriate when the reference data for all lookup stages in a job is small enough to fit into available physical memory. Each lookup reference requires a contiguous block of physical memory. If the datasets are larger than available resources, the JOIN or MERGE stage should be used.

If the reference to a Lookup is directly from a Oracle table, and the number of input rows is significantly smaller (eg. 1:100 or more) than the number of reference rows, a Sparse Lookup may be appropriate.

Data Stage Transformer Usage Guidelines

Choosing Appropriate Stages

The parallel Transformer stage always generates “C” code which is then compiled to a parallel component. For this reason, it is important to minimize the number of transformers, and to use other stages (Copy, Filter, Switch, etc) when derivations are not needed. 

  • The Copy stage should be used instead of a Transformer for simple operations including:
  • Job Design placeholder between stages (unless the Force option =true, EE will optimize this out at runtime)
  • Renaming Columns
  • Dropping Columns
  • Default Type Conversions

 

Note that rename, drop (if runtime column propagation is disabled), and default type conversion can also be performed by the output mapping tab of any stage

NEVERuse the “BASIC Transformer” stage in large-volume job flows. Instead, user-defined functions and routines can expand parallel Transformer capabilities. 

Consider, if possible, implementing complex derivation expressions using regular patterns by Lookup tables instead of using a Transformer with nested derivations.

For example, the derivation expression:

 

If A=0,1,2,3 Then B=”X” If A=4,5,6,7 Then B=”C”

Could be implemented with a lookup table containing values for column A and corresponding values of column B.

Optimize the overall job flow design to combine derivations from multiple Transformers into a single Transformer stage when possible.
 

In v7 and later, the Filter and/or Switch stages can be used to separate rows into multiple output links based on SQL-like link constraint expressions. 

In v7 and later, the Modify stage can be used for non-default type conversions, null handling, and character string trimming. See section 7.5 for more information.

 Buildops should be used instead of Transformers in the handful of scenarios where complex reusable logic is required, or where existing Transformer-based job flows do not meet performance requirements.

Transformer NULL Handling and Reject Link

When evaluating expressions for output derivations or link constraints, the Transformer will reject (through the reject link indicated by a dashed line) any row that has a NULL value used in the expression. To create a Transformer reject link in DataStage Designer, right-click on an output link and choose “Convert to Reject”.

The Transformer rejects NULL derivation results because the rules for arithmetic and string handling of NULL values are by definition undefined. For this reason, always test for null values before using a column in an expression, for example:
 

If ISNULL(link.col) Then… Else…
 

Note that if an incoming column is only used in a pass-through derivation, the Transformer will allow this row to be output. DataStage release 7 enhances this behavior by placing warnings in the log file when discards occur.

Transformer Derivation Evaluation

Output derivations are evaluated BEFORE any type conversions on the assignment. For example, the PadString function uses the length of the source type, not the target. Therefore, it is important to make sure the type conversion is done before a row reaches the Transformer.

 

For example, TrimLeadingTrailing(string) works only if string is a VarChar field. Thus, the incoming column must be type VarChar before it is evaluated in the Transformer.

Conditionally Aborting Jobs

The Transformer can be used to conditionally abort a job when incoming data matches a specific rule. Create a new output link that will handle rows that match the abort rule. Within the link constraints dialog box, apply the abort rule to this output link, and set the “Abort After Rows” count to the number of rows allowed before the job should be aborted .

Since the Transformer will abort the entire job flow immediately, it is possible that valid rows will not have been flushed from Sequential File (export) buffers, or committed to database tables. It is important to set the Sequential File buffer flush  or database commit parameters.

Data Stage Sequential File Stages (Import and Export) Performance Tuning

Improving Sequential File Performance

If the source file is fixed/de-limited, the Readers Per Nodeoption can be used to read a single input file in parallel at evenly-spaced offsets. Note that in this manner, input row order is not maintained.

 If the input sequential file cannot be read in parallel, performance can still be improved by separating the file I/O from the column parsing operation. To accomplish this, define a single large string column for the non-parallel Sequential File read, and then pass this to a Column Import stage to parse the file in parallel. The formatting and column properties of the Column Import stage match those of the Sequential File stage. 

On heavily-loaded file servers or some RAID/SAN array configurations, the environment variables $APT_IMPORT_BUFFER_SIZEand $APT_EXPORT_BUFFER_SIZEcan be used to improve I/O performance. These settings specify the size of the read (import) and write (export) buffer size in Kbytes, with a default of 128 (128K). Increasing this may improve performance.

Finally, in some disk array configurations, setting the environment variable $APT_CONSISTENT_BUFFERIO_SIZEto a value equal to the read/write size in bytes can significantly improve performance of Sequential File operations.

Partitioning Sequential File Reads

Care must be taken to choose the appropriate partitioning method from a Sequential File read:

Don’t read from Sequential File using SAME partitioning! Unless more than one source file is specified, SAME will read the entire file into a single partition, making the entire downstream flow run sequentially (unless it is later repartitioned).

When multiple files are read by a single Sequential File stage (using multiple files, or by using a File Pattern), each file’s data is read into a separate partition. It is important to use ROUND-ROBIN partitioning (or other partitioning appropriate to downstream components) to evenly distribute the data in the flow.

Sequential File (Export) Buffering

By default, the Sequential File (export operator) stage buffers its writes to optimize performance. When a job completes successfully, the buffers are always flushed to disk. The environment variable $APT_EXPORT_FLUSH_COUNTallows the job developer to specify how frequently (in number of rows) that the Sequential File stage flushes its internal buffer on writes. Setting this value to a low number (such as 1) is useful for realtime applications, but there is a small performance penalty associated with increased I/O.

 Reading from and Writing to Fixed-Length Files

Particular attention must be taken when processing fixed-length fields using the Sequential File stage:

If the incoming columns are variable-length data types (eg. Integer, Decimal, Varchar), the field width column property must be set to match the fixed-width of the input column. Double-click on the column number in the grid dialog to set this column property.
 

If a field is nullable, you must define the null field value and length in the Nullable section of the column property. Double-click on the column number in the grid dialog to set these properties.
 

When writing fixed-length files from variable-length fields (eg. Integer, Decimal, Varchar), the field width and pad string column properties must be set to match the fixed-width of the output column. Double-click on the column number in the grid dialog to set this column property.

 To display each field value, use the print_field import property. All import and export properties are listed in chapter 25, Import/Export Properties of the Orchestrate 7.0 Operators Reference.

 Reading Bounded-Length VARCHAR Columns

Care must be taken when reading delimited, bounded-length Varchar columns (Varchars with the length option set). By default, if the source file has fields with values longer than the maximum Varchar length, these extra characters will be silently truncated.

Starting with v7.01 the environment variable

$APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS will direct DataStage to reject records with strings longer than their declared maximum column length. 

-Example that Reduces Contention in Data Stage Job - Configuration File

The alternative to the first configuration method is more careful planning of the I/O behavior to reduce contention. You can imagine this could be hard given our hypothetical 6-way SMP with 4 disks because setting up the obvious one-to-one correspondence doesn't work. Doubling up some nodes on the same disk is unlikely to be good for overall performance since we create a hotspot. 

We could give every CPU two disks and rotate them around, but that would be little different than the previous strategy. So, let’s imagine a less constrained environment with two additional disks:

 computer host name “fastone”

6 CPUs

6 separate file systems on 4 drives named /fs0, /fs1, /fs2, /fs3, /fs4, /fs5

 

Now a configuration file for this environment might look like this:

 
                    {
                    node "n0" {
                                         pools ""
                                         fastname "fastone"
                                         resource disk "/fs0/ds/data" {pools ""}
                                         resource scratchdisk "/fs0/ds/scratch" {pools ""}
                    }
                    node "node2" {
                                         fastname "fastone"
                                         pools ""
                                         resource disk "/fs1/ds/data" {pools ""}
                                         resource scratchdisk "/fs1/ds/scratch" {pools ""}
                    }
                    node "node3" {
                                         fastname "fastone"
                                         pools ""
                                         resource disk "/fs2/ds/data" {pools ""}
                                         resource scratchdisk "/fs2/ds/scratch" {pools ""}
                    }
                    node "node4" {
                                         fastname "fastone"
                                         pools ""
                                         resource disk "/fs3/ds/data" {pools ""}
                                         resource scratchdisk "/fs3/ds/scratch" {pools ""}
                    }
                    node "node5" {
                                         fastname "fastone"
                                         pools ""
                                         resource disk "/fs4/ds/data" {pools ""}
                                         resource scratchdisk "/fs4/ds/scratch" {pools ""}
                    }
                    node "node6" {
                                         fastname "fastone"
                                         pools ""
                                         resource disk "/fs5/ds/data" {pools ""}
                                         resource scratchdisk "/fs5/ds/scratch" {pools ""}
                    }
                    } /* end of entire config */

 

While this is the simplest scenario, it is important to realize that no single player, stage, or operator instance on any one partition can go faster than the single disk it has access to.

 

You could combine strategies by adding in a node pool where disks have a one-to-one association with nodes. These nodes would then not be in the default node pool, but a special one that you would specifically assign to stage / operator instances.

Smaller Configuration Example

Because disk and scratchdisk resources are assigned per node, depending on the total disk space required to process large jobs, it may be necessary to distribute file systems across nodes in smaller environments (fewer available CPUs/memory).

 

Using the above server example, this time with 4-nodes:

computer host name “fastone”

4 CPUs

6 separate file systems on 4 drives named /fs0, /fs1, /fs2, /fs3, /fs4, /fs5

 

                    {
                    node "node1" {
                                         fastname "fastone"
                                         pools ""
                                         resource disk "/fs0/ds/data" {pools ""} /* start with fs0 */
                                         resource disk "/fs4/ds/data" {pools ""}
                                         resource scratchdisk "/fs4/ds/scratch" {pools ""} /* start with fs4 */
                                         resource scratchdisk "/fs0/ds/scratch" {pools ""}
                    }
                    node "node2" {
                                         fastname "fastone"
                                         pools ""
                                         resource disk "/fs1/ds/data" {pools ""}
                                         resource disk "/fs5/ds/data" {pools ""}
                                         resource scratchdisk "/fs5/ds/scratch" {pools ""}
                                         resource scratchdisk "/fs1/ds/scratch" {pools ""}
                    }
                    node "node3" {
                                         fastname "fastone"
                                         pools ""
                                         resource disk "/fs2/ds/data" {pools ""}
                                         resource disk "/fs6/ds/data" {pools ""}
                                         resource scratchdisk "/fs6/ds/scratch" {pools ""}
                                         resource scratchdisk "/fs2/ds/scratch" {pools ""}
                    }
                    node "node4" {
                                         fastname "fastone"
                                         pools ""
                                         resource disk "/fs3/ds/data" {pools ""}
                                         resource disk "/fs7/ds/data" {pools ""}
                                         resource scratchdisk "/fs7/ds/scratch" {pools ""}
                                         resource scratchdisk "/fs3/ds/scratch" {pools ""}
                    }
                    } /* end of entire config */

 

The 4-node example above illustrates another concept in configuration file setup – you can assign multiple disk and scratch disk resources for each node. 

 

Unfortunately, physical limitations of available hardware and disk configuration don’t always lend themselves to “clean” configurations illustrated above.

 

Other configuration file tips:

Consider avoiding the disk(s) that your input files reside on. Often those disks will be hotspots until the input phase is over. If the job is large and complex this is less of an issue since the input part is proportionally less of the total work.
 

Ensure that the different file systems mentioned as the disk and scratchdisk resources hit disjoint sets of spindles even if they're located on a RAID system. Do not trust high-level RAID/SAN monitoring tools, as their “cache hit ratios” are often misleading.

Never use NFS file systems for scratchdisk resources. Know what's real and what's NFS: Real disks are directly attached, or are reachable over a SAN (storage-area network - dedicated, just for storage, low-level protocols).

 

Proper configuration of scratch and resource disk (and the underlying filesystem and physical hardware architecture) can significantly affect overall job performance. Beware if you use NFS (and, often SAN) filesystem space for disk resources. For example, your final result files may need to be written out onto the NFS disk area, but that doesn't mean the intermediate data sets created and used temporarily in a multi-job sequence should use this NFS disk area. It is better to setup a "final" disk pool, and constrain the result sequential file or data set to reside there, but let intermediate storage go to local or SAN resources, not NFS.

Using Configuration Files in Data Stage Best Practices & Performance Tuning

The configuration file tells DataStage Enterprise Edition how to exploit underlying system resources (processing, temporary storage, and dataset storage). In more advanced environments, the configuration file can also define other resources such as databases and buffer storage. At runtime, EE first reads the configuration file to determine what system resources are allocated to it, and then distributes the job flow across these resources.

 

When you modify the system, by adding or removing nodes or disks, you must modify the DataStage EE configuration file accordingly. Since EE reads the configuration file every time it runs a job, it automatically scales the application to fit the system without having to alter the job design.

 

There is not necessarily one ideal configuration file for a given system because of the high variability between the way different jobs work. For this reason, multiple configuration files should be used to optimize overall throughput and to match job characteristics to available hardware resources. At runtime, the configuration file is specified through the environment variable $APT_CONFIG_FILE.

Logical Processing Nodes

The configuration file defines one or more EE processing nodes on which parallel jobs will run. EE processing nodes are a logical rather than a physical construct. For this reason, it is important to note that the number of processing nodes does not necessarily correspond to the actual number of CPUs in your system.

Within a configuration file, the number of processing nodes defines the degree of parallelism and resources that a particular job will use to run. It is up to the UNIX operating system to actually schedule and run the processes that make up a DataStage job across physical processors. A configuration file with a larger number of nodes generates a larger number of processes that use more memory (and perhaps more disk activity) than a configuration file with a smaller number of nodes.

While the DataStage documentation suggests creating half the number of nodes as physical CPUs, this is a conservative starting point that is highly dependent on system configuration, resource availability, job design, and other applications sharing the server hardware. For example, if a job is highly I/O dependent or dependent on external (eg. database) sources or targets, it may appropriate to have more nodes than physical CPUs.

For typical production environments, a good starting point is to set the number of nodes equal to the number of CPUs. For development environments, which are typically smaller and more resource-constrained, create smaller configuration files (eg. 2-4 nodes). Note that even in the smallest development environments, a 2-node configuration file should be used to verify that job logic and partitioning will work in parallel (as long as the test data can sufficiently identify data discrepancies).

Optimizing Parallelism

The degree of parallelism of a DataStage EE application is determined by the number of nodes you define in the configuration file. Parallelism should be optimized rather than maximized. Increasing parallelism may better distribute your work load, but it also adds to your overhead because the number of processes increases. Therefore, you must weigh the gains of added parallelism against the potential losses in processing efficiency. The CPUs, memory, disk controllers and disk configuration that make up your system influence the degree of parallelism you can sustain.

 

Keep in mind that the closest equal partitioning of data contributes to the best overall performance of an application running in parallel. For example, when hash partitioning, try to ensure that the resulting partitions are evenly populated. This is referred to as minimizing skew.

 

When business requirements dictate a partitioning strategy that is excessively skewed, remember to change the partition strategy to a more balanced one as soon as possible in the job flow. This will minimize the effect of data skew and significantly improve overall job performance.

Configuration File Examples

Given the large number of considerations for building a configuration file, where do you begin? For starters, the default configuration file (default.apt) created when DataStage is installed is appropriate for only the most basic environments.

 

The default configuration file has the following characteristics:

  • number of nodes = ½ number of physical CPUs
  • disk and scratchdisk storage use subdirectories within the DataStage install filesystem

 

You should create and use a new configuration file that is optimized to your hardware and file systems. Because different job flows have different needs (CPU-intensive? Memory-intensive? Disk-Intensive? Database-Intensive? Sorts? need to share resources with other jobs/databases/ applications? etc), it is often appropriate to have multiple configuration files optimized for particular types of processing.

 

With the synergistic relationship between hardware (number of CPUs, speed, cache, available system memory, number and speed of I/O controllers, local vs. shared disk, RAID configurations, disk size and speed, network configuration and availability), software topology (local vs. remote database access, SMP vs. Clustered processing), and job design, there is no definitive science for formulating a configuration file. This section attempts to provide some guidelines based on experience with actual production applications.

 

IMPORTANT: It is important to follow the order of all sub-items within individual node specifications in the example configuration files given in this section.

Example for Any Number of CPUs and Any Number of Disks

Assume you are running on a shared-memory multi-processor system, an SMP server, which is the most common platform today. Let’s assume these properties:

 

computer host name “fastone”

6 CPUs

4 separate file systems on 4 drives named /fs0, /fs1, /fs2, /fs3

 

You can adjust the sample to match your precise environment.

 

The configuration file you would use as a starting point would look like the one below. Assuming that the system load from processing outside of DataStage is minimal, it may be appropriate to create one node per CPU as a starting point.

 

In the following example, the way disk and scratchdisk resources are handled is the important.

 

                    { /* config files allow C-style comments. */
 
                    /* Configuration do not have flexible syntax. Keep all the sub-items of 
                    the individual node specifications in the order shown here. */
 
 
                    node "n0" {
                                         pools "" /* on an SMP node pools aren’t used often. */
                                         fastname "fastone" 
                                         resource scratchdisk "/fs0/ds/scratch" {} /* start with fs0 */
                                         resource scratchdisk "/fs1/ds/scratch" {}
                                         resource scratchdisk "/fs2/ds/scratch" {}
                                         resource scratchdisk "/fs3/ds/scratch" {}
                                         resource disk "/fs0/ds/disk" {} /* start with fs0 */
                                         resource disk "/fs1/ds/disk" {}
                                         resource disk "/fs2/ds/disk" {}
                                         resource disk "/fs3/ds/disk" {}
                    }
 
                    node "n1" {
                                         pools "" 
                                         fastname "fastone"
                                         resource scratchdisk "/fs1/ds/scratch" {} /* start with fs1 */
                                         resource scratchdisk "/fs2/ds/scratch" {}
                                         resource scratchdisk "/fs3/ds/scratch" {}
                                         resource scratchdisk "/fs0/ds/scratch" {} 
                                         resource disk "/fs1/ds/disk" {} /* start with fs1 */
                                         resource disk "/fs2/ds/disk" {}
                                         resource disk "/fs3/ds/disk" {}
                                         resource disk "/fs0/ds/disk" {} 
                    }
 
                    node "n2" {
                                         pools "" 
                                         fastname "fastone"
                                         resource scratchdisk "/fs2/ds/scratch" {} /* start with fs2 */
                                         resource scratchdisk "/fs3/ds/scratch" {}
                                         resource scratchdisk "/fs0/ds/scratch" {} 
                                         resource scratchdisk "/fs1/ds/scratch" {} 
                                         resource disk "/fs2/ds/disk" {} /* start with fs2 */
                                         resource disk "/fs3/ds/disk" {}
                                         resource disk "/fs0/ds/disk" {} 
                                         resource disk "/fs1/ds/disk" {} 
                    }
 
                    node "n3" {
                                         pools "" 
                                         fastname "fastone"
                                         resource scratchdisk "/fs3/ds/scratch" {} /* start with fs3 */
                                         resource scratchdisk "/fs0/ds/scratch" {} 
                                         resource scratchdisk "/fs1/ds/scratch" {} 
                                         resource scratchdisk "/fs2/ds/scratch" {} 
                                         resource disk "/fs3/ds/disk" {} /* start with fs3 */
                                         resource disk "/fs0/ds/disk" {} 
                                         resource disk "/fs1/ds/disk" {} 
                                         resource disk "/fs2/ds/disk" {} 
                    }
 
                    node "n4" {
                                         pools "" 
                                         fastname "fastone"
                    /* Now we have rotated through starting with a different disk, but the fundamental problem 
                   * in this scenario is that there are more nodes than disks. So what do we do now? 
                   * The answer: something that is not perfect. We’re going to repeat the sequence. You could 
                   * shuffle differently i.e., use /fs0 /fs2 /fs1 /fs3 as an order, but that most likely won’t
                   * matter. */
                                         resource scratchdisk “/fs0/ds/scratch” {} /* start with fs0 again */
                                         resource scratchdisk “/fs1/ds/scratch” {} 
                                         resource scratchdisk “/fs2/ds/scratch” {} 
                                         resource scratchdisk “/fs3/ds/scratch” {}
                                         resource disk “/fs0/ds/disk” {} /* start with fs0 again */
                                         resource disk “/fs1/ds/disk” {} 
                                         resource disk “/fs2/ds/disk” {} 
                                         resource disk “/fs3/ds/disk” {}
                    }
 
                    node “n5” {
                                        pools “” 
                                         fastname “fastone”
                                         resource scratchdisk “/fs1/ds/scratch” {} /* start with fs1 */
                                         resource scratchdisk “/fs2/ds/scratch” {} 
                                         resource scratchdisk “/fs3/ds/scratch” {}
                                         resource scratchdisk “/fs0/ds/scratch” {} 
                                         resource disk “/fs1/ds/disk” {} /* start with fs1 */
                                         resource disk “/fs2/ds/disk” {} 
                                         resource disk “/fs3/ds/disk” {}
                                         resource disk “/fs0/ds/disk” {} 
                    }
 
                    } /* end of entire config */
 
 

The file pattern of the configuration file above is a “give every node all the disk” example, albeit in different orders to minimize I/O contention. This configuration method works well when the job flow is complex enough that it is difficult to determine and precisely plan for good I/O utilization.

 

Within each node, EE does not “stripe” the data across multiple filesystems. Rather, it fills the disk and scratchdisk filesystems in the order specified in the configuration file. In the 4-node example above, the order of the disks is purposely shifted for each node, in an attempt to minimize I/O contention.

 

Even in this example, giving every partition (node) access to all the I/O resources can cause contention, but EE attempts to minimize this by using fairly large I/O blocks.

 

This configuration style works for any number of CPUs and any number of disks since it doesn't require any particular correspondence between them. The heuristic here is: “When it’s too difficult to figure out precisely, at least go for achieving balance.”

Environment Variable Settings for Data Stage Best Practices and Performance Tuning

DataStage EE provides a number of environment variables to control how jobs operate on a UNIX system.  In addition to providing required information, environment variables can be used to enable or disable various DataStage features, and to tune performance settings. 

Data Stage Environment Variable Settings for All Jobs

Ascential recommends the following environment variable settings for all Enterprise Edition jobs. These settings can be made at the project level, or may be set on an individual basis within the properties for each job.

 Environment Variable Settings For All Jobs

 

Environment Variable

Setting

Description

$APT_CONFIG_FILE

filepath

Specifies the full pathname to the EE configuration file.

$APT_DUMP_SCORE

 

1

Outputs EE score dump to the DataStage job log, providing detailed information about actual job flow including operators, processes, and datasets. Extremely useful for understanding how a job actually ran in the environment. (see section 10.1 Reading a Score Dump)

$OSH_ECHO

1

Includes a copy of the generated osh in the job’s DataStage log.  Starting with v7, this option is enabled when “Generated OSH visible for Parallel jobs in ALL projects” option is enabled in DataStage Administrator.

$APT_RECORD_COUNTS

1

Outputs record counts to the DataStage job log as each operator completes processing. The count is per operator per partition.

$APT_PM_SHOW_PIDS

1

Places entries in DataStage job log showing UNIX process ID (PID) for each process started by a job. Does not report PIDs of DataStage “phantom” processes started by Server shared containers.

$APT_BUFFER_MAXIMUM_TIMEOUT

1

Maximum buffer delay in seconds

$APT_THIN_SCORE

(DataStage 7.0 and earlier)

1

Only needed for DataStage v7.0 and earlier. Setting this environment variable significantly reduces memory usage for very large (>100 operator) jobs.

 

Additional Environment Variable Settings

Ascential recommends setting the following environment variables on an as-needed basis. These variables can be used to tune the performance of a particular job flow, to assist in debugging, and to change the default behavior of specific EE stages. 

NOTE: The environment variable settings in this section are only examples. Set values that are optimal to your environment.

 Sequential File Stage Environment Variables

Environment Variable

Setting

Description

$APT_EXPORT_FLUSH_COUNT

[nrows]

Specifies how frequently (in rows) that the Sequential File stage (export operator) flushes its internal buffer to disk. Setting this value to a low number (such as 1) is useful for realtime applications, but there is a small performance penalty from increased I/O.

$APT_IMPORT_BUFFER_SIZE

 

$APT_EXPORT_BUFFER_SIZE

[Kbytes]

Defines size of I/O buffer for Sequential File reads (imports) and writes (exports) respectively. Default is 128 (128K), with a minimum of 8. Increasing these values on heavily-loaded file servers may improve performance.

$APT_CONSISTENT_BUFFERIO_SIZE

[bytes]

In some disk array configurations, setting this variable to a value equal to the read / write size in bytes can improve performance of Sequential File import/export operations.

$APT_DELIMITED_READ_SIZE

[bytes]

Specifies the number of bytes the Sequential File (import) stage reads-ahead to get the next delimiter. The default is 500 bytes, but this can be set as low as 2 bytes.

This setting should be set to a lower value when reading from streaming inputs (eg. socket, FIFO) to avoid blocking.

$APT_MAX_DELIMITED_READ_SIZE

[bytes]

By default, Sequential File (import) will read ahead 500 bytes to get the next delimiter. If it is not found the importer looks ahead 4*500=2000 (1500 more) bytes, and so on (4X) up to 100,000 bytes. 

 

This variable controls the upper bound which is by default 100,000 bytes.  When more than 500 bytes read-ahead is desired, use this variable instead of APT_DELIMITED_READ_SIZE.

 

 Oracle Environment Variables

 

Environment Variable

Setting

Description

$ORACLE_HOME

[path]

Specifies installation directory for current Oracle instance. Normally set in a user’s environment by Oracle scripts.

$ORACLE_SID

[sid]

Specifies the Oracle service name, corresponding to a TNSNAMES entry.

$APT_ORAUPSERT_COMMIT_ROW_INTERVAL

$APT_ORAUPSERT_COMMIT_TIME_INTERVAL

[num]

[seconds]

These two environment variables work together to specify how often target rows are committed for target Oracle stages with Upsert method.
 

Commits are made whenever the time interval period has passed or the row interval is reached, whichever comes first. By default, commits are made every 2 seconds or 5000 rows.

$APT_ORACLE_LOAD_OPTIONS

[SQL*
Loader options]

Specifies Oracle SQL*Loader options used in a target Oracle stage with Load method. By default, this is set to OPTIONS(DIRECT=TRUE, PARALLEL=TRUE)

$APT_ORA_IGNORE_CONFIG_FILE_PARALLELISM

1

When set, a target Oracle stage with Load method will limit the number of players to the number of datafiles in the table’s tablespace.

$APT_ORA_WRITE_FILES

[filepath]

Useful in debugging Oracle SQL*Loader issues. When set, the output of a Target Oracle stage with Load method is written to files instead of invoking the Oracle SQL*Loader. The filepath specified by this environment variable specifies the file with the SQL*Loader commands.

$DS_ENABLE_RESERVED_CHAR_CONVERT

1

Allows DataStage to handle Oracle databases which use the special characters # and $ in column names.

 

 Job Monitoring Environment Variables

 

Environment Variable

Setting

Description

$APT_MONITOR_TIME

[seconds]

In v7 and later, specifies the time interval (in seconds) for generating job monitor information at runtime. To enable size-based job monitoring, unset this environment variable, and set $APT_MONITOR_SIZEbelow.

$APT_MONITOR_SIZE

 

[rows]

Determines the minimum number of records the job monitor reports. The default of 5000 records is usually too small. To minimize the number of messages during large job runs, set this to a higher value (eg. 1000000).

$APT_NO_JOBMON

1

Disables job monitoring completely. In rare instances, this may improve performance. In general, this should only be set on a per-job basis when attempting to resolve performance bottlenecks.

$APT_RECORD_COUNTS

1

Prints record counts in the job log as each operator completes processing. The count is per operator per partition.

 

 

Job Monitoring Environment Variables

 

Environment Variable

Setting

Description

$APT_MONITOR_TIME

[seconds]

In v7 and later, specifies the time interval (in seconds) for generating job monitor information at runtime. To enable size-based job monitoring, unset this environment variable, and set $APT_MONITOR_SIZEbelow.

$APT_MONITOR_SIZE

 

[rows]

Determines the minimum number of records the job monitor reports. The default of 5000 records is usually too small. To minimize the number of messages during large job runs, set this to a higher value (eg. 1000000).

$APT_NO_JOBMON

1

Disables job monitoring completely. In rare instances, this may improve performance. In general, this should only be set on a per-job basis when attempting to resolve performance bottlenecks.

$APT_RECORD_COUNTS

1

Prints record counts in the job log as each operator completes processing. The count is per operator per partition.

Pages

Subscribe to Data Stage