myabinito: August 2011

Def

AbInitio is one of the popular ETL tools that is in the market.
The ETL process in AbInitio is represented by AbInitio graphs. Graphs are formed by components (from the standard components library or custom), flows (data streams) and parameters.
Co>Operating System is a program provided by AbInitio which operates on the top of the operating system and is a base for all AbInitio processes.
It provides additional features known as air commands which can be installed on a variety of system environments such as Unix, HP-UX, Linux, IBM AIX and Windows systems. CoOperating System provides the following features:

Manage and run AbInitio graphs and control the ETL processes
Provides AbInitio extensions to the operating system
ETL processes monitoring and debugging
Metadata management and interaction with the EME

GDE is a graphical application for developers which is used for designing and running AbInitio graphs.It also provides:

A user-friendly frontend for designing Ab Initio ETL graphs
Ability to run, debug Ab Initio jobs and trace execution logs
GDE AbInitio graph compilation process results in generation of a UNIX shell script which may be executed on a machine without the GDE installed.

Enterprise Meta>Environment (EME) is an AbInitio repository and environment for storing and managing metadata.
It provides capability to store both business and technical metadata. EME metadata can be accessed from the Ab Initio GDE, web browser or AbInitio CoOperating system command line (air commands).
Conduct>It is an environment for creating enterprise Ab Initio data integration systems. Its main role is to create AbInitio Plans which is a special type of graph constructed of another graphs and scripts. AbInitio provides both graphical and command-line interface to Conduct>IT.
The Data Profiler is a graphical data analysis tool which runs on top of the Co>Operating system. It can be used to characterize data range, scope, distribution, variance, and quality.
Ab Initio implements parallelism in mainly 3 ways:
Data parallelism – data is divided among many partitions known as multi-files. During processing, each partition is processed in parallel.
Component parallelism – multiple components are run in parallel. Components execute simultaneously on different branches of a graph.
Pipeline parallelism – when a record is processed in one component and a previous record is being processed in another components. Operations like sorting and aggregation break pipeline parallelism.

Components

The different components of Ab initio architecture are:-

1.Dataset Components
2.Database Components
3.Partition Components
4.DePartition Components
5.Sort Components
6.Transform Components

Link: http://ab-initio-tutorials.blogspot.com/2010/01/components.html

Advanced Concepts: http://ab-initio-tutorials.blogspot.com/2010/01/advanced-concepts.html

Sample Interview Questions

What is the relation between EME , GDE and Co-operating system ?

ans. EME is said as enterprise metdata env, GDE as graphical devlopment env and Co-operating sytem can be said as asbinitio server
relation b/w this CO-OP, EME AND GDE is as fallows
Co operating system is the Abinitio Server. this co-op is installed on perticular O.S platform that is called NATIVE O.S .comming to the EME, its i just as repository in informatica , its hold the metadata,trnsformations,db config files source and targets informations. comming to GDE its is end user envirinment where we can devlop the graphs(mapping just like in informatica)
desinger uses the GDE and designs the graphs and save to the EME or Sand box it is at user side.where EME is ast server side.

What is the use of aggregation when we have rollup

as we know rollup component in abinitio is used to summirize group of data record. then where we will use aggregation ?
ans: Aggregation and Rollup both can summerise the data but rollup is much more convenient to use. In order to understand how a particular summerisation being rollup is much more explanatory compared to aggregate. Rollup can do some other functionalities like input and output filtering of records.
Aggregate and rollup perform same action, rollup display intermediat
result in main memory, Aggregate does not support intermediat result
what are kinds of layouts does ab initio supports

Basically there are serial and parallel layouts supported by AbInitio. A graph can have both at the same time. The parallel one depends on the degree of data parallelism. If the multi-file system is 4-way parallel then a component in a graph can run 4 way parallel if the layout is defined such as it's same as the degree of parallelism.

How can you run a graph infinitely?

To run a graph infinitely, the end script in the graph should call the .ksh file of the graph. Thus if the name of the graph is abc.mp then in the end script of the graph there should be a call to abc.ksh.
Like this the graph will run infinitely.

How do you add default rules in transformer?

Double click on the transform parameter of parameter tab page of component properties, it will open transform editor. In the transform editor click on the Edit menu and then select Add Default Rules from the dropdown. It will show two options - 1) Match Names 2) Wildcard.

Do you know what a local lookup is?

If your lookup file is a multifile and partioned/sorted on a particular key then local lookup function can be used ahead of lookup function call. This is local to a particular partition depending on the key.

Lookup File consists of data records which can be held in main memory. This makes the transform function to retrieve the records much faster than retirving from disk. It allows the transform component to process the data records of multiple files fastly.

What is the difference between look-up file and look-up, with a relevant example?

Generally Lookup file represents one or more serial files(Flat files). The amount of data is small enough to be held in the memory. This allows transform functions to retrive records much more quickly than it could retrive from Disk.
A lookup is a component of abinitio graph where we can store data and retrieve it by using a key parameter.
A lookup file is the physical file where the data for the lookup is stored.
How many components in your most complicated graph? It depends the type of components you us.

usually avoid using much complicated transform function in a graph.

Explain what is lookup?

Lookup is basically a specific dataset which is keyed. This can be used to mapping values as per the data present in a particular file (serial/multi file). The dataset can be static as well dynamic ( in case the lookup file is being generated in previous phase and used as lookup file in current phase). Sometimes, hash-joins can be replaced by using reformat and lookup if one of the input to the join contains less number of records with slim record length.
AbInitio has built-in functions to retrieve values using the key for the lookup
What is a ramp limit?
The limit parameter contains an integer that represents a number of reject events

The ramp parameter contains a real number that represents a rate of reject events in the number of records processed.
no of bad records allowed = limit + no of records*ramp.
ramp is basically the percentage value (from 0 to 1)
This two together provides the threshold value of bad records.

Have you worked with packages?

Multistage transform components by default uses packages. However user can create his own set of functions in a transfer function and can include this in other transfer functions.

Have you used rollup component? Describe how.

If the user wants to group the records on particular field values then rollup is best way to do that. Rollup is a multi-stage transform function and it contains the following mandatory functions.
1. initialise
2. rollup
3. finalise
Also need to declare one temporary variable if you want to get counts of a particular group.

For each of the group, first it does call the initialise function once, followed by rollup function calls for each of the records in the group and finally calls the finalise function once at the end of last rollup call.

How do you add default rules in transformer?

Add Default Rules — Opens the Add Default Rules dialog. Select one of the following: Match Names — Match names: generates a set of rules that copies input fields to output fields with the same name. Use Wildcard (.*) Rule — Generates one rule that copies input fields to output fields with the same name.

)If it is not already displayed, display the Transform Editor Grid.
2)Click the Business Rules tab if it is not already displayed.
3)Select Edit > Add Default Rules.

In case of reformat if the destination field names are same or subset of the source fields then no need to write anything in the reformat xfr unless you dont want to use any real transform other than reducing the set of fields or split the flow into a number of flows to achive the functionality.

What is the difference between partitioning with key and round robin?

Partition by Key or hash partition -> This is a partitioning technique which is used to partition data when the keys are diverse. If the key is present in large volume then there can large data skew. But this method is used more often for parallel data processing.

Round robin partition is another partitioning technique to uniformly distribute the data on each of the destination data partitions. The skew is zero in this case when no of records is divisible by number of partitions. A real life example is how a pack of 52 cards is distributed among 4 players in a round-robin manner.

How do you improve the performance of a graph?

There are many ways the performance of the graph can be improved.
1) Use a limited number of components in a particular phase
2) Use optimum value of max core values for sort and join components
3) Minimise the number of sort components
4) Minimise sorted join component and if possible replace them by in-memory join/hash join
5) Use only required fields in the sort, reformat, join components
6) Use phasing/flow buffers in case of merge, sorted joins
7) If the two inputs are huge then use sorted join, otherwise use hash join with proper driving port
8) For large dataset don't use broadcast as partitioner
9) Minimise the use of regular expression functions like re_index in the trasfer functions
10) Avoid repartitioning of data unnecessarily

Try to run the graph as long as possible in MFS. For these input files should be partitioned and if possible output file should also be partitioned.
How do you truncate a table?

From Abinitio run sql component using the DDL "trucate table
By using the Truncate table component in Ab Initio

Have you eveer encountered an error called "depth not equal"?

When two components are linked together if their layout doesnot match then this problem can occur during the compilation of the graph. A solution to this problem would be to use a partitioning component in between if there was change in layout.

What is the function you would use to transfer a string into a decimal?

In this case no specific function is required if the size of the string and decimal is same. Just use decimal cast with the size in the transform function and will suffice. For example, if the source field is defined as string(8) and the destination as decimal(8) then (say the field name is field1).

out.field :: (decimal(8)) in.field

If the destination field size is lesser than the input then use of string_substring function can be used likie the following.
say destination field is decimal(5).

out.field :: (decimal(5))string_lrtrim(string_substring(in.field,1,5)) /* string_lrtrim used to trim leading and trailing spaces */
What are primary keys and foreign keys?

Intro

Ab Initio means “ Starts From the Beginning”. Ab-Initio software works with the client-server model.

The client is called “Graphical Development Environment” (you can call it GDE).It
resides on user desktop.The server or back-end is called Co-Operating System”. The Co-Operating System can reside in a mainframe or unix remote machine.

The Ab-Initio code is called graph ,which has got .mp extension. The graph from GDE is required to be deployed in corresponding .ksh version. In Co-Operating system the
corresponding .ksh in run to do the required job.

How Ab-Initio Job Is Run What happens when you push the “Run” button?
•Your graph is translated into a script that can be executed in the Shell Development
•This script and any metadata files stored on the GDE client machine are shipped (via
FTP) to the server.
•The script is invoked (via REXEC or TELNET) on the server.
•The script creates and runs a job that may run across many hosts.
•Monitoring information is sent back to the GDE client.
Ab-Intio Environment The advantage of Ab-Initio code is that it can run in both the serial and multi-file system environment. Serial Environment: The normal UNIX file system. Muti-File System: Multi-File System (mfs) is meant for parallelism. In an mfs a particular file physically stored across different partition of the machine or even different
machine but pointed by a logical file, which is stored in the co-operating system. The
logical file is the control file which holds the pointer to the physical locations.
About Ab-Initio Graphs: An Ab-Initio graph comprises number of components to serve different purpose. Data is read or write by a component according to the dml ( do not
confuse with the database “data manipulating language” The most commonly used
components are described in the following sections.

Co>Operating System

Co>Operating System is a program provided by AbInitio which operates on the top of the operating system and is a base for all AbInitio processes. It provdes additional features known as air commands which can be installed on a variety of system environments such as Unix, HP-UX, Linux, IBM AIX, Windows systems. The AbInitio CoOperating System provides the following features:
- Manage and run AbInitio graphs and control the ETL processes
- Provides AbInitio extensions to the operating system
- ETL processes monitoring and debugging
- Metadata management and interaction with the EME

AbInitio GDE (Graphical Development Enviroment)

GDE is a graphical application for developers which is used for designing and running AbInitio graphs. It also provides:
- The ETL process in AbInitio is represented by AbInitio graphs. Graphs are formed by components (from the standard components library or custom), flows (data streams) and parameters.
- A user-friendly frontend for designing Ab Initio ETL graphs
- Ability to run, debug Ab Initio jobs and trace execution logs
- GDE AbInitio graph compilation process results in generation of a UNIX shell script which may be executed on a machine without the GDE installed

AbInitio EME

Enterprise Meta>Environment (EME) is an AbInitio repository and environment for storing and managing metadata. It provides capability to store both business and technical metadata. EME metadata can be accessed from the Ab Initio GDE, web browser or AbInitio CoOperating system command line (air commands)

Conduct>It

Conduct It is an environment for creating enterprise Ab Initio data integration systems. Its main role is to create AbInitio Plans which is a special type of graph constructed of another graphs and scripts. AbInitio provides both graphical and command-line interface to Conduct>IT.

Data Profiler

The Data Profiler is an analytical application that can specify data range, scope, distribution, variance, and quality. It runs in a graphic environment on top of the Co>Operating system.

Component Library

The Ab Initio Component Library is a reusable software module for sorting, data transformation, and high-speed database loading and unloading. This is a flexible and extensible tool which adapts at runtime to the formats of records entered and allows creation and incorporation of new components obtained from any program that permits integration and reuse of external legacy codes and storage engines.

Wednesday, 31 August 2011

Def

Components

Sample Interview Questions

Intro