g

Friday, 2 December 2011

What is Staging area why we need it in DWH?


If target and source databases are different and target table volume is high it contains some millions of records in this scenario without staging table we need to design your informatica using look up to find out whether the record exists or not in the target table since target has huge volumes so its costly to create cache it will hit the performance.

If we create staging tables in the target database we can simply do outer join in the source qualifier to determine insert/update this approach will give you good performance.

It will avoid full table scan to determine insert/updates on target. And also we can create index on staging tables since these tables were designed for specific application it will not impact to any other schemas/users.
While processing flat files to data warehousing we can perform cleansing.

Data cleansing, also known as data scrubbing, is the process of ensuring that a set of data is correct and accurate. During data cleansing, records are checked for accuracy and consistency.
  • Since it is one-to-one mapping from ODS to staging we do truncate and reload.
  • We can create indexes in the staging state, to perform our source qualifier best.
  • If we have the staging area no need to relay on the informatics transformation to known whether the record exists or not.

Data cleansing
Weeding out unnecessary or unwanted things (characters and spaces etc) from incoming data to make it more meaningful and informative.

Data merging
Data can be gathered from heterogeneous systems and put together.

Data scrubbing
Data scrubbing is the process of fixing or eliminating individual pieces of data that are incorrect, incomplete or duplicated before the data is passed to end user. 

Data scrubbing is aimed at more than eliminating errors and redundancy. The goal is also to bring consistency to various data sets that may have been created with different, incompatible business rules.

posted by Jeyakumar @ Friday, December 02, 2011, ,


Informatica Metadata


I am sure every PowerCenter developer either has an intention or necessity to know about the Informatica metadata tables and where information is stored etc. For the starters, all the objects that we create in Informatica PowerCenter - let them be sources, targets, mappings, workflows, sessions, expressions, be it anything related to PowerCenter, will get stored in a set of database tables (call them as metadata tables or OPB tables or repository tables).
  • I want to know all the sessions in my folder that are calling some shell script/command in the Post-Session command task.
  • I want to know how many mappings have transformations that contain "STOCK_CODE" defined as a port.
  • I want to know all unused ports in my repository of 100 folders.
In repositories where you have many number of sessions or workflows or mappings, it gets difficult to achieve this with the help of Informatica PowerCenter client tools. After all, whole of this data is stored in some form in the metadata tables. So if you know the data model of these repository tables, you will be in a better position to answer these questions.

Before we proceed further, let me clearly urge for something very important. Data in the repository/metadata/OPB tables is very sensitive and that the modifications like insert or updates are to be made using the PowerCenter tools ONLY. DO NOT DIRECTLY USE UPDATE OR INSERT COMMANDS AGAINST THESE TABLES.

Please also note that there is no official documentation from Informatica Corporation on how these tables act. It is purely based on my assumption, research and experience that I am providing these details. I will not be responsible to any of the damages caused if you use any statement other than the SELECT, knowing the details from this blog article. This is my disclaimer. Let us move on to the contents now.

There around a couple of hundred OPB tables in 7.x version of PowerCenter, but in 8.x, this number crosses 400. In this regard, I am going to talk about few important tables in this articles. As such, this is not a small topic to cover in one article. I shall write few more to cover other important tables like OPB_TDS, OPB_SESSLOG etc.

We shall start with OPB_SUBJECT now.

OPB_SUBJECT - PowerCenter folders table

This table stores the name of each PowerCenter repository folder.

Usage: Join any of the repository tables that have SUBJECT_ID as column with that of SUBJ_ID in this table to know the folder name.

OPB_MAPPING - Mappings table

This table stores the name and ID of each mapping and its corresponding folder.

Usage: Join any of the repository tables that have MAPPING_ID as column with that of MAPPING_ID in this table to know the mapping name.

OPB_TASK - Tasks table like sessions, workflow etc

This table stores the name and ID of each task like session, workflow and its corresponding folder.

Usage: Join any of the repository tables that have TASK_ID as column with that of TASK_ID/SESSION_ID in this table to know the task name. Observe that the session and also workflow are stored as tasks in the repository. TASK_TYPE for session is 68 and that of the workflow is 71.

OPB_SESSION - Session & Mapping linkage table

This table stores the linkage between the session and the corresponding mapping. As informed in the earlier paragraph, you can use the SESSION_ID in this table to join with TASK_ID of OPB_TASK table.

OPB_TASK_ATTR - Task attributes tables

This is the table that stores the attribute values (like Session log name etc) for tasks.

Usage: Use the ATTR_ID of this table to that of the ATTR_ID of OPB_ATTR table to find what each attribute in this table means. You can know more about OPB_ATTR table in the next paragraphs.

OPB_WIDGET - Transformations table

This table stores the names and IDs of all the transformations with their folder details.

Usage: Use WIDGET_ID from this table to that of the WIDGET_ID of any of the tables to know the transformation name and the folder details. Use this table in conjunction with OPB_WIDGET_ATTR or OPB_WIDGET_EXPR to know more about each transformation etc.

OPB_WIDGET_FIELD - Transformation ports table

This table stores the names and IDs of all the transformation fields for each of the transformations.

Usage: Take the FIELD_ID from this table and match it against the FIELD_ID of any of the tables like OPB_WIDGET_DEP and you can get the corresponding information.

OPB_WIDGET_ATTR - Transformation properties table

This table stores all the properties details about each of the transformations.

Usage: Use the ATTR_ID of this table to that of the ATTR_ID of OPB_ATTR table to find what each attribute in this transformation means.

OPB_EXPRESSION - Expressions table

This table stores the details of the expressions used anywhere in PowerCenter.

Usage: Use this table in conjunction with OPB_WIDGET/OPB_WIDGET_INST and OPB_WIDGET_EXPR to get the expressions in the Expression transformation for a particular, mapping or a set.

OPB_ATTR - Attributes

This table has a list of attributes and their default values if any. You can get the ATTR_ID from this table and look it up against any of the tables where you can get the attribute value. You should also make a note of the ATTR_TYPE, OBJECT_TYPE_ID before you pick up the ATTR_ID. You can find the same ATTR_ID in the table, but with different ATTR_TYPE or OBJECT_TYPE_ID.

OPB_COMPONENT - Session Component

This table stores the component details like Post-Session-Success-Email, commands in Post-Session/pre-Session etc.

Usage: Match the TASK_ID with that of the SESSION_ID in OPB_SESSION table to get the SESSION_NAME and to get the shell command or batch command that is there for the session, join this table with OPB_TASK_VAL_LIST table on TASK_ID.

OPB_CFG_ATTR - Session Configuration Attributes

This table stores the attribute values for Session Object configuration like "Save Session log by", Session log path

posted by Jeyakumar @ Friday, December 02, 2011, ,


Unix interview questions

How do send the session report (.txt) to manager after session is completed?
Email variable - %a (attach the file) %g – attach session log file

How do identify the empty line in a flat file in Unix? How to remove it?
grep –v “^$” filename

List the files in ascending order in Unix?
ls  -lt (sort by last date modified)  ls –ltr (reverse)
ls –lS (sort by size of the file)

How to open a Database using Unix Command
mysql -u uname -h hostname –p pwd

What is command to check space in Unix?
Ans: df -k 

What is command to kill last background Job? 
Ans: kill $!

How you will list all Hidden files? 
Ans: ls -la|grep ^[.] 

How to kill a process forcibly?
Ans: kill -9 PID (Process Identification Number) 

How to print/display the first line of a file?
There are many ways to do this. However the easiest way to display the first line of a file is using the [head] command.
$> head -1 file.txt
No prize in guessing that if you specify [head -2] then it would print first 2 records of the file.
Another way can be by using [sed] command. [Sed] is a very powerful text editor which can be used for various text manipulation purposes like this.
$> sed '2,$ d' file.txt
How does the above command work? The 'd' parameter basically tells [sed] to delete all the records from display from line 2 to last line of the file (last line is represented by $ symbol). Of course it does not actually delete those lines from the file, it just does not display those lines in standard output screen. So you only see the remaining line which is the 1st line.

How to print/display the last line of a file?
The easiest way is to use the [tail] command.
$> tail -1 file.txt
If you want to do it using [sed] command, here is what you should write:
$> sed -n '$ p' test
From our previous answer, we already know that '$' stands for the last line of the file. So '$ p' basically prints (p for print) the last line in standard output screen. '-n' switch takes [sed] to silent mode so that [sed] does not print anything else in the output.

How to display n-th line of a file?
The easiest way to do it will be by using [sed] I guess. Based on what we already know about [sed] from our previous examples, we can quickly deduce this command:
$> sed –n '<n> p' file.txt
You need to replace <n> with the actual line number. So if you want to print the 4th line, the command will be
$> sed –n '4 p' test
Of course you can do it by using [head] and [tail] command as well like below:
$> head -<n> file.txt | tail -1
You need to replace <n> with the actual line number. So if you want to print the 4th line, the command will be
$> head -4 file.txt | tail -1

How to remove the first line / header from a file?
We already know how [sed] can be used to delete a certain line from the output – by using the'd' switch. So if we want to delete the first line the command should be:
$> sed '1 d' file.txt
But the issue with the above command is, it just prints out all the lines except the first line of the file on the standard output. It does not really change the file in-place. So if you want to delete the first line from the file itself, you have two options.
Either you can redirect the output of the file to some other file and then rename it back to original file like below:
$> sed '1 d' file.txt > new_file.txt
$> mv new_file.txt file.txt
Or, you can use an inbuilt [sed] switch '–i' which changes the file in-place. See below:
$> sed –i '1 d' file.txt

How to remove the last line/ trailer from a file in Unix script?
Always remember that [sed] switch '$' refers to the last line. So using this knowledge we can deduce the below command:
$> sed –i '$ d' file.txt

How to remove certain lines from a file in Unix?
If you want to remove line <m> to line <n> from a given file, you can accomplish the task in the similar method shown above. Here is an example:
$> sed –i '5,7 d' file.txt
The above command will delete line 5 to line 7 from the file file.txt

How to remove the last n-th line from a file?
This is bit tricky. Suppose your file contains 100 lines and you want to remove the last 5 lines. Now if you know how many lines are there in the file, then you can simply use the above shown method and can remove all the lines from 96 to 100 like below:
$> sed –i '96,100 d' file.txt   # alternative to command [head -95 file.txt]
But not always you will know the number of lines present in the file (the file may be generated dynamically, etc.) In that case there are many different ways to solve the problem. There are some ways which are quite complex and fancy. But let's first do it in a way that we can understand easily and remember easily. Here is how it goes:
$> tt=`wc -l file.txt | cut -f1 -d' '`;sed –i "`expr $tt - 4`,$tt d" test
As you can see there are two commands. The first one (before the semi-colon) calculates the total number of lines present in the file and stores it in a variable called “tt”. The second command (after the semi-colon), uses the variable and works in the exact way as shows in the previous example.

How to check the length of any line in a file?
We already know how to print one line from a file which is this:
$> sed –n '<n> p' file.txt
Where <n> is to be replaced by the actual line number that you want to print. Now once you know it, it is easy to print out the length of this line by using [wc] command with '-c' switch.
$> sed –n '35 p' file.txt | wc –c
The above command will print the length of 35th line in the file.txt.

How to get the nth word of a line in Unix?
Assuming the words in the line are separated by space, we can use the [cut] command. [cut] is a very powerful and useful command and it's real easy. All you have to do to get the n-th word from the line is issue the following command:
cut –f<n> -d' '
'-d' switch tells [cut] about what is the delimiter (or separator) in the file, which is space ' ' in this case. If the separator was comma, we could have written -d',' then. So, suppose I want find the 4th word from the below string: “A quick brown fox jumped over the lazy cat”, we will do something like this:
$> echo “A quick brown fox jumped over the lazy cat” | cut –f4 –d' '
And it will print “fox”

How to reverse a string in unix?
Pretty easy. Use the [rev] command.
$> echo "unix" | rev
xinu

How to get the last word from a line in Unix file?
We will make use of two commands that we learnt above to solve this. The commands are [rev] and [cut]. Here we go.
Let's imagine the line is: “C for Cat”. We need “Cat”. First we reverse the line. We get “taC rof C”. Then we cut the first word, we get 'taC'. And then we reverse it again.
$>echo "C for Cat" | rev | cut -f1 -d' ' | rev
Cat

How to get the n-th field from a Unix command output?
We know we can do it by [cut]. Like below command extracts the first field from the output of [wc –c] command
$>wc -c file.txt | cut -d' ' -f1
109
But I want to introduce one more command to do this here. That is by using [awk] command. [awk] is a very powerful command for text pattern scanning and processing. Here we will see how may we use of [awk] to extract the first field (or first column) from the output of another command. Like above suppose I want to print the first column of the [wc –c] output. Here is how it goes like this:
$>wc -c file.txt | awk ' ''{print $1}'
109
The basic syntax of [awk] is like this:
awk 'pattern space''{action space}'
The pattern space can be left blank or omitted, like below:
$>wc -c file.txt | awk '{print $1}'
109
In the action space, we have asked [awk] to take the action of printing the first column ($1). More on [awk] later.

How to replace the n-th line in a file with a new line in Unix?
This can be done in two steps. The first step is to remove the n-th line. And the second step is to insert a new line in n-th line position. Here we go.
Step 1: remove the n-th line
$>sed -i'' '10 d' file.txt       # d stands for delete
Step 2: insert a new line at n-th line position
$>sed -i'' '10 i This is the new line' file.txt     # i stands for insert

How to show the non-printable characters in a file?
Open the file in VI editor. Go to VI command mode by pressing [Escape] and then [:]. Then type [set list]. This will show you all the non-printable characters, e.g. Ctrl-M characters (^M) etc., in the file.

How to zip a file in Linux?
Use inbuilt [zip] command in Linux

How to unzip a file in Linux?
Use inbuilt [unzip] command in Linux.
$> unzip –j file.zip

How to test if a zip file is corrupted in Linux?
Use “-t” switch with the inbuilt [unzip] command
$> unzip –t file.zip

How to check if a file is zipped in Unix?
In order to know the file type of a particular file use the [file] command like below:
$> file file.txt
file.txt: ASCII text
If you want to know the technical MIME type of the file, use “-i” switch.
$>file -i file.txt
file.txt: text/plain; charset=us-ascii
If the file is zipped, following will be the result
$> file –i file.zip
file.zip: application/x-zip

How to connect to Oracle database from within shell script?
You will be using the same [sqlplus] command to connect to database that you use normally even outside the shell script. To understand this, let's take an example. In this example, we will connect to database, fire a query and get the output printed from the unix shell. Ok? Here we go –
$>res=`sqlplus -s username/password@database_name <<EOF
SET HEAD OFF;
select count(*) from dual;
EXIT;
EOF`
$> echo $res
1
If you connect to database in this method, the advantage is – you will be able to pass Unix side shell variables value to the database. See below:
$>res=`sqlplus -s username/password@database_name <<EOF
SET HEAD OFF;
select count(*) from customer where last_name='$1';
EXIT;
EOF`
$> echo $res
12

How to execute a database stored procedure from Shell script?
$> SqlReturnMsg=`sqlplus -s username/password@database <<EOF
BEGIN
Proc_Your_Procedure(… your-input-parameters …);
END;
/
EXIT;
EOF`
$> echo $SqlReturnMsg

How to check the command line arguments in a UNIX command in Shell Script?
In a bash shell, you can access the command line arguments using $0, $1, $2, … variables, where $0 prints the command name, $1 prints the first input parameter of the command, $2 the second input parameter of the command and so on.

How to fail a shell script programmatically?
Just put an [exit] command in the shell script with return value other than 0. this is because the exit codes of successful Unix programs is zero. So, suppose if you write
exit -1
inside your program, then your program will thrown an error and exit immediately.

How to list down file/folder lists alphabetically?
Normally [ls –lt] command lists down file/folder list sorted by modified time. If you want to list then alphabetically, then you should simply specify: [ls –l]

How to check if the last command was successful in Unix?
To check the status of last executed command in UNIX, you can check the value of an inbuilt bash variable [$?]. See the below example:
$> echo $?

How to check if a file is present in a particular directory in Unix?
Using command, we can do it in many ways. Based on what we have learnt so far, we can make use of [ls] and [$?] command to do this. See below:
$> ls –l file.txt; echo $?
If the file exists, the [ls] command will be successful. Hence [echo $?] will print 0. If the file does not exist, then [ls] command will fail and hence [echo $?] will print 1.

How to check all the running processes in Unix?
The standard command to see this is [ps]. But [ps] only shows you the snapshot of the processes at that instance. If you need to monitor the processes for a certain period of time and need to refresh the results in each interval, consider using the [top] command.
$> ps –ef
If you wish to see the % of memory usage and CPU usage, then consider the below switches
$> ps aux
If you wish to use this command inside some shell script, or if you want to customize the output of [ps] command, you may use “-o” switch like below. By using “-o” switch, you can specify the columns that you want [ps] to print out.
$>ps -e -o stime,user,pid,args,%mem,%cpu

How to tell if my process is running in Unix?
You can list down all the running processes using [ps] command. Then you can “grep” your user name or process name to see if the process is running. See below:
$>ps -e -o stime,user,pid,args,%mem,%cpu | grep "opera"
14:53 opera 29904 sleep 60                     0.0  0.0
14:54 opera 31536 ps -e -o stime,user,pid,arg  0.0  0.0
14:54 opera 31538 grep opera                0.0  0.0

How to get the CPU and Memory details in Linux server?
In Linux based systems, you can easily access the CPU and memory details from the /proc/cpuinfo and /proc/meminfo, like this:
$>cat /proc/meminfo
$>cat /proc/cpuinfo
Just try the above commands in your system to see how it works.



How to display top 10 users Who | head -10 | wc –w?
who | head -10
ls -lrt |tail -10
 
What is ls -ltd?
ls - ltd  
current directory information.
Option l: Long listing
       t: Time Stamp
       d: Information about current directory 
          (Used with -l Option)

posted by Jeyakumar @ Friday, December 02, 2011, ,


Difference between Mapping and Mapplet


Mapping
Mapplet

Mapping is collection of source, target and transformation

Mapplet is only collection of transformation

Mapping is developed with different transformation but not reusable.
Mapplet can be reused other mapping and also mapplet.
Mapping is developed for what data move to target, what modification done upon that.
Mapplet is developed for complex calculation used in multiple mappings.

posted by Jeyakumar @ Friday, December 02, 2011, ,


Difference between Source Qualifier and Lookup Transformation


Source Qualifier
Lookup

Active Transformation

Passive Transformation

In source qualifier it will push all the matching records.
Where as in lookup we can restrict whether to display first value, last value or any value
In source qualifier there is no concept of cache.
Where as in lookup we concentrate on cache concept.
When both source and lookup are in same database we can use source qualifier.
When the source and lookup table exists in different database then we need to use lookup.

posted by Jeyakumar @ Friday, December 02, 2011, ,


Difference between Router Transformation and Filter Transformation

Router
Filter
Single Condition
Multiple Condition
Send to multiple target
Send to single target
It captured the rejected records in default o/p group.
It does not capture the rejected records.
Two groups
User defined group
Default Group
Does not have any group in filter

posted by Jeyakumar @ Friday, December 02, 2011, ,


Logical Data Modeling vs Physical Data Modeling


Logical Data Model
Physical Data Model
Represents business information and defines business rules
Represents the physical implementation of the model in a database.
Entity
Table
Attribute
Column
Primary Key
Primary Key Constraint
Alternate Key
Unique Constraint or Unique Index
Inversion Key Entry
Non Unique Index
Rule
Check Constraint, Default Value
Relationship
Foreign Key
Definition
Comment

posted by Jeyakumar @ Friday, December 02, 2011, ,


Different Between OLTP and OLAP


OLTP
OLAP
Application Oriented (e.g., purchase order it is functionality of an application)
Subject Oriented (subject in the sense customer, product, item, time)
Used to run business
Used to analyze business
Detailed data   
Summarized data
Repetitive access
Ad-hoc access
Few Records accessed at a time (tens), simple query
Large volumes accessed at a time(millions), complex query
Small database
Large Database
Current data
Historical data
Clerical User
Knowledge User
Row by Row Loading
Bulk Loading
Time invariant
Time variant
Normalized data
De-normalized data
E – R schema
Star schema

posted by Jeyakumar @ Friday, December 02, 2011, ,


Difference between Joiner Transformation and Lookup Transformation


Joiner

Lookup

Active Transformation

Passive Transformation

In joiner we cannot configure to use persistence cache, shared cache, uncached and dynamic cache
Where as in lookup we can configure to use persistence cache, shared cache, uncached and dynamic cache.
We cannot override the query in joiner
We can override the query in lookup to fetch the data from multiple tables.
We can perform outer join in joiner transformation.
We cannot perform outer join in lookup transformation.

Support Equi Join Only

Support Equi Join and Non Equi Join

Joiner used only as source

Lkp used as source and target

In Joiner on multiple matches it will return all matching records

In Lkp it will return either first record or last record or any value or error value

posted by Jeyakumar @ Friday, December 02, 2011, ,


Difference between Data Mart and Data Warehouse


Data Mart
Data Warehouse
Data mart is usually sponsored at the department level and developed with a specific issue or subject in mind, a data mart is a data warehouse with a focused objective.
Data warehouse is a “Subject-Oriented, Integrated, Time-Variant, Nonvolatile collection of data in support of decision making”.
A data mart is used on a business division/ department level.
A data warehouse is used on an enterprise level
A Data Mart is a subset of data from a Data Warehouse. Data Marts are built for specific user groups.
A Data Warehouse is simply an integrated consolidation of data from a variety of sources that is specially designed to support strategic and tactical decision making.
By providing decision makers with only a subset of data from the Data Warehouse, Privacy, Performance and Clarity Objectives can be attained.
The main objective of Data Warehouse is to provide an integrated environment and coherent picture of the business at a point in time.

posted by Jeyakumar @ Friday, December 02, 2011, ,


Difference between Star Schema and Snow Flake Schema


Star Schema
Snow Flake Schema
The star schema is the simplest data warehouse scheme.
Snowflake schema is a more complex data warehouse model than a star schema.
In star schema each of the dimensions is represented in a single table .It should not have any hierarchies between dims.
In snow flake schema at least one hierarchy should exists between dimension tables.
It contains a fact table surrounded by dimension tables. If the dimensions are de-normalized, we say it is a star schema design.
It contains a fact table surrounded by dimension tables. If a dimension is normalized, we say it is a snow flaked design.
In star schema only one join establishes the relationship between the fact table and any one of the dimension tables.
In snow flake schema since there is relationship between the dimensions tables it has to do many joins to fetch the data.
A star schema optimizes the performance by keeping queries simple and providing fast response time. All the information about the each level is stored in one row.
Snowflake schemas normalize dimensions to eliminated redundancy. The result is more complex queries and reduced query performance.
It is called a star schema because the diagram resembles a star.
It is called a snowflake schema because the diagram resembles a snowflake.

posted by Jeyakumar @ Friday, December 02, 2011, ,