Category Archives: database

RDD vs Dataframe

DataFrame, is like a table in database. It has schema to describe the data. We can easily manipulate 2 dataframes just like sql in relational database. You can do groupBy, count, sort, join, where. RDD, resilient distributed dataset. collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.… Read More »

Data Warehouse Type 2 slowly change dimension in kettle

In kettle, we can implement type 2 slowly change dimension easily by using “Dimension lookup/update”. We have below table in a CSV file: COUNTRY_ID,COUNTRY_NAME,REGION,LAST_UPDATE_DATE 11,Baharian,A,2015-02-15 12,China,Beijing,2013-02-15 13,USA,AZ,2014-02-20 14,CUBA,CIUDAD,2011-02-19 15,USA,Los Angeles,2011-02-18 16,Japan,TOKYO,2011-02-15 We want to put into table with below structure: First, we have 2 elements: Text file input, Dimension lookup/update. Text file input configuration: Dimension… Read More »


In Oracle, we can easily get the DDL clause. In mysql, we can use SHOW CREATE TABLE. In SqlServer, there is no similar command. However, we can get data type by below command: SELECT name, system_type_name, is_nullable FROM sys.dm_exec_describe_first_result_set(‘SELECT * FROM table‘, NULL, 0)

Intensive write: Mysql vs Redis

This is another test on mysql and redis write speed. The test is easy. Keep updating a counter in mysql and redis for 10s. And see how many times it can update. Mysql uses InnoDB engine. Each update commits. Result shows that redis has better update performance. A redis server can handle up to 40K update… Read More »

Difference between Varray and Nested Table in PL/SQL

In PL/SQL, there are 3 types of collections: Nested Table, Varray and Associative Array. Associative Array is like HashMap in Java which is easy to understand. Nested Table and Varray are like array in java. They are confusing for java developers. Below are their differences: Varray It has max size which is defined in type. It can’t extend… Read More »

Select distinct max in sql

Solution: SELECT t3.`id`, max_table3.max_type FROM table3 t3 JOIN (SELECT id, MAX(`type`) max_type FROM table3 GROUP BY id) max_table3 ON max_table3.max_type = t3.`type` AND = t3.`id` reference:

Mysql Million rows test: insertion

These days, I started to use mysql to query table with million rows. In order to get to know mysql performance better. I did this test. Basically, I created different tables with different type/number of indices. I tried to insert as many rows as possible within a certain time. And see the result. Environment: Windows… Read More »

Mysql Million rows test: select, join

These days, I started to use mysql to query table with million rows. In order to get to know mysql performance better. I did this test. Basically, I have 2 big tables and 1 small table. Each big table has 100M items. The small table has 1M items. I tested select, join operations among them… Read More »

Solve the problem that hive is pending for a long time for MapReduce select

I have partition_table in Hive. It run the select command well: Hive>select * from partition_table; When I ran Hive>select count(*) from partition_table; Starting Job = job_1414213419655_0001, Tracking URL = http://centmaster:8088/proxy/application_1414213419655_0001/ Kill Command = /home/hadoop/hadoop-2.3.0/bin/hadoop job  -kill job_1414213419655_0001 After that, Hive is pending there for 40 minutes, until I pressed Ctrl+c to stop it. I checked… Read More »


Before this, the Hive should be correctly installed. This not only means that you can enter hive, but also your hive can interact with the mysql you configured. 1. Jars needed. It’s better you import all .jar files in /Hive/lib. Besides, you need hadoop-common-2.3.0.jar and slf4j-api-1.6.6.jar.Include them: activation-1.1.jar ant-1.9.1.jar ant-launcher-1.9.1.jar antlr-2.7.7.jar antlr-runtime-3.4.jar asm-commons-3.1.jar asm-tree-3.1.jar avro-1.7.5.jar… Read More »