Distribution of data
For calculating the right execution-plan Oracle often needs the distribution of values for a column. For an outlier value a index can easily be used. For a value that is present in half the rows a index might be less effective.
For a similar exercise I needed the distribution of values myself. I needed to know how the values in a column are distributed. For testing purposes I created a a table with a column populated with a normal distribution, and a column with a “true” random distributed column:
SQL> create table distribution (n1 number, n2 number); Table created. SQL> insert into distribution with engine as (select rownum l from dual connect by level <=500000) select round(dbms_random.normal, 1)+20, round(dbms_random.random*100,1) from engine; 500000 rows created. SQL> commit; Commit complete. SQL>
Then I created a script distri.sh
-- Name : distri.sql -- Author : Martijn Bos -- Input : - -- Description : Shows the distribution of a (numeric) data -- Somewhat like a histogram -- --------------------------------------------------------------------- -- 2024-03-23 : MB : Initial Version -- Usage : -- distri <table_name> <column_name> -- @distri engines weight -- Shows the distribution of the values in the weight colum of the table engine -- @distri mechanic.engines length -- Shows the distribution of the values in the length column of the table engines in schema mechanic set lines 200 col graph for a25 define _TABLE=&1 define _COLUMN=&2 with engine as ( select floor(distribution.&_COLUMN/buckets.bucket_width) bucket, -- number of the bucket floor(distribution.&_COLUMN/buckets.bucket_width)*buckets.bucket_width min_value, -- lower boundary for the values in this bucket floor((distribution.&_COLUMN/buckets.bucket_width)+1)*buckets.bucket_width max_value, -- upper boundary for the values in this bucket count(*) freq, -- Number of items in the bucket buckets.nov nov -- total number of values from ( select max(&_COLUMN)/20 bucket_width, -- width of a bucket (for now we user 20 buckets) count(&_COLUMN) nov -- number of values from distribution ) buckets, &_TABLE distribution group by floor(distribution.&_COLUMN/buckets.bucket_width), floor(distribution.&_COLUMN/buckets.bucket_width)*buckets.bucket_width, floor((distribution.&_COLUMN/buckets.bucket_width)+1)*buckets.bucket_width, nov ) select engine.bucket, engine.min_value, engine.max_value, engine.freq, round((100*engine.freq/engine.nov),0) perc, rpad('|',round((100*engine.freq/max_freq.max_freq)/5,0),'-') graph from engine, ( select max(freq) max_freq from engine ) max_freq order by engine.bucket /
Let’s see how this works out:
SQL> @distri distribution n1
BUCKET MIN_VALUE MAX_VALUE FREQ PERC GRAPH
---------- ---------- ---------- ---------- ---------- -------------------------
12 14.88 16.12 28 0
13 16.12 17.36 1980 0
14 17.36 18.6 34611 7 |--
15 18.6 19.84 183758 37 |-----------------
16 19.84 21.08 206451 41 |-------------------
17 21.08 22.32 68438 14 |------
18 22.32 23.56 4637 1
19 23.56 24.8 96 0
20 24.8 26.04 1 0
9 rows selected.
SQL> @distri distribution n2
BUCKET MIN_VALUE MAX_VALUE FREQ PERC GRAPH
---------- ---------- ---------- ---------- ---------- -------------------------
-21 -2.255E+11 -2.147E+11 5 0
-20 -2.147E+11 -2.040E+11 12611 3 |-------------------
-19 -2.040E+11 -1.933E+11 12665 3 |-------------------
-18 -1.933E+11 -1.825E+11 12233 2 |------------------
-17 -1.825E+11 -1.718E+11 12547 3 |-------------------
-16 -1.718E+11 -1.611E+11 12258 2 |------------------
-15 -1.611E+11 -1.503E+11 12375 2 |------------------
-14 -1.503E+11 -1.396E+11 12470 2 |-------------------
-13 -1.396E+11 -1.288E+11 12339 2 |------------------
-12 -1.288E+11 -1.181E+11 12595 3 |-------------------
-11 -1.181E+11 -1.074E+11 12596 3 |-------------------
-10 -1.074E+11 -9.664E+10 12522 3 |-------------------
-9 -9.664E+10 -8.590E+10 12210 2 |------------------
-8 -8.590E+10 -7.516E+10 12646 3 |-------------------
-7 -7.516E+10 -6.442E+10 12413 2 |------------------
-6 -6.442E+10 -5.369E+10 12384 2 |------------------
-5 -5.369E+10 -4.295E+10 12425 2 |-------------------
-4 -4.295E+10 -3.221E+10 12512 3 |-------------------
-3 -3.221E+10 -2.147E+10 12522 3 |-------------------
-2 -2.147E+10 -1.074E+10 12621 3 |-------------------
-1 -1.074E+10 0 12582 3 |-------------------
0 0 1.0737E+10 12538 3 |-------------------
1 1.0737E+10 2.1474E+10 12536 3 |-------------------
2 2.1474E+10 3.2212E+10 12491 2 |-------------------
3 3.2212E+10 4.2949E+10 12552 3 |-------------------
4 4.2949E+10 5.3686E+10 12569 3 |-------------------
5 5.3686E+10 6.4423E+10 12445 2 |-------------------
6 6.4423E+10 7.5161E+10 12734 3 |-------------------
7 7.5161E+10 8.5898E+10 12355 2 |------------------
8 8.5898E+10 9.6635E+10 12431 2 |-------------------
9 9.6635E+10 1.0737E+11 12573 3 |-------------------
10 1.0737E+11 1.1811E+11 12521 3 |-------------------
11 1.1811E+11 1.2885E+11 12701 3 |-------------------
12 1.2885E+11 1.3958E+11 12374 2 |------------------
13 1.3958E+11 1.5032E+11 12539 3 |-------------------
14 1.5032E+11 1.6106E+11 12485 2 |-------------------
15 1.6106E+11 1.7180E+11 12477 2 |-------------------
16 1.7180E+11 1.8253E+11 12643 3 |-------------------
17 1.8253E+11 1.9327E+11 12640 3 |-------------------
18 1.9327E+11 2.0401E+11 12355 2 |------------------
19 2.0401E+11 2.1474E+11 12509 3 |-------------------
20 2.1474E+11 2.2548E+11 1 0
42 rows selected.
SQL>
As you can see, we now have a somewhat visual idea of how data is distributed. Take care, this script works only on numeric values