{"id":579,"date":"2024-03-23T15:58:58","date_gmt":"2024-03-23T15:58:58","guid":{"rendered":"https:\/\/maboc.nl\/?p=579"},"modified":"2024-03-23T15:58:58","modified_gmt":"2024-03-23T15:58:58","slug":"distribution-of-values-in-a-column","status":"publish","type":"post","link":"https:\/\/maboc.nl\/?p=579","title":{"rendered":"Distribution of values in a column"},"content":{"rendered":"<p>Distribution of data<\/p>\n<p>For calculating the right execution-plan Oracle often needs the distribution of values for a column. For an outlier value a index can easily be used. For a value that is present in half the rows a index might be less effective.<\/p>\n<p>For a similar exercise I needed the distribution of values myself. I needed to know how the values in a column are distributed. For testing purposes I created a a table with a column populated with a normal distribution, and a column with a &#8220;true&#8221; random distributed column:<\/p>\n<pre><span style=\"font-size: 8pt;\">SQL&gt; create table distribution (n1 number, n2 number);<\/span>\r\n\r\n<span style=\"font-size: 8pt;\">Table created.<\/span>\r\n\r\n<span style=\"font-size: 8pt;\">SQL&gt; insert into distribution <\/span>\r\n<span style=\"font-size: 8pt;\">       with engine as (select rownum l from dual connect by level &lt;=500000)<\/span>\r\n<span style=\"font-size: 8pt;\">       select round(dbms_random.normal, 1)+20,\r\n              round(dbms_random.random*100,1)\r\n       from   engine;<\/span>\r\n\r\n<span style=\"font-size: 8pt;\">500000 rows created.<\/span>\r\n\r\n<span style=\"font-size: 8pt;\">SQL&gt; commit;<\/span>\r\n\r\n<span style=\"font-size: 8pt;\">Commit complete.<\/span>\r\n\r\n<span style=\"font-size: 8pt;\">SQL&gt;<\/span><\/pre>\n<p>Then I created a script <a href=\"https:\/\/github.com\/maboc\/sql\/blob\/master\/oracle\/distri.sql\">distri.sh<\/a><\/p>\n<pre><span style=\"font-size: 8pt;\">-- Name        : distri.sql<\/span>\r\n<span style=\"font-size: 8pt;\">-- Author      : Martijn Bos<\/span>\r\n<span style=\"font-size: 8pt;\">-- Input       : -<\/span>\r\n<span style=\"font-size: 8pt;\">-- Description : Shows the distribution of a (numeric) data<\/span>\r\n<span style=\"font-size: 8pt;\">--               Somewhat like a histogram<\/span>\r\n<span style=\"font-size: 8pt;\">-- ---------------------------------------------------------------------<\/span>\r\n<span style=\"font-size: 8pt;\">-- 2024-03-23 : MB : Initial Version<\/span>\r\n\r\n<span style=\"font-size: 8pt;\">-- Usage :<\/span>\r\n<span style=\"font-size: 8pt;\">-- distri &lt;table_name&gt; &lt;column_name&gt;<\/span>\r\n<span style=\"font-size: 8pt;\">--   @distri engines weight<\/span>\r\n<span style=\"font-size: 8pt;\">--     Shows the distribution of the values in the weight colum of the table engine<\/span>\r\n<span style=\"font-size: 8pt;\">--   @distri mechanic.engines length<\/span>\r\n<span style=\"font-size: 8pt;\">--     Shows the distribution of the values in the length column of the table engines in schema mechanic<\/span>\r\n\r\n<span style=\"font-size: 8pt;\">set lines 200<\/span>\r\n<span style=\"font-size: 8pt;\">col graph for a25<\/span>\r\n\r\n<span style=\"font-size: 8pt;\">define _TABLE=&amp;1<\/span>\r\n<span style=\"font-size: 8pt;\">define _COLUMN=&amp;2<\/span>\r\n\r\n<span style=\"font-size: 8pt;\">with engine as<\/span>\r\n<span style=\"font-size: 8pt;\">  (<\/span>\r\n<span style=\"font-size: 8pt;\">    select   floor(distribution.&amp;_COLUMN\/buckets.bucket_width) bucket,                             -- number of the bucket<\/span>\r\n<span style=\"font-size: 8pt;\">             floor(distribution.&amp;_COLUMN\/buckets.bucket_width)*buckets.bucket_width min_value,     -- lower boundary for the values in this bucket<\/span>\r\n<span style=\"font-size: 8pt;\">             floor((distribution.&amp;_COLUMN\/buckets.bucket_width)+1)*buckets.bucket_width max_value, -- upper boundary for the values in this bucket<\/span>\r\n<span style=\"font-size: 8pt;\">             count(*) freq, -- Number of items in the bucket<\/span>\r\n<span style=\"font-size: 8pt;\">             buckets.nov nov -- total number of values<\/span>\r\n<span style=\"font-size: 8pt;\">    from     (<\/span>\r\n<span style=\"font-size: 8pt;\">               select max(&amp;_COLUMN)\/20 bucket_width, -- width of a bucket (for now we user 20 buckets)<\/span>\r\n<span style=\"font-size: 8pt;\">                      count(&amp;_COLUMN) nov -- number of values<\/span>\r\n<span style=\"font-size: 8pt;\">               from   distribution<\/span>\r\n<span style=\"font-size: 8pt;\">             ) buckets,<\/span>\r\n<span style=\"font-size: 8pt;\">             &amp;_TABLE distribution<\/span>\r\n<span style=\"font-size: 8pt;\">    group by floor(distribution.&amp;_COLUMN\/buckets.bucket_width),<\/span>\r\n<span style=\"font-size: 8pt;\">             floor(distribution.&amp;_COLUMN\/buckets.bucket_width)*buckets.bucket_width,<\/span>\r\n<span style=\"font-size: 8pt;\">             floor((distribution.&amp;_COLUMN\/buckets.bucket_width)+1)*buckets.bucket_width,<\/span>\r\n<span style=\"font-size: 8pt;\">             nov<\/span>\r\n<span style=\"font-size: 8pt;\">  )<\/span>\r\n<span style=\"font-size: 8pt;\">select   engine.bucket,<\/span>\r\n<span style=\"font-size: 8pt;\">         engine.min_value,<\/span>\r\n<span style=\"font-size: 8pt;\">         engine.max_value,<\/span>\r\n<span style=\"font-size: 8pt;\">         engine.freq,<\/span>\r\n<span style=\"font-size: 8pt;\">         round((100*engine.freq\/engine.nov),0) perc,<\/span>\r\n<span style=\"font-size: 8pt;\">         rpad('|',round((100*engine.freq\/max_freq.max_freq)\/5,0),'-') graph<\/span>\r\n<span style=\"font-size: 8pt;\">from     engine,<\/span>\r\n<span style=\"font-size: 8pt;\">         (<\/span>\r\n<span style=\"font-size: 8pt;\">           select max(freq) max_freq<\/span>\r\n<span style=\"font-size: 8pt;\">           from   engine<\/span>\r\n<span style=\"font-size: 8pt;\">         ) max_freq<\/span>\r\n<span style=\"font-size: 8pt;\">order by engine.bucket<\/span>\r\n<span style=\"font-size: 8pt;\">\/<\/span><\/pre>\n<p>Let&#8217;s see how this works out:<\/p>\n<pre><span style=\"font-size: 8pt;\">SQL&gt; @distri distribution n1\r\n\r\n    BUCKET  MIN_VALUE  MAX_VALUE       FREQ       PERC GRAPH\r\n---------- ---------- ---------- ---------- ---------- -------------------------\r\n        12      14.88      16.12         28          0\r\n        13      16.12      17.36       1980          0\r\n        14      17.36       18.6      34611          7 |--\r\n        15       18.6      19.84     183758         37 |-----------------\r\n        16      19.84      21.08     206451         41 |-------------------\r\n        17      21.08      22.32      68438         14 |------\r\n        18      22.32      23.56       4637          1\r\n        19      23.56       24.8         96          0\r\n        20       24.8      26.04          1          0\r\n\r\n9 rows selected.\r\n\r\nSQL&gt; @distri distribution n2\r\n\r\n    BUCKET  MIN_VALUE MAX_VALUE       FREQ       PERC GRAPH\r\n---------- ---------- ---------- ---------- ---------- -------------------------\r\n       -21 -2.255E+11 -2.147E+11          5          0\r\n       -20 -2.147E+11 -2.040E+11 12611    3 |-------------------\r\n       -19 -2.040E+11 -1.933E+11 12665    3 |-------------------\r\n       -18 -1.933E+11 -1.825E+11 12233    2 |------------------\r\n       -17 -1.825E+11 -1.718E+11 12547    3 |-------------------\r\n       -16 -1.718E+11 -1.611E+11 12258    2 |------------------\r\n       -15 -1.611E+11 -1.503E+11 12375    2 |------------------\r\n       -14 -1.503E+11 -1.396E+11 12470    2 |-------------------\r\n       -13 -1.396E+11 -1.288E+11 12339    2 |------------------\r\n       -12 -1.288E+11 -1.181E+11 12595    3 |-------------------\r\n       -11 -1.181E+11 -1.074E+11 12596    3 |-------------------\r\n       -10 -1.074E+11 -9.664E+10 12522    3 |-------------------\r\n        -9 -9.664E+10 -8.590E+10 12210    2 |------------------\r\n        -8 -8.590E+10 -7.516E+10 12646    3 |-------------------\r\n        -7 -7.516E+10 -6.442E+10 12413    2 |------------------\r\n        -6 -6.442E+10 -5.369E+10 12384    2 |------------------\r\n        -5 -5.369E+10 -4.295E+10 12425    2 |-------------------\r\n        -4 -4.295E+10 -3.221E+10 12512    3 |-------------------\r\n        -3 -3.221E+10 -2.147E+10 12522    3 |-------------------\r\n        -2 -2.147E+10 -1.074E+10 12621    3 |-------------------\r\n        -1 -1.074E+10          0 12582    3 |-------------------\r\n         0          0 1.0737E+10 12538    3 |-------------------\r\n         1 1.0737E+10 2.1474E+10 12536    3 |-------------------\r\n         2 2.1474E+10 3.2212E+10 12491    2 |-------------------\r\n         3 3.2212E+10 4.2949E+10 12552    3 |-------------------\r\n         4 4.2949E+10 5.3686E+10 12569    3 |-------------------\r\n         5 5.3686E+10 6.4423E+10 12445    2 |-------------------\r\n         6 6.4423E+10 7.5161E+10 12734    3 |-------------------\r\n         7 7.5161E+10 8.5898E+10 12355    2 |------------------\r\n         8 8.5898E+10 9.6635E+10 12431    2 |-------------------\r\n         9 9.6635E+10 1.0737E+11 12573    3 |-------------------\r\n        10 1.0737E+11 1.1811E+11 12521    3 |-------------------\r\n        11 1.1811E+11 1.2885E+11 12701    3 |-------------------\r\n        12 1.2885E+11 1.3958E+11 12374    2 |------------------\r\n        13 1.3958E+11 1.5032E+11 12539    3 |-------------------\r\n        14 1.5032E+11 1.6106E+11 12485    2 |-------------------\r\n        15 1.6106E+11 1.7180E+11 12477    2 |-------------------\r\n        16 1.7180E+11 1.8253E+11 12643    3 |-------------------\r\n        17 1.8253E+11 1.9327E+11 12640    3 |-------------------\r\n        18 1.9327E+11 2.0401E+11 12355    2 |------------------\r\n        19 2.0401E+11 2.1474E+11 12509    3 |-------------------\r\n        20 2.1474E+11 2.2548E+11     1    0\r\n\r\n42 rows selected.\r\n\r\nSQL&gt;<\/span><\/pre>\n<p>As you can see, we now have a somewhat visual idea of how data is distributed. Take care, this script works only on numeric values<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Distribution of data For calculating the right execution-plan Oracle often needs the distribution of values for a column. For an outlier value a index can easily be used. For a value that is present in half the rows a index might be less effective. For a similar exercise I needed the distribution of values myself. [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5,49],"tags":[96,24,14,76],"class_list":["post-579","post","type-post","status-publish","format-standard","hentry","category-oracle","category-sql","tag-distribution","tag-oracle","tag-sql","tag-sqlplus"],"_links":{"self":[{"href":"https:\/\/maboc.nl\/index.php?rest_route=\/wp\/v2\/posts\/579","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/maboc.nl\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/maboc.nl\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/maboc.nl\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/maboc.nl\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=579"}],"version-history":[{"count":11,"href":"https:\/\/maboc.nl\/index.php?rest_route=\/wp\/v2\/posts\/579\/revisions"}],"predecessor-version":[{"id":590,"href":"https:\/\/maboc.nl\/index.php?rest_route=\/wp\/v2\/posts\/579\/revisions\/590"}],"wp:attachment":[{"href":"https:\/\/maboc.nl\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=579"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/maboc.nl\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=579"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/maboc.nl\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=579"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}