<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.exploringbinary.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" version="2.0">

<channel>
	<title>Exploring Binary</title>
	
	<link>http://www.exploringbinary.com</link>
	<description>Binary Numbers, Binary Code, and Binary Logic</description>
	<lastBuildDate>Tue, 31 Jan 2012 16:38:56 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.exploringbinary.com/ExploringBinary" /><feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="exploringbinary" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><xhtml:meta xmlns:xhtml="http://www.w3.org/1999/xhtml" name="robots" content="noindex" /><feedburner:emailServiceId xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">ExploringBinary</feedburner:emailServiceId><feedburner:feedburnerHostname xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">http://feedburner.google.com</feedburner:feedburnerHostname><item>
		<title>Binary Addition</title>
		<link>http://www.exploringbinary.com/binary-addition/</link>
		<comments>http://www.exploringbinary.com/binary-addition/#comments</comments>
		<pubDate>Mon, 30 Jan 2012 18:15:19 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Binary numbers]]></category>
		<category><![CDATA[Binary arithmetic]]></category>
		<category><![CDATA[Binary integers]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Convert to decimal]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=358</guid>
		<description><![CDATA[This is the first of a four part series on binary arithmetic, which I&#8217;m writing as a supplement to my binary calculator. This article introduces binary arithmetic, and then discusses binary addition. Binary Arithmetic Binary arithmetic is of interest because that&#8217;s how computers do math. When implemented in computers, many things must be taken into [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/binary-addition/">Binary Addition</a></p>
]]></description>
			<content:encoded><![CDATA[<p><em>This is the first of a four part series on binary arithmetic, which I&#8217;m writing as a supplement to my <a title="Rick Regan's &ldquo;Binary Calculator&rdquo;" href="http://www.exploringbinary.com/binary-calculator/">binary calculator</a>. This article introduces binary arithmetic, and then discusses binary addition.</em></p>
<div class="wp-caption aligncenter" style="width: 203px"><img src="http://www.exploringbinary.com/wp-content/uploads/BinaryAdditionExample.png" alt="An Example of Binary Addition" width="193" height="159"/><p class="wp-caption-text">Example of Binary Addition</p></div>
<p><span id="more-358"></span></p>
<h2>Binary Arithmetic</h2>
<p>Binary arithmetic is of interest because that&#8217;s how computers do math. When implemented in computers, many things must be taken into account: format (fixed-point, floating-point, etc.), word size (8-bit, 16-bit, 32-bit, etc.), sign representation (sign-magnitude, ones&#8217; complement, two&#8217;s complement, etc.), overflow (when numbers are too big), and underflow (when numbers are too small). Furthermore, there is the design of the circuits to perform the algorithms (full adders, half adders, ripple carry adders, carry-lookahead adders, restoring dividers, non-restoring dividers, etc.). These are not inherent properties of binary arithmetic &#8212; they&#8217;re just implementation issues.</p>
<p>The truth is, in its purest form, binary arithmetic is very simple. The same &ldquo;pencil and paper&rdquo; algorithms you learned in grade school work for binary numbers. All you do differently is apply the &ldquo;facts&rdquo; for binary numerals, the (small) set of rules for manipulating 0s and 1s. And binary numbers on paper are written as you&#8217;d expect: without leading zeros, and with a minus sign (&lsquo;-&rsquo;) if negative.</p>
<p>In my series of articles, I will explain the pure forms of the four basic operations of binary arithmetic: addition, subtraction, multiplication, and division. In this first article, I will discuss binary addition.</p>
<h2>Decimal Addition</h2>
<p>To add two multiple-digit decimal numbers, you first need to know how to add two single-digit decimal numbers. This requires the memorization of 100 facts, or 55 facts if you exclude the commutative or &ldquo;turnaround&rdquo; facts. Also, because of carries, you need to know ten additional facts: 10 + 0 = 10, 10 + 1 = 11, &#8230; , 10 + 9 = 19. The latter apply when there&#8217;s a carry (always 1) and the &ldquo;top&rdquo; digit is 9.</p>
<table class="center" border="1" summary="109 Decimal Addition Facts">
<caption><strong>Decimal Addition Facts</strong></caption>
<tbody>
<tr>
<th class="center">+</th>
<th class="center internal_border_bottom">0</th>
<th class="center internal_border_bottom">1</th>
<th class="center internal_border_bottom">2</th>
<th class="center internal_border_bottom">3</th>
<th class="center internal_border_bottom">4</th>
<th class="center internal_border_bottom">5</th>
<th class="center internal_border_bottom">6</th>
<th class="center internal_border_bottom">7</th>
<th class="center internal_border_bottom">8</th>
<th class="center internal_border_bottom">9</th>
<th class="center internal_border_bottom">10</th>
</tr>
<tr>
<td class="center internal_border_right"><strong>0</strong></td>
<td class="center">0</td>
<td class="center grayed">1</td>
<td class="center grayed">2</td>
<td class="center grayed">3</td>
<td class="center grayed">4</td>
<td class="center grayed">5</td>
<td class="center grayed">6</td>
<td class="center grayed">7</td>
<td class="center grayed">8</td>
<td class="center grayed">9</td>
<td class="center grayed">10</td>
</tr>
<tr>
<td class="center internal_border_right"><strong>1</strong></td>
<td class="center">1</td>
<td class="center">2</td>
<td class="center grayed">3</td>
<td class="center grayed">4</td>
<td class="center grayed">5</td>
<td class="center grayed">6</td>
<td class="center grayed">7</td>
<td class="center grayed">8</td>
<td class="center grayed">9</td>
<td class="center grayed">10</td>
<td class="center grayed">11</td>
</tr>
<tr>
<td class="center internal_border_right"><strong>2</strong></td>
<td class="center">2</td>
<td class="center">3</td>
<td class="center">4</td>
<td class="center grayed">5</td>
<td class="center grayed">6</td>
<td class="center grayed">7</td>
<td class="center grayed">8</td>
<td class="center grayed">9</td>
<td class="center grayed">10</td>
<td class="center grayed">11</td>
<td class="center grayed">12</td>
</tr>
<tr>
<td class="center internal_border_right"><strong>3</strong></td>
<td class="center">3</td>
<td class="center">4</td>
<td class="center">5</td>
<td class="center">6</td>
<td class="center grayed">7</td>
<td class="center grayed">8</td>
<td class="center grayed">9</td>
<td class="center grayed">10</td>
<td class="center grayed">11</td>
<td class="center grayed">12</td>
<td class="center grayed">13</td>
</tr>
<tr>
<td class="center internal_border_right"><strong>4</strong></td>
<td class="center">4</td>
<td class="center">5</td>
<td class="center">6</td>
<td class="center">7</td>
<td class="center">8</td>
<td class="center grayed">9</td>
<td class="center grayed">10</td>
<td class="center grayed">11</td>
<td class="center grayed">12</td>
<td class="center grayed">13</td>
<td class="center grayed">14</td>
</tr>
<tr>
<td class="center internal_border_right"><strong>5</strong></td>
<td class="center">5</td>
<td class="center">6</td>
<td class="center">7</td>
<td class="center">8</td>
<td class="center">9</td>
<td class="center">10</td>
<td class="center grayed">11</td>
<td class="center grayed">12</td>
<td class="center grayed">13</td>
<td class="center grayed">14</td>
<td class="center grayed">15</td>
</tr>
<tr>
<td class="center internal_border_right"><strong>6</strong></td>
<td class="center">6</td>
<td class="center">7</td>
<td class="center">8</td>
<td class="center">9</td>
<td class="center">10</td>
<td class="center">11</td>
<td class="center">12</td>
<td class="center grayed">13</td>
<td class="center grayed">14</td>
<td class="center grayed">15</td>
<td class="center grayed">16</td>
</tr>
<tr>
<td class="center internal_border_right"><strong>7</strong></td>
<td class="center">7</td>
<td class="center">8</td>
<td class="center">9</td>
<td class="center">10</td>
<td class="center">11</td>
<td class="center">12</td>
<td class="center">13</td>
<td class="center">14</td>
<td class="center grayed">15</td>
<td class="center grayed">16</td>
<td class="center grayed">17</td>
</tr>
<tr>
<td class="center internal_border_right"><strong>8</strong></td>
<td class="center">8</td>
<td class="center">9</td>
<td class="center">10</td>
<td class="center">11</td>
<td class="center">12</td>
<td class="center">13</td>
<td class="center">14</td>
<td class="center">15</td>
<td class="center">16</td>
<td class="center grayed">17</td>
<td class="center grayed">18</td>
</tr>
<tr>
<td class="center internal_border_right"><strong>9</strong></td>
<td class="center">9</td>
<td class="center">10</td>
<td class="center">11</td>
<td class="center">12</td>
<td class="center">13</td>
<td class="center">14</td>
<td class="center">15</td>
<td class="center">16</td>
<td class="center">17</td>
<td class="center">18</td>
<td class="center grayed">19</td>
</tr>
<tr>
<td class="center internal_border_right"><strong>10</strong></td>
<td class="center">10</td>
<td class="center">11</td>
<td class="center">12</td>
<td class="center">13</td>
<td class="center">14</td>
<td class="center">15</td>
<td class="center">16</td>
<td class="center">17</td>
<td class="center">18</td>
<td class="center">19</td>
<td class="center">-</td>
</tr>
</tbody>
</table>
<p>For example, let&#8217;s add 19.7 and 12.8:</p>
<div class="wp-caption aligncenter" style="width: 127px"><img src="http://www.exploringbinary.com/wp-content/uploads/DecimalAdditionExample.png" alt="An Example of Decimal Addition" width="117" height="137"/><p class="wp-caption-text">Example of Decimal Addition</p></div>
<p>Let&#8217;s think about what a carry does. It turns our neat &#8220;add two single-digit numbers per column&#8221; problem into &#8220;add <em>three</em> single-digit numbers per column.&#8221; But addition is a binary (in a different sense) operation, meaning it operates on two numbers at a time. So what we do is add the carry to the &ldquo;top&rdquo; digit, and then add that result to the &ldquo;bottom&rdquo; digit. The first addition may result in a double digit answer (10), which means we&#8217;d have to use one of the ten extra facts to do the second addition.</p>
<p>In our example, the carry is added to 9, giving 10, and then 10 is added to 2, giving 12. The second addition used the extra &#8220;10 + 2 = 12&#8243; fact.</p>
<div class="wp-caption aligncenter" style="width: 139px"><img src="http://www.exploringbinary.com/wp-content/uploads/DecimalAdditionExampleExtra.png" alt="Decimal Addition Showing Intermediate Additions" width="129" height="151"/><p class="wp-caption-text">Decimal Addition Showing Intermediate Additions</p></div>
<h2>Binary Addition</h2>
<p>Binary addition works the same way as decimal addition, except it uses a different &#8212; and much smaller &#8212; set of facts. There are only four single-digit facts, or three if you exclude the commutative fact. To handle carries, you need to know two additional facts, 10 + 0 = 10 and 10 + 1 = 11. (<a title="Read Rick Regan's Article &ldquo;There Are 10 Types of People …&rdquo;" href="http://www.exploringbinary.com/there-are-10-types-of-people/">Resist the urge to pronounce 10 as ten</a> and 11 as eleven; either call them one-zero and one-one, or two and three.) The latter apply when there’s a carry (always 1) and the “top” digit is 1.</p>
<table class="center" border="1" summary="8 Binary Addition Facts">
<caption><strong>Binary Addition Facts</strong></caption>
<tbody>
<tr>
<th class="center">+</th>
<th class="center internal_border_bottom">0</th>
<th class="center internal_border_bottom">1</th>
<th class="center internal_border_bottom">10</th>
</tr>
<tr>
<td class="center internal_border_right"><strong>0</strong></td>
<td class="center">0</td>
<td class="center grayed">1</td>
<td class="center grayed">10</td>
</tr>
<tr>
<td class="center internal_border_right"><strong>1</strong></td>
<td class="center">1</td>
<td class="center">10</td>
<td class="center grayed">11</td>
</tr>
<tr>
<td class="center internal_border_right"><strong>10</strong></td>
<td class="center">10</td>
<td class="center">11</td>
<td class="center">-</td>
</tr>
</tbody>
</table>
<p>Of the five order-independent facts, two are the same as in decimal (0 + 0 = 0 and 1 + 0 = 1), and one is trivial since it just adds zero (10 + 0 = 10). <strong>This leaves only two facts to memorize: 1 + 1 = 10 and 10 + 1 = 11!</strong></p>
<p>To add two binary numbers, proceed as in decimal addition: </p>
<ul>
<li>If one or both numbers has a fractional part, line up the <a title="Wikipedia article on radix point" href="http://en.wikipedia.org/wiki/Radix_point">radix points</a>.</li>
<li>Proceeding from right to left, add the digits in each &ldquo;column,&rdquo; according to the facts table.</li>
<li>If the result has two-digits, write down the least significant digit; carry the most significant digit to the next column.</li>
</ul>
<p>Let&#8217;s return to the example in the introduction, 1011.01 + 11.011, this time showing it step-by-step:</p>
<div class="wp-caption aligncenter" style="width: 354px"><img src="http://www.exploringbinary.com/wp-content/uploads/BinaryAdditionExampleSteps.png" alt="Steps of Binary Addition" width="344" height="644"/><p class="wp-caption-text">Steps of Binary Addition</p></div>
<p>Eventually you won&#8217;t need to write down the intermediate additions, and you might even start doing &lsquo;1 + 1 + 1&rsquo; as one step.</p>
<h3>Checking the Answer</h3>
<p>You can verify the answer by <a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">converting the operands to decimal</a>, doing decimal addition, and then <a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">converting the decimal answer to binary</a> (of course you can do it that way period, avoiding binary arithmetic entirely!). 1011.01 = 11.25, and 11.011 = 3.375. They sum to 14.625, which is 1110.101, the answer we got using binary addition.</p>
<h2>Discussion</h2>
<p>We didn&#8217;t need to know the place value of columns, or that a carry represents a power of the base; the algorithm is base-independent. We <em>did</em> however need a base-dependent set of facts, which allowed us to manipulate binary numerals according to the basic rules of binary addition.</p>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/binary-addition/">Binary Addition</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/binary-addition/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bigcomp: Deciding Truncated, Near Halfway Conversions</title>
		<link>http://www.exploringbinary.com/bigcomp-deciding-truncated-near-halfway-conversions/</link>
		<comments>http://www.exploringbinary.com/bigcomp-deciding-truncated-near-halfway-conversions/#comments</comments>
		<pubDate>Fri, 02 Dec 2011 15:40:16 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Convert to decimal]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Exponents]]></category>
		<category><![CDATA[Floating-point]]></category>
		<category><![CDATA[Fractions]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=353</guid>
		<description><![CDATA[In my article &#8220;Using Integers to Check a Floating-Point Approximation,&#8221; I briefly mentioned &#8220;bigcomp,&#8221; an optimization strtod() uses to reduce big integer overhead when checking long decimal inputs. bigcomp does a floating-point to decimal conversion &#8212; right in the middle of a decimal to floating-point conversion mind you &#8212; to generate the decimal expansion of [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/bigcomp-deciding-truncated-near-halfway-conversions/">Bigcomp: Deciding Truncated, Near Halfway Conversions</a></p>
]]></description>
			<content:encoded><![CDATA[<p>In my article &ldquo;<a title="Read Rick Regan's article &ldquo;Using Integers to Check a Floating-Point Approximation&rdquo;" href="http://www.exploringbinary.com/using-integers-to-check-a-floating-point-approximation/">Using Integers to Check a Floating-Point Approximation</a>,&rdquo; I briefly mentioned &ldquo;bigcomp,&rdquo; an optimization strtod() uses to reduce big integer overhead when checking long decimal inputs. bigcomp does a floating-point to decimal conversion &#8212; right in the middle of a decimal to floating-point conversion mind you &#8212; to generate the decimal expansion of the number halfway between two target floating-point numbers. This decimal expansion is compared to the input decimal string, and the result of the comparison dictates which of the two target numbers is the correctly rounded result.</p>
<p>In this article, I&#8217;ll explain how bigcomp works, and when it applies. Also, I&#8217;ll talk briefly about its performance; my informal testing shows that, under the default setting, bigcomp actually <em>worsens</em> performance for some inputs.</p>
<p><span id="more-353"></span></p>
<h2>When Bigcomp Applies</h2>
<p>Bigcomp is enabled by default in strtod() (it can be disabled by &ldquo;&#35;defining&rdquo; the variable &ldquo;NO_STRTOD_BIGCOMP&rdquo;). It applies to decimal strings longer than STRTOD_DIGLIM significant digits, which by default is &ldquo;&#35;defined&rdquo; to 40. For decimal inputs <em>dStr</em> of 40 significant digits or less, the <a title="Read Rick Regan's article &ldquo;Using Integers to Check a Floating-Point Approximation&rdquo;" href="http://www.exploringbinary.com/using-integers-to-check-a-floating-point-approximation/">normal approximation check</a> is done, using all significant digits of <em>dStr</em> for its big integer representation <em>d</em>; for inputs longer than 40 digits, only the first 18 digits are used for <em>d</em>. This truncation of <em>d</em> leads to smaller big integers, but it can lead to cases where the approximation check cannot decide between adjacent floating-point values. When that happens, the bigcomp procedure kicks in, through a call to the bigcomp() function.</p>
<p>Specifically, bigcomp() is called when the approximation check cannot decide between the approximation <em>b</em> and its successor, <em>b + u</em>. (As with all my articles in this series, I will only be discussing how strtod() works for round-to-nearest/round-half-to-even rounding.) This happens when the truncated value of <em>d</em> falls between <em>b</em> and <em>b + h</em>; that is, between <em>b</em> and one-half ULP above <em>b</em>.</p>
<p>(In this article, I distinguish between the input <em>dStr</em> and its big integer representation <em>d</em>, which may or may not be a truncation of <em>dStr</em>.) </p>
<h2>Overview of the Algorithm</h2>
<p>bigcomp() compares the significant digits of <em>dStr</em> to a string representation of the significant digits of <em>b + h</em>. It goes left-to-right, digit by significant digit, stopping when the two strings differ. (In the bulk of cases, the difference occurs between the 16th and 18th digits, although in the worst case the difference can occur at the 54th digit; it depends on how close the decimal number is to <em>b + h</em>.) If <em>dStr</em> is <em>less</em> than <em>b + h</em>, bigcomp() returns <em>b</em>; if <em>dStr</em> is <em>greater</em> than <em>b + h</em>, bigcomp() returns <em>b + u</em>;  if <em>dStr</em> equals <em>b + h</em>, either <em>b</em> or <em>b + u</em> is chosen, according to round-half-to-even rounding.</p>
<p>bigcomp() generates the significant digits of <em>b + h</em> using essentially the same algorithm as the dtoa() function, the floating-point to decimal conversion routine co-located in <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s dtoa.c</a>. It uses big integer arithmetic, including big integer <em>division</em>.</p>
<p>bigcomp() starts by creating a fraction that represents the significant digits of <em>b + h</em> (not <em>b + h</em> itself). In this form, it generates a digit by multiplying the numerator of the fraction by ten, and then using integer division to form a quotient and remainder. The quotient represents the next digit; the remainder represents the remaining digits, which are accessed one at a time by repeating this process. Digits are generated until there is a mismatch between <em>dStr</em> and <em>b + h</em>.</p>
<h2>The Algorithm, By Example</h2>
<p>I&#8217;ll break the algorithm into five steps, and illustrate it with two examples. I set STRTOD_DIGLIM to 19 so I could generate shorter examples (both are 20 digits).</p>
<h3>Example 1: Check 1.3694713649464322631e-11</h3>
<p>In this example, we want to check if <em>dStr</em> = 1.3694713649464322631e-11 is closer to <em>b</em> = 0&#120;1.e1d703bb5749cp-37 or <em>b + u</em> = 0&#120;1.e1d703bb5749dp-37. We do this by comparing <em>dStr</em> to <em>b + h</em> =  0&#120;1.e1d703bb5749c8p-37. (I describe <em>b + h</em> using a <a title="Read Rick Regan's article &ldquo;Hexadecimal Floating-Point Constants&rdquo;" href="http://www.exploringbinary.com/hexadecimal-floating-point-constants/">hexadecimal floating-point constant</a>, even though it won&#8217;t be represented as an IEEE floating-point number. This number has 54 bits, with bit 54 being a 1. It is <em>b</em> with a hexadecimal &lsquo;8&rsquo; tacked on at the end.)</p>
<h4>1. Represent <em>b + h</em> as an integer times a power of two</h4>
<p>We start by representing <em>b</em> as a 53-bit integer times a power of two, as we do in the <a title="Read Rick Regan's article &ldquo;Using Integers to Check a Floating-Point Approximation&rdquo;" href="http://www.exploringbinary.com/using-integers-to-check-a-floating-point-approximation/">basic approximation check</a>:</p>
<pre class="noborder">
<em>b</em> = <em>bInt</em> x 2<sup><em>bExp</em></sup>
  = 8476617176609948 x 2<sup>-89</sup>
</pre>
<p>Since <em>bInt</em> is 53-bit integer, <em>u</em> = 2<sup><em>bExp</em></sup>, and <em>h</em> is 2<sup><em>bExp</em>-1</sup>. Therefore,</p>
<pre class="noborder">
<em>b + h</em> = <em>bInt</em> x 2<sup><em>bExp</em></sup> + 2<sup><em>bExp</em>-1</sup>
      = (<em>bInt</em> x 2 + 1) x 2<sup><em>bExp</em>-1</sup>
      = (8476617176609948 x 2 + 1) x 2<sup>-90</sup>
      = 16953234353219897 x 2<sup>-90</sup>
</pre>
<p><em>In code</em>: We create <em>bInt</em> and <em>bExp</em> from <em>b</em>&rsquo;s <a title="Read Rick Regan's article &ldquo;Displaying the Raw Fields of a Floating-Point Number&rdquo;" href="http://www.exploringbinary.com/displaying-the-raw-fields-of-a-floating-point-number/">underlying floating-point representation</a>: we treat its significand as a 53-bit integer, and subtract 52 from its power of two exponent. We create a big integer from <em>bInt</em>, and then we make it the integer portion of <em>b + h</em> by shifting it left and &lsquo;ORing&rsquo; a 1 into its least significant bit. We maintain the exponent of the power of two as a native integer, setting it to <em>bExp</em>-1.</p>
<h4>2. Scale <em>b + h</em> so its first significant digit is just to the left of the decimal point</h4>
<p>If you were to compute <em>b + h</em> from the expression above, you&#8217;d see its value is</p>
<p>0.0000000000136947136494643226044416541235590733258456475063269408565<br />
24743139743804931640625</p>
<p>(We&#8217;re not going to do that; I just wanted to show you where we&#8217;re headed. By the way, I computed it using <a title="Read Rick Regan's article &ldquo;Exploring Binary Numbers With PARI/GP Calculator&rdquo;" href="http://www.exploringbinary.com/exploring-binary-numbers-with-parigp-calculator/">PARI/GP</a>.)</p>
<p>What we want to do is scale <em>b + h</em> so that it looks like</p>
<p>1.3694713649464322604441654123559073325845647506326940856524743139743<br />
804931640625</p>
<p>In other words, we want to multiply it by a power of ten so that, when viewed as a decimal number, its significant digits look normalized; 10<sup>11</sup> does the trick, effectively shifting <em>b + h</em> <em>left</em> by 11 decimal places:</p>
<pre class="noborder">
Scaled <em>b + h</em> = 16953234353219897 x 2<sup>-90</sup> x 10<sup>11</sup>
</pre>
<p>This &ldquo;normalization&rdquo; sets up the digit generation algorithm.</p>
<p><em>In code</em>: A full-blown floating-point to decimal conversion would need to use logarithms to compute the power of ten; but in this case, we can obtain the power of ten exponent directly from <em>dStr</em>.</p>
<h4>3. Combine powers of two</h4>
<p>If we decompose the power of ten into a power of two and a power of five, we get</p>
<pre class="noborder">
Scaled <em>b + h</em> = 16953234353219897 x 2<sup>-90</sup> x 2<sup>11</sup> x 5<sup>11</sup>
</pre>
<p>We can now <a title="Read Rick Regan's Article &ldquo;Composing Powers of Two Using The Laws of Exponents&rdquo;" href="http://www.exploringbinary.com/composing-powers-of-two-using-the-laws-of-exponents/">combine powers of two</a> to reduce the size of the resulting big integers.</p>
<pre class="noborder">
Scaled <em>b + h</em> = 16953234353219897 x 2<sup>-79</sup> x 5<sup>11</sup>
</pre>
<p><em>In code</em>: The code is simple: just add the exponents of the power of two from <em>b + h</em> and the power of two from the power of ten scaling factor.</p>
<h4>4. Express scaled <em>b + h</em> as a fraction</h4>
<p>The negative power of two, 2<sup>-79</sup>, makes this a fraction:</p>
<pre class="noborder">
Scaled <em>b + h</em> = (16953234353219897 x 5<sup>11</sup>)/2<sup>79</sup>
             = 827794646153315283203125/604462909807314587353088
</pre>
<p><em>In code</em>: Compute the powers as big integers, using the absolute values of the exponents; then, multiply the terms out, as appropriate.</p>
<h4>5. Generate and check digits</h4>
<p>Scaled <em>b + h</em> has a numerator <em>num</em> and denominator <em>den</em>:</p>
<pre class="noborder">
<em>num</em> = 827794646153315283203125
<em>den</em> = 604462909807314587353088
</pre>
<p>The significant digits are generated from these integers using the &ldquo;repeated multiplication by ten&rdquo; algorithm; this is what&#8217;s done at each iteration (<em>dig</em> stands for &lsquo;digit&rsquo;, and <em>rem</em> stands for &lsquo;remainder&rsquo;):</p>
<ol>
<li><em>dig</em> = <em>num</em>/<em>den</em> (just the integer part of the quotient)</li>
<li><em>rem</em> = <em>num</em> &#8211; <em>dig</em> x <em>den</em></li>
<li><em>num</em> = <em>rem</em> x 10</li>
</ol>
<p>In other words, at each iteration, a digit is peeled off and the rest of the number is shifted left one decimal place. This step is repeated until a corresponding digit from <em>dStr</em> and <em>b + h</em> don&#8217;t match; here are the first three iterations:</p>
<pre class="noborder">
1. 827794646153315283203125/<em>den</em>  = 1 <em>rem</em> 223331736346000695850037
2. 2233317363460006958500370/<em>den</em> = 3 <em>rem</em> 419928634038063196441106
3. 4199286340380631964411060/<em>den</em> = 6 <em>rem</em> 572508881536744440292532
4. ...
</pre>
<p>This stops after step 19; that is, <em>dStr</em> and <em>b + h</em> differ at the 19th digit. Here are the digits of <em>dStr</em> and <em>b + h</em> aligned, to show where they differ:</p>
<pre class="noborder">
Digits of <em>dStr</em>  = 136947136494643226<span class="highlight_gray4">3</span>1
Digits of <em>b + h</em> = 136947136494643226<span class="highlight_gray4">0</span>444165412355907332584564750632
6940856524743139743804931640625
</pre>
<p>Since 3 is greater than 0, <em>dStr</em> is closer to <em>b + u</em> than it is to <em>b</em>, so <em>b + u</em> is the answer.</p>
<h3>Example 2: Check 9.3170532238714134438e+16</h3>
<p>In this example, we want to check if <em>dStr</em> = 9.3170532238714134438e+16 is closer to <em>b</em> = 0&#120;1.4b021afd9f651p+56 or <em>b + u</em> = 0&#120;1.4b021afd9f652p+56. We do this by comparing <em>dStr</em> to <em>b + h</em> =  0&#120;1.4b021afd9f6518p+56.</p>
<h4>1. Represent <em>b + h</em> as an integer times a power of two</h4>
<pre class="noborder">
<em>b</em> = <em>bInt</em> x 2<sup><em>bExp</em></sup>
  = 5823158264919633 x 2<sup>4</sup>

<em>b + h</em> = (<em>bInt</em> x 2 + 1) x 2<sup><em>bExp</em>-1</sup>
      = (5823158264919633 x 2 + 1) x 2<sup>3</sup>
      = 11646316529839267 x 2<sup>3</sup>
</pre>
<h4>2. Scale <em>b + h</em> so its first significant digit is just to the left of the decimal point</h4>
<p>If you were to compute <em>b + h</em> from the expression above, you&#8217;d see its value is</p>
<p>93170532238714136</p>
<p>We want to shift it <em>right</em> 16 decimal places so it looks like</p>
<p>9.3170532238714136</p>
<pre class="noborder">
Scaled <em>b + h</em> = 11646316529839267 x 2<sup>3</sup> x 10<sup>-16</sup>
</pre>
<h4>3. Combine powers of two</h4>
<p>If we decompose the power of ten into a power of two and a power of five, we get</p>
<pre class="noborder">
Scaled <em>b + h</em> = 11646316529839267 x 2<sup>3</sup> x 2<sup>-16</sup> x 5<sup>-16</sup>
</pre>
<p>Combining powers of two we get</p>
<pre class="noborder">
Scaled <em>b + h</em> = 11646316529839267 x 2<sup>-13</sup> x 5<sup>-16</sup>
</pre>
<h4>4. Express scaled <em>b + h</em> as a fraction</h4>
<p>The negative powers of two and five make this a fraction:</p>
<pre class="noborder">
Scaled <em>b + h</em> = 11646316529839267/(2<sup>13</sup> x 5<sup>16</sup>)
             = 11646316529839267/1250000000000000
</pre>
<h4>5. Generate and check digits</h4>
<p>Scaled <em>b + h</em> has a numerator <em>num</em> and denominator <em>den</em>:</p>
<pre class="noborder">
<em>num</em> = 11646316529839267
<em>den</em> = 1250000000000000
</pre>
<p>Here are the first three iterations of the repeated multiplication by ten algorithm:</p>
<pre class="noborder">
1. 11646316529839267/<em>den</em> = 9 <em>rem</em> 396316529839267
2. 3963165298392670/<em>den</em>  = 3 <em>rem</em> 213165298392670
3. 2131652983926700/<em>den</em>  = 1 <em>rem</em> 881652983926700
4. ...
</pre>
<p>This stops after step 17; that is, <em>dStr</em> and <em>b + h</em> differ at the 17th digit. Here are the digits of <em>dStr</em> and <em>b + h</em> aligned, to show where they differ:</p>
<pre class="noborder">
Digits of <em>dStr</em>  = 9317053223871413<span class="highlight_gray4">4</span>438
Digits of <em>b + h</em> = 9317053223871413<span class="highlight_gray4">6</span>
</pre>
<p>Since 4 is less than 6, <em>dStr</em> is closer to <em>b</em> than it is to <em>b + u</em>, so <em>b</em> is the answer.</p>
<h3>Discussion</h3>
<p>strtod() has an extra step, which I&#8217;ve not described. It scales the numerator and denominator by a common power of two, which is necessary for its &lsquo;quorem()&rsquo; function to compute the quotient and remainder properly. The power of two exponent is determined by a function called &lsquo;dshift().&rsquo;</p>
<p>If <em>dStr</em> represents a fractional value, step 2 (scale by power of ten) and step 3 (combine powers of two) are technically unnecessary. Without them, step 5 (generate digits) would generate leading zeros, which could be ignored. (Of course, the pre-scaling is more elegant, as it reduces everything to one form.)</p>
<h2>The 40-Digit Cutoff Is Too Low</h2>
<p>bigcomp() does a lot of work with big integers, which made me wonder why it performed better than the basic approximation check. I ran some tests, and found that it <em>doesn&#8217;t</em> perform better &#8212; not at the default STRTOD_DIGLIM value of 40! My tests indicate that bigcomp() only performs better starting at somewhere between 80 and 83 digits. (I don&#8217;t know why 40 was chosen as the default.) bigcomp() outperforms the basic check for really long inputs; for example, the difference in performance is great at 500 digits, and substantial at 1000 digits.</p>
<p>Let&#8217;s look at a 1000-digit input to see the difference in the big integers involved:</p>
<pre class="noborder">
<em>dStr</em> = 5.39393324717760434968487301635560570642059608762708229108908
24537754965408622951477281124864324524316631524646435056101989215441
13243584680626358335990970896315873327586087827333915843584231417344
18113410301502801068079096892041478634628161830670245841932386407800
12745803879948980583567722824049594060350336856210415679895612943123
50098271611832693583464221850190462604938332826437767915224673375308
74570398083131101089815004834148500912320646940469604931721018872877
13566715387976682236017863055187831755129573243931855002598816011501
73130862094578091926124182802567489337171671008075773833668718543916
59331687129078431660248055444327878892028873933579030052796118406453
59419797286387142281725295340184928613290361074947899276429780250293
85401861377815243939607988608126354001619571808095966555370152710852
57007714188421395421027623995592231545518458559072989336658831420578
08271261804364513313549623924640140003927894134813865873451090057841
21193123838848522389829029095139051820951352412966707477e+16
</pre>
<p>Without bigcomp, using the <a title="Read Rick Regan's article &ldquo;Using Integers to Check a Floating-Point Approximation&rdquo;" href="http://www.exploringbinary.com/using-integers-to-check-a-floating-point-approximation/">basic approximation check</a>, these are the variables involved:</p>
<pre class="noborder">
<em>dS</em> = 539393324717760434968487301635560570642059608762708229108908245
37754965408622951477281124864324524316631524646435056101989215441132
43584680626358335990970896315873327586087827333915843584231417344181
13410301502801068079096892041478634628161830670245841932386407800127
45803879948980583567722824049594060350336856210415679895612943123500
98271611832693583464221850190462604938332826437767915224673375308745
70398083131101089815004834148500912320646940469604931721018872877135
66715387976682236017863055187831755129573243931855002598816011501731
30862094578091926124182802567489337171671008075773833668718543916593
31687129078431660248055444327878892028873933579030052796118406453594
19797286387142281725295340184928613290361074947899276429780250293854
01861377815243939607988608126354001619571808095966555370152710852570
07714188421395421027623995592231545518458559072989336658831420578082
71261804364513313549623924640140003927894134813865873451090057841211
93123838848522389829029095139051820951352412966707477

<em>bS</em> = 539393324717760400000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000

<em>hS</em> = 400000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000

&#124;<em>dS-bS</em>&#124; = 3496848730163556057064205960876270822910890824537754965408
62295147728112486432452431663152464643505610198921544113243584680626
35833599097089631587332758608782733391584358423141734418113410301502
80106807909689204147863462816183067024584193238640780012745803879948
98058356772282404959406035033685621041567989561294312350098271611832
69358346422185019046260493833282643776791522467337530874570398083131
10108981500483414850091232064694046960493172101887287713566715387976
68223601786305518783175512957324393185500259881601150173130862094578
09192612418280256748933717167100807577383366871854391659331687129078
43166024805544432787889202887393357903005279611840645359419797286387
14228172529534018492861329036107494789927642978025029385401861377815
24393960798860812635400161957180809596655537015271085257007714188421
39542102762399559223154551845855907298933665883142057808271261804364
51331354962392464014000392789413481386587345109005784121193123838848
522389829029095139051820951352412966707477
</pre>
<p>&#124;<em>dS-bS</em>&#124; is less than <em>hS</em>, so <em>b</em> is chosen.</p>
<p><em>With</em> bigcomp, there are two sets of big integers involved: one set for the basic approximation check (on the truncated value), and one set for bigcomp. Here are the big integers used for the basic approximation check:</p>
<pre class="noborder">
<em>dS</em> = 539393324717760434
<em>bS</em> = 539393324717760400
<em>hS</em> = 40
&#124;<em>dS-bS</em>&#124; = 34
</pre>
<p>Here are the big integers used for bigcomp (not shown is the extra factor of 2<sup>8</sup> in the numerator and denominator due to dshift()):</p>
<pre class="noborder">
<em>num</em> = 13484833117944011
<em>den</em> = 2500000000000000
</pre>
<p>bigcomp() finds that digit 17 of <em>dStr</em> (3) is less than digit 17 of <em>b + h</em> (4), so it chooses <em>b</em>.</p>
<h2>Acknowledgement</h2>
<p>Thanks to Mark Dickinson for his commentary on bigcomp, both in <a title="Python dtoa.c" href="http://svn.python.org/view/python/branches/py3k-dtoa/Python/dtoa.c?revision=83813&#038;view=markup">Python&#8217;s version of David Gay&#8217;s dtoa.c</a>, and in private communication.</p>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/bigcomp-deciding-truncated-near-halfway-conversions/">Bigcomp: Deciding Truncated, Near Halfway Conversions</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/bigcomp-deciding-truncated-near-halfway-conversions/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Exploring Binary: One Million Views</title>
		<link>http://www.exploringbinary.com/exploring-binary-one-million-views/</link>
		<comments>http://www.exploringbinary.com/exploring-binary-one-million-views/#comments</comments>
		<pubDate>Tue, 08 Nov 2011 04:43:43 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[About]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=351</guid>
		<description><![CDATA[Today (November 7, 2011), exploringbinary.com received its one millionth view, according to WordPress.com Stats: This site went live on November 14, 2008; that works out to an average of over 920 views per day. My decimal/binary converter (an HTML form with an accompanying article) accounts for nearly half of those views (445,700+). Not counting my [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/exploring-binary-one-million-views/">Exploring Binary: One Million Views</a></p>
]]></description>
			<content:encoded><![CDATA[<p>Today (November 7, 2011), exploringbinary.com received its one millionth view, according to WordPress.com Stats:</p>
<div class="wp-caption aligncenter" style="width: 325px"><img src="http://www.exploringbinary.com/wp-content/uploads/WPstats.oneMillionViews.png" alt="WordPress.com Stats Reports One Million Views" width="315" height="149"/><p class="wp-caption-text">WordPress.com Stats Reports One Million Views</p></div>
<p><span id="more-351"></span></p>
<p>This site went live on November 14, 2008; that works out to an average of over 920 views per day. My <a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">decimal/binary converter</a> (an HTML form with an accompanying article) accounts for nearly half of those views (445,700+).  Not counting my converter, these are my top ten articles:</p>
<ol>
<li><a title="Read Rick Regan's Article &ldquo;Java Hangs When Converting 2.2250738585072012e-308&rdquo;" href="http://www.exploringbinary.com/java-hangs-when-converting-2-2250738585072012e-308/">Java Hangs When Converting 2.2250738585072012e-308</a> (134,500+ views)</li>
<li><a title="Read Rick Regan's Article &ldquo;PHP Hangs On Numeric Value 2.2250738585072011e-308&rdquo;" href="http://www.exploringbinary.com/php-hangs-on-numeric-value-2-2250738585072011e-308/">PHP Hangs On Numeric Value 2.2250738585072011e-308</a> (93,600+ views)</li>
<li><a title="Read Rick Regan's Article &ldquo;Binary Dates in 2010 and 2011&rdquo;" href="http://www.exploringbinary.com/binary-dates-in-2010-and-2011/">Binary Dates in 2010 and 2011</a> (45,600+ views)</li>
<li><a title="Read Rick Regan's Article &ldquo;Ten Ways to Check if an Integer Is a Power Of Two in C&rdquo;" href="http://www.exploringbinary.com/ten-ways-to-check-if-an-integer-is-a-power-of-two-in-c/">Ten Ways to Check if an Integer Is a Power Of Two in C</a> (37,400+ views)</li>
<li><a title="Read Rick Regan's Article &ldquo;Converting Floating-Point Numbers to Binary Strings in C&rdquo;" href="http://www.exploringbinary.com/converting-floating-point-numbers-to-binary-strings-in-c/">Converting Floating-Point Numbers to Binary Strings in C</a> (15,200+ views)</li>
<li><a title="Read Rick Regan's Article &ldquo;Why &ldquo;Volatile&rdquo; Fixes the 2.2250738585072011e-308 Bug&rdquo;" href="http://www.exploringbinary.com/why-volatile-fixes-the-2-2250738585072011e-308-bug/">Why &ldquo;Volatile&rdquo; Fixes the 2.2250738585072011e-308 Bug</a> (12,200+ views)</li>
<li><a title="Read Rick Regan's Article &ldquo;FPUpdater Fixes the Java 2.2250738585072012e-308 Bug&rdquo;" href="http://www.exploringbinary.com/fpupdater-fixes-the-java-2-2250738585072012e-308-bug/">FPUpdater Fixes the Java 2.2250738585072012e-308 Bug</a> (10,600+ views)</li>
<li><a title="Read Rick Regan's Article &ldquo;How I Taught Third Graders Binary Numbers&rdquo;" href="http://www.exploringbinary.com/how-i-taught-third-graders-binary-numbers/">How I Taught Third Graders Binary Numbers</a> (9,800+ views)</li>
<li><a title="Read Rick Regan's Article &ldquo;The Answer is One (Unless You Use Floating-Point)&rdquo;" href="http://www.exploringbinary.com/the-answer-is-one-unless-you-use-floating-point/">The Answer is One (Unless You Use Floating-Point)</a> (9,700+ views)</li>
<li><a title="Read Rick Regan's Article &ldquo;A Closer Look at the Java 2.2250738585072012e-308 Bug&rdquo;" href="http://www.exploringbinary.com/a-closer-look-at-the-java-2-2250738585072012e-308-bug/">A Closer Look at the Java 2.2250738585072012e-308 Bug</a> (9,200+ views)</li>
</ol>
<p>Five of my top ten articles are related to the PHP and Java floating-point conversion bugs; together they total 260,200+ views.</p>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/exploring-binary-one-million-views/">Exploring Binary: One Million Views</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/exploring-binary-one-million-views/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using Integers to Check a Floating-Point Approximation</title>
		<link>http://www.exploringbinary.com/using-integers-to-check-a-floating-point-approximation/</link>
		<comments>http://www.exploringbinary.com/using-integers-to-check-a-floating-point-approximation/#comments</comments>
		<pubDate>Mon, 31 Oct 2011 14:49:32 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Exponents]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=349</guid>
		<description><![CDATA[For decimal inputs that don&#8217;t qualify for fast path conversion, David Gay&#8217;s strtod() function does three things: first, it uses IEEE double-precision floating-point arithmetic to calculate an approximation to the correct result; next, it uses arbitrary-precision integer arithmetic (AKA big integers) to check if the approximation is correct; finally, it adjusts the approximation, if necessary. [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/using-integers-to-check-a-floating-point-approximation/">Using Integers to Check a Floating-Point Approximation</a></p>
]]></description>
			<content:encoded><![CDATA[<p>For decimal inputs that don&#8217;t qualify for <a title="Read Rick Regan's article &ldquo;Fast Path Decimal to Floating-Point Conversion&rdquo;" href="http://www.exploringbinary.com/fast-path-decimal-to-floating-point-conversion/">fast path conversion</a>, David Gay&#8217;s strtod() function does three things: first, it uses IEEE double-precision floating-point arithmetic to <a title="Read Rick Regan's article &ldquo;strtod()&rsquo;s Initial Decimal to Floating-Point Approximation&rdquo;" href="http://www.exploringbinary.com/strtod-initial-decimal-to-floating-point-approximation/">calculate an approximation</a> to the correct result; next, it uses arbitrary-precision integer arithmetic (AKA big integers) to check if the approximation is correct; finally, it adjusts the approximation, if necessary. In this article, I&#8217;ll explain the second step &#8212; how the check of the approximation is done.</p>
<p><span id="more-349"></span></p>
<h2>Overview</h2>
<p>The goal of David Gay&#8217;s strtod() is to perform efficient, correctly rounded decimal to floating point conversion. To that end, it minimizes the overhead due to arbitrary-precision arithmetic, which is unavoidable for some conversions. In particular, strtod() avoids <em>division</em> of big integers, as would be required by a <a title="Read Rick Regan's article &ldquo;Correct Decimal To Floating-Point Using Big Integers&rdquo;" href="http://www.exploringbinary.com/correct-decimal-to-floating-point-using-big-integers/">full-blown big integer algorithm</a>. Instead of dividing, it multiplies, adds, subtracts, and compares.</p>
<p>The key to the algorithm is the floating-point approximation. It can be compared to the original decimal input to see if it&#8217;s too small, correct, or too big (<a title="Read Rick Regan's article &ldquo;strtod()&rsquo;s Initial Decimal to Floating-Point Approximation&rdquo;" href="http://www.exploringbinary.com/strtod-initial-decimal-to-floating-point-approximation/">it will be within 10 ULPs or so of correct</a>). The approximation can be adjusted accordingly.</p>
<p>(My discussion will deal only with normal double-precision numbers; strtod() handles subnormal numbers as well.)</p>
<h2>Correctly Rounded Conversions</h2>
<p>For a floating-point number <em>b</em> to be considered the correctly rounded conversion of a decimal number <em>d</em>, it must be the <em>closest</em> floating-point number to <em>d</em>. That definition sounds simple enough, but there&#8217;s a problem: the meaning of &ldquo;closest&rdquo; can depend on the value of <em>b</em>. In any case, the tests for closeness are all based on a measure called a ULP &#8212; a binary (for our purposes) <em>unit in the last place</em>. A ULP is the gap between adjacent floating-point numbers.</p>
<p>I&#8217;ll present the tests as three separate cases, although in code they must be considered together.</p>
<h3>Case 1: <em>d</em> is less than one-half ULP away from <em>b</em></h3>
<p>If <em>d</em> is less than one-half of a ULP from <em>b</em>, then <em>b</em> is the correctly rounded conversion of <em>d</em> (this test is ambiguous when <em>b</em> is a power of two &#8212; I&#8217;ll explain below). Mathematically, this is stated as</p>
<p>b &#8211; <em>h</em> &lt; d &lt; b + <em>h</em> ,</p>
<p>where <em>h</em> = 1/2&middot;<em>u</em>, and where <em>u</em> equals one ULP.</p>
<p>Here&#8217;s what the relationship looks like:</p>
<div class="wp-caption aligncenter" style="width: 494px"><img src="http://www.exploringbinary.com/wp-content/uploads/decBinInterval.png" alt="Correct rounding when d is less than one-half ULP away from b" width="484" height="166"/><p class="wp-caption-text">Correct rounding when <em>d</em> is less than one-half ULP away from <em>b</em></p></div>
<p>Equivalently, this relationship can be written as</p>
<p>-<em>h</em> &lt; <em>d</em> &#8211; <em>b</em> &lt; <em>h</em> ,</p>
<p>or</p>
<p>&#124;<em>d</em> &#8211; <em>b</em>&#124; &lt; <em>h</em> .</p>
<h3>Case 2: <em>d</em> is one-half ULP away from <em>b</em></h3>
<p>If <em>d</em> is one-half of a ULP away from <em>b</em>, then <em>b</em> is the correctly rounded conversion of <em>d</em> only if <em>b</em>&rsquo;s least significant bit is 0  (this test is ambiguous when <em>b</em> is a power of two &#8212; I&#8217;ll explain below). This is called <a title="Wikipedia article definition of &ldquo;round half to even&rdquo; rounding mode" href="http://en.wikipedia.org/wiki/Rounding#Round_half_to_even">round-half-to-even</a> rounding.</p>
<p>The relationship is written as</p>
<p>&#124;<em>d</em> &#8211; <em>b</em>&#124; = <em>h</em></p>
<div class="wp-caption aligncenter" style="width: 495px"><img src="http://www.exploringbinary.com/wp-content/uploads/decBinInterval.halfway.png" alt="Correct rounding when d is one-half ULP away from b" width="485" height="197"/><p class="wp-caption-text">Correct rounding when <em>d</em> is one-half ULP away from <em>b</em></p></div>
<h3>Case 3: <em>b</em> is a power of two</h3>
<p>If <em>b</em> is a power of two, two different ULP values come into play, making our half ULP test ambiguous. One ULP <em>below</em> <em>b</em> is half as much as one ULP <em>above</em> <em>b</em>, since a ULP doubles at each increasing power of two boundary. This means that if <em>b</em> is a power of two, it is the correctly rounded conversion of <em>d</em> only if <em>d</em> is no more than one-half ULP above <em>b</em> or no more than one-quarter of that same (upper) ULP below <em>b</em>; here is the inequality:</p>
<p>-<em>h</em>/2 &le; <em>d</em> &#8211; <em>b</em> &le; <em>h</em> .</p>
<p>This is what it looks like:</p>
<div class="wp-caption aligncenter" style="width: 499px"><img src="http://www.exploringbinary.com/wp-content/uploads/decBinInterval.PO2.png" alt="Correct rounding when b is a power of two" width="489" height="175"/><p class="wp-caption-text">Correct rounding when <em>b</em> is a power of two</p></div>
<p>We use the &ldquo;&le;&rdquo; signs in the inequality because <em>b</em> is the correctly rounded result even if <em>d</em> = <em>b</em> &#8211; <em>h</em>/2 or <em>d</em> = <em>b</em> + <em>h</em>. This is because the significands of (normal) powers of two are zero.</p>
<h2>Comparing Using Big Integers</h2>
<p>To explain how the algorithm works, I will use this example: <em>d</em> = 1.7864e-45 and <em>b</em> = 0&#120;1.465a72e467d88p-149 (the latter is expressed as a <a title="Read Rick Regan's article &ldquo;Hexadecimal Floating-Point Constants&rdquo;" href="http://www.exploringbinary.com/hexadecimal-floating-point-constants/">hexadecimal floating-point constant</a>).  Here&#8217;s how <em>d</em> and <em>b</em> are related:</p>
<div class="wp-caption aligncenter" style="width: 502px"><img src="http://www.exploringbinary.com/wp-content/uploads/decBinExample.png" alt="1.7864 x 10^-45 In Relation to Its Binary Floating-Point Approximation 0&#120;1.465a72e467d88p-149" width="492" height="215"/><p class="wp-caption-text">1.7864 x 10<sup>-45</sup> In Relation to Its Binary Floating-Point Approximation, 0&#120;1.465a72e467d88p-149</p></div>
<p>The diagram shows that <em>b</em> is the closest floating-point number to <em>d</em> &#8212; but how do we check that programmatically, using binary arithmetic? We need to apply the above test: &#124;<em>d</em> &#8211; <em>b</em>&#124; &lt; <em>h</em>. (I know a priori this is not a special case &#8212; I wanted to keep my example simple.) But there&#8217;s a problem: how do we represent <em>d</em>? Answer: as a big integer. If we also represent <em>b</em> and <em>h</em> as big integers, such that they are proportional to <em>d</em>, we can apply the test given by the inequality.</p>
<p>I&#8217;ve broken the process into four steps:</p>
<h3>Step 1: Represent the significant digits of <em>d</em> and <em>b</em> as integers</h3>
<p>The first step is to represent <em>d</em> as an integer times a power of ten, and <em>b</em> as an integer times a power of two:</p>
<p><em>d</em> = <em>dInt</em> x 10<sup><em>dExp</em></sup><br />
<em>b</em> = <em>bInt</em> x 2<sup><em>bExp</em></sup></p>
<p><em>d</em> was already put in this form when the <a title="Read Rick Regan's article &ldquo;strtod()’s Initial Decimal to Floating-Point Approximation&rdquo;" href="http://www.exploringbinary.com/strtod-initial-decimal-to-floating-point-approximation/">floating point approximation <em>b</em> was computed</a> (but for the purposes of the comparison, <em>all</em> digits of <em>d</em> are used, not just the first 16). <em>b</em> can be put in this form simply, by <a title="Read Rick Regan's article &ldquo;Displaying the Raw Fields of a Floating-Point Number&rdquo;" href="http://www.exploringbinary.com/displaying-the-raw-fields-of-a-floating-point-number/">accessing its underlying floating-point representation</a>: we &ldquo;unnormalize&rdquo; its significand into an integer by decreasing its power of two exponent. We represent the significant digits as a <a title="Read Rick Regan's article &ldquo;Correct Decimal To Floating-Point Using Big Integers&rdquo;" href="http://www.exploringbinary.com/correct-decimal-to-floating-point-using-big-integers/">53-bit integer</a>, which means we subtract 52 from its power of two exponent.</p>
<p>For our example, <em>d</em> = 1.7864e-45 is represented as 17864 x 10<sup>-49</sup>, and <em>b</em> = 0&#120;1.465a72e467d88p-149, which in <a title="Read Rick Regan's article &ldquo;Displaying IEEE Doubles in Binary Scientific Notation&rdquo;" href="http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/">binary scientific notation</a> is</p>
<p>1.0100011001011010011100101110010001100111110110001 x 2<sup>-149</sup> ,</p>
<p>is represented (in decimal numerals) as 5741268244528520 x 2<sup>-201</sup>. </p>
<h3>Step 2: Represent <em>h</em> as a power of two</h3>
<p>Since <em>bInt</em> is 53-bit integer, <em>u</em> = 2<sup><em>bExp</em></sup>, and <em>h</em> is half that: 2<sup><em>bExp</em>-1</sup>. For consistency in my discussion, I&#8217;ll substitute the variable <em>hExp</em> for <em>bExp</em>-1, giving</p>
<p><em>h</em> = 2<sup><em>hExp</em></sup> .</p>
<h3>Step 3: Scale the inequality</h3>
<p>To complete the conversion to integers, we need to get rid of any negative powers; we do this by scaling both sides of the inequality. Let&#8217;s look at the inequality again:</p>
<p>&#124;<em>d</em> &#8211; <em>b</em>&#124; &lt; <em>h</em> .</p>
<p>Here are the three values as we expressed them above:</p>
<p><em>d</em> = <em>dInt</em> x 10<sup><em>dExp</em></sup><br />
<em>b</em> = <em>bInt</em> x 2<sup><em>bExp</em></sup><br />
<em>h</em> = 2<sup><em>hExp</em></sup></p>
<p>For efficiency, the scaling process will factor out common powers of two from the three powers; to facilitate this, we rewrite 10<sup><em>dExp</em></sup> as 2<sup><em>dExp</em></sup> x 5<sup><em>dExp</em></sup>; now we have</p>
<p><em>d</em> = <em>dInt</em> x 2<sup><em>dExp</em></sup> x 5<sup><em>dExp</em></sup><br />
<em>b</em> = <em>bInt</em> x 2<sup><em>bExp</em></sup><br />
<em>h</em> = 2<sup><em>hExp</em></sup></p>
<p>If any of the exponents <em>dExp</em>, <em>bExp</em>, or <em>hExp</em> are negative, we make them nonnegative by scaling. To reflect any scaling, let&#8217;s rename <em>d</em>, <em>b</em>, and <em>h</em> to <em>dS</em>, <em>bS</em>, and <em>hS</em>, respectively. The scaled inequality is now written</p>
<p>&#124;<em>dS</em> &#8211; <em>bS</em>&#124; &lt; <em>hS</em> ,</p>
<p>and the initial values of <em>dS</em>, <em>bS</em>, and <em>hS</em> are</p>
<p><em>dS</em> = <em>dInt</em> x 2<sup><em>dExp</em></sup> x 5<sup><em>dExp</em></sup><br />
<em>bS</em> = <em>bInt</em> x 2<sup><em>bExp</em></sup><br />
<em>hS</em> = 2<sup><em>hExp</em></sup></p>
<p>To show how scaling works, let&#8217;s continue with our example above; here are the initial values of the variables:</p>
<p><em>dS</em> = 17864  x  2<sup>-49</sup> x 5<sup>-49</sup><br />
<em>bS</em> = 5741268244528520 x 2<sup>-201</sup><br />
<em>hS</em> = 2<sup>-202</sup></p>
<p>In this example, all three exponents &#8212; <em>dExp</em>, <em>bExp</em>, and <em>hExp</em>  &#8212; are negative; we scale the values with a common factor <em>S</em> that makes them all nonnegative. The least common factor that does this is <em>S</em> = 2<sup>202</sup> x 5<sup>49</sup>; our variables then become</p>
<p><em>dS</em> = 17864  x  2<sup>153</sup><br />
<em>bS</em> = 5741268244528520 x 2<sup>1</sup> x 5<sup>49</sup><br />
<em>hS</em> = 5<sup>49</sup></p>
<p>Multiplied out, these are</p>
<p><em>dS</em> = 203970822259994138521801764465966248930731085529088<br />
<em>bS</em> = 203970822259994122305215569213032722473144531250000<br />
<em>hS</em> = 17763568394002504646778106689453125</p>
<h3>Step 4: Apply the test</h3>
<p>Now all we have to do is check the inequality, &#124;<em>dS</em> &#8211; <em>bS</em>&#124; &lt; <em>hS</em>:</p>
<p>&#124;<em>dS</em> &#8211; <em>bS</em>&#124; = 16216586195252933526457586554279088, which is less than <em>hS</em> = 17763568394002504646778106689453125. This proves that <em>b</em> is the correct conversion of <em>d</em>.</p>
<p>(In the diagram for this example above, I&#8217;ve drawn the relationship of <em>d</em> and <em>b</em> proportionally: (<em>dS</em> &#8211; <em>bS</em>)/<em>hS</em> is approximately 0.91, which means that <em>d</em> is about  0.91<em>h</em> above <em>b</em>.)</p>
<h2>Computing the Scaling Exponents In Code</h2>
<p>In code, the scaling is done by operating on native integer variables representing the exponents of the powers of two and five in each of the three scaled values. After the exponents are created, common factors of two are removed to reduce the size of the resulting big integers. Here is how I coded it in C (this is essentially how it&#8217;s done in David Gay&#8217;s strtod() and Java’s FloatingDecimal class; I&#8217;ve renamed the variables and structured it a bit differently to make it clearer):</p>
<pre>
//Initialize scaling exponents
dS_Exp2 = 0;
dS_Exp5 = 0;
bS_Exp2 = 0;
bS_Exp5 = 0;
hS_Exp2 = 0;
hS_Exp5 = 0;

//Adjust for decimal exponent
if (dExp >= 0) {
   dS_Exp2 += dExp;
   dS_Exp5 += dExp;
   }
else {
   bS_Exp2 -= dExp;
   bS_Exp5 -= dExp;
   hS_Exp2 -= dExp;
   hS_Exp5 -= dExp;
   }

//Adjust for binary exponent
if (bExp >= 0) {
    bS_Exp2 += bExp;
   }
else {
   dS_Exp2 -= bExp;
   hS_Exp2 -= bExp;
   }

//Adjust for half ulp exponent
if (hExp >= 0) {
   hS_Exp2 += hExp;
   }
else {
   dS_Exp2 -= hExp;
   bS_Exp2 -= hExp;
   }

//Remove common power of two factor from all three scaled values
common_Exp2 = min(dS_Exp2, min(bS_Exp2, hS_Exp2));
dS_Exp2 -= common_Exp2;
bS_Exp2 -= common_Exp2;
hS_Exp2 -= common_Exp2;
</pre>
<p>Here is how the code executes for our example (I&#8217;ll show the state of the exponent variables after adjusting for each exponent):</p>
<pre class="indented">
//After initializing scaling exponents
dS_Exp2 = 0;
dS_Exp5 = 0;
bS_Exp2 = 0;
bS_Exp5 = 0;
hS_Exp2 = 0;
hS_Exp5 = 0;

//After adjusting for decimal exponent (dExp = -49)
dS_Exp2 = 0;
dS_Exp5 = 0;
bS_Exp2 = 49;
bS_Exp5 = 49;
hS_Exp2 = 49;
hS_Exp5 = 49;

//After adjusting for binary exponent (bExp = -201)
dS_Exp2 = 201;
dS_Exp5 = 0;
bS_Exp2 = 49;
bS_Exp5 = 49;
hS_Exp2 = 250;
hS_Exp5 = 49;

//After adjusting for half ulp exponent (hExp = -202)
dS_Exp2 = 403;
dS_Exp5 = 0;
bS_Exp2 = 251;
bS_Exp5 = 49;
hS_Exp2 = 250;
hS_Exp5 = 49;

//After removing common power of two factor of 2^250
dS_Exp2 = 153;
dS_Exp5 = 0;
bS_Exp2 = 1;
bS_Exp5 = 49;
hS_Exp2 = 0;
hS_Exp5 = 49;
</pre>
<p>After this code executes, the powers are multiplied out as big integers and used to create big integers <em>dS</em>, <em>bS</em>, and <em>hS</em>.</p>
<h3>When <em>d</em>, <em>b</em>, and <em>h</em> are already integers</h3>
<p>When all three exponents are nonnegative, <em>d</em>, <em>b</em>, and <em>h</em> start out as integer values. In this case, they are scaled <em>down</em>, by their common power of two factor.</p>
<h2>Testing the Inequality In Code</h2>
<p>In strtod(), the checking centers around a comparison of &#124;<em>d</em> &#8211; <em>b</em>&#124; (actually, &#124;<em>b</em> &#8211; <em>d</em>&#124;, which is equivalent) to <em>h</em>, which results in three outcomes: &#124;<em>d</em> &#8211; <em>b</em>&#124; &lt; <em>h</em>, &#124;<em>d</em> &#8211; <em>b</em>&#124; = <em>h</em>, or &#124;<em>d</em> &#8211; <em>b</em>&#124; &gt; <em>h</em>. The special cases are worked into the first two outcomes. (Java’s FloatingDecimal class does similar checks, but it is structured differently.)</p>
<p>Regarding the required check at a power of two boundary (comparing <em>d</em> &#8211; <em>b</em> to -<em>h</em>/2), strtod() scales <em>d</em> &#8211; <em>b</em> by two &#8212; to ensure that all values remain integers. This gives the equivalent comparison of 2(<em>d</em> &#8211; <em>b</em>) to -<em>h</em>. (In many cases, <em>h</em> is divisible by two; strtod() could just as easily divide <em>h</em> by two instead of multiplying <em>d</em> &#8211; <em>b</em> by two. Java’s FloatingDecimal class does the former.)</p>
<h3>Adjusting <em>b</em></h3>
<p>strtod() makes theses checks in a loop, and adjusts <em>b</em> as necessary (that is beyond the scope of this article).</p>
<h2>strtod()&rsquo;s Optimization Called &ldquo;Bigcomp&rdquo;</h2>
<p>strtod() has an optimization called &ldquo;<a title="Read Rick Regan's Article &ldquo;Bigcomp: Deciding Truncated, Near Halfway Conversions&rdquo;" href="http://www.exploringbinary.com/bigcomp-deciding-truncated-near-halfway-conversions/">bigcomp</a>&rdquo;, which limits the big integer value of <em>d</em> to 40 decimal digits. For decimal inputs of 40 significant digits or less, all 40 are used for <em>d</em>; for decimal inputs greater than 40 digits, only the first 18 are used. Compensating for this truncation of <em>d</em> leads to a more complicated test procedure. (This optimization is on by default; you have to &#35;define variable &ldquo;NO_STRTOD_BIGCOMP&rdquo; to turn it off.)</p>
<h2>References</h2>
<ul>
<li>David Gay&#8217;s <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">dtoa.c</a> (see the strtod() function).</li>
<li>David Gay&#8217;s Technical Report <a title="David Gay's Technical Report &ldquo;Correctly Rounded Binary-Decimal and Decimal-Binary Conversions&rdquo;" href="http://www.ampl.com/REFS/rounding.pdf">&ldquo;Correctly Rounded Binary-Decimal and Decimal-Binary Conversions&rdquo;</a>.</li>
<li>William D. Clinger&#8217;s paper <a title="Read William D. Clinger's paper &ldquo;How to Read Floating Point Numbers Accurately&rdquo;" href="http://www.cesura17.net/~will/Professional/Research/Papers/howtoread.pdf">&ldquo;How to Read Floating Point Numbers Accurately&rdquo;</a> (see &lsquo;algorithmR&rsquo;).</li>
<li><a title="FloatingDecimal.java" href="http://www.docjar.com/html/api/sun/misc/FloatingDecimal.java.html">FloatingDecimal.java</a> (see the doubleValue() method).</li>
<li><a title="Python dtoa.c" href="http://svn.python.org/view/python/branches/py3k-dtoa/Python/dtoa.c?revision=83813&#038;view=markup">Python&#8217;s version of David Gay&#8217;s dtoa.c</a>.</li>
</ul>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/using-integers-to-check-a-floating-point-approximation/">Using Integers to Check a Floating-Point Approximation</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/using-integers-to-check-a-floating-point-approximation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>strtod()’s Initial Decimal to Floating-Point Approximation</title>
		<link>http://www.exploringbinary.com/strtod-initial-decimal-to-floating-point-approximation/</link>
		<comments>http://www.exploringbinary.com/strtod-initial-decimal-to-floating-point-approximation/#comments</comments>
		<pubDate>Tue, 27 Sep 2011 15:13:05 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Exponents]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=347</guid>
		<description><![CDATA[David Gay&#8217;s strtod() function does decimal to floating-point conversion using both IEEE double-precision floating-point arithmetic and arbitrary-precision integer arithmetic. For some inputs, a simple IEEE floating-point calculation suffices to produce the correct result; for other inputs, a combination of IEEE arithmetic and arbitrary-precision arithmetic is required. In the latter case, IEEE arithmetic is used to [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/strtod-initial-decimal-to-floating-point-approximation/">strtod()’s Initial Decimal to Floating-Point Approximation</a></p>
]]></description>
			<content:encoded><![CDATA[<p>David Gay&#8217;s strtod() function does decimal to floating-point conversion using both IEEE double-precision floating-point arithmetic and arbitrary-precision integer arithmetic. For some inputs, <a title="Read Rick Regan's article &ldquo;Fast Path Decimal to Floating-Point Conversion&rdquo;" href="http://www.exploringbinary.com/fast-path-decimal-to-floating-point-conversion/">a simple IEEE floating-point calculation suffices</a> to produce the correct result; for other inputs, a combination of IEEE arithmetic and arbitrary-precision arithmetic is required. In the latter case, IEEE arithmetic is used to calculate an <em>approximation</em> to the correct result, which is then refined using arbitrary-precision arithmetic. In this article, I&#8217;ll describe the approximation calculation, which is based on a form of binary exponentiation.</p>
<p><span id="more-347"></span></p>
<h2>Overview of the Approximation Calculation</h2>
<p>Conceptually, the approximation calculation involves two floating-point values: the significant digits of the decimal number, expressed as an integer, and a corresponding power of ten. For example, for the decimal string 163.118762e+109, the calculation is 163118762 x 10<sup>103</sup>; for the decimal string 8.453127e-67, the calculation is 8453127 x 10<sup>-73</sup>. If there are more than 16 significant digits, only the first 16 are used. The decimal string 6.2187331579177550499956283e+100, for example, would be treated as 6.218733157917755e+100, making the calculation 6218733157917755 x 10<sup>85</sup>.</p>
<p>The powers of ten can be very small or very large. For double-precision floating-point (I&#8217;m considering normal numbers only), they range from 10<sup>-323</sup> (e.g., for DBL_MIN = 2.2250738585072014e-308) to 10<sup>308</sup> (e.g., for 1e+308). The goal is to compute them efficiently.</p>
<h2>Right-to-Left Binary Exponentiation</h2>
<p>An efficient way to compute powers of ten is using <a title="Read Dr. Math's description of binary exponentiation" href="http://mathforum.org/library/drmath/view/55603.html">right-to-left binary exponentiation</a>. Conceptually, the process is this: given a power b<sup>n</sup> to compute, write <em>n</em> as a sum of powers of two; then, using the <a href="http://www.exploringbinary.com/the-laws-of-exponents/" title="Read Rick Regan's Article &ldquo;The Laws of Exponents&rdquo;">laws of exponents</a>, rewrite b<sup>n</sup> as a product of <em>b</em> raised to each power of two. For example:</p>
<p>10<sup>103</sup> = 10<sup>1 + 2 + 4 + 32 + 64</sup> = 10<sup>1</sup> * 10<sup>2</sup> * 10<sup>4</sup> * 10<sup>32</sup> * 10<sup>64</sup> .</p>
<p>Evaluating this expression requires only ten multiplications &#8212; six to compute the powers of ten 10<sup>2</sup> through 10<sup>64</sup> (by successive squaring), and four to multiply the five required powers together. If 10<sup>103</sup> were computed naively, it would require 102 multiplications.</p>
<p>As an example for computing negative exponents, consider </p>
<p>10<sup>-73</sup> = 10<sup>-(1 + 8 + 64)</sup> = 10<sup>-1</sup> * 10<sup>-8</sup> * 10<sup>-64</sup> .</p>
<p>This requires eight multiplications &#8212; six to compute the powers of ten 10<sup>-2</sup> through 10<sup>-64</sup>, and two to multiply the three required powers together. The naive calculation would do 72 multiplications.</p>
<h3>Required Set of Powers</h3>
<p>For the range of exponents of double-precision values, the required set of &ldquo;binary powers&rdquo; (powers of ten raised to powers of two) for right-to-left binary exponentiation are</p>
<ul>
<li>{10<sup>-1</sup>, 10<sup>-2</sup>, 10<sup>-4</sup>, 10<sup>-8</sup>, 10<sup>-16</sup>, 10<sup>-32</sup>, 10<sup>-64</sup>, 10<sup>-128</sup>, 10<sup>-256</sup>} for negative exponents, and</li>
<li>{10<sup>1</sup>, 10<sup>2</sup>, 10<sup>4</sup>, 10<sup>8</sup>, 10<sup>16</sup>, 10<sup>32</sup>, 10<sup>64</sup>, 10<sup>128</sup>, 10<sup>256</sup>} for positive exponents.</li>
</ul>
<p>Instead of computing these powers for each conversion, they can be computed once and stored in a table. This reduces the number of multiplications further. In the case of our examples, 10<sup>103</sup> would now require only four multiplications, and 10<sup>-73</sup> would require only two.</p>
<h2>The Approximation Calculation in strtod()</h2>
<p>The approximation calculation in strtod() is based on the calculations I just described, but there are a few differences. The significant digits and power of ten are not built in two variables; one variable is initialized with the significant digits, and then the power of ten is factored in incrementally. This allows for a two-phased calculation: the first phase is a <a title="Read Rick Regan's article &ldquo;Fast Path Decimal to Floating-Point Conversion&rdquo;" href="http://www.exploringbinary.com/fast-path-decimal-to-floating-point-conversion/">fast-path</a> like calculation that handles any component power of ten between 10<sup>-15</sup> and 10<sup>15</sup>; the second phase uses binary exponentiation to finish the computation of the power of ten, for component binary powers 10<sup>-16</sup> and smaller and 10<sup>16</sup> and greater. </p>
<p>In the first phase of the calculation, only positive powers of ten, 10<sup>1</sup> to 10<sup>15</sup>, are used. When negative powers of ten 10<sup>-1</sup> to 10<sup>-15</sup> are called for, the calculation is changed to the equivalent calculation that <em>divides</em> the significant digits by the corresponding <em>positive</em> power of ten. A separate table contains each of the powers of ten from 10<sup>1</sup> to 10<sup>15</sup>, which are looked up directly. The one division or multiplication replaces the up to three multiplications that might be needed for binary exponentiation (10<sup>1</sup> * 10<sup>2</sup> * 10<sup>4</sup> * 10<sup>8</sup>). This not only is potentially more efficient, but reduces the number of possibly inexact operations (operations in which the result must be rounded).</p>
<p>Here are two examples: 163.118762e+109, which is parsed as 163118762 x 10<sup>103</sup>, is calculated as 163118762 * 10<sup>7</sup> * 10<sup>32</sup> * 10<sup>64</sup>; 8.453127e-67, which is parsed as 8453127 x 10<sup>-73</sup>, is calculated as 8453127/10<sup>9</sup> * 10<sup>-64</sup>.</p>
<h3>The Calculation In Code</h3>
<p>I&#8217;ve taken <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">strtod()</a>&#8216;s approximation code and distilled it into my own C program. The main difference between my code and strtod() is that strtod() handles overflow (for values near DBL_MAX) and underflow (for values that are near DBL_MIN or are subnormal). I&#8217;ve also renamed a few variables and made a few other small changes to make the code more readable.</p>
<pre>
double strtod_approx(char* decimal)
{
 int exponent, sign, i;
 long long sigDigits16;

 static const double fastPosTens[] = {
   1e0, 1e1, 1e2, 1e3, 1e4, 1e5, 1e6, 1e7, 1e8, 1e9,
   1e10, 1e11, 1e12, 1e13, 1e14, 1e15};
 static const double binaryPosTens[] = {
   1e16, 1e32, 1e64, 1e128, 1e256};
 static const double binaryNegTens[] = {
   1e-16, 1e-32, 1e-64, 1e-128, 1e-256};

 double d;

 <span class="grayed">parseDecimalApprox(decimal,&amp;sign,&amp;sigDigits16,&amp;exponent);</span>

 <strong>d = sigDigits16;</strong>
 if (sign)
   d = -d;

 if (exponent &gt; 0)
   {
    if (i = exponent &amp; 0xF)
      <strong>d *= fastPosTens[i];</strong>
    if (exponent &gt;&gt;= 4)
      for(i = 0; exponent &gt; 0; i++, exponent &gt;&gt;= 1)
        if (exponent &amp; 1)
          <strong>d *= binaryPosTens[i];</strong>
   }
 else
   if (exponent &lt; 0)
     {
      exponent = -exponent;
      if (i = exponent &amp; 0xF)
        <strong>d /= fastPosTens[i];</strong>
      if (exponent &gt;&gt;= 4)
        for(i = 0; exponent &gt; 0; i++, exponent &gt;&gt;= 1)
          if (exponent &amp; 1)
            <strong>d *= binaryNegTens[i];</strong>
     }

 return d;
}
</pre>
<h4>Notes</h4>
<ul>
<li>This function calls an auxiliary function I wrote called parseDecimalApprox(), which I&#8217;ve not shown. It returns the sign of the number, the value of its first 16 significant digits, and its power of ten exponent.</li>
<li>There are three tables of powers of ten, all precomputed at compile time:
<ul>
<li><strong>fastPosTens</strong>: This holds the 15 <a title="Read Rick Regan's article &ldquo;Why Powers of Ten Up to 10^22 Are Exact As Doubles&rdquo;" href="http://www.exploringbinary.com/why-powers-of-ten-up-to-10-to-the-22-are-exact-as-doubles/">exactly representable powers of ten</a> needed for the first phase of the calculation (1e0 is not used &#8212; it&#8217;s just a placeholder).</li>
<li><strong>binaryPosTens</strong>: This holds the five positive powers of ten used for binary exponentiation.</li>
<li><strong>binaryNegTens</strong>: This holds the five negative powers of ten used for binary exponentiation.</li>
</ul>
</li>
<li>If the exponent is zero, no power of ten is factored in (it would be 10<sup>0</sup> = 1).</li>
</ul>
<p>Realized in code, you can see why the algorithm to compute the power of ten is called right-to-left binary exponentiation. Converting the exponent to a sum of powers of two is the same as converting it to binary; in code, the exponent is a binary integer. The bits of the exponent are accessed from right to left &#8212; least significant to most significant &#8212; using the right shift operator; a 1-bit means the corresponding power of two is part of the exponent. The exponent will be a maximum of 9 bits long.</p>
<h3>Analysis</h3>
<p>There are at most five floating-point multiplications and divisions in this calculation: three multiplications plus one multiplication or division for the worst case powers of ten, and one multiplication to combine the significant digits and the power of ten. If the significant digits start out inexact &#8212; either because they are truncated to 16 digits, or they are 16 digits (truncated or otherwise) but represent a value greater than 2<sup>53</sup> &#8211; 1 &#8212; that&#8217;s a maximum of six inexact operations.</p>
<p>Using my code above, I measured approximations as far as 10 ULPs off; 1.00431469722921494e-140 is one example. (This is consistent with my <a title="Read Rick Regan's article &ldquo;Properties of the Correction Loop in David Gay’s strtod()&rdquo;" href="http://www.exploringbinary.com/properties-of-the-correction-loop-in-david-gays-strtod/">analysis of David Gay&#8217;s code</a>, where the maximum difference I found was 11 ULPs. Why the difference? It could be due to the code that handles underflow and overflow, or it could be because I didn&#8217;t run the same exact tests in both cases.)</p>
<p>Contrast this with my <a title="Read Rick Regan's article &ldquo;Quick and Dirty Decimal to Floating-Point Conversion&rdquo;" href="http://www.exploringbinary.com/quick-and-dirty-decimal-to-floating-point-conversion/">quick and dirty decimal to floating-point conversion program</a>, which does a multiplication or division for every digit. Using that program, I found an example that was 14 ULPs off.</p>
<h2>Discussion</h2>
<h3>A Program That Converts Decimal Values Uses Decimal Values?</h3>
<p>The code converts decimal numbers to floating-point, and yet it itself needs decimal numbers converted to floating-point &#8212; isn&#8217;t that circular? Not really. The compiler&#8217;s implementation would have to be different, or at least modified so that the decimal constants are specified in some other way &#8212; say as <a title="Read Rick Regan's article &ldquo;Hexadecimal Floating-Point Constants&rdquo;" href="http://www.exploringbinary.com/hexadecimal-floating-point-constants/">hexadecimal floating-point constants</a>.</p>
<p>This raises another issue: what if the <a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in Visual C++&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/">compiler converts these decimal literals incorrectly</a>? After all, this is part of a larger conversion routine whose goal is correct conversion. My guess is that it does not matter. The worst that can happen is that the approximation will be a little less accurate; it will ultimately be corrected &#8212; by the <a title="Read Rick Regan's Article &ldquo;Using Integers to Check a Floating-Point Approximation&rdquo;" href="http://www.exploringbinary.com/using-integers-to-check-a-floating-point-approximation/">arbitrary-precision based reconciliation process in strtod()</a>. (For what it&#8217;s worth, I checked &#8212; Visual C++ got all of the conversions right.)</p>
<h3>Java’s FloatingDecimal Class</h3>
<p>The same approximation calculation is done in the doubleValue() method of Java’s FloatingDecimal Class, which is modeled on David Gay’s strtod().</p>
<h3>Using Division For the Negative Exponent Case</h3>
<p>For the negative exponent case of the binary exponentiation calculation, I tried using division (by positive powers of ten) instead of multiplication (by negative powers of ten). The division-based calculation ran a little slower, which was expected. But interestingly, the approximations were a little less accurate, on average. I don&#8217;t know why.</p>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/strtod-initial-decimal-to-floating-point-approximation/">strtod()’s Initial Decimal to Floating-Point Approximation</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/strtod-initial-decimal-to-floating-point-approximation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Fast Path Decimal to Floating-Point Conversion</title>
		<link>http://www.exploringbinary.com/fast-path-decimal-to-floating-point-conversion/</link>
		<comments>http://www.exploringbinary.com/fast-path-decimal-to-floating-point-conversion/#comments</comments>
		<pubDate>Thu, 01 Sep 2011 13:18:08 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Binary integers]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>
		<category><![CDATA[Fractions]]></category>
		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=344</guid>
		<description><![CDATA[In general, to convert an arbitrary decimal number into a binary floating-point number, arbitrary-precision arithmetic is required. However, a subset of decimal numbers can be converted correctly with just ordinary limited-precision IEEE floating-point arithmetic, taking what I call the fast path to conversion. Fast path conversion is an optimization used in practice: it&#8217;s in David [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/fast-path-decimal-to-floating-point-conversion/">Fast Path Decimal to Floating-Point Conversion</a></p>
]]></description>
			<content:encoded><![CDATA[<p>In general, to convert an arbitrary decimal number into a binary floating-point number, <a title="Read Rick Regan's article &ldquo;Correct Decimal To Floating-Point Using Big Integers&rdquo;" href="http://www.exploringbinary.com/correct-decimal-to-floating-point-using-big-integers/">arbitrary-precision arithmetic is required</a>. However, a subset of decimal numbers can be converted correctly with just ordinary limited-precision IEEE floating-point arithmetic, taking what I call the <em>fast path</em> to conversion. Fast path conversion is an optimization used in practice: it&#8217;s in David Gay&#8217;s strtod() function and in Java&#8217;s FloatingDecimal class. I will explain how fast path conversion works, and describe the set of numbers that qualify for it.</p>
<p><span id="more-344"></span></p>
<h2>An Unnormalized Form of Decimal Scientific Notation</h2>
<p>Any number can be expressed in <a title="Wikipedia article on scientific notation" href="http://en.wikipedia.org/wiki/Scientific_notation">decimal scientific notation</a>, a form where it is written as a product of its significant digits and a power of ten. Conventionally, the significant digits are expressed in normalized form, where the first digit is between 1 and 9, and any subsequent digits follow a decimal point. For example, 0.0926 would be written as 9.26 x 10<sup>-2</sup>, and 7390 would be written as 7.390 x 10<sup>3</sup>.</p>
<p>In the context of decimal to floating-point conversion, a nonstandard, <em>unnormalized</em> form of scientific notation is used, where the significant digits are written as an integer. For example, 0.0926 would be written as 926 x 10<sup>-4</sup>, and 7390 would be written as 739 x 10<sup>1</sup>.</p>
<p>In this unnormalized form, there are two cases: one where the integer containing the significant digits is multiplied by a nonnegative power of ten, and one where it&#8217;s multiplied by a <em>negative</em> power of ten. It&#8217;s convenient to think of the latter case as a fraction, where the significant digits integer is <em>divided</em> by the corresponding <em>positive</em> power of ten. For example, 926 x 10<sup>-4</sup> equals 926/10<sup>4</sup>. These two cases &#8212; multiplication or division of an integer by a power of ten that is an integer &#8212; form the basis of fast path conversion.</p>
<h2>The Fast Path</h2>
<p>For the fast path conversion process, we take a decimal number, represented as a string, and convert it to the nearest IEEE double-precision binary floating-point number. You can view this process as having two main steps. In the first step, floating-point integers <em>s</em> and <em>p</em> are created: <em>s</em> represents the significant digits of the decimal number (positive or negative), and <em>p</em> represents the corresponding nonnegative power of ten. In the second step, an IEEE floating-point multiplication or division (the fmul or fdiv instruction on Intel processors) operates on <em>s</em> and <em>p</em>, producing the desired double-precision result.</p>
<p>This process is guaranteed to work only when <em>s</em> and <em>p</em> are exactly representable in double-precision; this happens when their binary representations have no 1 bits beyond bit 53. The values of <em>p</em> that meet this criteria are <a title="Read Rick Regan's article &ldquo;Why Powers of Ten Up to 10^22 Are Exact As Doubles&rdquo;" href="http://www.exploringbinary.com/why-powers-of-ten-up-to-10-to-the-22-are-exact-as-doubles/">10<sup>0</sup> &le; <em>p</em> &le; 10<sup>22</sup></a>. The values of <em>s</em> that meet this criteria are, by definition, all floating-point integers. (Think of it this way: Take any floating-point integer and print all the digits of its decimal equivalent &#8212; that would be a valid value for <em>s</em>. For example, 52882540679000003517910258026656900519026609372478984461456769024, which is the floating-point integer <a title="Read Rick Regan's article &ldquo;Hexadecimal Floating-Point Constants&rdquo;" href="http://www.exploringbinary.com/hexadecimal-floating-point-constants/">0&#120;</a>1.0119c58f25bc6p+215.)</p>
<p>Practically speaking, there is no simple test to check all the valid values of <em>s</em>. But there is a simple test that identifies a <em>subset</em> of its values, <strong>0 &le; <em>s</em> &le; 2<sup>53</sup> &#8211; 1 = 9,007,199,254,740,991</strong>. I will consider only these 2<sup>53</sup> values as fast path cases.</p>
<p>This process relies implicitly on two things &#8212; that base conversion between integers is exact, and that IEEE multiplication and division produce correctly rounded results. Under these conditions, a maximum of one rounding is done; the result is a correctly rounded decimal to floating-point conversion.</p>
<p>Regarding the power of ten, it can be looked up in a table, or it can be computed. If it&#8217;s computed &#8212; say naively by repeated multiplication by ten &#8212; it will be done exactly. Also, if <em>p</em> = 1, a further optimization applies &#8212; the multiplication can be skipped.</p>
<h3>Examples</h3>
<p>Here are some examples that qualify for fast path conversion:</p>
<table class="center" border="1" summary="Example Fast Path Cases">
<caption><strong>Example Fast Path Cases</strong></caption>
<tbody>
<tr>
<th class="center">Decimal Literal</th>
<th class="center"><em>s</em></th>
<th class="center"><em>p</em></th>
<th class="center">Mult or Div?</th>
</tr>
<tr>
<td class="left">3.14159</td>
<td class="right">314159</td>
<td class="right">100000</td>
<td class="center">Div</td>
</tr>
<tr>
<td class="left">0.0001256789876643</td>
<td class="right">1256789876643</td>
<td class="right">10000000000000000</td>
<td class="center">Div</td>
</tr>
<tr>
<td class="left">9.11234e-17</td>
<td class="right">911234</td>
<td class="right">10000000000000000000000</td>
<td class="center">Div</td>
</tr>
<tr>
<td class="left">537.81e8</td>
<td class="right">53781</td>
<td class="right">1000000</td>
<td class="center">Mult</td>
</tr>
<tr>
<td class="left">9.007199254740991e37</td>
<td class="right">9007199254740991</td>
<td class="right">10000000000000000000000</td>
<td class="center">Mult</td>
</tr>
<tr>
<td class="left">299792458</td>
<td class="right">299792458</td>
<td class="right">1</td>
<td class="center">NA</td>
</tr>
<tr>
<td class="left">0</td>
<td class="right">0</td>
<td class="right">1</td>
<td class="center">NA</td>
</tr>
</tbody>
</table>
<h3>Fast Path Cases In Disguise</h3>
<p>There is another set of decimal numbers that can be transformed easily into fast path multiplication cases: numbers with a <em>p</em> greater than 10<sup>22</sup> but with an <em>s</em> that can be multiplied by those excess powers of ten and still remain exact. For example, take the number 123e34. As written, it has <em>s</em> = 123 and <em>p</em> = 10<sup>34</sup>; <em>p</em> is too big. However, if you treat it as 123000000000000e22, it qualifies as a fast path case. You transform it by multiplying <em>s</em> by 10<sup>12</sup> and by using a <em>p</em> that&#8217;s 12 orders of magnitude smaller. In other words, if you can transfer the extra powers of ten to the significant digits, you have a fast path case.</p>
<h2>The Fast Path in David Gay&#8217;s strtod()</h2>
<p><a title="David Gay's strtod() function in dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s strtod()</a> function implements a fast path for significant digits 0 &le; <em>s</em> &le; 10<sup>15</sup> &#8211; 1 = 999,999,999,999,999 and powers of ten 10<sup>0</sup> &le; <em>p</em> &le; 10<sup>22</sup> (the powers of ten are precomputed and stored in a table). It also handles the &ldquo;disguised&rdquo; fast path cases, up to the same 15-digit limit.</p>
<p>Why is strtod()&#8217;s fast path limited to 15 digits? My guess is because that was the easiest check to make; all integers of 15 digits or less are exact. But what about the remaining fast path cases? At a minimum, we could easily allow any <em>s</em> up to 8,999,999,999,999,999 &#8212; an eightfold increase in the number of fast path cases. All we&#8217;d need to do is add a check for &ldquo;number of digits = 16 and first digit &le; 8.&rdquo;</p>
<p>To cover the remaining cases, we could convert the entire 16-digit string to a double and then check whether its value is less than or equal to 2<sup>53</sup> &#8211; 1. The code currently converts the significant digits to a double anyhow, so why not use it for an additional fast path test? In fact, why not check the double value exclusively, skipping testing of the string altogether?</p>
<p>(<strike>I have a note out to Dave Gay asking him these questions; I&#8217;ll update this article if he responds.</strike> <strong><em>Update</em></strong>: Dave Gay agrees that the change to check for 16 digits/first digit &le; 8 would work and would be simple enough, but he&#8217;s not sure if it is worth making.)</p>
<h2>The Fast Path in Java&#8217;s FloatingDecimal Class</h2>
<p>Java does its decimal to floating-point conversion in the doubleValue() method of its <a title="FloatingDecimal.java" href="http://www.docjar.com/html/api/sun/misc/FloatingDecimal.java.html">FloatingDecimal</a> class. It is modeled on David Gay&#8217;s strtod(), and covers the exact set of fast path cases.</p>
<h2>Discussion</h2>
<h3>Fast Path Vs. the Arbitrary-Precision Integer Algorithm</h3>
<p>Fast path conversion is simple because most of the work is done in IEEE floating-point. Contrast this with the <a title="Read Rick Regan's article &ldquo;Correct Decimal To Floating-Point Using Big Integers&rdquo;" href="http://www.exploringbinary.com/correct-decimal-to-floating-point-using-big-integers/">arbitrary-precision integer conversion algorithm</a>, which</p>
<ul>
<li>Scales the fraction to make rounding of the significand easier.</li>
<li>Normalizes the result.</li>
<li>Encodes the result in IEEE floating-point format.</li>
</ul>
<p>In IEEE floating-point, scaling is unnecessary; correct rounding is handled automatically. Also, there is no need for explicit normalization, nor any need to encode the result &#8212; it&#8217;s already encoded! Additionally, IEEE floating-point handles the edge case, where the significand would otherwise round up to 2<sup>53</sup>.</p>
<p>The fast path reduces decimal to floating-point conversion to this: parsing of the decimal string, creation of two binary floating-point integers to represent it, and delegation of the hard work to a single floating-point instruction.</p>
<h3>Processor Rounding Mode</h3>
<p>Because the fast path algorithm uses IEEE arithmetic, it will be sensitive to the processor rounding mode. Typically, the rounding mode is set to round-to-nearest, but it could be set to one of three other modes. A conversion routine must ensure that both its <a title="Read Rick Regan's article &ldquo;Visual C++ and GLIBC strtod() Ignore Rounding Mode&rdquo;" href="http://www.exploringbinary.com/visual-c-plus-plus-and-glibc-strtod-ignore-rounding-mode/#rounding-mode-detection-in-david-gays-strtod">fast path and arbitrary-precision path round the same way</a>.</p>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/fast-path-decimal-to-floating-point-conversion/">Fast Path Decimal to Floating-Point Conversion</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/fast-path-decimal-to-floating-point-conversion/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Why Powers of Ten Up to 1022 Are Exact As Doubles</title>
		<link>http://www.exploringbinary.com/why-powers-of-ten-up-to-10-to-the-22-are-exact-as-doubles/</link>
		<comments>http://www.exploringbinary.com/why-powers-of-ten-up-to-10-to-the-22-are-exact-as-doubles/#comments</comments>
		<pubDate>Fri, 12 Aug 2011 02:30:14 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=342</guid>
		<description><![CDATA[The fast path in David Gay&#8217;s decimal to floating-point conversion routine relies on this property of the first twenty-three nonnegative powers of ten: they have exact representations in double-precision floating-point. While it&#8217;s easy to see why powers of ten up to 1015 are exact, it&#8217;s less clear why the powers of ten from 1016 to [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/why-powers-of-ten-up-to-10-to-the-22-are-exact-as-doubles/">Why Powers of Ten Up to 10<sup>22</sup> Are Exact As Doubles</a></p>
]]></description>
			<content:encoded><![CDATA[<p><a title="Read Rick Regan's article &ldquo;Fast Path Decimal to Floating-Point Conversion&rdquo;" href="http://www.exploringbinary.com/fast-path-decimal-to-floating-point-conversion/">The fast path in David Gay&#8217;s decimal to floating-point conversion routine</a> relies on this property of the first twenty-three nonnegative powers of ten: they have exact representations in double-precision floating-point. While it&#8217;s easy to see why powers of ten up to 10<sup>15</sup> are exact, it&#8217;s less clear why the powers of ten from 10<sup>16</sup> to 10<sup>22</sup> are. To see why, you have to look at their binary representations.</p>
<p><span id="more-342"></span></p>
<h2>Trailing Zeros Become &ldquo;Significantly Insignificant&rdquo;</h2>
<p>10<sup>15</sup> is the largest power of ten less than 2<sup>53</sup> &#8211; 1, the largest integer representable in 53 bits. 10<sup>15</sup> has 50 bits, and the next larger power of ten, 10<sup>16</sup>, has 54 bits. So how can 10<sup>16</sup> &#8212; and the 74-bit 10<sup>22</sup> for that matter &#8212; have an exact double-precision representation, when doubles only store a 53-bit significand? The answer is simple: it has a binary representation with trailing zeros after bit 53.</p>
<p>A nonnegative power of ten 10<sup>n</sup> equals 5<sup>n</sup> * 2<sup>n</sup>; <a title="Read Rick Regan's article &ldquo;A Pattern in Powers of Ten and Their Binary Equivalents&rdquo;" href="http://www.exploringbinary.com/a-pattern-in-powers-of-ten-and-their-binary-equivalents/">in binary,  that&#8217;s a power of five followed by n zeros</a> (a power of five always ends in &lsquo;1&rsquo;). For example, <a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">10<sup>22</sup> (10000000000000000000000) is</a> </p>
<pre class="small  noborder">
10000111100001100111100000110010011011101010110010010000000000000000000000
</pre>
<p>This is 5<sup>22</sup> (2384185791015625), which is 52 bits long, followed by 22 zeros. Here it is in <a title="Read Rick Regan's article &ldquo;Displaying IEEE Doubles in Binary Scientific Notation&rdquo;" href="http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/">binary scientific notation</a>, showing only its leading nonzero significant bits &#8212; the bits of its power of five factor:</p>
<p>1.000011110000110011110000011001001101110101011001001 x 2<sup>73</sup></p>
<p>Since this is 52 bits, it maps directly to a double &#8212; no rounding is required.</p>
<p>So what happened to the 22 significant trailing zeros? They were preserved in the exponent, as the factor 2<sup>22</sup>. (The remaining factor, 2<sup>51</sup>, undoes the normalization.) </p>
<h3>Why 10<sup>22</sup> Is the Maximum</h3>
<p>To see why 10<sup>22</sup> is the maximum exact power of ten, look at 10<sup>23</sup>. Its power of five factor is 5<sup>23</sup> (11920928955078125), which is 54 bits long; this is too long for exact representation in a double.</p>
<h2>10<sup>16</sup> to 10<sup>22</sup></h2>
<p>Here are the powers of ten from 10<sup>16</sup> to 10<sup>22</sup>, shown in binary and in binary scientific notation (in the binary representation, bits 54 and beyond are highlighted):</p>
<table border="1" summary="10^16 in Binary">
<tbody>
<tr>
<th class="center">10<sup>16</sup></th>
</tr>
<tr>
<td class="left small">10001110000110111100100110111111000001000000000000000<span class="highlight_gray4">0</span></td>
</tr>
<tr>
<td class="left small">1.0001110000110111100100110111111000001 x 2<sup>53</sup></td>
</tr>
</tbody>
</table>
<table border="1" summary="10^17 in Binary">
<tbody>
<tr>
<th class="center">10<sup>17</sup></th>
</tr>
<tr>
<td class="left small">10110001101000101011110000101110110001010000000000000<span class="highlight_gray4">0000</span></td>
</tr>
<tr>
<td class="left small">1.011000110100010101111000010111011000101 x 2<sup>56</sup></td>
</tr>
</tbody>
</table>
<table border="1" summary="10^18 in Binary">
<tbody>
<tr>
<th class="center">10<sup>18</sup></th>
</tr>
<tr>
<td class="left small">11011110000010110110101100111010011101100100000000000<span class="highlight_gray4">0000000</span></td>
</tr>
<tr>
<td class="left small">1.10111100000101101101011001110100111011001 x 2<sup>59</sup></td>
</tr>
</tbody>
</table>
<table border="1" summary="10^19 in Binary">
<tbody>
<tr>
<th class="center">10<sup>19</sup></th>
</tr>
<tr>
<td class="left small">10001010110001110010001100000100100010011110100000000<span class="highlight_gray4">00000000000</span></td>
</tr>
<tr>
<td class="left small">1.00010101100011100100011000001001000100111101 x 2<sup>63</sup></td>
</tr>
</tbody>
</table>
<table border="1" summary="10^20 in Binary">
<tbody>
<tr>
<th class="center">10<sup>20</sup></th>
</tr>
<tr>
<td class="left small">10101101011110001110101111000101101011000110001000000<span class="highlight_gray4">00000000000000</span></td>
</tr>
<tr>
<td class="left small">1.0101101011110001110101111000101101011000110001 x 2<sup>66</sup></td>
</tr>
</tbody>
</table>
<table border="1" summary="10^21 in Binary">
<tbody>
<tr>
<th class="center">10<sup>21</sup></th>
</tr>
<tr>
<td class="left small">11011000110101110010011010110111000101110111101010000<span class="highlight_gray4">00000000000000000</span></td>
</tr>
<tr>
<td class="left small">1.101100011010111001001101011011100010111011110101 x 2<sup>69</sup></td>
</tr>
</tbody>
</table>
<table border="1" summary="10^22 in Binary">
<tbody>
<tr>
<th class="center">10<sup>22</sup></th>
</tr>
<tr>
<td class="left small">10000111100001100111100000110010011011101010110010010<span class="highlight_gray4">000000000000000000000</span></td>
</tr>
<tr>
<td class="left small">1.000011110000110011110000011001001101110101011001001 x 2<sup>73</sup></td>
</tr>
</tbody>
</table>
<h3>On Binary Scientific Notation for Exact Integers</h3>
<p>In decimal scientific notation, when trailing zeros are significant, they are displayed. For example, if you know a value to be exactly 300,000, you would write it as 3.00000 x 10<sup>5</sup>. In binary scientific notation, however, this convention is not followed. I don&#8217;t know if there&#8217;s an official explanation, but I have my own. Binary scientific notation is used to represent binary floating-point numbers, and binary floating-point numbers are, in general, approximations to decimal numbers. Displaying trailing zeros, which could be the result of rounding, might imply an exactness that&#8217;s not there. Also, binary floating-point numbers have a limited length significand; keeping the number of significant bits displayed within this limit makes the mapping clearer.</p>
<h2>Other Large Numbers With Exact Representations</h2>
<p>For the same reason, other large numbers &#8212; like 14<sup>18</sup> (greater than 10<sup>20</sup>), 6<sup>33</sup> (greater than 10<sup>25</sup>), and <a title="Read Rick Regan's article &ldquo;A Simple C Program That Prints 2,098 Powers of Two&rdquo;" href="http://www.exploringbinary.com/a-simple-c-program-that-prints-2098-powers-of-two/">2<sup>1023</sup></a> (greater than 10<sup>307</sup>) &#8212; are exact as doubles. In fact, any integer with a power of two factor that takes it beyond 53 bits can be represented exactly (as long as the maximum exponent isn&#8217;t exceeded). In this case, <strong>a double can represent more than 53 significant bits!</strong></p>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/why-powers-of-ten-up-to-10-to-the-22-are-exact-as-doubles/">Why Powers of Ten Up to 10<sup>22</sup> Are Exact As Doubles</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/why-powers-of-ten-up-to-10-to-the-22-are-exact-as-doubles/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Correct Decimal To Floating-Point Using Big Integers</title>
		<link>http://www.exploringbinary.com/correct-decimal-to-floating-point-using-big-integers/</link>
		<comments>http://www.exploringbinary.com/correct-decimal-to-floating-point-using-big-integers/#comments</comments>
		<pubDate>Thu, 04 Aug 2011 02:44:19 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Binary integers]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>
		<category><![CDATA[Fractions]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=340</guid>
		<description><![CDATA[Producing correctly rounded decimal to floating-point conversions is hard, but only because it is made to be done efficiently. There is a simple algorithm that produces correct conversions, but it&#8217;s too slow &#8212; it&#8217;s based entirely on arbitrary-precision integer arithmetic. Nonetheless, you should know this algorithm, because it will help you understand the highly-optimized conversion [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/correct-decimal-to-floating-point-using-big-integers/">Correct Decimal To Floating-Point Using Big Integers</a></p>
]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.exploringbinary.com/topics/#correctly-rounded-decimal-to-floating-point" title="Read Articles from Rick Regan's Topics Page: &ldquo;Correctly Rounded Decimal to Floating-Point Conversion&rdquo;">Producing correctly rounded decimal to floating-point conversions is hard</a>, but only because it is made to be done efficiently. There is a simple algorithm that produces correct conversions, but it&#8217;s too slow &#8212; it&#8217;s based entirely on arbitrary-precision integer arithmetic. Nonetheless, you should know this algorithm, because it will help you understand the highly-optimized conversion routines used in practice, like <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s strtod() function</a>. I will outline the algorithm, which is easily implemented in a language like C, using a &#8220;big integer&#8221; library like <a href="http://www.exploringbinary.com/how-to-install-and-run-gmp-on-windows-using-mpir/" title="Read Rick Regan's Article &ldquo;How to Install and Run GMP on Windows Using MPIR&rdquo;">GMP</a>.</p>
<div class="wp-caption aligncenter" style="width: 535px"><img src="http://www.exploringbinary.com/wp-content/uploads/BigIntegers.1eminus20.png" alt="Ratio of Big Integers (2^119/10^20) Producing the 53-Bit Significand of 1e-20" width="525" height="107"/><p class="wp-caption-text">Ratio of Big Integers (2<sup>119</sup>/10<sup>20</sup>) Producing the 53-Bit Significand of 1e-20</p></div>
<p><span id="more-340"></span></p>
<h2>Arbitrary-Precision is Needed</h2>
<p>Our task is to write a computer program that uses <em><strong>binary arithmetic</strong></em> to convert a decimal number &#8212; represented as a character string in standard or scientific notation &#8212; into an <a title="Wikipedia article on IEEE double-precision format" href="http://en.wikipedia.org/wiki/Double_precision">IEEE double-precision binary floating-point number</a>. What makes this task difficult is that a given decimal number may not have a double-precision equivalent; an approximation &#8212; the double-precision number <em>closest</em> to it &#8212; must be used in its place.</p>
<p>An IEEE double-precision binary floating-point number, or <em>double</em>, represents a number in <a title="Read Rick Regan's article &ldquo;Displaying IEEE Doubles in Binary Scientific Notation&rdquo;" href="http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/">binary scientific notation</a>, with a 53-bit significand (subnormal numbers have less than 53 bits, but I won&#8217;t be discussing them) and an 11-bit power of two exponent. The limited precision of the significand is what makes approximation necessary. For example, a decimal fraction like 0.1 has an infinite binary representation, so it must be rounded to 53 significant bits. A large integer like 9007199254740997 has a binary representation that, although finite, is more than 53 bits long; it must also be rounded to 53 bits.</p>
<p>To round correctly, in general, you need to compute the approximation using more than 53 bits of precision; <em>arbitrary-precision</em> is required, an amount which depends on the input. An arbitrary-precision floating point or arbitrary-precision integer based algorithm could be used, but the algorithm I&#8217;ll discuss uses arbitrary-precision integers &#8212; AKA <em>big integers</em>. This makes for a more elegant solution.</p>
<h2>Big Integer Based Algorithm</h2>
<p>The big integer based algorithm I will show you is essentially <em>AlgorithmM</em>, as described in William D. Clinger&#8217;s paper &ldquo;<a title="Read William D. Clinger's paper &ldquo;How to Read Floating Point Numbers Accurately&rdquo;" href="http://www.cesura17.net/~will/Professional/Research/Papers/howtoread.pdf">How to Read Floating Point Numbers Accurately</a>.&rdquo; He presents it in the programming language Scheme, but I will present it to you in words, and by example. (Like Clinger, I will deal only with positive, normal numbers.)</p>
<h3>53-Bit Integer Significands</h3>
<p>The significand of a normalized double-precision floating-point number is 53 bits, with its most significant bit equal to 1. For the purposes of our conversion algorithm, we&#8217;ll think of it as a 53-bit integer; that is, an integer between 2<sup>52</sup> (4503599627370496) and 2<sup>53</sup> &#8211; 1 (9007199254740991).</p>
<h3>Overview</h3>
<p>The idea of the algorithm is to represent the decimal number as a fraction, one that produces a 53-bit integer quotient and an arbitrary-precision integer remainder. The significand is the quotient, rounded up or down according to the remainder; the power of two exponent is derived from the <a title="Wikipedia article on binary scaling" href="http://en.wikipedia.org/wiki/Binary_scaling">binary scale factor</a> used to produce the fraction.</p>
<p>There are two basic cases the algorithm must handle: one where the fraction needs to be <em>scaled up</em> to produce a 53-bit quotient, and one where the fraction needs to be <em>scaled down</em> to produce a 53-bit quotient. I will demonstrate the algorithm with an example of each case.</p>
<h3>Converting 3.14159 to Double-Precision</h3>
<p>Here&#8217;s how to convert the decimal literal 3.14159 to double-precision floating-point:</p>
<ol>
<li><strong>Express as an integer times a power of ten</strong>
<p>3.14159 = 314159 x 10<sup>-5</sup></p>
<p><em>In code</em>: This step is done as the decimal string is parsed. The significant digits are collected (sans the decimal point) and then converted to binary as a big integer. The exponent of the power of ten is determined based on the correct placement of the decimal point. (For strings expressed in decimal scientific notation, the exponent must be converted to binary, and then adjusted for the decimal point accordingly.)</p>
</li>
<li><strong>Express as a fraction</strong>
<p>The negative power of ten, 10<sup>-5</sup>, makes this a fraction, with 314159 as the numerator and 10<sup>5</sup> = 100000 as the denominator:</p>
<p>314159/100000</p>
<p><em>In code</em>: The positive power of ten is computed as a big integer,  raising ten to the absolute value of the exponent.</p>
</li>
<li><strong>Scale into the range [2<sup>52</sup>,2<sup>53</sup>)</strong>
<p>314159/100000 is a number <em>r</em>, 2<sup>1</sup> &le; <em>r</em> &lt; 2<sup>2</sup>. We need to scale it <em>up</em>, so that we have a scaled value <em>r&#8217;</em> in the range 2<sup>52</sup> &le; <em>r&#8217;</em> &lt; 2<sup>53</sup>; we do this by multiplying the numerator by 2<sup>51</sup>:</p>
<p>314159 x 2<sup>51</sup>/100000 = 707423177667543826432/100000</p>
<p>(We scale the numerator because we need the scaling done <em>before</em> the division, not after!)</p>
<p><em>In code</em>: You could use logarithms to compute the scale factor, but you&#8217;d have to use care. You would need arbitrary precision to properly distinguish values near powers of two. A simpler approach, in the spirit of sticking with integer arithmetic, is to use a loop: multiply the numerator by two until the quotient is in range. (For conversions where the quotient starts out <em>above</em> range, multiply the <em>denominator</em> by two until the quotient is in range.)</p>
</li>
<li><strong>Divide and round</strong>
<p>707423177667543826432/100000 = 7074231776675438 remainder 26432</p>
<p>The quotient is an integer <em>q</em> in the range 2<sup>52</sup> &le; <em>q</em> &le; 2<sup>53</sup> &#8211; 1. The remainder is an integer <em>r</em> in the range 0 &le; <em>r</em> &lt; 100000, and it tells us which way to round <em>q</em> &#8212; up or down to the next integer, which we do according to IEEE 754 <a title="Wikipedia article definition of &ldquo;round half to even&rdquo; rounding mode" href="http://en.wikipedia.org/wiki/Rounding#Round_half_to_even">round-half-to-even</a> rounding: if <em>r</em> is less than half the denominator, we round <em>q</em> down; if <em>r</em> is more than half the denominator, we round <em>q</em> up; if <em>r</em> equals half the denominator, we round <em>q</em> down if <em>q</em> is even, and round <em>q</em> up if <em>q</em> is odd. In our example, <em>r</em> is less than half the denominator, so <em>q</em> rounds down to <strong>7074231776675438</strong>; this is our significand.</p>
<p>(In the event that <em>q</em> rounds up to 2<sup>53</sup>, set it to 2<sup>52</sup> instead, and decrement the exponent of the scale factor.)</p>
<p><em>In code</em>: The remainder is easily calculated as <em>r</em> = numerator &#8211; <em>q</em> x denominator. For the &ldquo;half the denominator&rdquo; comparison, note that the denominator will either be 1 (for integers that start out in range and thus don&#8217;t need scaling) or divisible by 2: if it&#8217;s 1, there&#8217;s no remainder and no rounding; if it&#8217;s divisible by 2, then half the denominator is an integer, and the remainder can be compared to it with integer operations. For the test for evenness, just divide the quotient by two and check for a remainder of zero.</p>
</li>
<li><strong>Express in normalized binary scientific notation</strong>
<p>For this step, it&#8217;s easiest to think in terms of binary. The significand, 7074231776675438, <a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">converts to</a></p>
<p>11001001000011111100111110000000110111000011001101110</p>
<p>This value is scaled, by 2<sup>51</sup>; we <strong>unscale</strong> it by multiplying it by 2<sup>-51</sup>. Actually, we don&#8217;t multiply; we just express the number in unnormalized <a title="Read Rick Regan's article &ldquo;Displaying IEEE Doubles in Binary Scientific Notation&rdquo;" href="http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/">binary scientific notation</a>:</p>
<p>11001001000011111100111110000000110111000011001101110 x 2<sup>-51</sup></p>
<p>To be represented in the double-precision format, the significand must be <strong>normalized</strong>; this means we must move the radix point 52 bits to the left, and multiply the resulting binary fraction by 2<sup>52</sup>. The unscaled and normalized value becomes:</p>
<p>1.100100100001111110011111000000011011100001100110111 x 2<sup>-51</sup> x 2<sup>52</sup></p>
<p>By the <a title="Read Rick Regan's Article &ldquo;Composing Powers of Two Using The Laws of Exponents&rdquo;" href="http://www.exploringbinary.com/composing-powers-of-two-using-the-laws-of-exponents/">laws of exponents</a>, 2<sup>-51</sup> x 2<sup>52</sup> = 2<sup>1</sup>, so our resulting double-precision value, in normalized binary scientific notation, is</p>
<p>1.100100100001111110011111000000011011100001100110111 x 2<sup>1</sup></p>
<p><em>In code</em>: The code for this step is trivial: all we do is compute a power of two exponent, by adding 52 to the negative of the exponent of the scale factor.</p>
</li>
<li><strong>Encode as a double</strong>
<p>Normalized binary scientific notation maps straightforwardly to the IEEE double-precision format; here are the values of the fields for our example, in decimal:</p>
<ul>
<li>The sign field is 0, since the number is positive.</li>
<li>The exponent field is 1024, which equals the power of two exponent (1) plus the exponent bias (1023).</li>
<li>The fraction field is 2570632149304942, which is 7074231776675438 &#8211; 2<sup>52</sup>, the trailing 52 bits of the significand (the leading 1 bit is implicit).</li>
</ul>
<p>Here is the encoded double, in binary:</p>
<p><span class="highlight_gray4">0</span>10000000000<span class="highlight_gray4">1001001000011111100111110000000110111000011001101110</span></p>
<p>The sign field is 0, the exponent field is 10000000000, and the fraction field is 1001001000011111100111110000000110111000011001101110.</p>
<p>This double has the exact decimal value of</p>
<p>3.14158999999999988261834005243144929409027099609375</p>
<p><em>In code</em>: In C, you can set these values directly by <a title="Read Rick Regan's article &ldquo;Displaying the Raw Fields of a Floating-Point Number&rdquo;" href="http://www.exploringbinary.com/displaying-the-raw-fields-of-a-floating-point-number/">casting the double as a 64-bit integer</a> and treating the fields as integer subfields within it.</p>
</li>
</ol>
<h3>Converting 1.2345678901234567e22 to Double-Precision</h3>
<p>Here’s how to convert the decimal literal 1.2345678901234567e22 to double-precision floating-point:</p>
<ol>
<li><strong>Express as an integer times a power of ten</strong>
<p>1.2345678901234567e22 = 12345678901234567 x 10<sup>6</sup></p>
</li>
<li><strong>Express as a fraction</strong>
<p>The positive power of ten makes this an integer, but for the purposes of our algorithm, we still want to express it as a fraction; the numerator is 12345678901234567 x 10<sup>6</sup> = 12345678901234567 x 1000000 = 12345678901234567000000, and the denominator is 1:</p>
<p>12345678901234567000000/1</p>
</li>
<li><strong>Scale into the range [2<sup>52</sup>,2<sup>53</sup>)</strong>
<p>12345678901234567000000/1 is a number <em>r</em>, 2<sup>73</sup> &le; <em>r</em> &lt; 2<sup>74</sup>. We need to scale it <em>down</em>, so that we have a scaled value <em>r&#8217;</em> in the range 2<sup>52</sup> &le; <em>r&#8217;</em> &lt; 2<sup>53</sup>; we do this by multiplying the denominator by 2<sup>21</sup>:</p>
<p>12345678901234567000000/2<sup>21</sup> = 12345678901234567000000/2097152</p>
</li>
<li><strong>Divide and round</strong>
<p>12345678901234567000000/2097152 = 5886878443352969 remainder 1355712</p>
<p>The remainder is <em>more</em> than half the denominator, so the quotient &#8212; our significand &#8212; rounds up to <strong>5886878443352970</strong>.</p>
</li>
<li><strong>Express in normalized binary scientific notation</strong>
<p>In binary, 5886878443352970 is</p>
<p>10100111010100001010110110010011100111011001110001010</p>
<p>Unscaled, it is</p>
<p>10100111010100001010110110010011100111011001110001010 x 2<sup>21</sup></p>
<p>Normalized, it is</p>
<p>1.010011101010000101011011001001110011101100111000101 x 2<sup>21</sup> x 2<sup>52</sup></p>
<p>Simplified, it is the resulting double-precision value</p>
<p>1.010011101010000101011011001001110011101100111000101 x 2<sup>73</sup></p>
</li>
<li><strong>Encode as a double</strong>
<p>Here are the values of the fields for our example, in decimal:</p>
<ul>
<li>The sign field is 0, since the number is positive.</li>
<li>The exponent field is 1096, which equals the power of two exponent (73) plus the exponent bias (1023).</li>
<li>The fraction field is 1383278815982474, which is 5886878443352970 &#8211; 2<sup>52</sup>.</li>
</ul>
<p>Here is the encoded double, in binary:</p>
<p><span class="highlight_gray4">0</span>10001001000<span class="highlight_gray4">0100111010100001010110110010011100111011001110001010</span></p>
<p>The sign field is 0, the exponent field is 10001001000, and the fraction field is 0100111010100001010110110010011100111011001110001010.</p>
<p>This double has the exact decimal value of</p>
<p>12345678901234567741440</p>
</li>
</ol>
<h2>The Integers Get Huge</h2>
<p>For very large and very small decimal numbers, the scaled fractions have very large numerators and denominators. For example, for DBL_MAX, which is defined as the decimal literal 1.7976931348623158e308, the algorithm produces this scaled fraction:</p>
<p>179769313486231580000000000000000000000000000000000000000000000000000<br />
000000000000000000000000000000000000000000000000000000000000000000000<br />
000000000000000000000000000000000000000000000000000000000000000000000<br />
000000000000000000000000000000000000000000000000000000000000000000000<br />
000000000000000000000000000000000<strong>/</strong>19958403095347198116563727130368385<br />
660674512604354575415025472424372118918689640657849579654926357010893<br />
424468441924952439724379883935936607391717982848314203200056729510856<br />
765175377214443629871826533567445439239933308104551208703888888552684<br />
480441575071209068757560416423584952303440099278848</p>
<p>(The numerator is 1.7976931348623158 x 10<sup>308</sup>; the denominator is 2<sup>971</sup>. The quotient rounds to 9007199254740991, which in binary is 53 bits of 1s.)</p>
<p>Another example is DBL_MIN, which is defined as 2.2250738585072014e-308; it results in a fraction with even larger integers.</p>
<h2>Handling Remaining Cases</h2>
<p>The algorithm as presented only deals with positive, normal numbers. If you were to implement this according to the IEEE 754 specification, you&#8217;d have to handle zero, negative numbers, subnormal numbers, underflow, and overflow:</p>
<ul>
<li>To handle zero, check for it as a special case when parsing the input string.</li>
<li>To handle negative numbers, treat the number as positive and then just negate the converted result.</li>
<li>To handle subnormal numbers, reduce the precision of the significand &#8212; to between 1 and 52 bits &#8212; when scaling to 53 bits would take the power of two exponent into the subnormal range.</li>
<li>To handle underflow, return zero when the result would be below the subnormal range.</li>
<li>To handle overflow, return &ldquo;infinity&rdquo; when the result would be bigger than DBL_MAX.</li>
</ul>
<h2>Discussion</h2>
<h3>Big Integers vs. Big Floats</h3>
<p>As I mentioned, <a title="Read Rick Regan's article &ldquo;Decimal to Floating-Point Needs Arbitrary Precision&rdquo;" href="http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/">correct conversion can be done using arbitrary-precision floating point arithmetic</a> &#8212; AKA <em>big floats</em>. However, I think the big integer based algorithm is cleaner. Its main advantage comes from the &ldquo;baselessness&rdquo; of integers. You can work with them conceptually as decimal numbers, <a title="Read Rick Regan's article &ldquo;The Four Stages of Floating-Point Competence&rdquo;" href="http://www.exploringbinary.com/the-four-stages-of-floating-point-competence/">blissfully ignorant</a> that they are implemented in binary &#8212; conversion of integers to and from binary is exact.</p>
<p>A big advantage of &ldquo;baselessness&rdquo; is in rounding. To round correctly, there must be enough precision to ensure the accuracy of the rounding bits &#8212; significant bit 54 and an arbitrary number of bits beyond. In the integer algorithm, the required precision comes automatically, through scaling. Each conversion gets the rounding precision it needs, as reflected in the integer remainder. And the calculation is simple: the remainder is compared to one-half of the denominator.</p>
<p>In a floating-point algorithm, precision needs to be set explicitly (either on a case-by-case basis, or once, covering the worst case). Also, the calculation is not as straightforward, requiring knowledge of the absolute location of the rounding bits. For example, consider <a title="Read Rick Regan's article &ldquo;Decimal to Floating-Point Needs Arbitrary Precision&rdquo;" href="http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/#308984926168550152811">3.08984926168550152811e-32</a>, a decimal number very nearly halfway between two floating-point numbers. The calculation a floating-point algorithm must make is to compare 2<sup>-158</sup> (one-half ULP) to 2<sup>-158</sup> + 2<sup>-234</sup> (the value of the 77 required rounding bits).</p>
<h3>Sometimes Arbitrary-Precision Is Overkill</h3>
<p>The algorithm I presented is simple and works for every case, but sometimes it is overkill. For many conversions, a simple IEEE double-precision floating-point calculation does the trick. For example, 3.14159 can be converted by this line of standard C code: <em>double d = 314159.0/100000;</em>. This works because both 314159 and 100000 are exactly representable in double-precision, and a standard IEEE floating-point multiplication is guaranteed to produce a correctly rounded result. (<a title="Read Rick Regan's article &ldquo;Fast Path Decimal to Floating-Point Conversion&rdquo;" href="http://www.exploringbinary.com/fast-path-decimal-to-floating-point-conversion/">This is one optimization used in David Gay&#8217;s strtod()</a>.)</p>
<p>Arbitrary-precision <em>is</em> required for my second example though, since 12345678901234567 is not exactly representable in double-precision; <em>double d = 12345678901234567.0*1000000;</em> produces a result that is 1 ULP off.</p>
<p>(After publishing this I realized that the C statements above are processed by the compiler &#8212; constant folding. There&#8217;s no guarantee that the compiler will use floating-point division and multiplication instructions; it could use its own algorithms. My point is still valid though. In a real program, the two integer arguments will not be constants, so floating-point instructions will be used &#8212; at runtime.)</p>
<h3>Implemented In a C++ Program</h3>
<p>I implemented the big integer algorithm in a C++ program using GMP (<a href="http://www.exploringbinary.com/how-to-install-and-run-gmp-on-windows-using-mpir/" title="Read Rick Regan's Article &ldquo;How to Install and Run GMP on Windows Using MPIR&rdquo;">MPIR</a>). I validated it by comparing its results to David Gay&#8217;s strtod(). (I have used strtod() as a benchmark against which to compare other conversion routines; by writing this program, I&#8217;ve gotten a way to confirm that strtod() in fact produces correct results.)</p>
<h2>Summary</h2>
<p>I showed you a simple algorithm that uses binary integer arithmetic to convert a decimal string to an IEEE double-precision binary floating-point number. The algorithm starts by representing a decimal number as fraction, with a numerator and denominator that are arbitrary-precision integers. The fraction is scaled by a power of two &#8212; positive or <a title="Read Rick Regan's Article &ldquo;A Table of Negative Powers of Two&rdquo;" href="http://www.exploringbinary.com/a-table-of-negative-powers-of-two/">negative</a> &#8212; so that the integer part of the resulting quotient will contain exactly 53 bits. The 53-bit quotient is rounded to the nearest integer, unambiguously according to the value of the integer remainder. The rounded quotient &#8212; the 53 bits of the floating-point number &#8212; is unscaled and normalized by setting the power of two exponent of the floating-point number accordingly.</p>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/correct-decimal-to-floating-point-using-big-integers/">Correct Decimal To Floating-Point Using Big Integers</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/correct-decimal-to-floating-point-using-big-integers/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Binary Numbers Haiku</title>
		<link>http://www.exploringbinary.com/binary-numbers-haiku/</link>
		<comments>http://www.exploringbinary.com/binary-numbers-haiku/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 11:45:29 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Binary numbers]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=337</guid>
		<description><![CDATA[I wrote these haiku about binary numbers: Binary numbers, they are just zeros and ones. Switches &#8212; on and off! The powers of two, nonnegative, negative. Binary numbers. I type in 13. Decimal to binary. 1101. Numbers don&#8217;t add up. IEEE floating-point. Use an epsilon. (Please post your own in the comments.) By Rick Regan [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/binary-numbers-haiku/">Binary Numbers Haiku</a></p>
]]></description>
			<content:encoded><![CDATA[<p>I wrote these <a title="Merriam-Webster Definition of Haiku" href="http://www.merriam-webster.com/dictionary/haiku">haiku</a> about binary numbers:</p>
<div class="haiku">
Binary numbers,<br />
they are just zeros and ones.<br />
Switches &#8212; on and off!
</div>
<p><span id="more-337"></span></p>
<div class="haiku">
<a href="http://www.exploringbinary.com/the-powers-of-two/" title="Read Rick Regan's Article &ldquo;The Powers of Two&rdquo;">The powers of two</a>,<br />
<a title="Read Rick Regan's Article &ldquo;A Table of Nonnegative Powers of Two&rdquo;" href="http://www.exploringbinary.com/a-table-of-nonnegative-powers-of-two/">nonnegative</a>, <a title="Read Rick Regan's Article &ldquo;A Table of Negative Powers of Two&rdquo;" href="http://www.exploringbinary.com/a-table-of-negative-powers-of-two/">negative</a>.<br />
Binary numbers.
</div>
<div class="haiku">
I type in 13.<br />
<a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">Decimal to binary</a>.<br />
1101.
</div>
<div class="haiku">
Numbers don&#8217;t add up.<br />
IEEE floating-point.<br />
Use an epsilon.
</div>
<p>(Please post your own in the comments.)</p>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/binary-numbers-haiku/">Binary Numbers Haiku</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/binary-numbers-haiku/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>How I Taught Third Graders Binary Numbers</title>
		<link>http://www.exploringbinary.com/how-i-taught-third-graders-binary-numbers/</link>
		<comments>http://www.exploringbinary.com/how-i-taught-third-graders-binary-numbers/#comments</comments>
		<pubDate>Wed, 08 Jun 2011 14:51:29 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Binary numbers]]></category>
		<category><![CDATA[Binary integers]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Convert to decimal]]></category>
		<category><![CDATA[Education]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=335</guid>
		<description><![CDATA[Last week I introduced my son&#8217;s third grade class to binary numbers. I wanted to build on my prior visit, where I introduced them to the powers of two. By teaching them binary, I showed them that place value is not limited to base ten, and that there is a difference between numbers and numerals. [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/how-i-taught-third-graders-binary-numbers/">How I Taught Third Graders Binary Numbers</a></p>
]]></description>
			<content:encoded><![CDATA[<p>Last week I introduced my son&#8217;s third grade class to binary numbers. I wanted to build on my prior visit, where I <a title="Read Rick Regan's Article &ldquo;1,073,741,823 Grains of Rice&rdquo;" href="http://www.exploringbinary.com/1073741823-grains-of-rice/">introduced them to the powers of two</a>. By teaching them binary, I showed them that place value is not limited to base ten, and that there is a difference between numbers and numerals.</p>
<p>My presentation was based on base-ten-block-like imagery, since I knew the students were comfortable expressing numbers with base ten blocks. I thought extending the block model to other bases would work well. I think it did.</p>
<div class="wp-caption aligncenter" style="width: 450px"><img src="http://www.exploringbinary.com/wp-content/uploads/27.binary.whiteboard.png" alt="The Number Twenty-Seven, Broken Into Powers of Two" width="440" height="133"/><p class="wp-caption-text">The Number Twenty-Seven in Tape Flags, Broken Into Powers of Two</p></div>
<p><span id="more-335"></span></p>
<h2>Introduction</h2>
<p>Before my presentation, I put twenty-seven tape flags on the whiteboard, in an unorganized fashion like this:</p>
<div class="wp-caption aligncenter" style="width: 195px"><img src="http://www.exploringbinary.com/wp-content/uploads/27.pile.png" alt="Twenty-Seven Objects" width="185" height="253"/><p class="wp-caption-text">Twenty-Seven Objects</p></div>
<p>(I would have preferred to use magnets instead of tape flags, since they would have been easier to move and align; but I didn&#8217;t have twenty-seven identical magnets.)</p>
<p>I started my presentation by telling the class that I would teach them about something called binary numbers, but that first I would review the numbers they already know &#8212; decimal numbers (I took a moment to explain that this was not the same as &ldquo;decimals&rdquo;). The first thing we did was count the tape flags, and as we counted together I rearranged them into a line:</p>
<div class="wp-caption aligncenter" style="width: 542px"><img src="http://www.exploringbinary.com/wp-content/uploads/27.line.png" alt="Twenty-Seven Objects, Arranged In a Line" width="532" height="67"/><p class="wp-caption-text">Twenty-Seven Objects, Arranged In a Line</p></div>
<p>I asked them how they would write that number. One student came up and wrote &ldquo;27,&rdquo; which is the first answer I expected. Other suggestions were Roman numerals (&ldquo;XXVII&rdquo;) and &ldquo;twenty-seven,&rdquo; also as I anticipated. One student suggested writing it in Japanese (I was expecting a foreign language, but Spanish: &ldquo;veintisiete&rdquo;). Some students suggested arithmetic expressions, like 20 + 7. One unexpected answer was from a girl who wrote it on the board in base ten blocks, which is how I was planning to rearrange the tape flags next!</p>
<p>I suggested tally marks as another alternative, and wrote twenty-seven in tally marks on the board.</p>
<h2>Decimal</h2>
<p>I singled-out the answer &ldquo;27&rdquo; and said it is written in place value. I reviewed how the places were powers of ten. Then, as the class counted along to twenty-seven, I rearranged the flags into base ten block powers of ten groups, under headings labeled &ldquo;tens&rdquo; and &ldquo;ones&rdquo;:</p>
<div class="wp-caption aligncenter" style="width: 182px"><img src="http://www.exploringbinary.com/wp-content/uploads/27.decimal.png" alt="Twenty-Seven Objects, Broken Into Powers of Ten" width="172" height="352"/><p class="wp-caption-text">Twenty-Seven Objects, Broken Into Powers of Ten</p></div>
<p>We counted the powers of ten and wrote the totals in the blanks I drew below each grouping of blocks; we came up with the numeral &ldquo;27&rdquo;: two tens and seven ones.</p>
<h2>Quinary</h2>
<p>I told the class that place value is not limited to base ten. I said, for example, you could write any number in base five, or quinary. (I wanted to take an intermediate step to binary, which is the simplest base, having only a maximum of one instance of each power.) I had them compute the powers of five from one to 625, and I explained that these are the places in quinary. I told them we would group the flags into powers of five. I wrote three headings on the board: &ldquo;twenty-fives,&rdquo; &ldquo;fives,&rdquo; and &ldquo;ones.&rdquo;</p>
<p>I asked &ldquo;are there any twenty-fives in twenty-seven&rdquo; and they said &ldquo;yes.&rdquo; We then counted out twenty-five flags, which I removed from the decimal grouping we&#8217;d just done.  I built a block as we went, under the twenty-five label. Next I asked if there were any more twenty-fives in the flags that remained, and they quickly said &ldquo;no.&rdquo; They could also see there were no fives, and that there were only two ones left, which I moved under the ones label.</p>
<div class="wp-caption aligncenter" style="width: 337px"><img src="http://www.exploringbinary.com/wp-content/uploads/27.quinary.png" alt="Twenty-Seven Objects, Broken Into Powers of Five" width="327" height="251"/><p class="wp-caption-text">Twenty-Seven Objects, Broken Into Powers of Five</p></div>
<p>We counted the powers of five and wrote them under each grouping of blocks, coming up with the numeral &ldquo;102&rdquo;: one twenty-five, zero fives, and two ones. Some kids wanted to pronounce this as &ldquo;one-hundred and two&rdquo;, but I told them you pronounce it as &ldquo;twenty-seven,&rdquo; or &ldquo;one-zero-two base five.&rdquo;</p>
<h2>Binary</h2>
<p>Now I said let&#8217;s look at another example of place value: base two, or binary. I said it is based on <a title="Read Rick Regan's Article &ldquo;A Table of Nonnegative Powers of Two&rdquo;" href="http://www.exploringbinary.com/a-table-of-nonnegative-powers-of-two/">powers of two</a>. We computed the powers of two from one to thirty-two (my son was rattling them off to 4096 before I could cut him off <img src='http://www.exploringbinary.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> ), which they remembered from <a title="Read Rick Regan's Article &ldquo;1,073,741,823 Grains of Rice&rdquo;" href="http://www.exploringbinary.com/1073741823-grains-of-rice/">my last visit</a>.</p>
<p>We proceeded as above, except we pulled out the powers of two (from the flags in the quinary grouping): first we looked for sixteens, then eights, then fours, then twos, and then ones.</p>
<div class="wp-caption aligncenter" style="width: 380px"><img src="http://www.exploringbinary.com/wp-content/uploads/27.binary.png" alt="Twenty-Seven Objects, Broken Into Powers of Two" width="370" height="206"/><p class="wp-caption-text">Twenty-Seven Objects, Broken Into Powers of Two</p></div>
<p>We counted the powers of two and wrote them under each label, coming up with the numeral “11011”: one sixteen, one eight, zero fours, one two, and one one.</p>
<h2>Number of Digits</h2>
<p>When I was done with the tape flag examples, I took a moment to explain that base ten has ten digits, base five has five digits, and base two has two digits. As an example, I said that in base ten you could never have a 10 in any place, because that would be the same as a 1 in the next higher place. Similarly for base two, a 2 in a place would equal the next higher power of two, which also would be the same as a 1 in the next higher place.</p>
<h2>Other Bases</h2>
<p>I told the class that you could write any whole number in any base. One kid asked if I could do it in a base that was greater than ten (I forget which base he used as an example). I said any number could be the base, but you&#8217;d have to have enough symbols. I briefly explained why you wouldn&#8217;t want a multi-digit number in a place (it would make the numeral ambiguous). I mentioned base sixteen, and said it uses the letters A through F for the values ten through fifteen. (I did not intend to get into hexadecimal, but hey, I wanted to answer the question!)</p>
<h2>Students as Binary Numbers</h2>
<p>At the front of the classroom, just below the whiteboard, I arranged five chairs, facing the class. I wrote the names of the binary places above the chairs, left to right from the class&#8217;s point of view: sixteens, eights, fours, twos, ones. I got five volunteers to come up, and said that I would turn them into a binary number. I said if I told them to sit in their chair, they would count as a 0; if I told them to stand in front of their chair, they would count as a 1.</p>
<p>For my first example, I put the students in the pattern 11011, which the class correctly read as twenty-seven (they added the place values above the chairs of the standing students &#8212; that or they read the numerals I had left on the board under the tape flags <img src='http://www.exploringbinary.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> ). I did a few other examples like this, which amounted to binary to decimal conversion. They got them all right.</p>
<p>Next I did what amounted to decimal to binary conversion, asking the class how to arrange the volunteers to represent a given number. For example, when I said &ldquo;nine,&rdquo; they called out instructions to make the volunteers stand and sit to make the pattern 1001. They got all of these examples correct as well.</p>
<h2>Binary Counting</h2>
<p>The above discussion took about twenty-five minutes, so with the extra five minutes I squeezed in a demonstration of a binary counter. I took a new set of five volunteers and had the class direct them through the sequence zero to thirty-one. We got through the count, but I think a few students got lost as some of the faster adders called out instructions. In any case, there were definitely some who understood the process, enough to know that when I asked them to display thirty-two, they said we would need another volunteer.</p>
<p>If I had more time, I would have done the count a second time, with the volunteers driving the counting; I came up with this scheme after I left the class:</p>
<ul>
<li>All volunteers start out sitting, representing zero.</li>
<li>Whenever I say &ldquo;count&rdquo;
<ul>
<li>The ones place volunteer does the opposite of what she is currently doing: if she&#8217;s sitting, she stands; if she&#8217;s standing, she sits.</li>
<li>For everyone not in the ones place, if the kid to your left sits, you do the opposite of what you&#8217;re currently doing.</li>
</ul>
</li>
</ul>
<p>I think this would have made the counting easier and more fun.</p>
<h2>Binary Fractions</h2>
<p>I mentioned briefly that there is an equivalent of decimals in binary numbers. Instead of the tenths, hundredths, etc. places there are the halves, quarters, eighths, etc. places.</p>
<h2>Discussion</h2>
<p>I think most of the kids understood the presentation; certainly, they were all engaged. I&#8217;d like to think it gave them a better understanding of decimal, even if they didn&#8217;t understand the details of binary. I told them &ldquo;you may not understand this now, but when you see it again someday, you&#8217;ll remember back to this day in third grade and it will come to you.&rdquo; Someone then asked what grade they teach this in. I said it&#8217;s not really part of any particular math class (as far as I know) but that they would be taught it in a high-school computer class if they took one.</p>
<p>I used number words when I wanted to avoid decimal numerals; for example, when describing a number or when labeling places. Unfortunately, number words have decimal place value built-in, but that&#8217;s the closest I know how to get to a base-independent description of a number. That said, I don&#8217;t think the class recognized this, so I don&#8217;t think it caused any confusion.</p>
<p>I didn&#8217;t explain why we broke the numbers down by starting with the largest powers and working down. If I had more time, maybe I would have let them discover the algorithm themselves.</p>
<p>I use the term &ldquo;number&rdquo; when I really mean &ldquo;numeral&rdquo;, as in &ldquo;binary number&rdquo; or &ldquo;decimal number.&rdquo; This terminology is unfortunate, but it is standard.</p>
<h2>References</h2>
<ul>
<li><a title="See Rick Garlikov's article on the use of the Socratic method to teach binary numbers to third graders" href="http://www.garlikov.com/Soc_Meth.html">Rick Garlikov&#8217;s use of the Socratic method to teach binary numbers to third graders</a>.
<p>I used a different approach, but a lot of the same concepts are involved. Rick&#8217;s method centered on binary counting, which lead to a discussion of powers and places. My method started with powers and places, and lead to binary numerals and then binary counting. Rick discussed other bases after discussing binary, whereas I discussed them before. Also, he discussed binary arithmetic, but I did not.</p>
<p>One thing I liked about my approach is that I built in the concept of base conversion, showing the equivalence of whole numbers written in any base. I also liked the way I exhibited the concept of &ldquo;number vs. numeration.&rdquo;</p>
</li>
<li><a title="See CS Unplugged Page on Binary Numbers" href="http://csunplugged.org/binary-numbers">Computer Science Unplugged Page on Binary Numbers</a>.
<p>This page contains videos on binary counting, which inspired my own binary counting demonstration.</p>
</li>
<li>My article &ldquo;<a title="Read Rick Regan's Article &ldquo;How I Taught My Mother Binary Numbers&rdquo;" href="http://www.exploringbinary.com/how-i-taught-my-mother-binary-numbers/">How I Taught My Mother Binary Numbers</a>.&rdquo;
<p>I taught my mother a little differently (at least in my second attempt), mainly because I think most adults don&#8217;t think explicitly about place value.</p>
</li>
</ul>
<h2>Please Try This</h2>
<p>I&#8217;d love to know if this method works for you; if you try it, please let me know!</p>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/how-i-taught-third-graders-binary-numbers/">How I Taught Third Graders Binary Numbers</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/how-i-taught-third-graders-binary-numbers/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Pi and e In Binary</title>
		<link>http://www.exploringbinary.com/pi-and-e-in-binary/</link>
		<comments>http://www.exploringbinary.com/pi-and-e-in-binary/#comments</comments>
		<pubDate>Wed, 18 May 2011 18:52:36 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Convert to decimal]]></category>
		<category><![CDATA[Convert to hexadecimal]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=333</guid>
		<description><![CDATA[Some people are curious about the binary representations of the mathematical constants pi and e. Mathematically, they&#8217;re like every other irrational number &#8212; infinite strings of 0s and 1s (with no discernible pattern). In a computer, they&#8217;re finite, making them only approximations to their true values. I will show you what their approximations look like [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/pi-and-e-in-binary/">Pi and e In Binary</a></p>
]]></description>
			<content:encoded><![CDATA[<p>Some people are curious about the binary representations of the mathematical constants <a title="Wikipedia article on the mathematical constant &ldquo;pi&rdquo;" href="http://en.wikipedia.org/wiki/Pi">pi</a> and <a title="Wikipedia article on the mathematical constant &ldquo;e&rdquo;" href="http://en.wikipedia.org/wiki/E_%28mathematical_constant%29">e</a>. Mathematically, they&#8217;re like every other irrational number &#8212; infinite strings of 0s and 1s (with no discernible pattern). In a computer, they&#8217;re finite, making them only approximations to their true values. I will show you what their approximations look like in <a title="Wikipedia article on floating-point precision" href="http://en.wikipedia.org/wiki/Floating_point#Internal_representation">five different levels of binary floating-point precision</a>.</p>
<div class="wp-caption aligncenter" style="width: 450px"><img src="http://www.exploringbinary.com/wp-content/uploads/pi.e.png" alt="The first 43 bits of pi and e" width="440" height="77"/><p class="wp-caption-text">The first 43 bits of pi and e</p></div>
<p><span id="more-333"></span> </p>
<h2>Binary Floating-Point Formats</h2>
<p>In binary floating-point, infinitely precise values are rounded to finite precision. Here&#8217;s how rounding works in five different levels of precision:</p>
<ul>
<li>In <em>half-precision</em>, values are rounded to 11 significant bits.</li>
<li>In <em>single-precision</em>, values are rounded to 24 significant bits.</li>
<li>In <em>double-precision</em>, values are rounded to 53 significant bits.</li>
<li>In <em>extended-precision</em>, values are rounded to 64 significant bits.</li>
<li>In <em>quadruple-precision</em>, values are rounded to 113 significant bits.</li>
</ul>
<p>The rounding rule used most often in practice is <a title="Wikipedia article definition of &ldquo;round half to even&rdquo; rounding mode" href="http://en.wikipedia.org/wiki/Rounding#Round_half_to_even">round-to-nearest, round-half-to-even</a>; that&#8217;s the rule I will use. For pi and <em>e</em>, there are no &ldquo;half to even&rdquo; cases, since their binary expansions are infinite. This makes the rounding rule simple: if the rounding bit is 0, round down; if the rounding bit is 1, round up.</p>
<p>I will show the correctly rounded approximations of pi and <em>e</em> in these five formats.</p>
<h2>Pi (&pi;)</h2>
<p>Here are the first 50 decimal digits of pi:</p>
<p>3.1415926535897932384626433832795028841971693993751&#8230;</p>
<p>Here are the first 128 <a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">bits of pi</a>:</p>
<pre class="noborder">
11.00100100001111110110101010001000100001011010001100001000110100
1100010011000110011000101000101110000000110111000001110011010001...
</pre>
<p>Here are the 128 bits again, with the rounding bit for each level of precision highlighted (bits 12, 25, 54, 65, and 114):</p>
<pre class="noborder">
11.001001000<span class="highlight_gray4">0</span>111111011010<span class="highlight_gray4">1</span>0100010001000010110100011000<span class="highlight_gray4">0</span>1000110100
<span class="highlight_gray4">1</span>100010011000110011000101000101110000000110111000<span class="highlight_gray4">0</span>01110011010001...
</pre>
<p>Here are the correctly rounded values of pi in each of the five levels of precision, shown in <a title="Read Rick Regan's article &ldquo;Displaying IEEE Doubles in Binary Scientific Notation&rdquo;" href="http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/">normalized binary scientific notation</a> and as <a title="Read Rick Regan's article &ldquo;Hexadecimal Floating-Point Constants&rdquo;" href="http://www.exploringbinary.com/hexadecimal-floating-point-constants/">hexadecimal floating-point constants</a>:</p>
<h3>Half-Precision</h3>
<pre class="noborder">
pi = 1.1001001 x 2<sup>1</sup>
   = 0&#120;1.92p+1
</pre>
<p><a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">This equals 3.140625 in decimal</a> (all binary floating-point numbers have exact decimal representations), which approximates pi accurately to about 4 decimal digits.</p>
<p>(You can verify this conversion by hand, by adding the powers of two corresponding to the positions of the 1 bits: 11.001001 = 2<sup>1</sup> + 2<sup>0</sup> + 2<sup>-3</sup> + 2<sup>-6</sup> = 3.140625.)</p>
<h3>Single-Precision</h3>
<pre class="noborder">
pi = 1.10010010000111111011011 x 2<sup>1</sup>
   = 0&#120;1.921fb6p+1
</pre>
<p>This equals 3.1415927410125732421875, which approximates pi accurately to about 8 decimal digits.</p>
<h3>Double-Precision</h3>
<pre class="noborder">
pi = 1.1001001000011111101101010100010001000010110100011 x 2<sup>1</sup>
   = 0&#120;1.921fb54442d18p+1
</pre>
<p>This equals 3.141592653589793115997963468544185161590576171875, which approximates pi accurately to about 16 decimal digits.</p>
<h3>Extended-Precision</h3>
<pre class="noborder">
pi = 1.100100100001111110110101010001000100001011010001100001000110
     101 x 2<sup>1</sup>
   = 0&#120;1.921fb54442d1846ap+1
</pre>
<p>This equals</p>
<p>3.14159265358979323851280895940618620443274267017841339111328125</p>
<p>which approximates pi accurately to about 20 decimal digits.</p>
<h3>Quadruple-Precision</h3>
<pre class="noborder">
pi = 1.100100100001111110110101010001000100001011010001100001000110
     1001100010011000110011000101000101110000000110111 x 2<sup>1</sup>
   = 0&#120;1.921fb54442d18469898cc51701b8p+1
</pre>
<p>This equals</p>
<p>3.141592653589793238462643383279502797479068098137295573004504331<br />
874296718662975536062731407582759857177734375</p>
<p>which approximates pi accurately to about 35 decimal digits.</p>
<h2>Euler&#8217;s Number (e)</h2>
<p>Here are the first 50 decimal digits of <em>e</em>:</p>
<p>2.7182818284590452353602874713526624977572470936999&#8230;</p>
<p>Here are the first 128 bits of <em>e</em>:</p>
<pre class="noborder">
10.10110111111000010101000101100010100010101110110100101010011010
1010111111011100010101100010000000100111001111010011110011110001...
</pre>
<p>Here are the 128 bits again, with the rounding bits for each level of precision highlighted:</p>
<pre class="noborder">
10.101101111<span class="highlight_gray4">1</span>100001010100<span class="highlight_gray4">0</span>1011000101000101011101101001<span class="highlight_gray4">0</span>1010011010
<span class="highlight_gray4">1</span>010111111011100010101100010000000100111001111010<span class="highlight_gray4">0</span>11110011110001...
</pre>
<p>Here are the correctly rounded values of <em>e</em> in each of the five levels of precision:</p>
<h3>Half-Precision</h3>
<pre class="noborder">
<em>e</em> = 1.010111 x 2<sup>1</sup>
  = 0&#120;1.5cp+1
</pre>
<p>This equals 2.71875, which approximates <em>e</em> accurately to about 4 decimal digits.</p>
<h3>Single-Precision</h3>
<pre class="noborder">
<em>e</em> = 1.010110111111000010101 x 2<sup>1</sup>
  = 0&#120;1.5bf0a8p+1
</pre>
<p>This equals 2.71828174591064453125, which approximates <em>e</em> accurately to about 8 decimal digits.</p>
<h3>Double-Precision</h3>
<pre class="noborder">
<em>e</em> = 1.0101101111110000101010001011000101000101011101101001 x 2<sup>1</sup>
  = 0&#120;1.5bf0a8b145769p+1
</pre>
<p>This equals 2.718281828459045090795598298427648842334747314453125, which approximates <em>e</em> accurately to about 16 decimal digits.</p>
<h3>Extended-Precision</h3>
<pre class="noborder">
<em>e</em> = 1.010110111111000010101000101100010100010101110110100101010011
    011 x 2<sup>1</sup>
  = 0&#120;1.5bf0a8b145769536p+1
</pre>
<p>This equals</p>
<p>2.71828182845904523542816810799394033892895095050334930419921875</p>
<p>which approximates <em>e</em> accurately to about 20 decimal digits.</p>
<h3>Quadruple-Precision</h3>
<pre class="noborder">
<em>e</em> = 1.010110111111000010101000101100010100010101110110100101010011
    010101011111101110001010110001000000010011100111101 x 2<sup>1</sup>
  = 0&#120;1.5bf0a8b1457695355fb8ac404e7ap+1
</pre>
<p>This equals</p>
<p>2.718281828459045235360287471352662314358421867193548862669230860<br />
32766716801933881697550532408058643341064453125</p>
<p>which approximates <em>e</em> accurately to about 34 decimal digits.</p>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/pi-and-e-in-binary/">Pi and e In Binary</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/pi-and-e-in-binary/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Four Stages of Floating-Point Competence</title>
		<link>http://www.exploringbinary.com/the-four-stages-of-floating-point-competence/</link>
		<comments>http://www.exploringbinary.com/the-four-stages-of-floating-point-competence/#comments</comments>
		<pubDate>Wed, 20 Apr 2011 16:35:19 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=332</guid>
		<description><![CDATA[The four stages of competence model describes the phases you go through when acquiring a skill: Unconscious incompetence: You don&#8217;t know what you don&#8217;t know. Conscious incompetence: You know what you don&#8217;t know. Conscious competence: You know what you know. Unconscious competence: You don&#8217;t know what you know. I&#8217;ve applied this model to assess competence [...]<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/the-four-stages-of-floating-point-competence/">The Four Stages of Floating-Point Competence</a></p>
]]></description>
			<content:encoded><![CDATA[<p>The <a title="Wikipedia article on the the four stages of competence" href="http://en.wikipedia.org/wiki/Four_stages_of_competence">four stages of competence model</a> describes the phases you go through when acquiring a skill:</p>
<ol>
<li>Unconscious incompetence: You don&#8217;t know what you don&#8217;t know.</li>
<li>Conscious incompetence: You know what you don&#8217;t know.</li>
<li>Conscious competence: You know what you know.</li>
<li>Unconscious competence: You don&#8217;t know what you know.</li>
</ol>
<p>I&#8217;ve applied this model to assess competence in binary floating-point arithmetic; let&#8217;s see where you stand:</p>
<p><span id="more-332"></span>$19.24 plus $6.95 equals $26.189999999999998; you are</p>
<ol>
<li><em>Unconsciously incompetent</em> when you think there&#8217;s a bug in your programming language.</li>
<li><em>Consciously incompetent</em> when you realize that most decimal fractions are infinite in binary.</li>
<li><em>Consciously competent</em> when you study and apply the principles discussed in &ldquo;<a title="Read &ldquo;What Every Computer Scientist Should Know About Floating-Point Arithmetic&rdquo;" href="http://download.oracle.com/docs/cd/E19957-01/806-3568/ncg_goldberg.html">What Every Computer Scientist Should Know About Floating-Point Arithmetic</a>.&rdquo;</li>
<li><em>Unconsciously competent</em> when you know to use decimal or integer arithmetic instead of binary floating-point.</li>
</ol>
<p class="break"> <img src='http://www.exploringbinary.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>By Rick Regan (Copyright &copy; 2008-2012  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/the-four-stages-of-floating-point-competence/">The Four Stages of Floating-Point Competence</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/the-four-stages-of-floating-point-competence/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

