<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.exploringbinary.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" version="2.0">

<channel>
	<title>Exploring Binary</title>
	
	<link>http://www.exploringbinary.com</link>
	<description>Binary Numbers, Binary Code, and Binary Logic</description>
	<lastBuildDate>Wed, 01 Sep 2010 01:29:13 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.exploringbinary.com/ExploringBinary" /><feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="exploringbinary" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><xhtml:meta xmlns:xhtml="http://www.w3.org/1999/xhtml" name="robots" content="noindex" /><feedburner:emailServiceId xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">ExploringBinary</feedburner:emailServiceId><feedburner:feedburnerHostname xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">http://feedburner.google.com</feedburner:feedburnerHostname><item>
		<title>Double Rounding Errors in Floating-Point Conversions</title>
		<link>http://www.exploringbinary.com/double-rounding-errors-in-floating-point-conversions/</link>
		<comments>http://www.exploringbinary.com/double-rounding-errors-in-floating-point-conversions/#comments</comments>
		<pubDate>Wed, 11 Aug 2010 16:21:21 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Binary arithmetic]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=291</guid>
		<description><![CDATA[Double rounding is when a number is rounded twice, first from n0 digits to n1 digits, and then from n1 digits to n2 digits. Double rounding is often harmless, giving the same result as rounding once, directly from n0 digits to n2 digits. However, sometimes a doubly rounded result will be incorrect, in which case [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/double-rounding-errors-in-floating-point-conversions/">Double Rounding Errors in Floating-Point Conversions</a></p>
]]></description>
			<content:encoded><![CDATA[<p><a title="Wikipedia article definition of double rounding" href="http://en.wikipedia.org/wiki/Rounding#Double_rounding">Double rounding</a> is when a number is rounded twice, first from n<sub>0</sub> digits to n<sub>1</sub> digits, and then from n<sub>1</sub> digits to n<sub>2</sub> digits. Double rounding is often harmless, giving the same result as rounding once, directly from n<sub>0</sub> digits to n<sub>2</sub> digits. However, sometimes a doubly rounded result will be incorrect, in which case we say that a <strong><em>double rounding error</em></strong> has occurred. </p>
<p>For example, consider the 6-digit decimal number 7.23496. Rounded directly to 3 digits &#8212; using round-to-nearest, <a title="Wikipedia article definition of &ldquo;round half to even&rdquo; rounding mode" href="http://en.wikipedia.org/wiki/Rounding#Round_half_to_even">round half to even</a> rounding &#8212; it&#8217;s 7.23; rounded first to 5 digits (7.2350) and then to 3 digits it&#8217;s 7.24. The value 7.24 is incorrect, reflecting a double rounding error.</p>
<p>In a computer, double rounding occurs in binary floating-point arithmetic; the typical example is a calculated result that&#8217;s rounded to fit into an x87 FPU extended precision register and then rounded again to fit into a double-precision variable. But I&#8217;ve discovered another context in which double rounding occurs: conversion from a decimal floating-point literal to a single-precision floating-point variable. The double rounding is from full-precision binary to double-precision, and then from double-precision to single-precision.</p>
<p>In this article, I&#8217;ll show example conversions in C that are tainted by double rounding errors, and how attaching the &#8216;f&#8217; suffix to floating-point literals prevents them &#8212; in gcc C at least, but not in Visual C++!</p>
<p><span id="more-291"></span></p>
<h2>Double Rounding Error Example 1</h2>
<p>The following C program demonstrates a double rounding error that occurs when a decimal floating-point literal is converted to a float by the compiler. The program converts the same decimal number three times: first to a double <em>d</em>; then to a float <em>f</em>, using the &#8216;f&#8217; (float) suffix; and then to a float <em>fd</em>, without using the &#8216;f&#8217; suffix.</p>
<pre>
#include &lt;stdio.h&gt;

int main (void)
{
 double d;
 float f, fd; 

 d =  0.5000000894069671353303618843710864894092082977294921875;
 f =  0.5000000894069671353303618843710864894092082977294921875f;
 <strong>fd = 0.5000000894069671353303618843710864894092082977294921875;</strong>

 printf (&quot;d =  %a\n&quot;,d);
 printf (&quot;f =  %a\n&quot;,f);
 printf (&quot;fd = %a\n&quot;,fd);
}
</pre>
<h3>Output on Linux</h3>
<p>Here&#8217;s the output when the program is compiled and run on Linux using gcc:</p>
<pre>
d =  0x1.000003p-1
f =  0x1.000002p-1
<strong>fd = 0x1.000004p-1</strong>
</pre>
<p>Translated from hexadecimal floating-point constants to binary, the output is</p>
<pre class="noborder">
d =  0.1000000000000000000000011
f =  0.100000000000000000000001
fd = 0.10000000000000000000001
</pre>
<p>To interpret the output, first consider the <a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">full-precision binary equivalent</a> of the decimal number in the program (it is a terminating binary fraction):</p>
<p>0.1000000000000000000000010111111111111111111111111111111</p>
<p>Now let&#8217;s look at the three variables in turn:</p>
<ul>
<li><em>d</em> is the <a title="Read Rick Regan's article &ldquo;Decimal to Floating-Point Needs Arbitrary Precision&rdquo;" href="http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/#correct-rounding">correctly rounded conversion</a> of the full-precision binary value to double-precision &#8212; 53 significant bits. The full-precision binary value is rounded up, since the value of bits 54 and beyond is greater than 1/2 <a title="Wikipedia article &ldquo;Unit in the last place&rdquo;" href="http://en.wikipedia.org/wiki/Unit_in_the_last_place">ULP</a>:
<p>0.10000000000000000000000101111111111111111111111111111<span class="highlight_gray4">11</span></p>
</li>
<li><em>f</em> is the correctly rounded conversion of the full-precision binary value to single-precision &#8212; 24 significant bits. The full-precision binary value is rounded down, since the value of bits 25 and beyond is <em>less than</em> 1/2 ULP:
<p>0.100000000000000000000001<span class="highlight_gray4">0111111111111111111111111111111</span></p>
</li>
<li><em>fd</em> is an <em><strong>incorrectly rounded conversion</strong></em> to single-precision. It&#8217;s 1 ULP above the correctly rounded value, due to a double rounding error. The compiler converted it by first rounding it up to 53 bits, giving 0.100000000000000000000001<span class="highlight_gray4">1</span>, and then rounding that up to 24 bits, giving 0.10000000000000000000001. In other words, the first rounding causes a carry to propagate to bit 25, creating a <a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in Visual C++&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/#halfway-case">halfway case</a> that gets resolved by rounding up to nearest even.</li>
</ul>
<h4>Summary</h4>
<p>With the &#8216;f&#8217; suffix on a decimal literal, gcc converts a full-precision binary value directly to single-precision, avoiding double rounding. Without the &#8216;f&#8217; suffix, gcc converts the full-precision binary value to double-precision and then to single-precision; this indirect conversion from full-precision to single-precision is double rounding, and in this case results in a double rounding error.</p>
<h3>Output on Windows</h3>
<p>Here&#8217;s the output when the program is compiled and run on Windows using Visual C++:</p>
<pre>
d =  0x1.000003p-1
<strong>f =  0x1.000004p-1</strong>
<strong>fd = 0x1.000004p-1</strong>
</pre>
<p>Like with gcc, <em>fd</em> is doubly rounded; but Visual C++ doubly rounds <em>f</em> as well!  <strong>It seems Visual C++ ignores the &#8216;f&#8217; suffix</strong>, first rounding to double-precision and then rounding to single-precision. (I am presuming this is the case. It&#8217;s also possible that Visual C++ is honoring the &#8216;f&#8217; suffix, only it&#8217;s <a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in Visual C++&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/">doing the conversion incorrectly</a> &#8212; due to reasons other than double rounding.)</p>
<h2>Double Rounding Error Example 2</h2>
<p>Here&#8217;s a similar example, except that it demonstrates a double rounding error that results in a value that is 1 ULP <em>below</em> the correctly rounded value:</p>
<pre>
#include &lt;stdio.h&gt;

int main (void)
{
 double d;
 float f, fd; 

 d =  0.5000000298023224154508881156289135105907917022705078125;
 f =  0.5000000298023224154508881156289135105907917022705078125f;
 <strong>fd = 0.5000000298023224154508881156289135105907917022705078125;</strong>

 printf (&quot;d =  %a\n&quot;,d);
 printf (&quot;f =  %a\n&quot;,f);
 printf (&quot;fd = %a\n&quot;,fd);
}
</pre>
<h3>Output on Linux</h3>
<p>Here’s the output when the program is compiled and run on Linux using gcc:</p>
<pre>
d =  0x1.000001p-1
f =  0x1.000002p-1
<strong>fd = 0x1p-1</strong>
</pre>
<p>Translated from hexadecimal floating-point constants to binary, the output is</p>
<pre class="noborder">
d =  0.1000000000000000000000001
f =  0.100000000000000000000001
fd = 0.1
</pre>
<p>The full-precision binary equivalent of the decimal number in the program is</p>
<p>0.1000000000000000000000001000000000000000000000000000001</p>
<p>Again, let&#8217;s look at the three variables:</p>
<ul>
<li><em>d</em> is the correctly rounded conversion of the full-precision binary value to 53 significant bits. The full-precision binary value is rounded down, since the value of bits 54 and beyond is less than 1/2 ULP:
<p>0.10000000000000000000000010000000000000000000000000000<span class="highlight_gray4">01</span></p>
</li>
<li><em>f</em> is the correctly rounded conversion of the full-precision binary value to 24 significant bits. The full-precision binary value is rounded up, since the value of bits 25 and beyond is <em>greater than</em> 1/2 ULP:
<p>0.100000000000000000000000<span class="highlight_gray4">1000000000000000000000000000001</span></p>
</li>
<li><em>fd</em> is an <em><strong>incorrectly rounded conversion</strong></em> to single-precision. It&#8217;s 1 ULP below the correctly rounded value, due to a double rounding error. The compiler converted it by first rounding it down to 53 bits, giving 0.100000000000000000000000<span class="highlight_gray4">1</span>, and then rounding that down to 24 bits, giving 0.1. In other words, the first rounding creates a halfway case that gets resolved by rounding down to nearest even.</li>
</ul>
<p>As in example 1, use of the ‘f’ suffix prevents double rounding. </p>
<h3>Output on Windows</h3>
<p>Here&#8217;s the output when the program is compiled and run on Windows using Visual C++:</p>
<pre>
d =  0x1.000001p-1
<strong>f =  0x1.000000p-1</strong>
<strong>fd = 0x1.000000p-1</strong>
</pre>
<p>As in example 1, Visual C++ ignores the &#8216;f&#8217; suffix, resulting in double rounding for both <em>f</em> and <em>fd</em>.</p>
<h2>Summary</h2>
<p>Floating-point literals are subject to double rounding when assigned to single-precision variables, resulting in incorrectly rounded decimal to floating-point conversions. If you&#8217;re using the gcc C compiler, you can avoid this by attaching the &#8216;f&#8217; suffix to your literals.</p>
<h3>A Note on the Examples</h3>
<p>The examples above are admittedly contrived, but they serve to illustrate the potential for double rounding errors during conversion. </p>
<h2>Please Try My Examples</h2>
<p>If you are using a compiler other than Visual C++ or gcc, please try out my examples. In particular, I&#8217;d like to know if your compiler doubly rounds decimal literals when the &#8216;f&#8217; suffix is used.</p>
<h2>Bug Report</h2>
<p>I wrote a <a title="See Microsoft Bug Report &ldquo;Double Rounding Error When the 'f' Suffix is Used on Floating-Point Literals&rdquo;" href="http://connect.microsoft.com/VisualStudio/feedback/details/585669/double-rounding-error-when-the-f-suffix-is-used-on-floating-point-literals">bug report against Visual C++</a>; this was Microsoft&#8217;s response:</p>
<blockquote><p>&ldquo;This is indeed an issue with the compiler but we regret that we cannot fix it for the next release due to its priority.&rdquo;</p></blockquote>
<h2>References</h2>
<p>Here is some background reading on double rounding:</p>
<ul>
<li><a title="See Book &ldquo;Handbook of Floating-Point Arithmetic&rdquo;" href="http://books.google.com/books?id=VDJVFM8WyzsC&#038;printsec=frontcover&#038;dq=Handbook+of+Floating-Point+Arithmetic&#038;source=bl&#038;ots=qlW0NFNKgW&#038;sig=5x2gReyRufmK0HQzaAT-V84e4To&#038;hl=en&#038;ei=FklhTJziIoO88gah2J2jCg&#038;sa=X&#038;oi=book_result&#038;ct=result&#038;resnum=3&#038;ved=0CCUQ6AEwAg#v=onepage&#038;q&#038;f=false">Handbook of Floating-Point Arithmetic</a> (see pages 75-77).</li>
<li><a title="See Presentation &ldquo;When Double Rounding is Odd&rdquo;" href="http://www.lri.fr/~melquion/doc/05-imacs17_1-expose.pdf">When Double Rounding is Odd</a> (see the two pages titled &ldquo;Double rounding going bad&rdquo;).</li>
<li><a title="See Paper &ldquo;When Is Double Rounding Innocuous?&rdquo;" href="http://delivery.acm.org/10.1145/230000/221334/p21-figueroa.pdf?key1=221334&#038;key2=8593441821&#038;coll=portal&#038;dl=ACM&#038;CFID=15151515&#038;CFTOKEN=6184618">When Is Double Rounding Innocuous?</a> (see the first two pages).</li>
</ul>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/double-rounding-errors-in-floating-point-conversions/">Double Rounding Errors in Floating-Point Conversions</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/double-rounding-errors-in-floating-point-conversions/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Displaying IEEE Doubles in Binary Scientific Notation</title>
		<link>http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/</link>
		<comments>http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/#comments</comments>
		<pubDate>Thu, 01 Jul 2010 04:39:28 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=289</guid>
		<description><![CDATA[An IEEE double-precision floating-point number, or double, is a 64-bit encoding of a rational number. Internally, the 64 bits are broken into three fields: a 1-bit sign field, which represents positive or negative; an 11-bit exponent field, which represents a power of two; and a 52-bit fraction field which, when appended to an implicit leading [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/">Displaying IEEE Doubles in Binary Scientific Notation</a></p>
]]></description>
			<content:encoded><![CDATA[<p>An IEEE double-precision floating-point number, or double, is a 64-bit encoding of a rational number. Internally, the 64 bits are broken into three fields: a 1-bit sign field, which represents positive or negative; an 11-bit exponent field, which represents a power of two; and a 52-bit fraction field which, when appended to an implicit leading 1, represents a normalized binary fraction. The three fields together represent a number in binary scientific notation, with 53 bits of precision.</p>
<p>For example, consider the decimal number 33.75.  It converts to a double with a sign field of 0, an exponent field of 10000000100, and a fraction field of 0000111000000000000000000000000000000000000000000000. The 0 in the sign field means it&#8217;s a positive number (1 would mean it&#8217;s negative). The value of 10000000100 in the exponent field, which equals 1028 in decimal, means the exponent of the power of two is 5 (the exponent field value is offset, or biased, by 1023). The fraction field represents the normalized binary fraction 1.0000111. Written in binary scientific notation &#8212; following the convention that the fraction is written in binary and the power of two is written in decimal &#8212; 33.75 equals 1.0000111 x 2<sup>5</sup>.</p>
<p>In this article, I&#8217;ll show you the C function I wrote to display a double in binary scientific notation. This function is useful, for example, when <a title="Read Rick Regan's article &ldquo;Decimal to Floating-Point Needs Arbitrary Precision&rdquo;" href="http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/">verifying that decimal to floating-point conversions are correctly rounded</a>.</p>
<p><span id="more-289"></span></p>
<h2>The Code</h2>
<p>I wrote a function called <strong>print_double_binsci()</strong> that prints a double-precision floating-point number in binary scientific notation. It is based on a call to my function <a title="Read Rick Regan's article &ldquo;Displaying the Raw Fields of a Floating-Point Number&rdquo;" href="http://www.exploringbinary.com/displaying-the-raw-fields-of-a-floating-point-number/">parse_double()</a>, which isolates the three fields of a double.</p>
<p>I declared and defined this function in files I named binsci.h and binsci.c, respectively.</p>
<h3>binsci.h</h3>
<pre>
/***********************************************************/
/* binsci.h: Function to print an IEEE double-precision    */
/*           floating-point number in binary scientific    */
/*           notation                                      */
/***********************************************************/
void print_double_binsci(double d);
</pre>
<h3>binsci.c</h3>
<pre>
/***********************************************************/
/* binsci.c: Function to print an IEEE double-precision    */
/*           floating-point number in binary scientific    */
/*           notation                                      */
/***********************************************************/
#include &lt;stdio.h&gt;
#include &quot;<a title="Read Rick Regan's article &ldquo;Displaying the Raw Fields of a Floating-Point Number&rdquo;" href="http://www.exploringbinary.com/displaying-the-raw-fields-of-a-floating-point-number/">rawdouble.h</a>&quot;
#include &quot;binsci.h&quot;

void print_double_binsci(double d)
{
 int i;
 unsigned char sign_field;
 unsigned short exponent_field;
 unsigned long long fraction_field;

 //Isolate the three fields of the double
 <a title="Read Rick Regan's article &ldquo;Displaying the Raw Fields of a Floating-Point Number&rdquo;" href="http://www.exploringbinary.com/displaying-the-raw-fields-of-a-floating-point-number/">parse_double</a>(d,&#038;sign_field,&#038;exponent_field,&#038;fraction_field);

 //Print a minus sign, if necessary
 if (sign_field == 1)
   printf(&quot;-&quot;);

 //Print the implicit leading 1 and radix point
 printf(&quot;1.&quot;);

 //Print the fraction field
 for (i=0; i&lt;=51; i++)
   if ((fraction_field &gt;&gt; (51-i)) &#038; 1)
     printf(&quot;1&quot;);
   else
     printf(&quot;0&quot;);

 //Print the power, taking into account exponent bias
 printf(&quot; x 2^%d\n&quot;,exponent_field - 1023);
}
</pre>
<h3>Notes</h3>
<ul>
<li>All 53 bits are printed, including superfluous trailing 0s.</li>
<li>Numbers that are not raised to a power are printed with the suffix &ldquo;x 2^0&rdquo;.</li>
<li>Not-a-number (NaN) and infinity values are not handled.</li>
</ul>
<h2>Examples of Usage</h2>
<p>I wrote a program, called binsciTest.c, that shows some example calls to print_double_binsci():</p>
<pre>
/***********************************************************/
/* binsciTest.c: Program to test printing of IEEE double   */
/*               precision floating-point numbers in       */
/*               binary scientific notation                */
/***********************************************************/
#include &lt;stdio.h&gt;
#include &quot;binsci.h&quot;

int main (void)
{
 printf(&quot;33.75 =\n&quot;);
 print_double_binsci(33.75);
 printf(&quot;\n&quot;);

 printf(&quot;0.1 =\n&quot;);
 print_double_binsci(0.1);
 printf(&quot;\n&quot;);

 printf(&quot;-0.6 =\n&quot;);
 print_double_binsci(-0.6);
 printf(&quot;\n&quot;);

 printf(&quot;3.518437208883201171875e13 =\n&quot;);
 print_double_binsci(3.518437208883201171875e13);
 printf(&quot;\n&quot;);

 printf(&quot;9214843084008499.0 =\n&quot;);
 print_double_binsci(9214843084008499.0);
 printf(&quot;\n&quot;);

 printf(&quot;30078505129381147446200.0 =\n&quot;);
 print_double_binsci(30078505129381147446200.0);
 printf(&quot;\n&quot;);

 printf(&quot;1777820000000000000001.0 =\n&quot;);
 print_double_binsci(1777820000000000000001.0);
 printf(&quot;\n&quot;);

 printf(&quot;0.3932922657273 =\n&quot;);
 print_double_binsci(0.3932922657273);

 return (0);
}
</pre>
<p>(Some of these examples were taken from my articles <a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in Visual C++&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/">Incorrectly Rounded Conversions in Visual C++</a> and <a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in GCC and GLIBC&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-gcc-and-glibc/">Incorrectly Rounded Conversions in GCC and GLIBC</a>.)</p>
<p>I compiled and ran it on both Windows and Linux:</p>
<ul>
<li>On Windows, I built a project in Visual C++ with files binsci.c, binsci.h, binsciTest.c, rawdouble.c, and rawdouble.h, and compiled and ran it in there.</li>
<li>On Linux, I compiled with &ldquo;gcc binsciTest.c binsci.c rawdouble.c -o binsciTest&rdquo; and then ran it with &#8220;./binsciTest&#8221;.</li>
</ul>
<h3>Output</h3>
<pre>
33.75 =
1.0000111000000000000000000000000000000000000000000000 x 2^5

0.1 =
1.1001100110011001100110011001100110011001100110011010 x 2^-4

-0.6 =
-1.0011001100110011001100110011001100110011001100110011 x 2^-1

3.518437208883201171875e13 =
1.0000000000000000000000000000000000000000000000000001 x 2^45

9214843084008499.0 =
1.0000010111100110110011101100010101110111011000011001 x 2^53

30078505129381147446200.0 =
1.1001011110100011110001110010011100011011000000100000 x 2^74

1777820000000000000001.0 =
1.1000000110000000110101011011101011010010111000111110 x 2^70

0.3932922657273 =
1.1001001010111011001101010010110001000110001000111001 x 2^-2
</pre>
<p>Except for 33.75, which is exact, all the other examples are 53 significant bit <em>approximations</em> to the decimal numbers they stand in for (type &lsquo;0.1&rsquo; into my <a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">decimal to binary converter</a> and see for yourself).</p>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/">Displaying IEEE Doubles in Binary Scientific Notation</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Incorrect Directed Conversions in David Gay’s strtod()</title>
		<link>http://www.exploringbinary.com/incorrect-directed-conversions-in-david-gays-strtod/</link>
		<comments>http://www.exploringbinary.com/incorrect-directed-conversions-in-david-gays-strtod/#comments</comments>
		<pubDate>Wed, 23 Jun 2010 13:47:58 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Binary arithmetic]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=287</guid>
		<description><![CDATA[For correctly rounded decimal to floating-point conversions, many open source projects rely on David Gay&#8217;s strtod() function. In the default rounding mode, IEEE 754 round-to-nearest, this function is known to give correct results (notwithstanding recent bugs, which have been fixed). However, in the less frequently used IEEE 754 directed rounding modes &#8212; round toward positive [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/incorrect-directed-conversions-in-david-gays-strtod/">Incorrect Directed Conversions in David Gay&#8217;s strtod()</a></p>
]]></description>
			<content:encoded><![CDATA[<p>For <a title="Read Rick Regan's article &ldquo;Decimal to Floating-Point Needs Arbitrary Precision&rdquo;" href="http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/">correctly rounded decimal to floating-point conversions</a>, many open source projects rely on <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s strtod() function</a>. In the default rounding mode, IEEE 754 <em>round-to-nearest</em>, this function is known to give correct results (notwithstanding <a title="Change Log for David Gay's dtoa.c" href="http://www.netlib.org/fp/changes">recent bugs</a>, which have been fixed). However, in the less frequently used IEEE 754 directed rounding modes &#8212; <em>round toward positive infinity</em>, <em>round toward negative infinity</em>, and <em>round toward zero</em> &#8212; strtod() gives incorrectly rounded results for some inputs.</p>
<p><span id="more-287"></span></p>
<p>(<strong>Update 7/11/10: dtoa.c is now fixed &#8212; see below</strong>.)</p>
<h2>Discovering the Problem</h2>
<p>As a follow up to my article &ldquo;<a title="Read Rick Regan's article &ldquo;Visual C++ and GLIBC strtod() Ignore Rounding Mode&rdquo;" href="http://www.exploringbinary.com/visual-c-plus-plus-and-glibc-strtod-ignore-rounding-mode/">Visual C++ and GLIBC strtod() Ignore Rounding Mode</a>,&rdquo; I asked Mark Dickinson &#8212; the developer who maintains Python&#8217;s copy of dtoa.c &#8212; why Python does not enable what I call <em>directed conversions</em>. Out of that discussion came Mark&#8217;s conjecture: the directed rounding code was not correct. He soon followed up with an example to prove it.</p>
<h2>An Automated Way to Find Examples</h2>
<p>I wanted to find more examples, but I wanted a C program to do it for me. I needed a way to generate examples whose correct outcomes were known (in my evaluations of round-to-nearest conversions, David Gay&#8217;s strtod() served this purpose). Inspired by Mark&#8217;s example, this is the algorithm I devised:</p>
<ol>
<li>Generate a random decimal string that represents a number that is not an integer and is not <a title="Wikipedia article on dyadic fraction" href="http://en.wikipedia.org/wiki/Dyadic_fraction">dyadic</a>.</li>
<li>Convert the decimal string to a double (I used David Gay&#8217;s strtod(), but the built-in strtod() would have been fine &#8212; correct rounding does not matter here).</li>
<li>Convert the double to a decimal string, giving the exact decimal representation of a floating-point number (I used <a title="David Gay's g_fmt.c" href="http://www.netlib.org/fp/g_fmt.c">David Gay&#8217;s g_fmt() function</a>, since I&#8217;m running on Windows and <a title="Read Rick Regan's article &ldquo;Print Precision of Dyadic Fractions Varies by Language&rdquo;" href="http://www.exploringbinary.com/print-precision-of-dyadic-fractions-varies-by-language/">printf() in Visual C++ can&#8217;t print all the decimal digits of a floating-point number</a>).</li>
<li>Change the last digit of the new decimal string &#8212; which by the way is always 5 (<a title="Read Rick Regan's Article &ldquo;Patterns in the Last Digits of the Positive Powers of Five&rdquo;" href="http://www.exploringbinary.com/patterns-in-the-last-digits-of-the-positive-powers-of-five/">this article</a> indirectly explains why) &#8212; to either 4 or 6, depending on whether you want a number just below or just above the floating-point number.</li>
</ol>
<p>Except for Mark&#8217;s example, the examples below were found by my C program implementation of this algorithm. I ran the program against a copy of dtoa.c most recently changed on &ldquo;Tue Feb  2 23:05:34 MST 2010,&rdquo; which I downloaded on May 27, 2010.</p>
<h2>Examples: Decimal Numbers Just Above a Floating-Point Number</h2>
<p>This section shows two examples that David Gay&#8217;s strtod() (hereafter referred to only as strtod()) rounds incorrectly; they are very close to, but just <em>above</em> (in absolute value), a floating-point number.</p>
<h3>Example 1</h3>
<p>This is Mark&#8217;s example, and it shows a positive number rounding to nearest in round toward positive infinity mode:</p>
<p>1.100000000000000088817841970012523233890533447265626</p>
<p>It converts to this binary number, which has a non-terminating binary fraction portion:</p>
<p>1.0001100110011001100110011001100110011001100110011010<span class="highlight_gray4">0</span>0000000000000<br />
00000000000000000000000000000000000000000000000000000000000000000000 00000000000000000000000000000000000<span class="highlight_gray4">1</span>&#8230;</p>
<p>Rounded toward positive infinity the result should be</p>
<p>1.0001100110011001100110011001100110011001100110011011</p>
<p>In hexadecimal “%a” notation, this equals 0&#120;1.199999999999bp+0.</p>
<p>strtod() converts it incorrectly to 0&#120;1.199999999999ap+0, which is one ULP below the correctly rounded value. 0&#120;1.199999999999ap+0 is the floating-point number equal to decimal 1.100000000000000088817841970012523233890533447265625, which differs from the input decimal value only in the last decimal digit (5 vs 6).</p>
<h3>Example 2</h3>
<p>This example shows a negative number rounding to nearest in round toward negative infinity mode:</p>
<p>-0.91276999999999997026378650843980722129344940185546876</p>
<p>It converts to this non-terminating binary number:</p>
<p>-0.11101001101010110100101101110010110001010001100101111<span class="highlight_gray4">0</span>000000000000<br />
0000000000000000000000000000000000000000000000000000000000000000000<br />
0000000000000000000000000000000000000000000<span class="highlight_gray4">1</span>&#8230;</p>
<p>Rounded toward negative infinity the result should be -0&#120;1.d35696e58a330p-1; strtod() converts it incorrectly to -0&#120;1.d35696e58a32fp-1.</p>
<h2>Examples: Decimal Numbers Just Below a Floating-Point Number</h2>
<p>This section shows two examples that strtod() rounds incorrectly; they are very close to, but just <em>below</em> (in absolute value), a floating-point number.</p>
<h3>Example 3</h3>
<p>This example shows a negative number rounding to nearest in round toward positive infinity mode: </p>
<p>-266.240000000000009094947017729282379150390624</p>
<p>Rounded toward positive infinity it should be -0&#120;1.0a3d70a3d70a3p+8; strtod() rounds it incorrectly to -0&#120;1.0a3d70a3d70a4p+8.</p>
<h3>Example 4</h3>
<p>This example shows a positive number rounding to nearest in both round toward negative infinity and round toward zero modes: </p>
<p>8.2556288587679180024720432899523381023022507640626854730214745359262<br />
45152950286865234374e-17 </p>
<p>Rounded toward negative infinity and rounded toward zero it should be 0&#120;1.7cb9433617c9bp-54; strtod() rounds it incorrectly &#8212; in both cases &#8212; to 0&#120;1.7cb9433617c9cp-54.</p>
<h2>Discussion</h2>
<p>In my article &ldquo;<a title="Read Rick Regan's article &ldquo;Visual C++ and GLIBC strtod() Ignore Rounding Mode&rdquo;" href="http://www.exploringbinary.com/visual-c-plus-plus-and-glibc-strtod-ignore-rounding-mode/">Visual C++ and GLIBC strtod() Ignore Rounding Mode</a>&rdquo; I said</p>
<blockquote><p>The percentage of incorrect conversions for both Visual C++ and glibc ranged between 49.00% and 49.06%, suggesting they were not reacting to the rounding mode at all. (Why it wasn’t almost exactly 50% I don’t know.) </p></blockquote>
<p>It might not have been 50% because, although I didn&#8217;t know it at the time, strtod()&#8217;s conversions weren&#8217;t 100% correct! <strike>(I&#8217;ll rerun those tests when dtoa.c is fixed.)</strike> (Update: the percentages are still near 49%, even with the fixed dtoa.c, so scratch that theory.)</p>
<h3>Do You Use Directed Rounding?</h3>
<p>If your project uses dtoa.c, could you let me know whether it enables directed conversions? That is, does it define the &ldquo;Honor_FLT_ROUNDS&rdquo; flag?</p>
<p>You should be aware that <a title="Read Rick Regan's article &ldquo;Visual C++ and GLIBC strtod() Ignore Rounding Mode&rdquo;" href="http://www.exploringbinary.com/visual-c-plus-plus-and-glibc-strtod-ignore-rounding-mode/">strtod() reacts to the rounding mode, although incorrectly, when Honor_FLT_ROUNDS is not defined</a>.</p>
<h2>Bug Report</h2>
<p>I reported this to David Gay; I&#8217;ll keep you posted.</p>
<h3>Update: dtoa.c is Fixed</h3>
<p>David Gay updated dtoa.c with a fix, and it appears to work: all four examples above now round correctly, as did the millions of random examples I checked.</p>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/incorrect-directed-conversions-in-david-gays-strtod/">Incorrect Directed Conversions in David Gay&#8217;s strtod()</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/incorrect-directed-conversions-in-david-gays-strtod/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Visual C++ and GLIBC strtod() Ignore Rounding Mode</title>
		<link>http://www.exploringbinary.com/visual-c-plus-plus-and-glibc-strtod-ignore-rounding-mode/</link>
		<comments>http://www.exploringbinary.com/visual-c-plus-plus-and-glibc-strtod-ignore-rounding-mode/#comments</comments>
		<pubDate>Fri, 18 Jun 2010 13:28:25 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Binary arithmetic]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=286</guid>
		<description><![CDATA[When a decimal number is converted to a binary floating-point number, the floating-point number, in general, is only an approximation to the decimal number. Large integers, and most decimal fractions, require more significant bits than can be represented in the floating-point format. This means the decimal number must be rounded, to one of the two [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/visual-c-plus-plus-and-glibc-strtod-ignore-rounding-mode/">Visual C++ and GLIBC strtod() Ignore Rounding Mode</a></p>
]]></description>
			<content:encoded><![CDATA[<p>When a decimal number is converted to a binary floating-point number, the floating-point number, in general, is only an approximation to the decimal number. Large integers, and most decimal fractions, require more significant bits than can be represented in the floating-point format. This means the decimal number must be rounded, to one of the two floating-point numbers that surround it.</p>
<p>Common practice considers a decimal number correctly rounded when the <em>nearest</em> of the two floating-point numbers is chosen (and when both are equally near, when the one with significant bit number 53 equal to 0 is chosen). This makes sense intuitively, and also reflects the default IEEE 754 rounding mode &#8212; <em>round-to-nearest</em>. However, there are three other IEEE 754 rounding modes, which allow for directed rounding: <em>round toward positive infinity</em>, <em>round toward negative infinity</em>, and <em>round toward zero</em>. For a conversion to be considered truly correctly rounded, it must honor all four rounding modes &#8212; whichever is currently in effect.</p>
<p>I evaluated the Visual C++ and glibc strtod() functions under the three directed rounding modes, like I did for round-to-nearest mode in my articles &ldquo;<a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in Visual C++&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/">Incorrectly Rounded Conversions in Visual C++</a>&rdquo; and &ldquo;<a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in GCC and GLIBC&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-gcc-and-glibc/">Incorrectly Rounded Conversions in GCC and GLIBC</a>.&rdquo;. What I discovered was this: they only convert correctly about half the time &#8212; pure chance! &#8212; because they ignore the rounding mode altogether.</p>
<p><span id="more-286"></span></p>
<h2>Example: Correct Rounding in All Four Rounding Modes</h2>
<p>The diagram below shows how correct conversion works in each of the four rounding modes; I used the decimal values 0.1 and -0.1 as examples:</p>
<div class="wp-caption aligncenter" style="width: 557px"><img src="http://www.exploringbinary.com/wp-content/uploads/RoundingIntervalCorrect.png" alt="How 0.1 and -0.1 Convert Correctly in Each of the Four Rounding Modes" width="547" height="169"/><p class="wp-caption-text">How 0.1 and -0.1 Convert Correctly in Each of the Four Rounding Modes</p></div>
<p>The conversions &#8212; shown in hexadecimal &ldquo;%a&rdquo; notation &#8212; were done by <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s strtod() function</a> and also verified by hand. In terms of absolute value, there are two different results; here they are in binary:</p>
<ul>
<li>0&#120;1.9999999999999p-4 =
<p>1.1001100110011001100110011001100110011001100110011001  x 2<sup>-4</sup></p>
</li>
<li>0&#120;1.999999999999ap-4 =
<p>1.100110011001100110011001100110011001100110011001101 x 2<sup>-4</sup></p>
</li>
</ul>
<h2>Example: Incorrect Rounding in Visual C++ and glibc</h2>
<p>Visual C++ and glibc strtod() do four of the eight conversions incorrectly:</p>
<div class="wp-caption aligncenter" style="width: 555px"><img src="http://www.exploringbinary.com/wp-content/uploads/RoundingIntervalIncorrect.png" alt="How 0.1 and -0.1 Convert Incorrectly in Visual C++ and glibc" width="545" height="166"/><p class="wp-caption-text">How 0.1 and -0.1 Convert Incorrectly in Visual C++ and glibc</p></div>
<p>0.1, in round toward zero and round toward negative infinity modes, converts incorrectly to 0&#120;1.999999999999ap-4. -0.1, in round toward zero and round toward positive infinity modes, converts incorrectly to -0&#120;1.999999999999ap-4. That is, Visual C++ and glibc are converting 0.1 and -0.1 to a single absolute value in every case: 0&#120;1.999999999999ap-4. That, not coincidentally, is the nearest floating-point number.</p>
<h2>Visual C++ and glibc Always Round to Nearest</h2>
<p>To see what percentage of conversions Visual C++ and glibc got right, I tested strtod() with random decimal values &#8212; both positive and negative &#8212;  in each of the three directed rounding modes. I compared their results to those computed by David Gay&#8217;s strtod(). The percentage of incorrect conversions for both Visual C++ and glibc ranged between 49.00% and 49.06%, suggesting they were not reacting to the rounding mode at all. (Why it wasn&#8217;t almost exactly 50% I don&#8217;t know.) To confirm this, I generated random decimal values and converted each in all four modes; in every case, the four conversions were the same.</p>
<p><strong>Conclusion</strong>: Visual C++ and glibc are using arbitrary-precision arithmetic in their calculations, and their algorithms are hard-coded to perform round-to-nearest rounding.</p>
<p>(<em>Update</em>: I was assuming David Gay&#8217;s strtod() got the conversions right in the directed rounding modes, but it turns out <a title="Read Rick Regan's article &ldquo;Incorrect Directed Conversions in David Gay&rsquo;s strtod()&rdquo;" href="http://www.exploringbinary.com/incorrect-directed-conversions-in-david-gays-strtod/">there was a bug &#8212; now fixed &#8212; that caused some conversions to come out incorrectly rounded</a>. In any case, my conclusion is unaffected, since ultimately I didn&#8217;t need David Gay&#8217;s strtod() to reach it.)</p>
<h2>Enabling Rounding Mode Detection in David Gay&#8217;s strtod()</h2>
<p>David Gay&#8217;s strtod() function does not honor rounding mode by default; it must be enabled, by building dtoa.c with the flag &ldquo;Honor_FLT_ROUNDS&rdquo;. This forces the inclusion of file fenv.h, which defines the function fegetround(). fegetround() is called to get the current processor rounding mode, which strtod() uses to round its results correctly.</p>
<p>To make this work in Visual C++ &#8212; which doesn&#8217;t have the file fenv.h &#8212; I had to simulate fegetround(). I did this by calling _controlfp_s(&#038;cur_cw,0,0) and translating the rounding mode masks (I am using the x87 FPU).</p>
<h3>Conversion Algorithm Uses Floating-Point Arithmetic</h3>
<p>David Gay&#8217;s strtod() function uses floating-point as well as arbitrary-precision calculations in its conversion, which you can see by reading the source code. You can also see this by building dtoa.c without &ldquo;Honor_FLT_ROUNDS&rdquo; and running it in the different rounding modes. Although the conversions won&#8217;t  be correct (in general), they will differ, meaning they were computed with the help of the processor&#8217;s floating-point unit. </p>
<h2>Discussion</h2>
<h3>Compile Time Vs. Run Time Conversions</h3>
<p>I&#8217;ve been discussing strtod(), which does conversions at run time. But conversions can be done at compile time too &#8212; using floating-point literals. For the sake of argument, let&#8217;s say both strtod() and compile time conversions honor the current rounding mode (I don&#8217;t know if compile time conversions honor the rounding mode, but I expect they don&#8217;t). Assume also that both give the same result for the same rounding mode for the same decimal input. The compiled and strtod() conversions can still differ &#8212; if the rounding mode is different at run time than at compile time.</p>
<h3>GLIBC Bug Report</h3>
<p>The glibc bug report &ldquo;<a title="Bugzilla Bug 3479: Incorrect rounding in strtod()" href="http://sources.redhat.com/bugzilla/show_bug.cgi?id=3479">Bugzilla Bug 3479: Incorrect rounding in strtod()</a>&rdquo; mentions, as an aside, strtod()&#8217;s failure to respond to the current rounding mode (see comment 6).</p>
<h3>Are There Any Applications That Depend on Directed Conversions?</h3>
<p>Are there any Visual C++ or glibc based applications that depend on these &ldquo;directed&rdquo; conversions? I hope not &#8212; it doesn&#8217;t work.</p>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/visual-c-plus-plus-and-glibc-strtod-ignore-rounding-mode/">Visual C++ and GLIBC strtod() Ignore Rounding Mode</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/visual-c-plus-plus-and-glibc-strtod-ignore-rounding-mode/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Incorrectly Rounded Conversions in GCC and GLIBC</title>
		<link>http://www.exploringbinary.com/incorrectly-rounded-conversions-in-gcc-and-glibc/</link>
		<comments>http://www.exploringbinary.com/incorrectly-rounded-conversions-in-gcc-and-glibc/#comments</comments>
		<pubDate>Thu, 03 Jun 2010 22:42:42 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Binary arithmetic]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=285</guid>
		<description><![CDATA[Visual C++ rounds some decimal to double-precision floating-point conversions incorrectly, but it&#8217;s not alone; the gcc C compiler and the glibc strtod() function do the same. In this article, I&#8217;ll show examples of incorrect conversions in gcc and glibc, and I&#8217;ll present a C program that demonstrates the errors. (To understand my methodology and terminology, [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-gcc-and-glibc/">Incorrectly Rounded Conversions in GCC and GLIBC</a></p>
]]></description>
			<content:encoded><![CDATA[<p><a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in Visual C++&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/">Visual C++ rounds some decimal to double-precision floating-point conversions incorrectly</a>, but it&#8217;s not alone; the gcc C compiler and the glibc strtod() function do the same. In this article, I&#8217;ll show examples of incorrect conversions in gcc and glibc, and I&#8217;ll present a C program that demonstrates the errors.</p>
<p><span id="more-285"></span></p>
<p>(To understand my methodology and terminology, please read my article &ldquo;<a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in Visual C++&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/">Incorrectly Rounded Conversions in Visual C++</a>.&rdquo;)</p>
<h2>My Analysis</h2>
<p>I ran my tests on an Intel Core Duo processor running <a title="Ubuntu Linux" href="http://www.ubuntu.com/">Ubuntu Linux</a> version 10.04, and I used the <a title="GCC, the GNU Compiler Collection" href="http://gcc.gnu.org/">gcc C compiler</a> version 4.4.3 and the <a title="Embedded GLIBC (EGLIBC)" href="http://www.eglibc.org/home">embedded glibc library (eglibc)</a> version 2.11.1. (As far as I can tell, eglibc uses the same conversion code as <a title="GNU C Library" href="http://www.gnu.org/software/libc/">glibc</a>, so I will refer to eglibc as glibc for the rest of this article.) (<em>Update</em>: I reran my tests on glibc, getting the same results.)</p>
<p>I did automated testing of glibc strtod() string conversions and manual testing of gcc floating-point literal conversions. This is what found in my automated testing of strtod():</p>
<ul>
<li>Conversions were done incorrectly only about four ten-thousandths of a percent of the time.</li>
<li>No conversion was incorrect by more than one <em>unit in the last place</em> (ULP).</li>
<li>Incorrect conversions included both &ldquo;hard&rdquo; and &ldquo;easy&rdquo; cases.</li>
<li>There were incorrect conversions covering all three cases: &ldquo;halfway&rdquo;, &ldquo;greater than halfway&rdquo;, and &ldquo;less than halfway.&rdquo;</li>
</ul>
<p>My manual testing of gcc was not as comprehensive, of course, but I did manage to find some examples of incorrect conversions.</p>
<p>For the remainder of this article, I will discuss six examples of incorrect conversions &#8212; examples uncovered by my automated and manual testing, as well as examples found in Bugzilla bug reports for gcc and glibc.</p>
<h2>Incorrect Conversion by Both GCC and GLIBC</h2>
<p>Here&#8217;s an example that&#8217;s converted incorrectly by <em>both</em> gcc and glibc &#8212; to the same incorrect value:</p>
<h3>Example 1 (Halfway Case)</h3>
<p>0.500000000000000166533453693773481063544750213623046875, </p>
<p>which equals 2<sup>-1</sup> + 2<sup>-53</sup> + 2<sup>-54</sup>, converts to this binary number:</p>
<p>0.10000000000000000000000000000000000000000000000000001<span class="highlight_gray4">1</span></p>
<p>It has 54 significant bits, with bits 53 and 54 equal to 1. The correctly rounded value &#8212; in <a title="Read Rick Regan's article &ldquo;Displaying IEEE Doubles in Binary Scientific Notation&rdquo;" href="http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/">binary scientific notation</a> &#8212; is</p>
<p>1.000000000000000000000000000000000000000000000000001 x 2<sup>-1</sup></p>
<p>gcc and glibc strtod() both compute the converted value as</p>
<p>1.0000000000000000000000000000000000000000000000000001 x 2<sup>-1</sup></p>
<p>which is one ULP below the correctly rounded value.</p>
<p>(I constructed this example by hand.)</p>
<h2>Incorrect Conversions by GLIBC Only</h2>
<p>Here are three examples that are converted incorrectly by glibc but converted correctly by gcc:</p>
<h3>Example 2 (Halfway Case)</h3>
<p>3.518437208883201171875e13 converts to this binary number:</p>
<p>1000000000000000000000000000000000000000000000.0000001<span class="highlight_gray4">1</span></p>
<p>It has 54 significant bits, with bits 53 and 54 equal to 1. The correctly rounded value is</p>
<p>1.000000000000000000000000000000000000000000000000001 x 2<sup>45</sup></p>
<p>glibc strtod() computes the converted value as</p>
<p>1.0000000000000000000000000000000000000000000000000001 x 2<sup>45</sup></p>
<p>which is one ULP below the correctly rounded value.</p>
<p>(This example was taken from &ldquo;<a title="Bugzilla Bug 3479: Incorrect rounding in strtod()" href="http://sources.redhat.com/bugzilla/show_bug.cgi?id=3479">Bugzilla Bug 3479: Incorrect rounding in strtod()</a>.&rdquo;)</p>
<h3>Example 3 (Greater than Halfway Case)</h3>
<p>62.5364939768271845828 converts to this binary number, which has a non-terminating binary fraction portion:</p>
<p>111110.10001001010101111010101101010100111110010100011<span class="highlight_gray4">1</span>000000000000<span class="highlight_gray4">1</span>&#8230;</p>
<p>Bits 53 and 54 are equal to 1, so the correctly rounded value is</p>
<p>1.11110100010010101011110101011010101001111100101001 x 2<sup>5</sup></p>
<p>glibc strtod() computes the converted value as</p>
<p>1.1111010001001010101111010101101010100111110010100011 x 2<sup>5</sup></p>
<p>which is one ULP below the correctly rounded value.</p>
<p>(I discovered this example during random testing.)</p>
<h3>Example 4 (Less than Halfway Case)</h3>
<p>8.10109172351e-10 converts to this non-terminating binary fraction:</p>
<p>0.000000000000000000000000000000110111101010111001011101011101111000<br />
01111110100001100<span class="highlight_gray4">0</span>1111111111111<span class="highlight_gray4">0</span>&#8230;</p>
<p>Significant bit 54 is 0, so it should be rounded down to get the correctly rounded value</p>
<p>1.10111101010111001011101011101111000011111101000011 x 2<sup>-31</sup></p>
<p>glibc strtod() computes the converted value as</p>
<p>1.1011110101011100101110101110111100001111110100001101 x 2<sup>-31</sup></p>
<p>which is one ULP above the correctly rounded value.</p>
<p>(I discovered this example during random testing.)</p>
<h2>Incorrect Conversions by GCC Only</h2>
<p>Here are two examples that are converted incorrectly by gcc but converted correctly by glibc:</p>
<h3>Example 5 (Halfway Case)</h3>
<p>The number</p>
<p>1.50000000000000011102230246251565404236316680908203125</p>
<p>which equals 1 + 2<sup>-1</sup> + 2<sup>-53</sup>, converts to this binary number:</p>
<p>1.1000000000000000000000000000000000000000000000000000<span class="highlight_gray4">1</span></p>
<p>It rounds correctly to 1.1, since bit 53 is 0. </p>
<p>gcc converts it to </p>
<p>1.1000000000000000000000000000000000000000000000000001</p>
<p>which is one ULP above the correctly rounded value.</p>
<p>(I constructed this example by hand.)</p>
<h3>Example 6 (Less than Halfway Case)</h3>
<p>9007199254740991.4999999999999999999999999999999995 converts to this binary number, which has a non-terminating binary fraction portion:</p>
<p>11111111111111111111111111111111111111111111111111111.<span class="highlight_gray4">0</span>1111111111111<br />
1111111111111111111111111111111111111111111111111111111111111111111<br />
11111111111111111111111111111<span class="highlight_gray4">0</span>&#8230;</p>
<p>Bit 54 is 0, so it converts correctly to</p>
<p>1.1111111111111111111111111111111111111111111111111111 x 2<sup>52</sup></p>
<p>gcc converts it to</p>
<p>1.0 x 2<sup>53</sup></p>
<p>which is one ULP above the correctly rounded value.</p>
<p>(This example was taken from &ldquo;<a title="GCC Bugzilla Bug 21718: real.c rounding not perfect" href="http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21718">Bugzilla Bug 21718: real.c rounding not perfect</a>.&rdquo;)</p>
<h2>A C Program that Reports the Behavior</h2>
<p>Here&#8217;s a C program you can use to confirm incorrect rounding on your own Linux system:</p>
<pre>
// gcc -o examples examples.c
#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;

#define NUM_CONVERT 6

int main (void)
{
 unsigned int i;

 char decimal[NUM_CONVERT][60] = {
  "0.500000000000000166533453693773481063544750213623046875",
  "3.518437208883201171875e13",
  "62.5364939768271845828",
  "8.10109172351e-10",
  "1.50000000000000011102230246251565404236316680908203125",
  "9007199254740991.4999999999999999999999999999999995"
 };

 double gcc_conversion[NUM_CONVERT] = {
  0.500000000000000166533453693773481063544750213623046875,
  3.518437208883201171875e13,
  62.5364939768271845828,
  8.10109172351e-10,
  1.50000000000000011102230246251565404236316680908203125,
  9007199254740991.4999999999999999999999999999999995
};

 char correct[NUM_CONVERT][30] = {
  "0x1.0000000000002p-1",
  "0x1.0000000000002p+45",
  "0x1.f44abd5aa7ca4p+5",
  "0x1.bd5cbaef0fd0cp-31",
  "0x1.8p+0",
  "0x1.fffffffffffffp+52"
 };

 for (i=0; i &lt; NUM_CONVERT; i++)
 {
  printf ("%s\n",decimal[i]);
  printf (" Correct = %s\n",correct[i]);
  printf (" gcc =     %a\n",gcc_conversion[i]);
  printf (" strtod =  %a\n\n",strtod(decimal[i],NULL));
 }
}
</pre>
<p>This is the output on my system (I shaded the incorrect conversions):</p>
<pre>
0.500000000000000166533453693773481063544750213623046875
 Correct = 0x1.0000000000002p-1
 <span class="highlight_gray4">gcc =     0x1.0000000000001p-1</span>
 <span class="highlight_gray4">strtod =  0x1.0000000000001p-1</span>

3.518437208883201171875e13
 Correct = 0x1.0000000000002p+45
 gcc =     0x1.0000000000002p+45
 <span class="highlight_gray4">strtod =  0x1.0000000000001p+45</span>

62.5364939768271845828
 Correct = 0x1.f44abd5aa7ca4p+5
 gcc =     0x1.f44abd5aa7ca4p+5
 <span class="highlight_gray4">strtod =  0x1.f44abd5aa7ca3p+5</span>

8.10109172351e-10
 Correct = 0x1.bd5cbaef0fd0cp-31
 gcc =     0x1.bd5cbaef0fd0cp-31
 <span class="highlight_gray4">strtod =  0x1.bd5cbaef0fd0dp-31</span>

1.50000000000000011102230246251565404236316680908203125
 Correct = 0x1.8p+0
 <span class="highlight_gray4">gcc =     0x1.8000000000001p+0</span>
 strtod =  0x1.8p+0

9007199254740991.4999999999999999999999999999999995
 Correct = 0x1.fffffffffffffp+52
 <span class="highlight_gray4">gcc =     0x1p+53</span>
 strtod =  0x1.fffffffffffffp+52
</pre>
<p>The values for the correct conversions were obtained using <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s strtod() function</a>. I ran a separate program to compute them and then hard coded them in this program. (I could also have constructed the correct hex formats by hand, using my analysis in the examples above.)</p>
<p>If you try this on your system, let me know what happens; I expect the same output. Please include the type of Linux, and version numbers if possible.</p>
<h2>Bug Reports</h2>
<p>There are existing bug reports for gcc and glibc:</p>
<ul>
<li><a title="GCC Bugzilla Bug 21718: real.c rounding not perfect" href="http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21718">Bugzilla Bug 21718: real.c rounding not perfect</a>.</li>
<li><a title="Bugzilla Bug 3479: Incorrect rounding in strtod()" href="http://sources.redhat.com/bugzilla/show_bug.cgi?id=3479">Bugzilla Bug 3479: Incorrect rounding in strtod()</a>.</li>
</ul>
<p>It&#8217;s not clear whether these bugs will (or should?) be fixed; if they are, I&#8217;d imagine <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s strtod() function</a> would be used to replace the current code.</p>
<p>I would guess this problem affects the other GCC compilers, although I haven&#8217;t tested.</p>
<h2>GCC/GLIBC Vs. Visual C++</h2>
<p>Here are the differences I found between gcc/glibc and Visual C++ in my two analyses:</p>
<ul>
<li>glibc strtod() converted incorrectly about two orders of magnitude less frequently than Visual C++ strtod().</li>
<li>I found no integer values that rounded incorrectly in gcc or glibc, unlike in Visual C++.</li>
<li>gcc and glibc strtod() can convert the same decimal number differently; I found no such example in Visual C++ and its strtod() function.</li>
<li>Of the <a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in Visual C++&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/">five examples Visual C++ converts incorrectly</a>, only one &#8212; example 2 &#8212; is converted incorrectly by gcc or glibc (in this case, by both). Of the six examples in this article converted incorrectly by gcc or glibc, two are converted incorrectly by Visual C++: examples 1 and 2.</li>
<li>When gcc or glibc rounds halfway cases incorrectly, it does so either up or down. I found Visual C++ to consistently round incorrectly down.</li>
<li>I found an incorrect conversion for a &ldquo;less than halfway case&rdquo; in gcc/glibc; I found no such example in Visual C++.</li>
</ul>
<h2>Summary</h2>
<p>I&#8217;ve demonstrated that a decimal floating-point number is not guaranteed to convert to its <em>closest</em> binary floating-point number. Also, I&#8217;ve shown that a decimal floating-point number may be converted to <em>different</em> binary floating-point numbers, depending on which compiler or run-time library is used.</p>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-gcc-and-glibc/">Incorrectly Rounded Conversions in GCC and GLIBC</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/incorrectly-rounded-conversions-in-gcc-and-glibc/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Incorrectly Rounded Conversions in Visual C++</title>
		<link>http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/</link>
		<comments>http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/#comments</comments>
		<pubDate>Fri, 28 May 2010 18:56:23 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Binary arithmetic]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=283</guid>
		<description><![CDATA[In my analysis of decimal to floating-point conversion I noted an example that was converted incorrectly by the Microsoft Visual C++ compiler. I&#8217;ve found more examples &#8212; including a class of examples &#8212; that it converts incorrectly. I will analyze those examples in this article. My Analysis I did my analysis using Visual C++ Express [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/">Incorrectly Rounded Conversions in Visual C++</a></p>
]]></description>
			<content:encoded><![CDATA[<p>In my analysis of <a title="Read Rick Regan's article &ldquo;Decimal to Floating-Point Needs Arbitrary Precision&rdquo;" href="http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/">decimal to floating-point conversion</a> I noted an example that was converted incorrectly by the Microsoft Visual C++ compiler. I&#8217;ve found more examples &#8212; including a class of examples &#8212; that it converts incorrectly. I will analyze those examples in this article.</p>
<p><span id="more-283"></span></p>
<h2>My Analysis</h2>
<p>I did my analysis using Visual C++ Express Edition &#8212; both the 2008 and 2010 versions. I converted decimal numbers to double-precision floating-point, via the strtod() library function and floating-point literals converted implicitly by the compiler. I discovered examples of incorrect conversion in three ways: by random testing (using only strtod() so that it could be automated easily), by trial and error, and by direct construction. </p>
<p>My random testing was extensive, but not exhaustive. I generated random decimal strings (positive numbers only) in scientific notation, with a random number of digits (up to 21) and a random exponent (between -308 and +308). I compared the Visual C++ conversion to that given by <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s strtod() function</a>, the de facto standard correct conversion routine. This is what I found:</p>
<ul>
<li>Conversions were done incorrectly about three hundredths of a percent of the time.</li>
<li>No conversion was incorrect by more than one <em>unit in the last place</em> (ULP).</li>
<li>Incorrect conversions included both &ldquo;hard&rdquo; and &ldquo;easy&rdquo; cases (defined below).</li>
<li>All incorrect conversions that I verified by hand were either &ldquo;halfway&rdquo; or &ldquo;greater than halfway&rdquo; cases (defined below).</li>
</ul>
<p>For all the examples below, I used my C function <a href="http://www.exploringbinary.com/converting-floating-point-numbers-to-binary-strings-in-c/" title="Read Rick Regan's Article &ldquo;Converting Floating-Point Numbers to Binary Strings in C&rdquo;">fp2bin()</a> to inspect the converted floating-point values. I verified that both strtod() and the compiler converted the same decimal number to the same double-precision floating-point value. I also verified the correctly converted value by hand, using the full-precision binary numbers generated by my <a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">decimal to binary converter</a>.</p>
<h2><a name="halfway-case"></a>Halfway Cases</h2>
<p>A <em>halfway case</em> is when a decimal number converts to a full-precision binary number that is halfway between two binary floating-point values. By convention, in correct rounding, the tie is broken by choosing the floating-point value with its least significant bit equal to 0. </p>
<p>In terms of double-precision floating-point, the halfway case happens when significant bit 54 of the full-precision binary number is equal to 1, and all bits beyond are equal to 0. If bit 53 is equal to 0, then the full-precision binary number is rounded down to 53 bits; if bit 53 is equal to 1, then the full-precision binary number is rounded up to 53 bits. In either case, the correctly rounded result has 0 in its 53rd significant bit.</p>
<p>My testing shows that Visual C++ consistently rounds halfway cases incorrectly; specifically, it always rounds down, even when bit 53 of the full-precision binary number equals 1.</p>
<h3>Example 1 (Incorrect)</h3>
<p>The integer 9,214,843,084,008,499 converts to this binary number:</p>
<p>10000010111100110110011101100010101110111011000011001<span class="highlight_gray4">1</span></p>
<p>Bit 54 is 1, and with no other bits beyond, is a halfway case. Bit 53 is 1, so the correctly rounded value &#8212; in <a title="Read Rick Regan's article &ldquo;Displaying IEEE Doubles in Binary Scientific Notation&rdquo;" href="http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/">binary scientific notation</a> &#8212; is</p>
<p>1.000001011110011011001110110001010111011101100001101 x 2<sup>53</sup></p>
<p>Visual C++ computes the converted value as</p>
<p>1.0000010111100110110011101100010101110111011000011001 x 2<sup>53</sup></p>
<p>which is one ULP below the correctly rounded value.</p>
<h3>Example 2 (Incorrect)</h3>
<p>The <a title="Wikipedia article on dyadic fraction" href="http://en.wikipedia.org/wiki/Dyadic_fraction">dyadic fraction</a></p>
<p>0.500000000000000166533453693773481063544750213623046875, </p>
<p>which equals 2<sup>-1</sup> + 2<sup>-53</sup> + 2<sup>-54</sup>, converts to this binary number:</p>
<p>0.100000000000000000000000000000000000000000000000000011</p>
<p>It has 54 significant bits, with bits 53 and 54 equal to 1. The correctly rounded value is</p>
<p>1.000000000000000000000000000000000000000000000000001 x 2<sup>-1</sup></p>
<p>Visual C++ computes the converted value as</p>
<p>1.0000000000000000000000000000000000000000000000000001 x 2<sup>-1</sup></p>
<p>which is one ULP below the correctly rounded value.</p>
<h3>Discussion</h3>
<p>Examples 1 and 2 would be considered &ldquo;easy&rdquo; cases in correct rounding parlance. They require only one extra bit beyond double precision to make the correct rounding decision. These examples support my assertion that Visual C++ universally adopts truncation for halfway cases.</p>
<h2>Greater than Halfway Cases</h2>
<p>The <em>greater than halfway case</em> occurs when a decimal number converts to a full-precision binary number that is between two binary floating-point values, but closer to the one above. Visual C++ gets some of these wrong and some of these right.</p>
<h3>Example 3 (Incorrect)</h3>
<p>The integer 30,078,505,129,381,147,446,200 converts to this binary number:</p>
<p>11001011110100011110001110010011100011011000000100000<span class="highlight_gray4">1</span>000000000<span class="highlight_gray4">1</span>0111<br />
0111000</p>
<p>Bit 54 is 1 and there is a 1 bit beyond bit 54, so the correctly rounded value is</p>
<p>1.1001011110100011110001110010011100011011000000100001 x 2<sup>74</sup></p>
<p>Visual C++ computes the converted value as</p>
<p>1.10010111101000111100011100100111000110110000001 x 2<sup>74</sup></p>
<p>which is one ULP below the correctly rounded value.</p>
<h3>Example 4 (Correct)</h3>
<p>The integer 1,777,820,000,000,000,000,001 converts to this binary number:</p>
<p>11000000110000000110101011011101011010010111000111101<span class="highlight_gray4">1</span>00000000000000<br />
00<span class="highlight_gray4">1</span></p>
<p>Just like the prior example, bit 54 is 1 and there is a 1 bit beyond it &#8212; but Visual C++ converts this <em><strong>correctly</strong></em>, to </p>
<p>1.100000011000000011010101101110101101001011100011111 x 2<sup>70</sup></p>
<h3>Example 5 (Incorrect)</h3>
<p>The dyadic fraction</p>
<p>0.500000000000000166547006220929549868969843373633921146392822265625</p>
<p>which equals 2<sup>-1</sup> + 2<sup>-53</sup> + 2<sup>-54</sup> + 2<sup>-66</sup>, converts to this binary number:</p>
<p>0.10000000000000000000000000000000000000000000000000001<span class="highlight_gray4">1</span>00000000000<span class="highlight_gray4">1</span></p>
<p>It converts correctly to</p>
<p>1.000000000000000000000000000000000000000000000000001 x 2<sup>-1</sup></p>
<p>But Visual C++ converts it to this, one ULP below the correctly rounded value:</p>
<p>1.0000000000000000000000000000000000000000000000000001 x 2<sup>-1</sup></p>
<h3>Example 6 (Correct)</h3>
<p>The dyadic fraction</p>
<p>0.50000000000000016656055874808561867439493653364479541778564453125 </p>
<p>which equals 2<sup>-1</sup> + 2<sup>-53</sup> + 2<sup>-54</sup> + 2<sup>-65</sup>, converts to this binary number:</p>
<p>0.10000000000000000000000000000000000000000000000000001<span class="highlight_gray4">1</span>0000000000<span class="highlight_gray4">1</span></p>
<p>Visual C++ converts this <em><strong>correctly</strong></em>, to</p>
<p>1.000000000000000000000000000000000000000000000000001 x 2<sup>-1</sup></p>
<h3>Example 7 (Incorrect)</h3>
<p>The decimal fraction 0.3932922657273 is non-terminating in binary (it&#8217;s not dyadic); here is the relevant portion of it &#8212; its first 82 places:</p>
<p>0.011001001010111011001101010010110001000110001000111001<span class="highlight_gray4">1</span>00000000000<br />
000000000000000<span class="highlight_gray4">1</span>&#8230;</p>
<p>Significant bits 53 and 54 are both 1, as is significant bit 81; it rounds correctly to</p>
<p>1.100100101011101100110101001011000100011000100011101 x 2<sup>-2</sup></p>
<p>Visual C++ computes the value one ULP below:</p>
<p>1.1001001010111011001101010010110001000110001000111001 x 2<sup>-2</sup></p>
<h3>Discussion</h3>
<p>Visual C++ gets example 4 correct, even though it appears harder than example 3, which it converts incorrectly; example 4 has six more zeros beyond bit 54.</p>
<p>Examples 5 and 6 are dyadic fractions, which theoretically should be easy to convert. Example 5 is 66 digits long, and example 6 is 65 digits long. (Dyadic fractions have the same number of bits as decimal digits.)  Example 6 converts correctly, but example 5 does not. Is it because Visual C++ stops looking beyond 65 places?</p>
<p>Example 7 is a hard case, which Visual C++ does get wrong.</p>
<h2>My Bug Report</h2>
<p>In April 2009, I submitted a bug report to Microsoft, titled &ldquo;<a title="See Microsoft bug report &ldquo;Incorrectly Rounded Decimal to Binary Conversions in Visual C++&rdquo;" href="http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=434491">Incorrectly Rounded Decimal to Binary Conversions in Visual C++</a>.&rdquo; Microsoft closed the bug, but said they would &ldquo;re-consider [sic] for future versions of VC++&rdquo;. We&#8217;ll see.</p>
<p>You could argue that being only one ULP away from the correct result is inconsequential. Technically, I think it&#8217;s &lsquo;legal&rsquo; from the perspective of the IEEE floating-point specification. (This seems to be conventional wisdom, but I can&#8217;t find the exact wording in the spec &#8212; can someone enlighten me?) However, widely used open source code &#8212; <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s strtod() function in dtoa.c</a> &#8212; gets these conversions right. Should Visual C++ do the same?</p>
<p> (I would imagine this &lsquo;problem&rsquo; also affects the rest of Visual Studio, but I haven&#8217;t investigated.)</p>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/">Incorrectly Rounded Conversions in Visual C++</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Decimal to Floating-Point Needs Arbitrary Precision</title>
		<link>http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/</link>
		<comments>http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/#comments</comments>
		<pubDate>Thu, 20 May 2010 18:42:28 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Binary arithmetic]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=282</guid>
		<description><![CDATA[In my article &#8220;Quick and Dirty Decimal to Floating-Point Conversion&#8221; I presented a small C program that converts a decimal string to a double-precision binary floating-point number. The number it produces, however, is not necessarily the closest &#8212; or so-called correctly rounded &#8212; double-precision binary floating-point number. This is not a failing of the algorithm; [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/">Decimal to Floating-Point Needs Arbitrary Precision</a></p>
]]></description>
			<content:encoded><![CDATA[<p>In my article &ldquo;<a title="Read Rick Regan's article &ldquo;Quick and Dirty Decimal to Floating-Point Conversion&rdquo;" href="http://www.exploringbinary.com/quick-and-dirty-decimal-to-floating-point-conversion/">Quick and Dirty Decimal to Floating-Point Conversion</a>&rdquo; I  presented a small C program that converts a decimal string to a double-precision binary floating-point number. The number it produces, however, is not necessarily the closest &#8212; or so-called <em>correctly rounded</em> &#8212; double-precision binary floating-point number. This is not a failing of the algorithm; mathematically speaking, the algorithm is correct. The flaw comes in its implementation in limited precision binary floating-point arithmetic. </p>
<p>The quick and dirty program is implemented in native C, so it&#8217;s limited to double-precision floating-point arithmetic (although on some systems, extended precision may be used). Higher precision arithmetic &#8212; in fact, arbitrary precision arithmetic &#8212; is needed to ensure that all decimal inputs are converted correctly. I will demonstrate the need for high precision by analyzing three examples, all taken from Vern Paxson&#8217;s paper &ldquo;<a title="Read Vern Paxson's paper &ldquo;A Program for Testing IEEE Decimal–Binary Conversion&rdquo;" href="http://www.icir.org/vern/papers/testbase-report.pdf">A Program for Testing IEEE Decimal–Binary Conversion</a>&rdquo;.</p>
<p><span id="more-282"></span></p>
<h2><a name="correct-rounding"></a>Correct Rounding</h2>
<p>A <a title="Read David M. Gay's paper &ldquo;Correctly Rounded Binary-Decimal and Decimal-Binary Conversions&rdquo;" href="http://www.ampl.com/REFS/rounding.pdf">correctly rounded conversion </a> from a decimal string to a double-precision floating-point number is the 53 significant bit value closest to the decimal input &#8212; and in the event of a tie, the value with significant bit number 53 equal to 0. (This is the prevailing definition, being based on the default IEEE 754 rounding mode. But to be truly correctly rounded &#8212; theoretically at least &#8212; <a title="Read Rick Regan's article &ldquo;Visual C++ and GLIBC strtod() Ignore Rounding Mode&rdquo;" href="http://www.exploringbinary.com/visual-c-plus-plus-and-glibc-strtod-ignore-rounding-mode/">all four IEEE 754 rounding modes must be honored</a>.) The closest 53-bit number is obtained by rounding what would be the full-precision binary number to 53 bits. This can be done correctly only if the correct value of bit 54 &#8212; <em>and sometimes the correct values of an arbitrary number of bits beyond</em> &#8212; is known. Here are the rules:</p>
<ul>
<li>If bit 54 is 0, round down</li>
<li>If bit 54 is 1
<ul>
<li>If any bit beyond bit 54 is 1, round up</li>
<li>If all bits beyond bit 54 are 0 (meaning the number is halfway between two floating-point numbers)
<ul>
<li>If bit 53 is 0, round down</li>
<li>If bit 53 is 1, round up</li>
</ul>
</li>
</ul>
</li>
</ul>
<p>The rules seem simple enough, but there&#8217;s a problem: the calculation needs to be precise enough so that the true value of bit 54 and the appropriate number of bits beyond is known. Decimal numbers that are close to halfway between two floating-point numbers require the most precision to get right; the three examples that follow are such numbers.</p>
<h2>Example 1: 7.8459735791271921 x 10<sup>65</sup></h2>
<p>7.8459735791271921 x 10<sup>65</sup> is shorthand for this number:</p>
<p>784597357912719210000000000000000000000000000000000000000000000000</p>
<p>In binary, it equals</p>
<p>11101110011010000000010001001110000010011000101001110<span class="highlight_gray4">0</span>11111111111111<br />
11111111111111111111111111111111111111111111111111<span class="highlight_gray4">0</span>00111100101001110<br />
11111110111110111000110111111101010000000000000000000000000000000000<br />
000000000000000</p>
<p>Rounded correctly to 53 bits it equals</p>
<p>11101110011010000000010001001110000010011000101001110000000000000000<br />
00000000000000000000000000000000000000000000000000000000000000000000<br />
00000000000000000000000000000000000000000000000000000000000000000000<br />
000000000000000</p>
<p>which in <a title="Read Rick Regan's article &ldquo;Displaying IEEE Doubles in Binary Scientific Notation&rdquo;" href="http://www.exploringbinary.com/displaying-ieee-doubles-in-binary-scientific-notation/">binary scientific notation</a> equals</p>
<p>1.110111001101000000001000100111000001001100010100111 x 2<sup>218</sup></p>
<h3>Why This Conversion is Difficult in Floating-Point</h3>
<p>The full-precision binary value above was obtained using my <a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">decimal to binary converter</a>. The converter uses <a title="Read Rick Regan's Article &ldquo;Base Conversion in PHP Using BCMath&rdquo;" href="http://www.exploringbinary.com/base-conversion-in-php-using-bcmath/">arbitrary precision decimal arithmetic</a>, so every bit in the result is correct. Since bit 54 (the first highlighted bit above) is 0, it is rounded down to get the correctly rounded 53-bit result.</p>
<p>When the conversion is done in floating-point, things get a bit murky. With insufficient precision, it&#8217;s impossible to tell whether bit 54 is really 0, since it&#8217;s a tiny fraction away from being 1. The conversion algorithm cannot make the distinction unless the calculation is known to be precise through the 119th bit (the second highlighted bit above).</p>
<h3>Converting with Varying Floating-Point Precision</h3>
<p>To demonstrate the effect of floating-point precision, I modified my <a title="Read Rick Regan's article &ldquo;Quick and Dirty Decimal to Floating-Point Conversion&rdquo;" href="http://www.exploringbinary.com/quick-and-dirty-decimal-to-floating-point-conversion/">quick and dirty conversion program</a> to use <a href="http://www.exploringbinary.com/how-to-install-and-run-gmp-on-windows-using-mpir/" title="Read Rick Regan's Article &ldquo;How to Install and Run GMP on Windows Using MPIR&rdquo;">GMP floating-point arithmetic</a>. I increased the precision of the GMP conversions incrementally until I reached the intermediate result with sufficient precision; for each increment, I printed the result in binary and rounded correctly by hand.</p>
<p>I converted 7.8459735791271921 x 10<sup>65</sup> to double precision using the quick and dirty program  (53-bit precision) and GMP with 64, 96, and 128 bits of precision (on my system, GMP changes precision internally at 32-bit granularity, plus a few extra bits); here are the results, expressed in binary scientific notation:</p>
<ul>
<li>Quick and Dirty (53-bit precision):
<p>1.110111001101000000001000100111000001001100010101 x 2<sup>218</sup>
</p>
</li>
<li>GMP, with 64 and 96 bit precision:
<p>1.110111001101000000001000100111000001001100010100111 x 2<sup>218</sup>
</p>
</li>
<li>GMP, with 128-bit precision:
<p>1.110111001101000000001000100111000001001100010100111 x 2<sup>218</sup>
</p>
</li>
</ul>
<p>The quick and dirty result is <em>incorrect</em>, and is off by two units in the last place (ULPs). All three GMP values are correct, but the 64 and 96 bit values are correct by chance; before rounding, the intermediate result (not shown) has bit 53 equal to 0 and bit 54 equal to 1, with all bits beyond equal to 0. That makes it look like a halfway case, so it rounds down to the correct value &#8212; by sheer luck. The 128-bit GMP calculation is precise enough that the 119th bit is known to be 0, meaning bit 54 is 0 and that the result should be rounded down.</p>
<h2>Example 2: 3.571 x 10<sup>266</sup></h2>
<p>3.571 x 10<sup>266</sup> is an 886 bit binary number, so I&#8217;ll only show the relevant portion &#8212; its first 77 bits:</p>
<p>10110001001100100010011000110000111010100000110101001<span class="highlight_gray4">1</span>00000000000000<br />
00000000<span class="highlight_gray4">1</span>&#8230;</p>
<p>Rounded correctly to 53 bits it equals</p>
<p>1.011000100110010001001100011000011101010000011010101 x 2<sup>885</sup></p>
<p>Here are the quick and dirty and GMP conversions:</p>
<ul>
<li>Quick and Dirty (53-bit precision):
<p>1.011000100110010001001100011000011101010000011011001 x 2<sup>885</sup>
</p>
</li>
<li>GMP, with 64-bit precision:
<p>1.0110001001100100010011000110000111010100000110101001 x 2<sup>885</sup>
</p>
</li>
<li>GMP, with 96-bit precision:
<p>1.011000100110010001001100011000011101010000011010101 x 2<sup>885</sup>
</p>
</li>
</ul>
<p>The quick and dirty and 64-bit GMP results are both <em>incorrect</em>; the quick and dirty result is off by eight ULPs, and the 64-bit GMP result is off by one ULP. The 64-bit conversion goes wrong because GMP computes this intermediate value (it&#8217;s 66 bits, since GMP can carry extra bits beyond the expected precision):</p>
<p>10110001001100100010011000110000111010100000110101001<span class="highlight_gray4">0</span>11111111111<span class="highlight_gray4">0</span></p>
<p>At 64-bit precision, bit 54 appears to be 0, and hence the result is rounded down.</p>
<p>As expected, the 96-bit GMP result is correct, since only 77 bits of precision are needed to decide the fate of bit 54.</p>
<h2>Example 3: 3.08984926168550152811 x 10<sup>-32</sup></h2>
<p>3.08984926168550152811 x 10<sup>-32</sup> is a decimal fraction which, when written in binary, is non-terminating; for our purposes, we only need to see its first 234 places (104 leading 0s, and significant bits 1-53 and 54-130):</p>
<p>0.000000000000000000000000000000000000000000000000000000000000000000<br />
00000000000000000000000000000000000000101000000110111100100100001100<br />
11101100110010100111010<span class="highlight_gray4">1</span>00000000000000000000000000000000000000000000<br />
0000000000000000000000000000000<span class="highlight_gray4">1</span>&#8230;</p>
<p>Rounded correctly to 53 bits it equals</p>
<p>1.0100000011011110010010000110011101100110010100111011 x 2<sup>-105</sup></p>
<p>Here are the quick and dirty and GMP conversions:</p>
<ul>
<li>Quick and Dirty (53-bit precision):
<p>1.010000001101111001001000011001110110011001010011101 x 2<sup>-105</sup>
</p>
</li>
<li>GMP, with 64 and 96 bit precision:
<p>1.010000001101111001001000011001110110011001010011101 x 2<sup>-105</sup>
</p>
</li>
<li>GMP, with 128-bit precision:
<p>1.0100000011011110010010000110011101100110010100111011 x 2<sup>-105</sup>
</p>
</li>
<li>GMP, with 160-bit precision:
<p>1.0100000011011110010010000110011101100110010100111011 x 2<sup>-105</sup>
</p>
</li>
</ul>
<p>The quick and dirty, 64-bit GMP, and 96-bit GMP results are all <em>incorrect</em>; they are all the same, and are off by 1 ULP. (The Visual C++ compiler gets this conversion wrong too, getting the same wrong answer!)</p>
<p>The 128-bit GMP result is correct by accident. The intermediate value used to compute it (not shown) has a 0 at bit 53, a 1 at bit 54, and then a string of 74 0s followed by a 1 (yes, 129 bits). The full-precision result has <em>75</em> 0s followed by a 1. The difference does not matter though, since both bit patterns indicate that the result should be rounded up.</p>
<p>The 160-bit GMP result is correct, which is expected since 130 bits of the full-precision binary result are needed to guarantee the right rounding decision is made.</p>
<h2>How Much Precision is Enough?</h2>
<p>In the examples above, I knew how much precision was needed for correct rounding because I already knew the full-precision binary equivalents of the decimal inputs &#8212; by external means. But how would a floating-point conversion routine know how much precision is enough? Naively, it would have to assume the worst case, computing with enough precision to match the input (for integer values, this would mean computing the full-precision binary equivalent). That would be overkill, since most of the time much less precision is needed to make a correct rounding decision.</p>
<p><a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s strtod() function in dtoa.c</a> &#8212; the de facto standard conversion routine &#8212; avoids this problem by avoiding high precision floating-point. It uses standard double-precision floating-point for most conversions, and arbitrary precision integer arithmetic for &ldquo;tough&rdquo; conversions; it has the smarts to know when to use one or the other. (I will discuss that code in a future article.)</p>
<h2>Acknowledgement</h2>
<p>Thanks to <a title="Vern Paxson's page at Berkeley" href="http://www.eecs.berkeley.edu/Faculty/Homepages/paxson.html">Vern Paxson</a> for helping me understand the &ldquo;very close to but less than 1/2 ULP case.&rdquo;</p>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/">Decimal to Floating-Point Needs Arbitrary Precision</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/decimal-to-floating-point-needs-arbitrary-precision/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Quick and Dirty Decimal to Floating-Point Conversion</title>
		<link>http://www.exploringbinary.com/quick-and-dirty-decimal-to-floating-point-conversion/</link>
		<comments>http://www.exploringbinary.com/quick-and-dirty-decimal-to-floating-point-conversion/#comments</comments>
		<pubDate>Fri, 30 Apr 2010 15:21:20 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Binary arithmetic]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=281</guid>
		<description><![CDATA[This little C program converts a decimal value &#8212; represented as a string &#8212; into a double-precision floating-point number: #include &#60;string.h&#62; int main (void) { double intPart = 0, fracPart = 0, conversion; unsigned int i; char decimal[] = &#34;3.14159&#34;; i = 0; /* Left to right */ while (decimal[i] != '.') { intPart = [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/quick-and-dirty-decimal-to-floating-point-conversion/">Quick and Dirty Decimal to Floating-Point Conversion</a></p>
]]></description>
			<content:encoded><![CDATA[<p>This little C program converts a decimal value &#8212; represented as a string &#8212; into a double-precision floating-point number: </p>
<pre>
#include &lt;string.h&gt;

int main (void)
{
 double intPart = 0, fracPart = 0, conversion;
 unsigned int i;
 char decimal[] = &quot;3.14159&quot;;

 i = 0; /* Left to right */
 while (decimal[i] != '.') {
    <strong>intPart = intPart*10 + (decimal[i] - '0');</strong>
    i++;
   }

 i = strlen(decimal)-1; /* Right to left */
 while (decimal[i] != '.') {
    <strong>fracPart = (fracPart + (decimal[i] - '0'))/10;</strong>
    i--;
   }

 conversion = intPart + fracPart;
}
</pre>
<p>The conversion is done using the elegant Horner&#8217;s method, summing each digit according to its decimal place value. So why do I call it &ldquo;quick and <em>dirty</em>?&rdquo; Because the binary floating-point value it produces is not necessarily the closest approximation to the input decimal value &#8212; the so-called <em>correctly rounded</em> result. (Remember that most real numbers cannot be represented exactly in floating-point.) Most of the time it <em>will</em> produce the correctly rounded result, but sometimes it won&#8217;t &#8212; the result will be off in its least significant bit(s). There&#8217;s just not enough precision in floating-point to guarantee the result is correct every time.</p>
<p>I will demonstrate this algorithm with different input values, some of which convert correctly, and some of which don&#8217;t. In the end, you&#8217;ll appreciate one reason why library functions like strtod() exist &#8212; to perform efficient, correctly rounded conversion.</p>
<p><span id="more-281"></span></p>
<h2>Testing the Conversions</h2>
<p>How do you test whether a quick and dirty conversion is correct? First you need to know what the correctly rounded value is, and then you need to compare it to the quick and dirty value.</p>
<p>Here are two ways to find the correctly rounded value:</p>
<ul>
<li>Assign the decimal constant to a double and let your compiler compute it for you.</li>
<li>Enter the decimal constant in my <a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">arbitrary-precision decimal/binary converter</a> and then round it by hand to 53 significant bits (the number of significant bits in a double).</li>
</ul>
<p>I did both, because I know that sometimes even compilers do the conversions incorrectly (see my articles on <a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in Visual C++&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-visual-c-plus-plus/">Visual C++</a> and <a title="Read Rick Regan's article &ldquo;Incorrectly Rounded Conversions in GCC and GLIBC&rdquo;" href="http://www.exploringbinary.com/incorrectly-rounded-conversions-in-gcc-and-glibc/">gcc/glibc</a>). In any case, I verified that all the compiler conversions in the examples below are correct.</p>
<p>Next we need to compare the quick and dirty value to the correctly rounded value &#8212; but how? <a href="http://www.exploringbinary.com/when-doubles-dont-behave-like-doubles/" title="Read Rick Regan's Article &ldquo;When Doubles Don’t Behave Like Doubles&rdquo;">Comparing floating-point values for equality is not reliable</a>. That means we should display the values and compare them manually. But we can&#8217;t depend on printf(), because in many implementations it won&#8217;t print all the significant digits, either <a title="Read Rick Regan's article &ldquo;Print Precision of Floating-Point Integers Varies Too&rdquo;" href="http://www.exploringbinary.com/print-precision-of-floating-point-integers-varies-too/">before the decimal point</a> or <a title="Read Rick Regan's article &ldquo;Print Precision of Dyadic Fractions Varies by Language&rdquo;" href="http://www.exploringbinary.com/print-precision-of-dyadic-fractions-varies-by-language/">after the decimal point</a> (this is true of Visual C++, the compiler I am using). This leaves two options:</p>
<ul>
<li>Print the <a title="Read Rick Regan's article &ldquo;Displaying the Raw Fields of a Floating-Point Number&rdquo;" href="http://www.exploringbinary.com/displaying-the-raw-fields-of-a-floating-point-number/">raw IEEE representation</a> of the double.</li>
<li>Print the value of the double in binary, using my function <a href="http://www.exploringbinary.com/converting-floating-point-numbers-to-binary-strings-in-c/" title="Read Rick Regan's Article &ldquo;Converting Floating-Point Numbers to Binary Strings in C&rdquo;">fp2bin()</a>.</li>
</ul>
<p>I chose to use fp2bin(), because seeing the binary representation is more enlightening.</p>
<h3>The Test Code</h3>
<p>I modified the code in the introduction of this article to allow for testing, as described; here it is</p>
<pre>
#include &lt;string.h&gt;
#include &lt;stdio.h&gt;
#include &quot;fp2bin.h&quot;

int main (void)
{
 char binString[FP2BIN_STRING_MAX];
 double intPart = 0, fracPart = 0, conversion;
 unsigned int i;
 char decimal[] = &quot;<strong>3.14159</strong>&quot;; /* Converted manually */
 double compiler_conversion = <strong>3.14159</strong>; /* Converted by compiler */

 i = 0; /* Left to right */
 while (decimal[i] != '.')
   {
    intPart = intPart*10 + (decimal[i] - '0');
    i++;
   }

 i = strlen(decimal)-1; /* Right to left */
 while (decimal[i] != '.')
   {
    fracPart = (fracPart + (decimal[i] - '0'))/10;
    i--;
   }

 conversion = intPart + fracPart;

 /* Print the converted values, in binary */
 fp2bin(compiler_conversion,binString);
 printf(&quot;Correct = %s\n&quot;,binString);
 fp2bin(conversion,binString);
 printf(&quot;Q and D = %s\n&quot;,binString);
}
</pre>
<p>When run (I ran on an Intel Core Duo processor), the output shows that the compiler and quick and dirty conversions of <strong>3.14159</strong> are both correct:</p>
<pre>
Correct = 11.00100100001111110011111000000011011100001100110111
Q and D = 11.00100100001111110011111000000011011100001100110111
</pre>
<p>Their value in decimal, obtained from my <a title="Rick Regan's Decimal/Binary Converter" href="http://www.exploringbinary.com/binary-converter/">binary/decimal converter</a>, is<br />
3.14158999999999988261834005243144929409027099609375.</p>
<h2>Incorrect Conversions</h2>
<p>Incorrect conversions are due to the limitations of floating-point arithmetic; specifically, these three things affect the conversion calculations:</p>
<ul>
<li><strong>Large integers require more bits than can fit in a floating-point variable</strong>.
<p>Multiplying a binary integer by ten is exact, but once the integer exceeds the largest integer that a double can represent exactly, rounding errors can occur.</p>
</li>
<li><strong>Division by ten in binary arithmetic is inexact</strong>.
<p>Each time the fractional value is divided by ten, there is a chance of roundoff error.</p>
</li>
<li><strong>Addition of integers and fractions can incur rounding error</strong>.
<p>When an integer and fraction are added, some of the bits of the fraction can be lost.</p>
</li>
</ul>
<p>The example values below &#8212; which I discovered by both random testing and trial and error &#8212; are all rounded incorrectly by my quick and dirty algorithm.</p>
<h3>The Integer 18014398509481993</h3>
<p>The number 18,014,398,509,481,993 (2<sup>54</sup> + 9) is not converted correctly by the quick and dirty algorithm; this is the output of the program (the lines I changed are <em>char decimal[] = &#8220;18014398509481993.0&#8243;;</em> and <em>double compiler_conversion = 18014398509481993.0;</em>): </p>
<pre>
Correct = 1000000000000000000000000000000000000000000000000001000.0
Q and D = 1000000000000000000000000000000000000000000000000001<span class="highlight_gray4">100</span>.0
</pre>
<p>Their values, in decimal, are</p>
<ul>
<li>Correct = 18014398509481992 = 2<sup>54</sup> + <span class="highlight_gray4">8</span></li>
<li>Q and D = 18014398509481996 = 2<sup>54</sup> + <span class="highlight_gray4">12</span></li>
</ul>
<p>These values differ by 4, which is one <a title="Wikipedia article &ldquo;Unit in the last place&rdquo;" href="http://en.wikipedia.org/wiki/Unit_in_the_last_place">unit in the last place</a> (the last place for a double is the 53rd significant bit).</p>
<h3>The Fraction 0.9199</h3>
<p>The number 0.9199 is not converted correctly by the quick and dirty algorithm; this is the output of the program (the lines I changed are <em>char decimal[] = &#8220;0.9199&#8243;;</em> and  <em>double compiler_conversion = 0.9199;</em>): </p>
<pre>
Correct = 0.11101011011111101001000011111111100101110010010001111
Q and D = 0.1110101101111110100100001111111110010111001001000111<span class="highlight_gray4"> </span>
</pre>
<p>Their values, in decimal, are</p>
<ul>
<li>Correct = 0.91990000000000005098144129078718833625316619873046875
<p> = 828572259443623<span class="highlight_gray4">9</span>/2<sup>53</sup></p>
</li>
<li>Q and D = 0.9198999999999999399591388282715342938899993896484375
<p> = 4142861297218119/2<sup>52</sup> = 828572259443623<span class="highlight_gray4">8</span>/2<sup>53</sup></p>
</li>
</ul>
<p>These values differ by 2<sup>-53</sup>, which is one unit in the last place.</p>
<h3>The Combined Integer and Fraction 1.89</h3>
<p>The number 1.89 is not converted correctly by the quick and dirty algorithm; this is the output of the program (the lines I changed are <em>char decimal[] = &#8220;1.89&#8243;;</em> and <em>double compiler_conversion = 1.89;</em>): </p>
<pre>
Correct = 1.1110001111010111000010100011110101110000101000111101
Q and D = 1.11100011110101110000101000111101011100001010001111<span class="highlight_gray4">1 </span>
</pre>
<p>Their values, in decimal, are</p>
<ul>
<li>Correct = 1.8899999999999999023003738329862244427204132080078125
<p> = 1 + 400820366835974<span class="highlight_gray4">1</span>/2<sup>52</sup></p>
</li>
<li>Q and D = 1.890000000000000124344978758017532527446746826171875
<p> = 1 + 2004101834179871/2<sup>51</sup> = 1 + 400820366835974<span class="highlight_gray4">2</span>/2<sup>52</sup></p>
</li>
</ul>
<p>These values differ by 2<sup>-52</sup>, which is <em>two</em> units in the last place.</p>
<p>The interesting thing is that 0.89 by itself converts correctly; it&#8217;s only when it&#8217;s added to 1 that the final result becomes incorrect.</p>
<h2>Discussion</h2>
<p>In practice, you&#8217;ll depend on your compiler or interpreter or a library call for correctly rounded conversion. The de facto standard code used by many implementations is <a title="David Gay's dtoa.c" href="http://www.netlib.org/fp/dtoa.c">David Gay&#8217;s strtod() function in dtoa.c</a>. Take a look at it to get a sense of how involved  correctly rounded conversion is. (dtoa.c handles scientific notation, and also includes code for correct conversion in the other direction &#8212; floating-point to decimal).</p>
<p>For a little more detailed discussion of Horner&#8217;s method, see my PHP functions <a title="Read Rick Regan's Article &ldquo;Base Conversion in PHP Using BCMath&rdquo;" href="http://www.exploringbinary.com/base-conversion-in-php-using-bcmath/">bin2dec_i() and bin2dec_f()</a>. Those two functions implement the same quick and dirty algorithm, only using decimal arithmetic and powers of two (there&#8217;s nothing &ldquo;dirty&rdquo; about them though &#8212; they&#8217;re binary to decimal and arbitrary precision).</p>
<h3>Conversion to Single-Precision</h3>
<p>Conversion to floats works similarly, also producing correctly and incorrectly rounded results. There&#8217;s one thing I observed, which you might find surprising: a given decimal value may convert incorrectly to a double but <em>correctly</em> to a float!</p>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/quick-and-dirty-decimal-to-floating-point-conversion/">Quick and Dirty Decimal to Floating-Point Conversion</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/quick-and-dirty-decimal-to-floating-point-conversion/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>In Search of Decimal/Binary/Hexadecimal Palindromes</title>
		<link>http://www.exploringbinary.com/in-search-of-decimal-binary-hexadecimal-palindromes/</link>
		<comments>http://www.exploringbinary.com/in-search-of-decimal-binary-hexadecimal-palindromes/#comments</comments>
		<pubDate>Wed, 14 Apr 2010 15:33:17 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Binary numbers]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=279</guid>
		<description><![CDATA[Are there any multiple digit hexadecimal number palindromes that are also palindromic in binary and decimal? I have been searching but have not found any. I started my search with my program that finds multiple-base palindromes. I generated palindromes in binary, and then checked them to see if they were also palindromes in hexadecimal and [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/in-search-of-decimal-binary-hexadecimal-palindromes/">In Search of Decimal/Binary/Hexadecimal Palindromes</a></p>
]]></description>
			<content:encoded><![CDATA[<p>Are there any multiple digit hexadecimal number palindromes that are also palindromic in binary and decimal? I have been searching but have not found any.</p>
<p>I started my search with my <a href="http://www.exploringbinary.com/finding-numbers-that-are-palindromic-in-multiple-bases/" title="Read Rick Regan's Article &ldquo;Finding Numbers That Are Palindromic In Multiple Bases&rdquo;">program that finds multiple-base palindromes</a>. I generated palindromes in binary, and then checked them to see if they were also palindromes in hexadecimal and decimal. I looked for decimal/binary/hexadecimal palindromes up to 16 hex digits long, but did not find any.</p>
<p>To continue my search into bigger numbers, I wrote a program that uses arbitrary-precision integer arithmetic and a more efficient algorithm. Despite being able to search much further, I still have not found any.</p>
<p>In this article, I&#8217;ll analyze the size of the palindrome &ldquo;search space&rdquo;, explain my improved algorithm, and discuss the state of my search.</p>
<p><span id="more-279"></span></p>
<h2>Search Space for Decimal/Binary/Hexadecimal Palindromes</h2>
<p>The biggest obstacle in the search for palindromes is the growth in the amount of numbers that must be checked &#8212; the <em>search space</em>. As the number of digits increases, the search space grows exponentially. While this exponential growth can&#8217;t be eliminated, it can be delayed &#8212; by pruning the search space.</p>
<p>Consider the search space for 32 hexadecimal digit decimal/binary/hexadecimal palindromes. With no pruning, 15&middot;16<sup>31</sup> integers would have to be checked &#8212; 319,014,718,988,379,809,496,913,694,467,282,698,240 numbers, an impossible task. If only hexadecimal palindromes are considered, the search space is reduced to <a href="http://www.exploringbinary.com/counting-binary-and-hexadecimal-palindromes/" title="Read Rick Regan's Article &ldquo;Counting Binary and Hexadecimal Palindromes&rdquo;"> 15&middot;16<sup>15</sup></a> = 17,293,822,569,102,704,640 numbers &#8212; much smaller but still intractable. If only <a href="http://www.exploringbinary.com/the-structure-of-binary-hexadecimal-palindromes/" title="Read Rick Regan's Article &ldquo;The Structure of Binary/Hexadecimal Palindromes&rdquo;">binary/hexadecimal palindromes</a> are considered, then there are <a href="http://www.exploringbinary.com/counting-binary-hexadecimal-palindromes/" title="Read Rick Regan's Article &ldquo;Counting Binary/Hexadecimal Palindromes&rdquo;">2<sup>16</sup> + 2<sup>32</sup></a> = 4,295,032,832 numbers to be checked &#8212; quite manageable!</p>
<p>The binary/hexadecimal palindrome space can be partitioned by starting digit so that subsets can be searched independently. Consider the 32 hexadecimal digit binary/hexadecimal palindromes. There are <a href="http://www.exploringbinary.com/counting-binary-hexadecimal-palindromes/" title="Read Rick Regan's Article &ldquo;Counting Binary/Hexadecimal Palindromes&rdquo;">2<sup>16</sup></a> = 65,536 that begin with the digits 1 and 3, and <a href="http://www.exploringbinary.com/counting-binary-hexadecimal-palindromes/" title="Read Rick Regan's Article &ldquo;Counting Binary/Hexadecimal Palindromes&rdquo;">2<sup>32</sup></a> = 4,294,967,296 that begin with the digits 5, 7, 9, and F. The number of palindromes that start with 1 and 3 is markedly lower, allowing for a deeper search within them.</p>
<p>The following diagrams illustrate how the search space grows, showing the search spaces for the 16, 32, and 64 digit hexadecimal palindromes:</p>
<div class="wp-caption aligncenter" style="width: 470px"><img src="http://www.exploringbinary.com/wp-content/uploads/bin-hex-search-space.16.png" alt="Number of 16-Digit Hexadecimal Palindromes, by Subset (not to scale)" width="460" height="273"/><p class="wp-caption-text">Number of 16-Digit Hexadecimal Palindromes, by Subset (<strong>not to scale</strong>)</p></div>
<div class="wp-caption aligncenter" style="width: 471px"><img src="http://www.exploringbinary.com/wp-content/uploads/bin-hex-search-space.32.png" alt="Number of 32-Digit Hexadecimal Palindromes, by Subset (not to scale)" width="461" height="271"/><p class="wp-caption-text">Number of 32-Digit Hexadecimal Palindromes, by Subset (<strong>not to scale</strong>)</p></div>
<div class="wp-caption aligncenter" style="width: 468px"><img src="http://www.exploringbinary.com/wp-content/uploads/bin-hex-search-space.64.png" alt="Number of 64-Digit Hexadecimal Palindromes, by Subset (not to scale)" width="458" height="274"/><p class="wp-caption-text">Number of 64-Digit Hexadecimal Palindromes, by Subset (<strong>not to scale</strong>)</p></div>
<h2>Program to Find Decimal/Binary/Hexadecimal Palindromes</h2>
<p>I wrote a C program to search for decimal/binary/hexadecimal palindromes &#8212; it is based on two things:</p>
<ul>
<li><strong>Arbitrary-precision integers</strong>.
<p>This allows for palindrome numbers that are too big to fit in an <em>unsigned long long</em> integer variable; that is, numbers bigger than 64 bits, or 16 hex digits. I used <a href="http://www.exploringbinary.com/how-to-install-and-run-gmp-on-windows-using-mpir/" title="Read Rick Regan's Article &ldquo;How to Install and Run GMP on Windows Using MPIR&rdquo;">GMP in Visual C++ on Windows</a>.</p>
</li>
<li><strong>Direct generation of binary/hexadecimal palindromes</strong>.
<p>This reduces the search space dramatically, so that only known binary/hexadecimal palindromes are checked &#8212; for being palindromic in decimal. I used a simple algorithm derived from my <a href="http://www.exploringbinary.com/the-structure-of-binary-hexadecimal-palindromes/" title="Read Rick Regan's Article &ldquo;The Structure of Binary/Hexadecimal Palindromes&rdquo;">analysis of binary/hexadecimal palindromes</a>.</p>
</li>
</ul>
<p>I will outline the algorithm instead of providing the source code.</p>
<h3>The Algorithm</h3>
<p>Binary/hexadecimal palindromes could be generated in two ways. One way would be incrementally, by generating the two-digit palindromes, then the three-digit palindromes, then the four-digit palindromes, etc. A set of m-digit palindromes would be built from a set of m-1 digit palindromes, by inserting a digit in the middle of each m-1 digit palindrome. This requires storage of palindromes, which will not scale &#8212; the number of palindromes grows exponentially.</p>
<p>Another way to generate binary/hexadecimal palindromes &#8212; the way I chose &#8212; is to construct them from permutations of hex digits. Each of the six subsets of binary/hexadecimal palindromes has its own rules for valid digits:</p>
<ul>
<li>Starts/ends with 1: each digit between can be 0 or 1</li>
<li>Starts/ends with 3: each digit between can be 0 or 3</li>
<li>Starts/ends with 5: each digit between can be 0, 2, 5, or 7</li>
<li>Starts/ends with 7: each digit between can be 0, 2, 5, or 7</li>
<li>Starts/ends with 9: each digit between can be 0, 6, 9, or F</li>
<li>Starts/ends with F: each digit between can be 0, 6, 9, or F</li>
</ul>
<p>You construct palindromes of n digits from permutations of floor(n/2) digits. For example, to generate a 32 digit palindrome that starts with 5, create a number that starts with 5 followed by a 15 digit permutation of the digits 0, 2, 5, and 7. Then, reverse that number and append it to itself to create a 32 digit palindrome. (The process for odd length palindromes is similar, except you insert an extra middle digit between the two halves.)</p>
<p>Permutations are easily generated, with a series of nested loops. You can nest loops statically, by coding a loop for each place of the permutation, or you can nest them dynamically, by calling a single loop recursively. The advantage of dynamic nesting is not having to add loops manually for increasing digit lengths; the advantage of static nesting is the ability to &ldquo;checkpoint&rdquo; the state of the search easily &#8212; an important feature for long running programs. I chose static nesting for that reason.</p>
<h2>Results</h2>
<p>I&#8217;ve searched the &ldquo;1/3 space&rdquo; through all 64 hex digit numbers, and the  &ldquo;5/7/9/F space&rdquo; through all 32 hex digit numbers. I have not found a single multi-digit decimal/binary/hexadecimal palindrome. </p>
<p>I have suspended my search, since the running time of my program is starting to explode. For example, it took almost 60 hours on my Intel Core Duo processor to test all 2<sup>32</sup> 64 hex digit binary/hexadecimal palindromes in the 1/3 space. Resuming the search would probably require a <a href="http://www.mersenne.org/" title="Home page of GIMPS, the Great Internet Mersenne Prime Search">GIMPS</a>-like distributive computing approach, which I have no plans to undertake.</p>
<h2>Discussion</h2>
<p>Are there any tricks to reduce the search space further, delaying the inevitable exponential explosion even more? <a title="Charlton's Binary/Decimal Palindrome Page" href="http://bach.dynet.com/palin/">Charlton&#8217;s binary/decimal palindrome search algorithm</a> eliminates swaths of numbers as it runs, recognizing conditions under which copalindromic numbers can&#8217;t exist. Can similar ideas be applied to the search for decimal/binary/hexadecimal palindromes?</p>
<h3>Another Way to Count Binary/Hexadecimal Palindromes</h3>
<p>In my article &ldquo;<a href="http://www.exploringbinary.com/counting-binary-hexadecimal-palindromes/" title="Read Rick Regan's Article &ldquo;Counting Binary/Hexadecimal Palindromes&rdquo;">Counting Binary/Hexadecimal Palindromes</a>&rdquo; I derived formulas to count binary/hexadecimal palindromes using geometric series. An alternate way would be based on counting permutations. For example, the number of 32 digit palindromes starting with 5 is the number of 15 digit permutations of the digits 0, 2, 5, and 7; that is, 4<sup>15</sup> =  2<sup>30</sup> = 1,073,741,824.</p>
<p>(You would still need geometric series to count the number of palindromes of <em>n-digits or less</em>.)</p>
<h2>Are There Any?</h2>
<p>Are there any decimal/binary/hexadecimal palindromes? Can it be proved either way?</p>
<p>I have no reason to believe they don&#8217;t exist. After all, <a href="http://www.exploringbinary.com/finding-numbers-that-are-palindromic-in-multiple-bases/" title="Read Rick Regan's Article &ldquo;Finding Numbers That Are Palindromic In Multiple Bases&rdquo;">there are at least two multi-digit decimal/binary/octal palindromes</a>. And it&#8217;s always possible I made an error in my search, although <a title="Charlton's Binary/Decimal Palindrome Page" href="http://bach.dynet.com/palin/">Charlton&#8217;s results</a> confirm my results through 32 hex digits plus.</p>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/in-search-of-decimal-binary-hexadecimal-palindromes/">In Search of Decimal/Binary/Hexadecimal Palindromes</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/in-search-of-decimal-binary-hexadecimal-palindromes/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Barbie Goes Binary</title>
		<link>http://www.exploringbinary.com/barbie-goes-binary/</link>
		<comments>http://www.exploringbinary.com/barbie-goes-binary/#comments</comments>
		<pubDate>Sat, 10 Apr 2010 18:38:05 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Binary code]]></category>
		<category><![CDATA[Geekware]]></category>
		<category><![CDATA[Pop culture]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=277</guid>
		<description><![CDATA[In case you haven&#8217;t heard, Mattel&#174; has created Computer Engineer Barbie&#174;, based on popular vote. Here&#8217;s the laptop she is holding: It spells &#8220;Barbie&#8221; &#8212; repeatedly, in ASCII code. B-a-r-b-i-e The screen shows six full lines of code &#8212; three copies of these two lines: 01000010 01100001 01110010 01100010 01101001 01100101 Those happen to be [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/barbie-goes-binary/">Barbie Goes Binary</a></p>
]]></description>
			<content:encoded><![CDATA[<p>In case you haven&#8217;t heard, Mattel&reg; has created <a href="http://shop.mattel.com/product/index.jsp?productId=4032107" title="Computer Engineer Barbie">Computer Engineer Barbie</a>&reg;, based on <a href="http://www.barbie.com/vote/" title="Voting for Computer Engineer Barbie">popular vote</a>. Here&#8217;s the laptop she is holding:</p>
<div class="wp-caption aligncenter" style="width: 396px"><img src="http://www.exploringbinary.com/wp-content/uploads/Barbie.laptop.binary.jpg" alt="Binary Code on Barbie's Laptop" width="386" height="306"/><p class="wp-caption-text">Binary Code on Barbie's Laptop</p></div>
<p>It <a href="http://online.wsj.com/article/SB20001424052702304198004575171791681002592.html" title="Wall Street Journal Article &ldquo;Revenge of the Nerds: How Barbie Got Her Geek On &rdquo;">spells &ldquo;Barbie&rdquo;</a> &#8212; repeatedly, in ASCII code.</p>
<p><span id="more-277"></span></p>
<h2>B-a-r-b-i-e</h2>
<p>The screen shows six full lines of code &#8212; three copies of these two lines:</p>
<p>01000010 01100001 01110010<br />
01100010 01101001 01100101</p>
<p>Those happen to be six <a href="http://en.wikipedia.org/wiki/ASCII" title="Wikipedia Article on ASCII code">ASCII </a> character codes, written in binary:</p>
<p>01000010 = B<br />
01100001 = a<br />
01110010 = r<br />
01100010 = b<br />
01101001 = i<br />
01100101 = e</p>
<p>I&#8217;ve overlaid the characters on the screen so you can see the correspondence better:</p>
<div class="wp-caption aligncenter" style="width: 396px"><img src="http://www.exploringbinary.com/wp-content/uploads/Barbie.laptop.ascii.jpg" alt="Barbie's Laptop Overlaid With Corresponding ASCII Characters" width="386" height="306"/><p class="wp-caption-text">Barbie's Laptop Overlaid with Corresponding ASCII Characters</p></div>
<p class="break">(Now if instead those were numbers encoded in IEEE floating-point I&#8217;d have been really impressed <img src='http://www.exploringbinary.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> .)</p>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/barbie-goes-binary/">Barbie Goes Binary</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/barbie-goes-binary/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>When Doubles Don’t Behave Like Doubles</title>
		<link>http://www.exploringbinary.com/when-doubles-dont-behave-like-doubles/</link>
		<comments>http://www.exploringbinary.com/when-doubles-dont-behave-like-doubles/#comments</comments>
		<pubDate>Fri, 09 Apr 2010 17:15:35 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Binary arithmetic]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=276</guid>
		<description><![CDATA[In my article &#8220;When Floats Don’t Behave Like Floats&#8221; I explained how calculations involving single-precision floating-point variables may be done, under the covers, in double or extended precision. This leads to anomalies in expected results, which I demonstrated with two C programs &#8212; compiled with Microsoft Visual C++ and run on a 32-bit Intel Core [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/when-doubles-dont-behave-like-doubles/">When Doubles Don’t Behave Like Doubles</a></p>
]]></description>
			<content:encoded><![CDATA[<p>In my article &ldquo;<a href="http://www.exploringbinary.com/when-floats-dont-behave-like-floats/" title="Read Rick Regan's Article &ldquo;When Floats Don’t Behave Like Floats&rdquo;">When Floats Don’t Behave Like Floats</a>&rdquo; I explained how calculations involving single-precision floating-point variables may be done, under the covers, in double or extended precision. This leads to anomalies in expected results, which I demonstrated with two C programs &#8212; compiled with Microsoft Visual C++ and run on a 32-bit Intel Core Duo processor.</p>
<p>In this article, I&#8217;ll do a similar analysis for <em>double</em>-precision floating-point variables, showing how similar anomalies arise when extended precision calculations are done. I modified my two example programs to use doubles instead of floats. Interestingly, the doubles version of program 2 does not exhibit the anomaly. I&#8217;ll explain.</p>
<p><span id="more-276"></span></p>
<h2>The Environment in Which the Anomalies Occur</h2>
<p>To see the described anomalies, floating-point calculations must be done in precision greater than double-precision. For Intel processors, this means that x87 FPU instructions must be used (as opposed to <a title="Wikipedia article defining SSE" href="http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions">SSE instructions</a>), and the processor must be calculating in <strong>extended-precision</strong> mode. </p>
<p>Your compiler determines which instructions to generate, although you may influence it with compiler options. On my system, Visual C++ generates x87 FPU instructions by default. As for the precision mode, it has a default &#8212; which is double-precision on my system &#8212; that can be changed by a call to the <em>controlfp</em> function. I call this function in the programs below.</p>
<h2>Program 1</h2>
<pre>
#include &quot;stdio.h&quot;
#include &quot;float.h&quot;
int main (void)
{
 double d1 = 0.3333333333333333, d2 = 3.0, d3;
 unsigned int cur_cw;

 _controlfp_s(&#038;cur_cw,_PC_64,MCW_PC); 

 d3 = d1 * d2;
 if (d3 != d1 * d2)
   printf(&quot;Not equal\n&quot;);
}
</pre>
<p>Program 1 prints &ldquo;<strong>Not equal</strong>&rdquo;, for the same reason its float counterpart does: greater precision is being used under the covers.</p>
<h3>Computing d3</h3>
<p>Using my function <a href="http://www.exploringbinary.com/converting-floating-point-numbers-to-binary-strings-in-c/" title="Read Rick Regan's Article &ldquo;Converting Floating-Point Numbers to Binary Strings in C&rdquo;">fp2bin</a>, I printed the binary values of the three floating point variables in the computation:</p>
<ul>
<li>d1 = 0.010101010101010101010101010101010101010101010101010101</li>
<li>d2 = 11</li>
<li>d3 = 1</li>
</ul>
<p>The value of d3 is only an approximation to d1 * d2. Here&#8217;s the true value of d1 * d2, computed by hand:</p>
<pre>
  0.010101010101010101010101010101010101010101010101010101
  x                                                     11
  --------------------------------------------------------
     10101010101010101010101010101010101010101010101010101
    10101010101010101010101010101010101010101010101010101
  --------------------------------------------------------
  0.11111111111111111111111111111111111111111111111111111<strong>1</strong>
</pre>
<p>The product is 54 bits, which can be represented exactly in extended precision, but only approximately in double-precision. Since bit 54 is a 1, the value is rounded up, to 53 bits. The result is d3 = 1.</p>
<p>Here&#8217;s the relevant assembler code generated by the compiler:</p>
<pre>
 <strong>d3 = d1 * d2;</strong>
0041131F  fld         qword ptr [d1]
00411322  fmul        qword ptr [d2]
00411325  fstp        qword ptr [d3]
</pre>
<h3>Comparing d3 and d1 * d2</h3>
<p>The assembler code shows what is happening:</p>
<pre>
 <strong>if (d3 != d1 * d2)</strong>
00411328  fld         qword ptr [d1]
0041132B  fmul        qword ptr [d2]
0041132E  fcomp       qword ptr [d3]
</pre>
<p>d1 * d2 is computed in extended precision and left on the stack. It is then compared to d3, which still has 53 bits of precision even though it’s “promoted” to extended precision. The two compared values will differ.</p>
<h2>Program 2</h2>
<pre>
#include &quot;stdio.h&quot;
#include &quot;float.h&quot;
int main (void)
{
 double d1 = 0.3333333333333333, d2 = 3.0, d3;
 long long i1, i2;
 unsigned int cur_cw;

 _controlfp_s(&#038;cur_cw,_PC_64,MCW_PC); 

 d3 = d1 * d2;
 i1 = (int)d3;
 i2 = (int)(d1 * d2);
 if (i1 != i2)
   printf(&quot;Not equal\n&quot;);
}
</pre>
<p>Program 2 does not print anything, which is surprising knowing that its float counterpart does. According to our analysis of program 1 &#8212; program 2 uses the same values for d1 and d2 &#8212; d1 * d2 rounds to 1 in double precision, but is slightly less than 1 in extended precision. Therefore, we would expect i1 = 1 and i2 = 0. Instead, both are 1.</p>
<p>The explanation is simple: the extended precision calculation of d1 * d2 is &ldquo;spilled&rdquo; into a double before the conversion to integer, so it will be equal to d3:</p>
<h3>Computing i2</h3>
<pre>
 <strong>i2 = (int)(d1 * d2);</strong>
004114A5  fld         qword ptr [d1]
004114A8  fmul        qword ptr [d2]
004114AB  call        @ILT+245(__ftol2_sse) (4110FAh)
           ||
           ||
           \/
--- f:\dd\vctools\crt_bld\SELF_X86\crt\prebuild\tran\i386\ftol2.asm
00411B40  cmp         dword ptr [___sse2_available (417598h)],0
00411B47  je          _ftol2 (411B76h)
00411B49  push        ebp
00411B4A  mov         ebp,esp
00411B4C  sub         esp,8
00411B4F  and         esp,0FFFFFFF8h
00411B52  <span class="highlight_gray4">fstp        qword ptr [esp]</span>
00411B55  cvttsd2si   eax,mmword ptr [esp]
00411B5A  leave
00411B5B  ret
</pre>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/when-doubles-dont-behave-like-doubles/">When Doubles Don’t Behave Like Doubles</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/when-doubles-dont-behave-like-doubles/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>When Floats Don’t Behave Like Floats</title>
		<link>http://www.exploringbinary.com/when-floats-dont-behave-like-floats/</link>
		<comments>http://www.exploringbinary.com/when-floats-dont-behave-like-floats/#comments</comments>
		<pubDate>Tue, 30 Mar 2010 15:07:21 +0000</pubDate>
		<dc:creator>Rick Regan</dc:creator>
				<category><![CDATA[Numbers in computers]]></category>
		<category><![CDATA[Binary arithmetic]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[Code]]></category>
		<category><![CDATA[Convert to binary]]></category>
		<category><![CDATA[Decimals]]></category>
		<category><![CDATA[Floating-point]]></category>

		<guid isPermaLink="false">http://www.exploringbinary.com/?p=275</guid>
		<description><![CDATA[These two programs &#8212; compiled with Microsoft Visual C++ and run on a 32-bit Intel Core Duo processor &#8212; demonstrate an anomaly that occurs when using single-precision floating point variables: Program 1 #include &#34;stdio.h&#34; int main (void) { float f1 = 0.1f, f2 = 3.0f, f3; f3 = f1 * f2; if (f3 != f1 [...]<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/when-floats-dont-behave-like-floats/">When Floats Don&#8217;t Behave Like Floats</a></p>
]]></description>
			<content:encoded><![CDATA[<p>These two programs &#8212; compiled with Microsoft Visual C++ and run on a 32-bit Intel  Core Duo processor &#8212; demonstrate an anomaly that occurs when using single-precision floating point variables:</p>
<p><strong>Program 1</strong></p>
<pre>
#include &quot;stdio.h&quot;
int main (void)
{
 float f1 = 0.1f, f2 = 3.0f, f3;

 f3 = f1 * f2;
 if (f3 != f1 * f2)
   printf(&quot;Not equal\n&quot;);
}
</pre>
<p>Prints &ldquo;<strong>Not equal</strong>&rdquo;.</p>
<p><strong>Program 2</strong></p>
<pre>
#include &quot;stdio.h&quot;
int main (void)
{
 float f1 = 0.7f, f2 = 10.0f, f3;
 int i1, i2;

 f3 = f1 * f2;
 i1 = (int)f3;
 i2 = (int)(f1 * f2);
 if (i1 != i2)
   printf(&quot;Not equal\n&quot;);
}
</pre>
<p>Prints &ldquo;<strong>Not equal</strong>&rdquo;.</p>
<p>In each case, f3 and f1 * f2 differ. But why? I&#8217;ll explain what&#8217;s going on.</p>
<p><span id="more-275"></span>(This article was inspired by <a title="Stack Overflow Question &ldquo;Why is my number being rounded incorrectly?&rdquo;" href="http://stackoverflow.com/questions/2509576/why-is-my-number-being-rounded-incorrectly">this question</a> and several related questions on stackoverflow.com.)</p>
<h2>Analyzing Program 1</h2>
<p>Program 1 compares two floating point values for equality &#8212; something conventional wisdom says not to do. But why should that rule apply in this case? We&#8217;re just checking that a variable holds the value we just gave it. How could it not? </p>
<p>The root of the &ldquo;problem&rdquo; is this: the compiler generates instructions that do floating-point calculations in <a title="Wikipedia article defining extended precision" href="http://en.wikipedia.org/wiki/Extended_precision">extended precision</a>. Whereas floats are 4 bytes long and have 24 significant bits of precision, extended precision values &#8212; which are stored in registers on the floating point stack &#8212; are 10 bytes long and have 64 bits of precision. </p>
<h3>Computing f3</h3>
<p>Using my function <a href="http://www.exploringbinary.com/converting-floating-point-numbers-to-binary-strings-in-c/" title="Read Rick Regan's Article &ldquo;Converting Floating-Point Numbers to Binary Strings in C&rdquo;">fp2bin</a>, I printed the binary values of the three floating point variables in the computation:</p>
<ul>
<li>f1 = 0.000110011001100110011001101</li>
<li>f2 = 11</li>
<li>f3 = 0.010011001100110011001101</li>
</ul>
<p>The value of f3 is only an approximation to f1 * f2. To understand why, let&#8217;s calculate the true value of f1 * f2; that is, let&#8217;s calculate it by hand, using binary multiplication:</p>
<pre>
  0.000110011001100110011001101
  x                          11
  -----------------------------
       110011001100110011001101
      110011001100110011001101
-------------------------------
  0.010011001100110011001100111
</pre>
<p>So f1 * f2 = 0.010011001100110011001100111. It fits comfortably within extended precision. But to assign it to f3 &#8212; a float &#8212; it must be rounded. f1 * f2 has 26 significant bits, but a float holds only 24. Rounding it to the nearest 24 bit value makes it 0.010011001100110011001101.</p>
<p>The assembler code generated by the compiler confirms what we&#8217;re seeing:</p>
<pre>
 <strong>f3 = f1 * f2;</strong>
0041130B  fld         dword ptr [f1]
0041130E  fmul        dword ptr [f2]
00411311  <span class="highlight_gray4">fstp        dword ptr [f3]</span>
</pre>
<p>f1 * f2 is computed in extended precision &#8212; enough bits to hold its true value &#8212; but that extra precision is lost when stored in f3.</p>
<h3>Comparing f3 and f1 * f2</h3>
<p>The answer is in the assembler code, so let&#8217;s get right to it:</p>
<pre>
 <strong>if (f3 != f1 * f2)</strong>
00411314  fld         dword ptr [f1]
00411317  fmul        dword ptr [f2]
0041131A  fld         dword ptr [f3]
0041131D  <span class="highlight_gray4">fucompp</span>
</pre>
<p>Again, f1 * f2 is computed in extended precision, but this time <strong>its true value is retained</strong> &#8212; it is left on the stack. f3 is then loaded onto the stack (of course it still has only 24 bits of precision, even though it&#8217;s been &ldquo;promoted&rdquo; to extended precision). The two values on top of the stack are then compared. Clearly, they differ.</p>
<h2>Analyzing Program 2</h2>
<p>Program 2 &ldquo;fails&rdquo; for the same reason as program 1, except that the &ldquo;error&rdquo; is magnified by the conversion of f3 and f1 * f2 to integers: the integer part of f3 is <strong>7</strong>, and  the integer part of f1 * f2 is <strong>6</strong>.</p>
<h3>Computing i1</h3>
<p>The binary values of the three floating point variables in the computation are:</p>
<ul>
<li>f1 = 0.101100110011001100110011</li>
<li>f2 = 1010</li>
<li>f3 = 111</li>
</ul>
<p>The value of f3 is an integer. To understand why, let&#8217;s calculate the true value of f1 * f2:</p>
<pre>
  0.101100110011001100110011
  x                     1010
  --------------------------
                           0
   101100110011001100110011
                         0
 101100110011001100110011
----------------------------
110.11111111111111111111111
</pre>
<p>So f1 * f2 = 110.11111111111111111111111. To assign it to f3 it must be rounded. f1 * f2 has 26 significant bits, and rounding it to the nearest 24 bit value makes it 111, or 7 decimal. Clearly, this means i1 will be 7 as well.</p>
<p>Here&#8217;s the assembler code:</p>
<pre>
 <strong>f3 = f1 * f2;</strong>
0041130B  fld         dword ptr [f1]
0041130E  fmul        dword ptr [f2]
00411311  fstp        dword ptr [f3]
 <strong>i1 = (int)f3;</strong>
00411314  fld         dword ptr [f3]
00411317  call        @ILT+155(__ftol2_sse) (4110A0h)
           ||
           ||
           \/
--- f:\dd\vctools\crt_bld\SELF_X86\crt\prebuild\tran\i386\ftol2.asm
00411600  cmp         dword ptr [___sse2_available (416554h)],0
00411607  je          _ftol2 (411636h)
00411609  push        ebp
0041160A  mov         ebp,esp
0041160C  sub         esp,8
0041160F  and         esp,0FFFFFFF8h
00411612  fstp        qword ptr [esp]
00411615  cvttsd2si   eax,mmword ptr [esp]
0041161A  leave
0041161B  ret
</pre>
<h3>Computing i2</h3>
<p>Here&#8217;s the assembler code for computing i2:</p>
<pre>
 <strong>i2 = (int)(f1 * f2);</strong>
0041131F  fld         dword ptr [f1]
00411322  fmul        dword ptr [f2]
00411325  call        @ILT+155(__ftol2_sse) (4110A0h)
           ||
           ||
           \/
--- f:\dd\vctools\crt_bld\SELF_X86\crt\prebuild\tran\i386\ftol2.asm
00411600  cmp         dword ptr [___sse2_available (416554h)],0
00411607  je          _ftol2 (411636h)
00411609  push        ebp
0041160A  mov         ebp,esp
0041160C  sub         esp,8
0041160F  and         esp,0FFFFFFF8h
00411612  <span class="highlight_gray4">fstp        qword ptr [esp]</span>
00411615  <span class="highlight_gray4">cvttsd2si   eax,mmword ptr [esp]</span>
0041161A  leave
0041161B  ret
</pre>
<p>The interesting thing in the calculation of i2 is that f1 * f2 is stored in a <em>double</em> before being cast to an integer (see the highlighted fstp instruction). f1 * f2 in double precision is 110.11111111111111111111111; that is, its true 26 bit value. Casting this to an integer results in 110, or decimal 6. If the compiler had stored f1 * f2 in a float and used the cvtts<strong>s</strong>2si instruction instead, the answer would have been 7 &#8212; just like in the first calculation.</p>
<h3>Discussion</h3>
<p>In program 2, it would appear that single-precision is more accurate than extended precision. After all, forcing the value into a float gives the expected answer. This is just a happy coincidence. Two losses of precision &#8212; the conversion of 0.7 to floating-point and the rounding up of the product &#8212; have effectively canceled each other out.</p>
<h2>Behavior Depends on the Processor and How It Is Used</h2>
<p>This anomaly may not occur on your machine. It depends on your processor, and in particular, the instructions used and the mode that it&#8217;s in. My example programs were compiled into Intel x87 FPU instructions, and ran with the x87 precision control field set to 53-bits. (I said above that my programs were using extended precision; technically they weren&#8217;t, but double precision is sufficient to make them &ldquo;fail&rdquo;. The true values of f1 * f2 are less than 53 bits.)</p>
<p>Intel processors can also do floating point using SSE (Streaming SIMD Extensions) instructions. I recompiled my programs using the Visual C++ compiler options /arch:SSE (single-precision floating-point) and /arch:SSE2  (double-precision floating-point). For the SSE option, there was no change; the compiler, at its discretion, still decided to generate x87 instructions. (See the <a title="A comment on this article" href="http://www.exploringbinary.com/when-floats-dont-behave-like-floats/#comment-4316">comment below</a> about the Mac&#8217;s use of SSE instructions making the anomaly disappear.)</p>
<p>Recompiling with /arch:SSE2 I got different assembler code, but the same output; here&#8217;s the assembler code for program 1:</p>
<pre>
<strong>f3 = f1 * f2;</strong>
00411313  cvtss2sd    xmm0,dword ptr [f1]
00411318  cvtss2sd    xmm1,dword ptr [f2]
0041131D  mulsd       xmm0,xmm1
00411321  cvtsd2ss    xmm0,xmm0
00411325  movss       dword ptr [f3],xmm0
<strong>if (f3 != f1 * f2)</strong>
0041132A  cvtss2sd    xmm0,dword ptr [f1]
0041132F  cvtss2sd    xmm1,dword ptr [f2]
00411334  mulsd       xmm0,xmm1
00411338  cvtss2sd    xmm1,dword ptr [f3]
0041133D  ucomisd     xmm1,xmm0
</pre>
<p>The SSE double-precision instructions are used. f3 &#8212; single precision &#8212; is compared to the double-precision intermediate result f1 * f2, resulting in a mismatch.</p>
<h3>Please Try it Out</h3>
<p>If you have access to a different compiler or processor, please try these programs out. Let me know what you find!</p>
<h2>The Message</h2>
<p>You can&#8217;t count on floats being handled as single precision values; they can be processed in double or extended precision, as dictated by your compiler and CPU.</p>
<p class="break">(I have written a companion article called &ldquo;<a href="http://www.exploringbinary.com/when-doubles-dont-behave-like-doubles/" title="Read Rick Regan's Article &ldquo;When Doubles Don’t Behave Like Doubles&rdquo;">When Doubles Don’t Behave Like Doubles</a>&rdquo;.)</p>
<p>By Rick Regan (Copyright &copy; 2008-2010  <a href="http://www.exploringbinary.com">Exploring Binary</a>)<br/><br/><a href="http://www.exploringbinary.com/when-floats-dont-behave-like-floats/">When Floats Don&#8217;t Behave Like Floats</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.exploringbinary.com/when-floats-dont-behave-like-floats/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
