From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tarantool-patches-bounces@dev.tarantool.org>
Received: from [87.239.111.99] (localhost [127.0.0.1])
	by dev.tarantool.org (Postfix) with ESMTP id AB60F5C571B;
	Tue, 15 Aug 2023 14:40:27 +0300 (MSK)
DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org AB60F5C571B
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev;
	t=1692099627; bh=55lfRDtA+hZm90UQDeTK5ix39dH54FuGUxRoUof6LvA=;
	h=Date:To:References:In-Reply-To:Subject:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=h0Km1ELEyNx4CVKazD3GiSEa4wUIHvbXVsJZZ/vR0R1kb8wbV4NAfWfo5AKfXDnTE
	 MWQFClK3y/A6KRvfSOJXF7YLFAWz4NSKwSK1jHaxzkIySrNu06qkWU+p+Nryz8g4rR
	 XcTuTnHnylRdaRsSN2/hB8rfJKJzS1JmnkWxL+4I=
Received: from smtp38.i.mail.ru (smtp38.i.mail.ru [95.163.41.79])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id 71A145C3F0F
 for <tarantool-patches@dev.tarantool.org>;
 Tue, 15 Aug 2023 14:40:26 +0300 (MSK)
DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 71A145C3F0F
Received: by smtp38.i.mail.ru with esmtpa (envelope-from
 <m.kokryashkin@tarantool.org>)
 id 1qVsPQ-00FzZh-2b; Tue, 15 Aug 2023 14:40:25 +0300
Date: Tue, 15 Aug 2023 14:40:24 +0300
To: Sergey Kaplun <skaplun@tarantool.org>
Message-ID: <rp5gu6r44yohnvc4tain5loxcooe2tuetakh3vvwzqp2ro7rey@p3lhokgnfpjz>
References: <cover.1691592488.git.skaplun@tarantool.org>
 <e0d5e1d22c2349991be8efff5dae3e5533696c97.1691592488.git.skaplun@tarantool.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <e0d5e1d22c2349991be8efff5dae3e5533696c97.1691592488.git.skaplun@tarantool.org>
X-Mailru-Src: smtp
X-7564579A: EEAE043A70213CC8
X-77F55803: 4F1203BC0FB41BD94C9043905E7830E16779D9C599B337D77B38BEB1DB8D0DBE182A05F5380850404C228DA9ACA6FE2787AB39036EC102D205ADAB8EC0E925BC2687BBA1D3B71E6E2D2B3D8723A03DAD
X-C1DE0DAB: 0D63561A33F958A55579A279E3A411A7B2F3DEF5FC14535CA052159452E0E7F4F87CCE6106E1FC07E67D4AC08A07B9B0B6FBD635917D924DCB5012B2E24CD356
X-C8649E89: 1C3962B70DF3F0AD5177F0B940C8B66ECE892A7B2722663E91682638B966EB3F662256BEEFA9527F2A8DD4D9C8BCFC0C0AC187366C223CE2C8F441919B6D790E5DBF48453DF986F666DACC4934B28F5D37FD76D11AF80A0DAA5229C04A0AE798A3C3F0F8C431862DEA455F16B58544A21C197AAF4D2E4732965026E5D17F6739C77C69D99B9914278E50E1F0597A6FD5CD72808BE417F3B9E0E7457915DAA85F
X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojJ1ceUZTkowlZhIg2WBnU1A==
X-Mailru-Sender: 11C2EC085EDE56FA38FD4C59F7EFE40702317F2B32A88D7934AEDB7BDED1A8E2F267DDCB615D7F84D51284F0FE6F529ABC7555A253F5B200DF104D74F62EE79D27EC13EC74F6107F4198E0F3ECE9B5443453F38A29522196
X-Mras: OK
Subject: Re: [Tarantool-patches] [PATCH luajit 05/19] PPC: Add soft-float
 support to interpreter.
X-BeenThere: tarantool-patches@dev.tarantool.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
From: Maxim Kokryashkin via Tarantool-patches
 <tarantool-patches@dev.tarantool.org>
Reply-To: Maxim Kokryashkin <m.kokryashkin@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org
Errors-To: tarantool-patches-bounces@dev.tarantool.org
Sender: "Tarantool-patches" <tarantool-patches-bounces@dev.tarantool.org>

Hi, Sergey!
Thanks for the patch!
LGTM, except for a few comments below.
On Wed, Aug 09, 2023 at 06:35:54PM +0300, Sergey Kaplun via Tarantool-patches wrote:
> From: Mike Pall <mike>
> 
> Contributed by Djordje Kovacevic and Stefan Pejic from RT-RK.com.
> Sponsored by Cisco Systems, Inc.
> 
> (cherry-picked from commit 71b7bc88341945f13f3951e2bb5fd247b639ff7a)
> 
> The software floating point library is used on machines which do not
> have hardware support for floating point [1]. This patch enables
> support for such machines in the VM for powerpc.
Typo: s/powerpc/PowerPC/
> This includes:
> * Any loads/storages of double values use load/storage through 32-bit
Typo: s/storages/stores/ Feel free to ignore, though.
>   registers of `lo` and `hi` part of the TValue union.
> * Macro .FPU is added to skip instructions necessary only for
>   hard-float operations (load/store floating point registers from/on the
>   stack, when leave/enter VM, for example).
Typo: s/leave/enter/leaving/entering/
> * Now r25 named as `SAVE1` is used as saved temporary register (used in
>   different fast functions)
> * `sfi2d` macro is introduced to convert integer, that represents a
Typo: s/convert/convert an/
>   soft-float, to double. Receives destination and source registers, uses
Typo: s/to double/to a double/
>   `TMP0` and `TMP1`.
> * `sfpmod` macro is introduced for soft-float point `fmod` built-in.
> * `ins_arith` now receives the third parameter -- operation to use for
>   soft-float point.
> * `LJ_ARCH_HASFPU`, `LJ_ABI_SOFTFP` macros are introduced to mark that
>   there is defined `_SOFT_FLOAT` or `_SOFT_DOUBLE`. `LJ_ARCH_NUMMODE` is
>   set to the `LJ_NUMMODE_DUAL`, when `LJ_ABI_SOFTFP` is true.
> 
> Support of soft-float point for the JIT compiler will be added in the
> next patch.
> 
> [1]: https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html
> 
> Sergey Kaplun:
> * added the description for the feature
> 
> Part of tarantool/tarantool#8825
> ---
>  src/host/buildvm_asm.c |    2 +-
>  src/lj_arch.h          |   29 +-
>  src/lj_ccall.c         |   38 +-
>  src/lj_ccall.h         |    4 +-
>  src/lj_ccallback.c     |   30 +-
>  src/lj_frame.h         |    2 +-
>  src/lj_ircall.h        |    2 +-
>  src/vm_ppc.dasc        | 1249 +++++++++++++++++++++++++++++++++-------
>  8 files changed, 1101 insertions(+), 255 deletions(-)
> 
> diff --git a/src/host/buildvm_asm.c b/src/host/buildvm_asm.c
> index ffd14903..43595b31 100644
> --- a/src/host/buildvm_asm.c
> +++ b/src/host/buildvm_asm.c
> @@ -338,7 +338,7 @@ void emit_asm(BuildCtx *ctx)
>  #if !(LJ_TARGET_PS3 || LJ_TARGET_PSVITA)
>      fprintf(ctx->fp, "\t.section .note.GNU-stack,\"\"," ELFASM_PX "progbits\n");
>  #endif
> -#if LJ_TARGET_PPC && !LJ_TARGET_PS3
> +#if LJ_TARGET_PPC && !LJ_TARGET_PS3 && !LJ_ABI_SOFTFP
>      /* Hard-float ABI. */
>      fprintf(ctx->fp, "\t.gnu_attribute 4, 1\n");
>  #endif
> diff --git a/src/lj_arch.h b/src/lj_arch.h
> index c39526ea..8bb8757d 100644
> --- a/src/lj_arch.h
> +++ b/src/lj_arch.h
> @@ -262,6 +262,29 @@
>  #else
>  #define LJ_ARCH_BITS		32
>  #define LJ_ARCH_NAME		"ppc"
> +
> +#if !defined(LJ_ARCH_HASFPU)
> +#if defined(_SOFT_FLOAT) || defined(_SOFT_DOUBLE)
> +#define LJ_ARCH_HASFPU		0
> +#else
> +#define LJ_ARCH_HASFPU		1
> +#endif
> +#endif
> +
> +#if !defined(LJ_ABI_SOFTFP)
> +#if defined(_SOFT_FLOAT) || defined(_SOFT_DOUBLE)
> +#define LJ_ABI_SOFTFP		1
> +#else
> +#define LJ_ABI_SOFTFP		0
> +#endif
> +#endif
> +#endif
> +
> +#if LJ_ABI_SOFTFP
> +#define LJ_ARCH_NOJIT		1  /* NYI */
> +#define LJ_ARCH_NUMMODE		LJ_NUMMODE_DUAL
> +#else
> +#define LJ_ARCH_NUMMODE		LJ_NUMMODE_DUAL_SINGLE
>  #endif
>  
>  #define LJ_TARGET_PPC		1
> @@ -271,7 +294,6 @@
>  #define LJ_TARGET_MASKSHIFT	0
>  #define LJ_TARGET_MASKROT	1
>  #define LJ_TARGET_UNIFYROT	1	/* Want only IR_BROL. */
> -#define LJ_ARCH_NUMMODE		LJ_NUMMODE_DUAL_SINGLE
>  
>  #if LJ_TARGET_CONSOLE
>  #define LJ_ARCH_PPC32ON64	1
> @@ -431,16 +453,13 @@
>  #error "No support for ILP32 model on ARM64"
>  #endif
>  #elif LJ_TARGET_PPC
> -#if defined(_SOFT_FLOAT) || defined(_SOFT_DOUBLE)
> -#error "No support for PowerPC CPUs without double-precision FPU"
> -#endif
>  #if !LJ_ARCH_PPC64 && LJ_ARCH_ENDIAN == LUAJIT_LE
>  #error "No support for little-endian PPC32"
>  #endif
>  #if LJ_ARCH_PPC64
>  #error "No support for PowerPC 64 bit mode (yet)"
>  #endif
> -#ifdef __NO_FPRS__
> +#if defined(__NO_FPRS__) && !defined(_SOFT_FLOAT)
>  #error "No support for PPC/e500 anymore (use LuaJIT 2.0)"
>  #endif
>  #elif LJ_TARGET_MIPS32
> diff --git a/src/lj_ccall.c b/src/lj_ccall.c
> index d39ff861..c1e12f56 100644
> --- a/src/lj_ccall.c
> +++ b/src/lj_ccall.c
> @@ -388,6 +388,24 @@
>  #define CCALL_HANDLE_COMPLEXARG \
>    /* Pass complex by value in 2 or 4 GPRs. */
>  
> +#define CCALL_HANDLE_GPR \
> +  /* Try to pass argument in GPRs. */ \
> +  if (n > 1) { \
> +    lua_assert(n == 2 || n == 4);  /* int64_t or complex (float). */ \
> +    if (ctype_isinteger(d->info) || ctype_isfp(d->info)) \
> +      ngpr = (ngpr + 1u) & ~1u;  /* Align int64_t to regpair. */ \
> +    else if (ngpr + n > maxgpr) \
> +      ngpr = maxgpr;  /* Prevent reordering. */ \
> +  } \
> +  if (ngpr + n <= maxgpr) { \
> +    dp = &cc->gpr[ngpr]; \
> +    ngpr += n; \
> +    goto done; \
> +  } \
> +
> +#if LJ_ABI_SOFTFP
> +#define CCALL_HANDLE_REGARG  CCALL_HANDLE_GPR
> +#else
>  #define CCALL_HANDLE_REGARG \
>    if (isfp) {  /* Try to pass argument in FPRs. */ \
>      if (nfpr + 1 <= CCALL_NARG_FPR) { \
> @@ -396,24 +414,16 @@
>        d = ctype_get(cts, CTID_DOUBLE);  /* FPRs always hold doubles. */ \
>        goto done; \
>      } \
> -  } else {  /* Try to pass argument in GPRs. */ \
> -    if (n > 1) { \
> -      lua_assert(n == 2 || n == 4);  /* int64_t or complex (float). */ \
> -      if (ctype_isinteger(d->info)) \
> -	ngpr = (ngpr + 1u) & ~1u;  /* Align int64_t to regpair. */ \
> -      else if (ngpr + n > maxgpr) \
> -	ngpr = maxgpr;  /* Prevent reordering. */ \
> -    } \
> -    if (ngpr + n <= maxgpr) { \
> -      dp = &cc->gpr[ngpr]; \
> -      ngpr += n; \
> -      goto done; \
> -    } \
> +  } else { \
> +    CCALL_HANDLE_GPR \
>    }
> +#endif
>  
> +#if !LJ_ABI_SOFTFP
>  #define CCALL_HANDLE_RET \
>    if (ctype_isfp(ctr->info) && ctr->size == sizeof(float)) \
>      ctr = ctype_get(cts, CTID_DOUBLE);  /* FPRs always hold doubles. */
> +#endif
>  
>  #elif LJ_TARGET_MIPS32
>  /* -- MIPS o32 calling conventions ---------------------------------------- */
> @@ -1081,7 +1091,7 @@ static int ccall_set_args(lua_State *L, CTState *cts, CType *ct,
>    }
>    if (fid) lj_err_caller(L, LJ_ERR_FFI_NUMARG);  /* Too few arguments. */
>  
> -#if LJ_TARGET_X64 || LJ_TARGET_PPC
> +#if LJ_TARGET_X64 || (LJ_TARGET_PPC && !LJ_ABI_SOFTFP)
>    cc->nfpr = nfpr;  /* Required for vararg functions. */
>  #endif
>    cc->nsp = nsp;
> diff --git a/src/lj_ccall.h b/src/lj_ccall.h
> index 59f66481..6efa48c7 100644
> --- a/src/lj_ccall.h
> +++ b/src/lj_ccall.h
> @@ -86,9 +86,9 @@ typedef union FPRArg {
>  #elif LJ_TARGET_PPC
>  
>  #define CCALL_NARG_GPR		8
> -#define CCALL_NARG_FPR		8
> +#define CCALL_NARG_FPR		(LJ_ABI_SOFTFP ? 0 : 8)
>  #define CCALL_NRET_GPR		4	/* For complex double. */
> -#define CCALL_NRET_FPR		1
> +#define CCALL_NRET_FPR		(LJ_ABI_SOFTFP ? 0 : 1)
>  #define CCALL_SPS_EXTRA		4
>  #define CCALL_SPS_FREE		0
>  
> diff --git a/src/lj_ccallback.c b/src/lj_ccallback.c
> index 224b6b94..c33190d7 100644
> --- a/src/lj_ccallback.c
> +++ b/src/lj_ccallback.c
> @@ -419,6 +419,23 @@ void lj_ccallback_mcode_free(CTState *cts)
>  
>  #elif LJ_TARGET_PPC
>  
> +#define CALLBACK_HANDLE_GPR \
> +  if (n > 1) { \
> +    lua_assert(((LJ_ABI_SOFTFP && ctype_isnum(cta->info)) ||  /* double. */ \
> +		ctype_isinteger(cta->info)) && n == 2);  /* int64_t. */ \
> +    ngpr = (ngpr + 1u) & ~1u;  /* Align int64_t to regpair. */ \
> +  } \
> +  if (ngpr + n <= maxgpr) { \
> +    sp = &cts->cb.gpr[ngpr]; \
> +    ngpr += n; \
> +    goto done; \
> +  }
> +
> +#if LJ_ABI_SOFTFP
> +#define CALLBACK_HANDLE_REGARG \
> +  CALLBACK_HANDLE_GPR \
> +  UNUSED(isfp);
> +#else
>  #define CALLBACK_HANDLE_REGARG \
>    if (isfp) { \
>      if (nfpr + 1 <= CCALL_NARG_FPR) { \
> @@ -427,20 +444,15 @@ void lj_ccallback_mcode_free(CTState *cts)
>        goto done; \
>      } \
>    } else {  /* Try to pass argument in GPRs. */ \
> -    if (n > 1) { \
> -      lua_assert(ctype_isinteger(cta->info) && n == 2);  /* int64_t. */ \
> -      ngpr = (ngpr + 1u) & ~1u;  /* Align int64_t to regpair. */ \
> -    } \
> -    if (ngpr + n <= maxgpr) { \
> -      sp = &cts->cb.gpr[ngpr]; \
> -      ngpr += n; \
> -      goto done; \
> -    } \
> +    CALLBACK_HANDLE_GPR \
>    }
> +#endif
>  
> +#if !LJ_ABI_SOFTFP
>  #define CALLBACK_HANDLE_RET \
>    if (ctype_isfp(ctr->info) && ctr->size == sizeof(float)) \
>      *(double *)dp = *(float *)dp;  /* FPRs always hold doubles. */
> +#endif
>  
>  #elif LJ_TARGET_MIPS32
>  
> diff --git a/src/lj_frame.h b/src/lj_frame.h
> index 2bdf3c48..5cb3d639 100644
> --- a/src/lj_frame.h
> +++ b/src/lj_frame.h
> @@ -226,7 +226,7 @@ enum { LJ_CONT_TAILCALL, LJ_CONT_FFI_CALLBACK };  /* Special continuations. */
>  #define CFRAME_OFS_L		36
>  #define CFRAME_OFS_PC		32
>  #define CFRAME_OFS_MULTRES	28
> -#define CFRAME_SIZE		272
> +#define CFRAME_SIZE		(LJ_ARCH_HASFPU ? 272 : 128)
>  #define CFRAME_SHIFT_MULTRES	3
>  #endif
>  #elif LJ_TARGET_MIPS32
> diff --git a/src/lj_ircall.h b/src/lj_ircall.h
> index c1ac29d1..bbad35b1 100644
> --- a/src/lj_ircall.h
> +++ b/src/lj_ircall.h
> @@ -291,7 +291,7 @@ LJ_DATA const CCallInfo lj_ir_callinfo[IRCALL__MAX+1];
>  #define fp64_f2l __aeabi_f2lz
>  #define fp64_f2ul __aeabi_f2ulz
>  #endif
> -#elif LJ_TARGET_MIPS
> +#elif LJ_TARGET_MIPS || LJ_TARGET_PPC
>  #define softfp_add __adddf3
>  #define softfp_sub __subdf3
>  #define softfp_mul __muldf3
> diff --git a/src/vm_ppc.dasc b/src/vm_ppc.dasc
> index 7ad8df37..980ad897 100644
> --- a/src/vm_ppc.dasc
> +++ b/src/vm_ppc.dasc
> @@ -103,6 +103,18 @@
>  |// Fixed register assignments for the interpreter.
>  |// Don't use: r1 = sp, r2 and r13 = reserved (TOC, TLS or SDATA)
>  |
> +|.macro .FPU, a, b
> +|.if FPU
> +|  a, b
> +|.endif
> +|.endmacro
> +|
> +|.macro .FPU, a, b, c
> +|.if FPU
> +|  a, b, c
> +|.endif
> +|.endmacro
> +|
>  |// The following must be C callee-save (but BASE is often refetched).
>  |.define BASE,		r14	// Base of current Lua stack frame.
>  |.define KBASE,		r15	// Constants of current Lua function.
> @@ -116,8 +128,10 @@
>  |.define TISNUM,	r22
>  |.define TISNIL,	r23
>  |.define ZERO,		r24
> +|.if FPU
>  |.define TOBIT,		f30	// 2^52 + 2^51.
>  |.define TONUM,		f31	// 2^52 + 2^51 + 2^31.
> +|.endif
>  |
>  |// The following temporaries are not saved across C calls, except for RA.
>  |.define RA,		r20	// Callee-save.
> @@ -133,6 +147,7 @@
>  |
>  |// Saved temporaries.
>  |.define SAVE0,		r21
> +|.define SAVE1,		r25
>  |
>  |// Calling conventions.
>  |.define CARG1,		r3
> @@ -141,8 +156,10 @@
>  |.define CARG4,		r6	// Overlaps TMP3.
>  |.define CARG5,		r7	// Overlaps INS.
>  |
> +|.if FPU
>  |.define FARG1,		f1
>  |.define FARG2,		f2
> +|.endif
>  |
>  |.define CRET1,		r3
>  |.define CRET2,		r4
> @@ -213,10 +230,16 @@
>  |.endif
>  |.else
>  |
> +|.if FPU
>  |.define SAVE_LR,	276(sp)
>  |.define CFRAME_SPACE,	272     // Delta for sp.
>  |// Back chain for sp:	272(sp) <-- sp entering interpreter
>  |.define SAVE_FPR_,	128     // .. 128+18*8: 64 bit FPR saves.
> +|.else
> +|.define SAVE_LR,	132(sp)
> +|.define CFRAME_SPACE,	128     // Delta for sp.
> +|// Back chain for sp:	128(sp) <-- sp entering interpreter
> +|.endif
>  |.define SAVE_GPR_,	56      // .. 56+18*4: 32 bit GPR saves.
>  |.define SAVE_CR,	52(sp)  // 32 bit CR save.
>  |.define SAVE_ERRF,	48(sp)  // 32 bit C frame info.
> @@ -226,16 +249,25 @@
>  |.define SAVE_PC,	32(sp)
>  |.define SAVE_MULTRES,	28(sp)
>  |.define UNUSED1,	24(sp)
> +|.if FPU
>  |.define TMPD_LO,	20(sp)
>  |.define TMPD_HI,	16(sp)
>  |.define TONUM_LO,	12(sp)
>  |.define TONUM_HI,	8(sp)
> +|.else
> +|.define SFSAVE_4,	20(sp)
> +|.define SFSAVE_3,	16(sp)
> +|.define SFSAVE_2,	12(sp)
> +|.define SFSAVE_1,	8(sp)
> +|.endif
>  |// Next frame lr:	4(sp)
>  |// Back chain for sp:	0(sp)	<-- sp while in interpreter
>  |
> +|.if FPU
>  |.define TMPD_BLO,	23(sp)
>  |.define TMPD,		TMPD_HI
>  |.define TONUM_D,	TONUM_HI
> +|.endif
>  |
>  |.endif
>  |
> @@ -245,7 +277,7 @@
>  |.else
>  |  stw r..reg, SAVE_GPR_+(reg-14)*4(sp)
>  |.endif
> -|  stfd f..reg, SAVE_FPR_+(reg-14)*8(sp)
> +|  .FPU stfd f..reg, SAVE_FPR_+(reg-14)*8(sp)
>  |.endmacro
>  |.macro rest_, reg
>  |.if GPR64
> @@ -253,7 +285,7 @@
>  |.else
>  |  lwz r..reg, SAVE_GPR_+(reg-14)*4(sp)
>  |.endif
> -|  lfd f..reg, SAVE_FPR_+(reg-14)*8(sp)
> +|  .FPU lfd f..reg, SAVE_FPR_+(reg-14)*8(sp)
>  |.endmacro
>  |
>  |.macro saveregs
> @@ -323,6 +355,7 @@
>  |// Trap for not-yet-implemented parts.
>  |.macro NYI; tw 4, sp, sp; .endmacro
>  |
> +|.if FPU
>  |// int/FP conversions.
>  |.macro tonum_i, freg, reg
>  |  xoris reg, reg, 0x8000
> @@ -346,6 +379,7 @@
>  |.macro toint, reg, freg
>  |  toint reg, freg, freg
>  |.endmacro
> +|.endif
>  |
>  |//-----------------------------------------------------------------------
>  |
> @@ -533,9 +567,19 @@ static void build_subroutines(BuildCtx *ctx)
>    |  beq >2
>    |1:
>    |  addic. TMP1, TMP1, -8
> +  |.if FPU
>    |   lfd f0, 0(RA)
> +  |.else
> +  |   lwz CARG1, 0(RA)
> +  |   lwz CARG2, 4(RA)
> +  |.endif
>    |    addi RA, RA, 8
> +  |.if FPU
>    |   stfd f0, 0(BASE)
> +  |.else
> +  |   stw CARG1, 0(BASE)
> +  |   stw CARG2, 4(BASE)
> +  |.endif
>    |    addi BASE, BASE, 8
>    |  bney <1
>    |
> @@ -613,23 +657,23 @@ static void build_subroutines(BuildCtx *ctx)
>    |  .toc ld TOCREG, SAVE_TOC
>    |     li TISNUM, LJ_TISNUM		// Setup type comparison constants.
>    |  lp BASE, L->base
> -  |     lus TMP3, 0x59c0		// TOBIT = 2^52 + 2^51 (float).
> +  |     .FPU lus TMP3, 0x59c0		// TOBIT = 2^52 + 2^51 (float).
>    |   lwz DISPATCH, L->glref		// Setup pointer to dispatch table.
>    |     li ZERO, 0
> -  |     stw TMP3, TMPD
> +  |     .FPU stw TMP3, TMPD
>    |  li TMP1, LJ_TFALSE
> -  |     ori TMP3, TMP3, 0x0004		// TONUM = 2^52 + 2^51 + 2^31 (float).
> +  |     .FPU ori TMP3, TMP3, 0x0004	// TONUM = 2^52 + 2^51 + 2^31 (float).
>    |     li TISNIL, LJ_TNIL
>    |    li_vmstate INTERP
> -  |     lfs TOBIT, TMPD
> +  |     .FPU lfs TOBIT, TMPD
>    |  lwz PC, FRAME_PC(BASE)		// Fetch PC of previous frame.
>    |  la RA, -8(BASE)			// Results start at BASE-8.
> -  |     stw TMP3, TMPD
> +  |     .FPU stw TMP3, TMPD
>    |   addi DISPATCH, DISPATCH, GG_G2DISP
>    |  stw TMP1, 0(RA)			// Prepend false to error message.
>    |  li RD, 16				// 2 results: false + error message.
>    |    st_vmstate
> -  |     lfs TONUM, TMPD
> +  |     .FPU lfs TONUM, TMPD
>    |  b ->vm_returnc
>    |
>    |//-----------------------------------------------------------------------
> @@ -690,22 +734,22 @@ static void build_subroutines(BuildCtx *ctx)
>    |     li TISNUM, LJ_TISNUM		// Setup type comparison constants.
>    |   lp TMP1, L->top
>    |  lwz PC, FRAME_PC(BASE)
> -  |     lus TMP3, 0x59c0		// TOBIT = 2^52 + 2^51 (float).
> +  |     .FPU lus TMP3, 0x59c0		// TOBIT = 2^52 + 2^51 (float).
>    |    stb CARG3, L->status
> -  |     stw TMP3, TMPD
> -  |     ori TMP3, TMP3, 0x0004		// TONUM = 2^52 + 2^51 + 2^31 (float).
> -  |     lfs TOBIT, TMPD
> +  |     .FPU stw TMP3, TMPD
> +  |     .FPU ori TMP3, TMP3, 0x0004	// TONUM = 2^52 + 2^51 + 2^31 (float).
> +  |     .FPU lfs TOBIT, TMPD
>    |   sub RD, TMP1, BASE
> -  |     stw TMP3, TMPD
> -  |     lus TMP0, 0x4338		// Hiword of 2^52 + 2^51 (double)
> +  |     .FPU stw TMP3, TMPD
> +  |     .FPU lus TMP0, 0x4338		// Hiword of 2^52 + 2^51 (double)
>    |   addi RD, RD, 8
> -  |     stw TMP0, TONUM_HI
> +  |     .FPU stw TMP0, TONUM_HI
>    |    li_vmstate INTERP
>    |     li ZERO, 0
>    |    st_vmstate
>    |  andix. TMP0, PC, FRAME_TYPE
>    |   mr MULTRES, RD
> -  |     lfs TONUM, TMPD
> +  |     .FPU lfs TONUM, TMPD
>    |     li TISNIL, LJ_TNIL
>    |  beq ->BC_RET_Z
>    |  b ->vm_return
> @@ -739,19 +783,19 @@ static void build_subroutines(BuildCtx *ctx)
>    |  lp TMP2, L->base			// TMP2 = old base (used in vmeta_call).
>    |     li TISNUM, LJ_TISNUM		// Setup type comparison constants.
>    |   lp TMP1, L->top
> -  |     lus TMP3, 0x59c0		// TOBIT = 2^52 + 2^51 (float).
> +  |     .FPU lus TMP3, 0x59c0		// TOBIT = 2^52 + 2^51 (float).
>    |  add PC, PC, BASE
> -  |     stw TMP3, TMPD
> +  |     .FPU stw TMP3, TMPD
>    |     li ZERO, 0
> -  |     ori TMP3, TMP3, 0x0004		// TONUM = 2^52 + 2^51 + 2^31 (float).
> -  |     lfs TOBIT, TMPD
> +  |     .FPU ori TMP3, TMP3, 0x0004	// TONUM = 2^52 + 2^51 + 2^31 (float).
> +  |     .FPU lfs TOBIT, TMPD
>    |  sub PC, PC, TMP2			// PC = frame delta + frame type
> -  |     stw TMP3, TMPD
> -  |     lus TMP0, 0x4338		// Hiword of 2^52 + 2^51 (double)
> +  |     .FPU stw TMP3, TMPD
> +  |     .FPU lus TMP0, 0x4338		// Hiword of 2^52 + 2^51 (double)
>    |   sub NARGS8:RC, TMP1, BASE
> -  |     stw TMP0, TONUM_HI
> +  |     .FPU stw TMP0, TONUM_HI
>    |    li_vmstate INTERP
> -  |     lfs TONUM, TMPD
> +  |     .FPU lfs TONUM, TMPD
>    |     li TISNIL, LJ_TNIL
>    |    st_vmstate
>    |
> @@ -839,15 +883,30 @@ static void build_subroutines(BuildCtx *ctx)
>    |  lwz INS, -4(PC)
>    |   subi CARG2, RB, 16
>    |  decode_RB8 SAVE0, INS
> +  |.if FPU
>    |   lfd f0, 0(RA)
> +  |.else
> +  |   lwz TMP2, 0(RA)
> +  |   lwz TMP3, 4(RA)
> +  |.endif
>    |  add TMP1, BASE, SAVE0
>    |   stp BASE, L->base
>    |  cmplw TMP1, CARG2
>    |   sub CARG3, CARG2, TMP1
>    |  decode_RA8 RA, INS
> +  |.if FPU
>    |   stfd f0, 0(CARG2)
> +  |.else
> +  |   stw TMP2, 0(CARG2)
> +  |   stw TMP3, 4(CARG2)
> +  |.endif
>    |  bney ->BC_CAT_Z
> +  |.if FPU
>    |   stfdx f0, BASE, RA
> +  |.else
> +  |   stwux TMP2, RA, BASE
> +  |   stw TMP3, 4(RA)
> +  |.endif
>    |  b ->cont_nop
>    |
>    |//-- Table indexing metamethods -----------------------------------------
> @@ -900,9 +959,19 @@ static void build_subroutines(BuildCtx *ctx)
>    |  // Returns TValue * (finished) or NULL (metamethod).
>    |  cmplwi CRET1, 0
>    |  beq >3
> +  |.if FPU
>    |   lfd f0, 0(CRET1)
> +  |.else
> +  |   lwz TMP0, 0(CRET1)
> +  |   lwz TMP1, 4(CRET1)
> +  |.endif
>    |  ins_next1
> +  |.if FPU
>    |   stfdx f0, BASE, RA
> +  |.else
> +  |   stwux TMP0, RA, BASE
> +  |   stw TMP1, 4(RA)
> +  |.endif
>    |  ins_next2
>    |
>    |3:  // Call __index metamethod.
> @@ -920,7 +989,12 @@ static void build_subroutines(BuildCtx *ctx)
>    |  // Returns cTValue * or NULL.
>    |  cmplwi CRET1, 0
>    |  beq >1
> +  |.if FPU
>    |  lfd f14, 0(CRET1)
> +  |.else
> +  |  lwz SAVE0, 0(CRET1)
> +  |  lwz SAVE1, 4(CRET1)
> +  |.endif
>    |  b ->BC_TGETR_Z
>    |1:
>    |  stwx TISNIL, BASE, RA
> @@ -975,11 +1049,21 @@ static void build_subroutines(BuildCtx *ctx)
>    |  bl extern lj_meta_tset		// (lua_State *L, TValue *o, TValue *k)
>    |  // Returns TValue * (finished) or NULL (metamethod).
>    |  cmplwi CRET1, 0
> +  |.if FPU
>    |   lfdx f0, BASE, RA
> +  |.else
> +  |   lwzux TMP2, RA, BASE
> +  |   lwz TMP3, 4(RA)
> +  |.endif
>    |  beq >3
>    |  // NOBARRIER: lj_meta_tset ensures the table is not black.
>    |  ins_next1
> +  |.if FPU
>    |   stfd f0, 0(CRET1)
> +  |.else
> +  |   stw TMP2, 0(CRET1)
> +  |   stw TMP3, 4(CRET1)
> +  |.endif
>    |  ins_next2
>    |
>    |3:  // Call __newindex metamethod.
> @@ -990,7 +1074,12 @@ static void build_subroutines(BuildCtx *ctx)
>    |   add PC, TMP1, BASE
>    |  lwz LFUNC:RB, FRAME_FUNC(BASE)	// Guaranteed to be a function here.
>    |   li NARGS8:RC, 24			// 3 args for func(t, k, v)
> +  |.if FPU
>    |  stfd f0, 16(BASE)			// Copy value to third argument.
> +  |.else
> +  |  stw TMP2, 16(BASE)
> +  |  stw TMP3, 20(BASE)
> +  |.endif
>    |  b ->vm_call_dispatch_f
>    |
>    |->vmeta_tsetr:
> @@ -999,7 +1088,12 @@ static void build_subroutines(BuildCtx *ctx)
>    |  stw PC, SAVE_PC
>    |  bl extern lj_tab_setinth  // (lua_State *L, GCtab *t, int32_t key)
>    |  // Returns TValue *.
> +  |.if FPU
>    |  stfd f14, 0(CRET1)
> +  |.else
> +  |  stw SAVE0, 0(CRET1)
> +  |  stw SAVE1, 4(CRET1)
> +  |.endif
>    |  b ->cont_nop
>    |
>    |//-- Comparison metamethods ---------------------------------------------
> @@ -1038,9 +1132,19 @@ static void build_subroutines(BuildCtx *ctx)
>    |
>    |->cont_ra:				// RA = resultptr
>    |  lwz INS, -4(PC)
> +  |.if FPU
>    |   lfd f0, 0(RA)
> +  |.else
> +  |   lwz CARG1, 0(RA)
> +  |   lwz CARG2, 4(RA)
> +  |.endif
>    |  decode_RA8 TMP1, INS
> +  |.if FPU
>    |   stfdx f0, BASE, TMP1
> +  |.else
> +  |   stwux CARG1, TMP1, BASE
> +  |   stw CARG2, 4(TMP1)
> +  |.endif
>    |  b ->cont_nop
>    |
>    |->cont_condt:			// RA = resultptr
> @@ -1246,22 +1350,32 @@ static void build_subroutines(BuildCtx *ctx)
>    |.macro .ffunc_n, name
>    |->ff_ .. name:
>    |  cmplwi NARGS8:RC, 8
> -  |   lwz CARG3, 0(BASE)
> +  |   lwz CARG1, 0(BASE)
> +  |.if FPU
>    |    lfd FARG1, 0(BASE)
> +  |.else
> +  |    lwz CARG2, 4(BASE)
> +  |.endif
>    |  blt ->fff_fallback
> -  |  checknum CARG3; bge ->fff_fallback
> +  |  checknum CARG1; bge ->fff_fallback
>    |.endmacro
>    |
>    |.macro .ffunc_nn, name
>    |->ff_ .. name:
>    |  cmplwi NARGS8:RC, 16
> -  |   lwz CARG3, 0(BASE)
> +  |   lwz CARG1, 0(BASE)
> +  |.if FPU
>    |    lfd FARG1, 0(BASE)
> -  |   lwz CARG4, 8(BASE)
> +  |   lwz CARG3, 8(BASE)
>    |    lfd FARG2, 8(BASE)
> +  |.else
> +  |    lwz CARG2, 4(BASE)
> +  |   lwz CARG3, 8(BASE)
> +  |    lwz CARG4, 12(BASE)
> +  |.endif
>    |  blt ->fff_fallback
> +  |  checknum CARG1; bge ->fff_fallback
>    |  checknum CARG3; bge ->fff_fallback
> -  |  checknum CARG4; bge ->fff_fallback
>    |.endmacro
>    |
>    |// Inlined GC threshold check. Caveat: uses TMP0 and TMP1.
> @@ -1282,14 +1396,21 @@ static void build_subroutines(BuildCtx *ctx)
>    |  bge cr1, ->fff_fallback
>    |   stw CARG3, 0(RA)
>    |  addi RD, NARGS8:RC, 8		// Compute (nresults+1)*8.
> +  |  addi TMP1, BASE, 8
> +  |  add TMP2, RA, NARGS8:RC
>    |   stw CARG1, 4(RA)
>    |  beq ->fff_res			// Done if exactly 1 argument.
> -  |  li TMP1, 8
> -  |  subi RC, RC, 8
>    |1:
> -  |  cmplw TMP1, RC
> -  |   lfdx f0, BASE, TMP1
> -  |   stfdx f0, RA, TMP1
> +  |  cmplw TMP1, TMP2
> +  |.if FPU
> +  |   lfd f0, 0(TMP1)
> +  |   stfd f0, 0(TMP1)
> +  |.else
> +  |   lwz CARG1, 0(TMP1)
> +  |   lwz CARG2, 4(TMP1)
> +  |   stw CARG1, -8(TMP1)
> +  |   stw CARG2, -4(TMP1)
> +  |.endif
>    |    addi TMP1, TMP1, 8
>    |  bney <1
>    |  b ->fff_res
> @@ -1304,8 +1425,14 @@ static void build_subroutines(BuildCtx *ctx)
>    |  orc TMP1, TMP2, TMP0
>    |  addi TMP1, TMP1, ~LJ_TISNUM+1
>    |  slwi TMP1, TMP1, 3
> +  |.if FPU
>    |   la TMP2, CFUNC:RB->upvalue
>    |  lfdx FARG1, TMP2, TMP1
> +  |.else
> +  |  add TMP1, CFUNC:RB, TMP1
> +  |  lwz CARG1, CFUNC:TMP1->upvalue[0].u32.hi
> +  |  lwz CARG2, CFUNC:TMP1->upvalue[0].u32.lo
> +  |.endif
>    |  b ->fff_resn
>    |
>    |//-- Base library: getters and setters ---------------------------------
> @@ -1383,7 +1510,12 @@ static void build_subroutines(BuildCtx *ctx)
>    |   mr CARG1, L
>    |  bl extern lj_tab_get  // (lua_State *L, GCtab *t, cTValue *key)
>    |  // Returns cTValue *.
> +  |.if FPU
>    |  lfd FARG1, 0(CRET1)
> +  |.else
> +  |  lwz CARG2, 4(CRET1)
> +  |  lwz CARG1, 0(CRET1)	// Caveat: CARG1 == CRET1.
> +  |.endif
>    |  b ->fff_resn
>    |
>    |//-- Base library: conversions ------------------------------------------
> @@ -1392,7 +1524,11 @@ static void build_subroutines(BuildCtx *ctx)
>    |  // Only handles the number case inline (without a base argument).
>    |  cmplwi NARGS8:RC, 8
>    |   lwz CARG1, 0(BASE)
> +  |.if FPU
>    |    lfd FARG1, 0(BASE)
> +  |.else
> +  |    lwz CARG2, 4(BASE)
> +  |.endif
>    |  bne ->fff_fallback			// Exactly one argument.
>    |   checknum CARG1; bgt ->fff_fallback
>    |  b ->fff_resn
> @@ -1443,12 +1579,23 @@ static void build_subroutines(BuildCtx *ctx)
>    |  cmplwi CRET1, 0
>    |   li CARG3, LJ_TNIL
>    |  beq ->fff_restv			// End of traversal: return nil.
> -  |  lfd f0, 8(BASE)			// Copy key and value to results.
>    |   la RA, -8(BASE)
> +  |.if FPU
> +  |  lfd f0, 8(BASE)			// Copy key and value to results.
>    |  lfd f1, 16(BASE)
>    |  stfd f0, 0(RA)
> -  |   li RD, (2+1)*8
>    |  stfd f1, 8(RA)
> +  |.else
> +  |  lwz CARG1, 8(BASE)
> +  |  lwz CARG2, 12(BASE)
> +  |  lwz CARG3, 16(BASE)
> +  |  lwz CARG4, 20(BASE)
> +  |  stw CARG1, 0(RA)
> +  |  stw CARG2, 4(RA)
> +  |  stw CARG3, 8(RA)
> +  |  stw CARG4, 12(RA)
> +  |.endif
> +  |   li RD, (2+1)*8
>    |  b ->fff_res
>    |
>    |.ffunc_1 pairs
> @@ -1457,17 +1604,32 @@ static void build_subroutines(BuildCtx *ctx)
>    |  bne ->fff_fallback
>  #if LJ_52
>    |   lwz TAB:TMP2, TAB:CARG1->metatable
> +  |.if FPU
>    |  lfd f0, CFUNC:RB->upvalue[0]
> +  |.else
> +  |  lwz TMP0, CFUNC:RB->upvalue[0].u32.hi
> +  |  lwz TMP1, CFUNC:RB->upvalue[0].u32.lo
> +  |.endif
>    |   cmplwi TAB:TMP2, 0
>    |  la RA, -8(BASE)
>    |   bne ->fff_fallback
>  #else
> +  |.if FPU
>    |  lfd f0, CFUNC:RB->upvalue[0]
> +  |.else
> +  |  lwz TMP0, CFUNC:RB->upvalue[0].u32.hi
> +  |  lwz TMP1, CFUNC:RB->upvalue[0].u32.lo
> +  |.endif
>    |  la RA, -8(BASE)
>  #endif
>    |   stw TISNIL, 8(BASE)
>    |  li RD, (3+1)*8
> +  |.if FPU
>    |  stfd f0, 0(RA)
> +  |.else
> +  |  stw TMP0, 0(RA)
> +  |  stw TMP1, 4(RA)
> +  |.endif
>    |  b ->fff_res
>    |
>    |.ffunc ipairs_aux
> @@ -1513,14 +1675,24 @@ static void build_subroutines(BuildCtx *ctx)
>    |  stfd FARG2, 0(RA)
>    |.endif
>    |  ble >2				// Not in array part?
> +  |.if FPU
>    |  lwzx TMP2, TMP1, TMP3
>    |  lfdx f0, TMP1, TMP3
> +  |.else
> +  |  lwzux TMP2, TMP1, TMP3
> +  |  lwz TMP3, 4(TMP1)
> +  |.endif
>    |1:
>    |  checknil TMP2
>    |   li RD, (0+1)*8
>    |  beq ->fff_res			// End of iteration, return 0 results.
>    |   li RD, (2+1)*8
> +  |.if FPU
>    |  stfd f0, 8(RA)
> +  |.else
> +  |  stw TMP2, 8(RA)
> +  |  stw TMP3, 12(RA)
> +  |.endif
>    |  b ->fff_res
>    |2:  // Check for empty hash part first. Otherwise call C function.
>    |  lwz TMP0, TAB:CARG1->hmask
> @@ -1534,7 +1706,11 @@ static void build_subroutines(BuildCtx *ctx)
>    |   li RD, (0+1)*8
>    |  beq ->fff_res
>    |  lwz TMP2, 0(CRET1)
> +  |.if FPU
>    |  lfd f0, 0(CRET1)
> +  |.else
> +  |  lwz TMP3, 4(CRET1)
> +  |.endif
>    |  b <1
>    |
>    |.ffunc_1 ipairs
> @@ -1543,12 +1719,22 @@ static void build_subroutines(BuildCtx *ctx)
>    |  bne ->fff_fallback
>  #if LJ_52
>    |   lwz TAB:TMP2, TAB:CARG1->metatable
> +  |.if FPU
>    |  lfd f0, CFUNC:RB->upvalue[0]
> +  |.else
> +  |  lwz TMP0, CFUNC:RB->upvalue[0].u32.hi
> +  |  lwz TMP1, CFUNC:RB->upvalue[0].u32.lo
> +  |.endif
>    |   cmplwi TAB:TMP2, 0
>    |  la RA, -8(BASE)
>    |   bne ->fff_fallback
>  #else
> +  |.if FPU
>    |  lfd f0, CFUNC:RB->upvalue[0]
> +  |.else
> +  |  lwz TMP0, CFUNC:RB->upvalue[0].u32.hi
> +  |  lwz TMP1, CFUNC:RB->upvalue[0].u32.lo
> +  |.endif
>    |  la RA, -8(BASE)
>  #endif
>    |.if DUALNUM
> @@ -1558,7 +1744,12 @@ static void build_subroutines(BuildCtx *ctx)
>    |.endif
>    |   stw ZERO, 12(BASE)
>    |  li RD, (3+1)*8
> +  |.if FPU
>    |  stfd f0, 0(RA)
> +  |.else
> +  |  stw TMP0, 0(RA)
> +  |  stw TMP1, 4(RA)
> +  |.endif
>    |  b ->fff_res
>    |
>    |//-- Base library: catch errors ----------------------------------------
> @@ -1577,19 +1768,32 @@ static void build_subroutines(BuildCtx *ctx)
>    |
>    |.ffunc xpcall
>    |  cmplwi NARGS8:RC, 16
> -  |   lwz CARG4, 8(BASE)
> +  |   lwz CARG3, 8(BASE)
> +  |.if FPU
>    |    lfd FARG2, 8(BASE)
>    |    lfd FARG1, 0(BASE)
> +  |.else
> +  |    lwz CARG1, 0(BASE)
> +  |    lwz CARG2, 4(BASE)
> +  |    lwz CARG4, 12(BASE)
> +  |.endif
>    |  blt ->fff_fallback
>    |  lbz TMP1, DISPATCH_GL(hookmask)(DISPATCH)
>    |   mr TMP2, BASE
> -  |  checkfunc CARG4; bne ->fff_fallback  // Traceback must be a function.
> +  |  checkfunc CARG3; bne ->fff_fallback  // Traceback must be a function.
>    |   la BASE, 16(BASE)
>    |  // Remember active hook before pcall.
>    |  rlwinm TMP1, TMP1, 32-HOOK_ACTIVE_SHIFT, 31, 31
> +  |.if FPU
>    |    stfd FARG2, 0(TMP2)		// Swap function and traceback.
> -  |  subi NARGS8:RC, NARGS8:RC, 16
>    |    stfd FARG1, 8(TMP2)
> +  |.else
> +  |    stw CARG3, 0(TMP2)
> +  |    stw CARG4, 4(TMP2)
> +  |    stw CARG1, 8(TMP2)
> +  |    stw CARG2, 12(TMP2)
> +  |.endif
> +  |  subi NARGS8:RC, NARGS8:RC, 16
>    |  addi PC, TMP1, 16+FRAME_PCALL
>    |  b ->vm_call_dispatch
>    |
> @@ -1632,9 +1836,21 @@ static void build_subroutines(BuildCtx *ctx)
>    |  stp BASE, L->top
>    |2:  // Move args to coroutine.
>    |  cmpw TMP1, NARGS8:RC
> +  |.if FPU
>    |   lfdx f0, BASE, TMP1
> +  |.else
> +  |   add CARG3, BASE, TMP1
> +  |   lwz TMP2, 0(CARG3)
> +  |   lwz TMP3, 4(CARG3)
> +  |.endif
>    |  beq >3
> +  |.if FPU
>    |   stfdx f0, CARG2, TMP1
> +  |.else
> +  |   add CARG3, CARG2, TMP1
> +  |   stw TMP2, 0(CARG3)
> +  |   stw TMP3, 4(CARG3)
> +  |.endif
>    |  addi TMP1, TMP1, 8
>    |  b <2
>    |3:
> @@ -1665,8 +1881,17 @@ static void build_subroutines(BuildCtx *ctx)
>    |   stp TMP2, L:SAVE0->top		// Clear coroutine stack.
>    |5:  // Move results from coroutine.
>    |  cmplw TMP1, TMP3
> +  |.if FPU
>    |   lfdx f0, TMP2, TMP1
>    |   stfdx f0, BASE, TMP1
> +  |.else
> +  |   add CARG3, TMP2, TMP1
> +  |   lwz CARG1, 0(CARG3)
> +  |   lwz CARG2, 4(CARG3)
> +  |   add CARG3, BASE, TMP1
> +  |   stw CARG1, 0(CARG3)
> +  |   stw CARG2, 4(CARG3)
> +  |.endif
>    |    addi TMP1, TMP1, 8
>    |  bne <5
>    |6:
> @@ -1691,12 +1916,22 @@ static void build_subroutines(BuildCtx *ctx)
>    |  andix. TMP0, PC, FRAME_TYPE
>    |  la TMP3, -8(TMP3)
>    |   li TMP1, LJ_TFALSE
> +  |.if FPU
>    |  lfd f0, 0(TMP3)
> +  |.else
> +  |  lwz CARG1, 0(TMP3)
> +  |  lwz CARG2, 4(TMP3)
> +  |.endif
>    |   stp TMP3, L:SAVE0->top		// Remove error from coroutine stack.
>    |    li RD, (2+1)*8
>    |   stw TMP1, -8(BASE)		// Prepend false to results.
>    |    la RA, -8(BASE)
> +  |.if FPU
>    |  stfd f0, 0(BASE)			// Copy error message.
> +  |.else
> +  |  stw CARG1, 0(BASE)			// Copy error message.
> +  |  stw CARG2, 4(BASE)
> +  |.endif
>    |  b <7
>    |.else
>    |  mr CARG1, L
> @@ -1875,7 +2110,12 @@ static void build_subroutines(BuildCtx *ctx)
>    |  lus CARG1, 0x8000			// -(2^31).
>    |  beqy ->fff_resi
>    |5:
> +  |.if FPU
>    |  lfd FARG1, 0(BASE)
> +  |.else
> +  |  lwz CARG1, 0(BASE)
> +  |  lwz CARG2, 4(BASE)
> +  |.endif
>    |  blex func
>    |  b ->fff_resn
>    |.endmacro
> @@ -1899,10 +2139,14 @@ static void build_subroutines(BuildCtx *ctx)
>    |
>    |.ffunc math_log
>    |  cmplwi NARGS8:RC, 8
> -  |   lwz CARG3, 0(BASE)
> -  |    lfd FARG1, 0(BASE)
> +  |   lwz CARG1, 0(BASE)
>    |  bne ->fff_fallback			// Need exactly 1 argument.
> -  |  checknum CARG3; bge ->fff_fallback
> +  |  checknum CARG1; bge ->fff_fallback
> +  |.if FPU
> +  |  lfd FARG1, 0(BASE)
> +  |.else
> +  |  lwz CARG2, 4(BASE)
> +  |.endif
>    |  blex log
>    |  b ->fff_resn
>    |
> @@ -1924,17 +2168,24 @@ static void build_subroutines(BuildCtx *ctx)
>    |.if DUALNUM
>    |.ffunc math_ldexp
>    |  cmplwi NARGS8:RC, 16
> -  |   lwz CARG3, 0(BASE)
> +  |   lwz TMP0, 0(BASE)
> +  |.if FPU
>    |    lfd FARG1, 0(BASE)
> -  |   lwz CARG4, 8(BASE)
> +  |.else
> +  |    lwz CARG1, 0(BASE)
> +  |    lwz CARG2, 4(BASE)
> +  |.endif
> +  |   lwz TMP1, 8(BASE)
>    |.if GPR64
>    |    lwz CARG2, 12(BASE)
> -  |.else
> +  |.elif FPU
>    |    lwz CARG1, 12(BASE)
> +  |.else
> +  |    lwz CARG3, 12(BASE)
>    |.endif
>    |  blt ->fff_fallback
> -  |  checknum CARG3; bge ->fff_fallback
> -  |  checknum CARG4; bne ->fff_fallback
> +  |  checknum TMP0; bge ->fff_fallback
> +  |  checknum TMP1; bne ->fff_fallback
>    |.else
>    |.ffunc_nn math_ldexp
>    |.if GPR64
> @@ -1949,8 +2200,10 @@ static void build_subroutines(BuildCtx *ctx)
>    |.ffunc_n math_frexp
>    |.if GPR64
>    |  la CARG2, DISPATCH_GL(tmptv)(DISPATCH)
> -  |.else
> +  |.elif FPU
>    |  la CARG1, DISPATCH_GL(tmptv)(DISPATCH)
> +  |.else
> +  |  la CARG3, DISPATCH_GL(tmptv)(DISPATCH)
>    |.endif
>    |   lwz PC, FRAME_PC(BASE)
>    |  blex frexp
> @@ -1959,7 +2212,12 @@ static void build_subroutines(BuildCtx *ctx)
>    |.if not DUALNUM
>    |   tonum_i FARG2, TMP1
>    |.endif
> +  |.if FPU
>    |  stfd FARG1, 0(RA)
> +  |.else
> +  |  stw CRET1, 0(RA)
> +  |  stw CRET2, 4(RA)
> +  |.endif
>    |  li RD, (2+1)*8
>    |.if DUALNUM
>    |   stw TISNUM, 8(RA)
> @@ -1972,13 +2230,20 @@ static void build_subroutines(BuildCtx *ctx)
>    |.ffunc_n math_modf
>    |.if GPR64
>    |  la CARG2, -8(BASE)
> -  |.else
> +  |.elif FPU
>    |  la CARG1, -8(BASE)
> +  |.else
> +  |  la CARG3, -8(BASE)
>    |.endif
>    |   lwz PC, FRAME_PC(BASE)
>    |  blex modf
>    |   la RA, -8(BASE)
> +  |.if FPU
>    |  stfd FARG1, 0(BASE)
> +  |.else
> +  |  stw CRET1, 0(BASE)
> +  |  stw CRET2, 4(BASE)
> +  |.endif
>    |  li RD, (2+1)*8
>    |  b ->fff_res
>    |
> @@ -1986,13 +2251,13 @@ static void build_subroutines(BuildCtx *ctx)
>    |.if DUALNUM
>    |  .ffunc_1 name
>    |  checknum CARG3
> -  |   addi TMP1, BASE, 8
> -  |   add TMP2, BASE, NARGS8:RC
> +  |   addi SAVE0, BASE, 8
> +  |   add SAVE1, BASE, NARGS8:RC
>    |  bne >4
>    |1:  // Handle integers.
> -  |  lwz CARG4, 0(TMP1)
> -  |   cmplw cr1, TMP1, TMP2
> -  |  lwz CARG2, 4(TMP1)
> +  |  lwz CARG4, 0(SAVE0)
> +  |   cmplw cr1, SAVE0, SAVE1
> +  |  lwz CARG2, 4(SAVE0)
>    |   bge cr1, ->fff_resi
>    |  checknum CARG4
>    |   xoris TMP0, CARG1, 0x8000
> @@ -2009,36 +2274,76 @@ static void build_subroutines(BuildCtx *ctx)
>    |.if GPR64
>    |  rldicl CARG1, CARG1, 0, 32
>    |.endif
> -  |   addi TMP1, TMP1, 8
> +  |   addi SAVE0, SAVE0, 8
>    |  b <1
>    |3:
>    |  bge ->fff_fallback
>    |  // Convert intermediate result to number and continue below.
> +  |.if FPU
>    |  tonum_i FARG1, CARG1
> -  |  lfd FARG2, 0(TMP1)
> +  |  lfd FARG2, 0(SAVE0)
> +  |.else
> +  |  mr CARG2, CARG1
> +  |  bl ->vm_sfi2d_1
> +  |  lwz CARG3, 0(SAVE0)
> +  |  lwz CARG4, 4(SAVE0)
> +  |.endif
>    |  b >6
>    |4:
> +  |.if FPU
>    |   lfd FARG1, 0(BASE)
> +  |.else
> +  |   lwz CARG1, 0(BASE)
> +  |   lwz CARG2, 4(BASE)
> +  |.endif
>    |  bge ->fff_fallback
>    |5:  // Handle numbers.
> -  |  lwz CARG4, 0(TMP1)
> -  |   cmplw cr1, TMP1, TMP2
> -  |  lfd FARG2, 0(TMP1)
> +  |  lwz CARG3, 0(SAVE0)
> +  |   cmplw cr1, SAVE0, SAVE1
> +  |.if FPU
> +  |  lfd FARG2, 0(SAVE0)
> +  |.else
> +  |  lwz CARG4, 4(SAVE0)
> +  |.endif
>    |   bge cr1, ->fff_resn
> -  |  checknum CARG4; bge >7
> +  |  checknum CARG3; bge >7
>    |6:
> +  |   addi SAVE0, SAVE0, 8
> +  |.if FPU
>    |  fsub f0, FARG1, FARG2
> -  |   addi TMP1, TMP1, 8
>    |.if ismax
>    |  fsel FARG1, f0, FARG1, FARG2
>    |.else
>    |  fsel FARG1, f0, FARG2, FARG1
>    |.endif
> +  |.else
> +  |  stw CARG1, SFSAVE_1
> +  |  stw CARG2, SFSAVE_2
> +  |  stw CARG3, SFSAVE_3
> +  |  stw CARG4, SFSAVE_4
> +  |  blex __ledf2
> +  |  cmpwi CRET1, 0
> +  |.if ismax
> +  |  blt >8
> +  |.else
> +  |  bge >8
> +  |.endif
> +  |  lwz CARG1, SFSAVE_1
> +  |  lwz CARG2, SFSAVE_2
> +  |  b <5
> +  |8:
> +  |  lwz CARG1, SFSAVE_3
> +  |  lwz CARG2, SFSAVE_4
> +  |.endif
>    |  b <5
>    |7:  // Convert integer to number and continue above.
> -  |   lwz CARG2, 4(TMP1)
> +  |   lwz CARG3, 4(SAVE0)
>    |  bne ->fff_fallback
> -  |  tonum_i FARG2, CARG2
> +  |.if FPU
> +  |  tonum_i FARG2, CARG3
> +  |.else
> +  |  bl ->vm_sfi2d_2
> +  |.endif
>    |  b <6
>    |.else
>    |  .ffunc_n name
> @@ -2238,28 +2543,37 @@ static void build_subroutines(BuildCtx *ctx)
>    |
>    |.macro .ffunc_bit_op, name, ins
>    |  .ffunc_bit name
> -  |  addi TMP1, BASE, 8
> -  |  add TMP2, BASE, NARGS8:RC
> +  |  addi SAVE0, BASE, 8
> +  |  add SAVE1, BASE, NARGS8:RC
>    |1:
> -  |  lwz CARG4, 0(TMP1)
> -  |   cmplw cr1, TMP1, TMP2
> +  |  lwz CARG4, 0(SAVE0)
> +  |   cmplw cr1, SAVE0, SAVE1
>    |.if DUALNUM
> -  |  lwz CARG2, 4(TMP1)
> +  |  lwz CARG2, 4(SAVE0)
>    |.else
> -  |  lfd FARG1, 0(TMP1)
> +  |  lfd FARG1, 0(SAVE0)
>    |.endif
>    |   bgey cr1, ->fff_resi
>    |  checknum CARG4
>    |.if DUALNUM
> +  |.if FPU
>    |  bnel ->fff_bitop_fb
>    |.else
> +  |  beq >3
> +  |  stw CARG1, SFSAVE_1
> +  |  bl ->fff_bitop_fb
> +  |  mr CARG2, CARG1
> +  |  lwz CARG1, SFSAVE_1
> +  |3:
> +  |.endif
> +  |.else
>    |  fadd FARG1, FARG1, TOBIT
>    |  bge ->fff_fallback
>    |  stfd FARG1, TMPD
>    |  lwz CARG2, TMPD_LO
>    |.endif
>    |  ins CARG1, CARG1, CARG2
> -  |   addi TMP1, TMP1, 8
> +  |   addi SAVE0, SAVE0, 8
>    |  b <1
>    |.endmacro
>    |
> @@ -2281,7 +2595,14 @@ static void build_subroutines(BuildCtx *ctx)
>    |.macro .ffunc_bit_sh, name, ins, shmod
>    |.if DUALNUM
>    |  .ffunc_2 bit_..name
> +  |.if FPU
>    |  checknum CARG3; bnel ->fff_tobit_fb
> +  |.else
> +  |  checknum CARG3; beq >1
> +  |  bl ->fff_tobit_fb
> +  |  lwz CARG2, 12(BASE)	// Conversion polluted CARG2.
> +  |1:
> +  |.endif
>    |  // Note: no inline conversion from number for 2nd argument!
>    |  checknum CARG4; bne ->fff_fallback
>    |.else
> @@ -2318,27 +2639,77 @@ static void build_subroutines(BuildCtx *ctx)
>    |->fff_resn:
>    |  lwz PC, FRAME_PC(BASE)
>    |  la RA, -8(BASE)
> +  |.if FPU
>    |  stfd FARG1, -8(BASE)
> +  |.else
> +  |  stw CARG1, -8(BASE)
> +  |  stw CARG2, -4(BASE)
> +  |.endif
>    |  b ->fff_res1
>    |
>    |// Fallback FP number to bit conversion.
>    |->fff_tobit_fb:
>    |.if DUALNUM
> +  |.if FPU
>    |  lfd FARG1, 0(BASE)
>    |  bgt ->fff_fallback
>    |  fadd FARG1, FARG1, TOBIT
>    |  stfd FARG1, TMPD
>    |  lwz CARG1, TMPD_LO
>    |  blr
> +  |.else
> +  |  bgt ->fff_fallback
> +  |  mr CARG2, CARG1
> +  |  mr CARG1, CARG3
> +  |// Modifies: CARG1, CARG2, TMP0, TMP1, TMP2.
> +  |->vm_tobit:
> +  |  slwi TMP2, CARG1, 1
> +  |  addis TMP2, TMP2, 0x0020
> +  |  cmpwi TMP2, 0
> +  |  bge >2
> +  |   li TMP1, 0x3e0
> +  |  srawi TMP2, TMP2, 21
> +  |   not TMP1, TMP1
> +  |  sub. TMP2, TMP1, TMP2
> +  |    cmpwi cr7, CARG1, 0
> +  |  blt >1
> +  |   slwi TMP1, CARG1, 11
> +  |    srwi TMP0, CARG2, 21
> +  |   oris TMP1, TMP1, 0x8000
> +  |   or TMP1, TMP1, TMP0
> +  |   srw CARG1, TMP1, TMP2
> +  |  bclr 4, 28			// Return if cr7[lt] == 0, no hint.
> +  |   neg CARG1, CARG1
> +  |  blr
> +  |1:
> +  |  addi TMP2, TMP2, 21
> +  |  srw TMP1, CARG2, TMP2
> +  |   slwi CARG2, CARG1, 12
> +  |  subfic TMP2, TMP2, 20
> +  |   slw TMP0, CARG2, TMP2
> +  |   or CARG1, TMP1, TMP0
> +  |  bclr 4, 28			// Return if cr7[lt] == 0, no hint.
> +  |   neg CARG1, CARG1
> +  |  blr
> +  |2:
> +  |  li CARG1, 0
> +  |  blr
> +  |.endif
>    |.endif
>    |->fff_bitop_fb:
>    |.if DUALNUM
> -  |  lfd FARG1, 0(TMP1)
> +  |.if FPU
> +  |  lfd FARG1, 0(SAVE0)
>    |  bgt ->fff_fallback
>    |  fadd FARG1, FARG1, TOBIT
>    |  stfd FARG1, TMPD
>    |  lwz CARG2, TMPD_LO
>    |  blr
> +  |.else
> +  |  bgt ->fff_fallback
> +  |  mr CARG1, CARG4
> +  |  b ->vm_tobit
> +  |.endif
>    |.endif
>    |
>    |//-----------------------------------------------------------------------
> @@ -2531,10 +2902,21 @@ static void build_subroutines(BuildCtx *ctx)
>    |  decode_RA8 RC, INS			// Call base.
>    |   beq >2
>    |1:  // Move results down.
> +  |.if FPU
>    |  lfd f0, 0(RA)
> +  |.else
> +  |  lwz CARG1, 0(RA)
> +  |  lwz CARG2, 4(RA)
> +  |.endif
>    |   addic. TMP1, TMP1, -8
>    |    addi RA, RA, 8
> +  |.if FPU
>    |  stfdx f0, BASE, RC
> +  |.else
> +  |  add CARG3, BASE, RC
> +  |  stw CARG1, 0(CARG3)
> +  |  stw CARG2, 4(CARG3)
> +  |.endif
>    |    addi RC, RC, 8
>    |   bne <1
>    |2:
> @@ -2587,10 +2969,12 @@ static void build_subroutines(BuildCtx *ctx)
>    |//-----------------------------------------------------------------------
>    |
>    |.macro savex_, a, b, c, d
> +  |.if FPU
>    |  stfd f..a, 16+a*8(sp)
>    |  stfd f..b, 16+b*8(sp)
>    |  stfd f..c, 16+c*8(sp)
>    |  stfd f..d, 16+d*8(sp)
> +  |.endif
>    |.endmacro
>    |
>    |->vm_exit_handler:
> @@ -2662,16 +3046,16 @@ static void build_subroutines(BuildCtx *ctx)
>    |  lwz KBASE, PC2PROTO(k)(TMP1)
>    |  // Setup type comparison constants.
>    |  li TISNUM, LJ_TISNUM
> -  |  lus TMP3, 0x59c0			// TOBIT = 2^52 + 2^51 (float).
> -  |  stw TMP3, TMPD
> +  |  .FPU lus TMP3, 0x59c0		// TOBIT = 2^52 + 2^51 (float).
> +  |  .FPU stw TMP3, TMPD
>    |  li ZERO, 0
> -  |  ori TMP3, TMP3, 0x0004		// TONUM = 2^52 + 2^51 + 2^31 (float).
> -  |  lfs TOBIT, TMPD
> -  |  stw TMP3, TMPD
> -  |  lus TMP0, 0x4338			// Hiword of 2^52 + 2^51 (double)
> +  |  .FPU ori TMP3, TMP3, 0x0004	// TONUM = 2^52 + 2^51 + 2^31 (float).
> +  |  .FPU lfs TOBIT, TMPD
> +  |  .FPU stw TMP3, TMPD
> +  |  .FPU lus TMP0, 0x4338			// Hiword of 2^52 + 2^51 (double)
>    |    li TISNIL, LJ_TNIL
> -  |  stw TMP0, TONUM_HI
> -  |  lfs TONUM, TMPD
> +  |  .FPU stw TMP0, TONUM_HI
> +  |  .FPU lfs TONUM, TMPD
>    |  // Modified copy of ins_next which handles function header dispatch, too.
>    |  lwz INS, 0(PC)
>    |   addi PC, PC, 4
> @@ -2716,7 +3100,35 @@ static void build_subroutines(BuildCtx *ctx)
>    |//-- Math helper functions ----------------------------------------------
>    |//-----------------------------------------------------------------------
>    |
> -  |// NYI: Use internal implementations of floor, ceil, trunc.
> +  |// NYI: Use internal implementations of floor, ceil, trunc, sfcmp.
> +  |
> +  |.macro sfi2d, AHI, ALO
> +  |.if not FPU
> +  |  mr. AHI, ALO
> +  |  bclr 12, 2				// Handle zero first.
> +  |  srawi TMP0, ALO, 31
> +  |  xor TMP1, ALO, TMP0
> +  |  sub TMP1, TMP1, TMP0		// Absolute value in TMP1.
> +  |  cntlzw AHI, TMP1
> +  |  andix. TMP0, TMP0, 0x800		// Mask sign bit.
> +  |  slw TMP1, TMP1, AHI		// Align mantissa left with leading 1.
> +  |  subfic AHI, AHI, 0x3ff+31-1	// Exponent -1 in AHI.
> +  |  slwi ALO, TMP1, 21
> +  |  or AHI, AHI, TMP0			// Sign | Exponent.
> +  |  srwi TMP1, TMP1, 11
> +  |  slwi AHI, AHI, 20			// Align left.
> +  |  add AHI, AHI, TMP1			// Add mantissa, increment exponent.
> +  |  blr
> +  |.endif
> +  |.endmacro
> +  |
> +  |// Input: CARG2. Output: CARG1, CARG2. Temporaries: TMP0, TMP1.
> +  |->vm_sfi2d_1:
> +  |  sfi2d CARG1, CARG2
> +  |
> +  |// Input: CARG4. Output: CARG3, CARG4. Temporaries: TMP0, TMP1.
> +  |->vm_sfi2d_2:
> +  |  sfi2d CARG3, CARG4
>    |
>    |->vm_modi:
>    |  divwo. TMP0, CARG1, CARG2
> @@ -2784,21 +3196,21 @@ static void build_subroutines(BuildCtx *ctx)
>    |   addi DISPATCH, r12, GG_G2DISP
>    |  stw r11, CTSTATE->cb.slot
>    |  stw r3, CTSTATE->cb.gpr[0]
> -  |   stfd f1, CTSTATE->cb.fpr[0]
> +  |   .FPU stfd f1, CTSTATE->cb.fpr[0]
>    |  stw r4, CTSTATE->cb.gpr[1]
> -  |   stfd f2, CTSTATE->cb.fpr[1]
> +  |   .FPU stfd f2, CTSTATE->cb.fpr[1]
>    |  stw r5, CTSTATE->cb.gpr[2]
> -  |   stfd f3, CTSTATE->cb.fpr[2]
> +  |   .FPU stfd f3, CTSTATE->cb.fpr[2]
>    |  stw r6, CTSTATE->cb.gpr[3]
> -  |   stfd f4, CTSTATE->cb.fpr[3]
> +  |   .FPU stfd f4, CTSTATE->cb.fpr[3]
>    |  stw r7, CTSTATE->cb.gpr[4]
> -  |   stfd f5, CTSTATE->cb.fpr[4]
> +  |   .FPU stfd f5, CTSTATE->cb.fpr[4]
>    |  stw r8, CTSTATE->cb.gpr[5]
> -  |   stfd f6, CTSTATE->cb.fpr[5]
> +  |   .FPU stfd f6, CTSTATE->cb.fpr[5]
>    |  stw r9, CTSTATE->cb.gpr[6]
> -  |   stfd f7, CTSTATE->cb.fpr[6]
> +  |   .FPU stfd f7, CTSTATE->cb.fpr[6]
>    |  stw r10, CTSTATE->cb.gpr[7]
> -  |   stfd f8, CTSTATE->cb.fpr[7]
> +  |   .FPU stfd f8, CTSTATE->cb.fpr[7]
>    |  addi TMP0, sp, CFRAME_SPACE+8
>    |  stw TMP0, CTSTATE->cb.stack
>    |   mr CARG1, CTSTATE
> @@ -2809,21 +3221,21 @@ static void build_subroutines(BuildCtx *ctx)
>    |  lp BASE, L:CRET1->base
>    |     li TISNUM, LJ_TISNUM		// Setup type comparison constants.
>    |  lp RC, L:CRET1->top
> -  |     lus TMP3, 0x59c0		// TOBIT = 2^52 + 2^51 (float).
> +  |     .FPU lus TMP3, 0x59c0		// TOBIT = 2^52 + 2^51 (float).
>    |     li ZERO, 0
>    |   mr L, CRET1
> -  |     stw TMP3, TMPD
> -  |     lus TMP0, 0x4338		// Hiword of 2^52 + 2^51 (double)
> +  |     .FPU stw TMP3, TMPD
> +  |     .FPU lus TMP0, 0x4338		// Hiword of 2^52 + 2^51 (double)
>    |  lwz LFUNC:RB, FRAME_FUNC(BASE)
> -  |     ori TMP3, TMP3, 0x0004		// TONUM = 2^52 + 2^51 + 2^31 (float).
> -  |     stw TMP0, TONUM_HI
> +  |     .FPU ori TMP3, TMP3, 0x0004	// TONUM = 2^52 + 2^51 + 2^31 (float).
> +  |     .FPU stw TMP0, TONUM_HI
>    |     li TISNIL, LJ_TNIL
>    |    li_vmstate INTERP
> -  |     lfs TOBIT, TMPD
> -  |     stw TMP3, TMPD
> +  |     .FPU lfs TOBIT, TMPD
> +  |     .FPU stw TMP3, TMPD
>    |  sub RC, RC, BASE
>    |    st_vmstate
> -  |     lfs TONUM, TMPD
> +  |     .FPU lfs TONUM, TMPD
>    |  ins_callt
>    |.endif
>    |
> @@ -2837,7 +3249,7 @@ static void build_subroutines(BuildCtx *ctx)
>    |  mr CARG2, RA
>    |  bl extern lj_ccallback_leave	// (CTState *cts, TValue *o)
>    |  lwz CRET1, CTSTATE->cb.gpr[0]
> -  |  lfd FARG1, CTSTATE->cb.fpr[0]
> +  |  .FPU lfd FARG1, CTSTATE->cb.fpr[0]
>    |  lwz CRET2, CTSTATE->cb.gpr[1]
>    |  b ->vm_leave_unw
>    |.endif
> @@ -2871,14 +3283,14 @@ static void build_subroutines(BuildCtx *ctx)
>    |  bge <1
>    |2:
>    |  bney cr1, >3
> -  |  lfd f1, CCSTATE->fpr[0]
> -  |  lfd f2, CCSTATE->fpr[1]
> -  |  lfd f3, CCSTATE->fpr[2]
> -  |  lfd f4, CCSTATE->fpr[3]
> -  |  lfd f5, CCSTATE->fpr[4]
> -  |  lfd f6, CCSTATE->fpr[5]
> -  |  lfd f7, CCSTATE->fpr[6]
> -  |  lfd f8, CCSTATE->fpr[7]
> +  |  .FPU lfd f1, CCSTATE->fpr[0]
> +  |  .FPU lfd f2, CCSTATE->fpr[1]
> +  |  .FPU lfd f3, CCSTATE->fpr[2]
> +  |  .FPU lfd f4, CCSTATE->fpr[3]
> +  |  .FPU lfd f5, CCSTATE->fpr[4]
> +  |  .FPU lfd f6, CCSTATE->fpr[5]
> +  |  .FPU lfd f7, CCSTATE->fpr[6]
> +  |  .FPU lfd f8, CCSTATE->fpr[7]
>    |3:
>    |   lp TMP0, CCSTATE->func
>    |  lwz CARG2, CCSTATE->gpr[1]
> @@ -2895,7 +3307,7 @@ static void build_subroutines(BuildCtx *ctx)
>    |  lwz TMP2, -4(r14)
>    |   lwz TMP0, 4(r14)
>    |  stw CARG1, CCSTATE:TMP1->gpr[0]
> -  |  stfd FARG1, CCSTATE:TMP1->fpr[0]
> +  |  .FPU stfd FARG1, CCSTATE:TMP1->fpr[0]
>    |  stw CARG2, CCSTATE:TMP1->gpr[1]
>    |   mtlr TMP0
>    |  stw CARG3, CCSTATE:TMP1->gpr[2]
> @@ -2924,19 +3336,19 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>    case BC_ISLT: case BC_ISGE: case BC_ISLE: case BC_ISGT:
>      |  // RA = src1*8, RD = src2*8, JMP with RD = target
>      |.if DUALNUM
> -    |  lwzux TMP0, RA, BASE
> +    |  lwzux CARG1, RA, BASE
>      |    addi PC, PC, 4
>      |   lwz CARG2, 4(RA)
> -    |  lwzux TMP1, RD, BASE
> +    |  lwzux CARG3, RD, BASE
>      |    lwz TMP2, -4(PC)
> -    |  checknum cr0, TMP0
> -    |   lwz CARG3, 4(RD)
> +    |  checknum cr0, CARG1
> +    |   lwz CARG4, 4(RD)
>      |    decode_RD4 TMP2, TMP2
> -    |  checknum cr1, TMP1
> -    |    addis TMP2, TMP2, -(BCBIAS_J*4 >> 16)
> +    |  checknum cr1, CARG3
> +    |    addis SAVE0, TMP2, -(BCBIAS_J*4 >> 16)
>      |  bne cr0, >7
>      |  bne cr1, >8
> -    |   cmpw CARG2, CARG3
> +    |   cmpw CARG2, CARG4
>      if (op == BC_ISLT) {
>        |  bge >2
>      } else if (op == BC_ISGE) {
> @@ -2947,28 +3359,41 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>        |  ble >2
>      }
>      |1:
> -    |  add PC, PC, TMP2
> +    |  add PC, PC, SAVE0
>      |2:
>      |  ins_next
>      |
>      |7:  // RA is not an integer.
>      |  bgt cr0, ->vmeta_comp
>      |  // RA is a number.
> -    |   lfd f0, 0(RA)
> +    |   .FPU lfd f0, 0(RA)
>      |  bgt cr1, ->vmeta_comp
>      |  blt cr1, >4
>      |  // RA is a number, RD is an integer.
> -    |  tonum_i f1, CARG3
> +    |.if FPU
> +    |  tonum_i f1, CARG4
> +    |.else
> +    |  bl ->vm_sfi2d_2
> +    |.endif
>      |  b >5
>      |
>      |8: // RA is an integer, RD is not an integer.
>      |  bgt cr1, ->vmeta_comp
>      |  // RA is an integer, RD is a number.
> +    |.if FPU
>      |  tonum_i f0, CARG2
> +    |.else
> +    |  bl ->vm_sfi2d_1
> +    |.endif
>      |4:
> -    |  lfd f1, 0(RD)
> +    |  .FPU lfd f1, 0(RD)
>      |5:
> +    |.if FPU
>      |  fcmpu cr0, f0, f1
> +    |.else
> +    |  blex __ledf2
> +    |  cmpwi CRET1, 0
> +    |.endif
>      if (op == BC_ISLT) {
>        |  bge <2
>      } else if (op == BC_ISGE) {
> @@ -3016,42 +3441,42 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      vk = op == BC_ISEQV;
>      |  // RA = src1*8, RD = src2*8, JMP with RD = target
>      |.if DUALNUM
> -    |  lwzux TMP0, RA, BASE
> +    |  lwzux CARG1, RA, BASE
>      |    addi PC, PC, 4
>      |   lwz CARG2, 4(RA)
> -    |  lwzux TMP1, RD, BASE
> -    |  checknum cr0, TMP0
> -    |    lwz TMP2, -4(PC)
> -    |  checknum cr1, TMP1
> -    |    decode_RD4 TMP2, TMP2
> -    |   lwz CARG3, 4(RD)
> +    |  lwzux CARG3, RD, BASE
> +    |  checknum cr0, CARG1
> +    |    lwz SAVE0, -4(PC)
> +    |  checknum cr1, CARG3
> +    |    decode_RD4 SAVE0, SAVE0
> +    |   lwz CARG4, 4(RD)
>      |  cror 4*cr7+gt, 4*cr0+gt, 4*cr1+gt
> -    |    addis TMP2, TMP2, -(BCBIAS_J*4 >> 16)
> +    |    addis SAVE0, SAVE0, -(BCBIAS_J*4 >> 16)
>      if (vk) {
>        |  ble cr7, ->BC_ISEQN_Z
>      } else {
>        |  ble cr7, ->BC_ISNEN_Z
>      }
>      |.else
> -    |  lwzux TMP0, RA, BASE
> -    |   lwz TMP2, 0(PC)
> +    |  lwzux CARG1, RA, BASE
> +    |   lwz SAVE0, 0(PC)
>      |    lfd f0, 0(RA)
>      |   addi PC, PC, 4
> -    |  lwzux TMP1, RD, BASE
> -    |  checknum cr0, TMP0
> -    |   decode_RD4 TMP2, TMP2
> +    |  lwzux CARG3, RD, BASE
> +    |  checknum cr0, CARG1
> +    |   decode_RD4 SAVE0, SAVE0
>      |    lfd f1, 0(RD)
> -    |  checknum cr1, TMP1
> -    |   addis TMP2, TMP2, -(BCBIAS_J*4 >> 16)
> +    |  checknum cr1, CARG3
> +    |   addis SAVE0, SAVE0, -(BCBIAS_J*4 >> 16)
>      |  bge cr0, >5
>      |  bge cr1, >5
>      |  fcmpu cr0, f0, f1
>      if (vk) {
>        |  bne >1
> -      |  add PC, PC, TMP2
> +      |  add PC, PC, SAVE0
>      } else {
>        |  beq >1
> -      |  add PC, PC, TMP2
> +      |  add PC, PC, SAVE0
>      }
>      |1:
>      |  ins_next
> @@ -3059,36 +3484,36 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |5:  // Either or both types are not numbers.
>      |.if not DUALNUM
>      |    lwz CARG2, 4(RA)
> -    |    lwz CARG3, 4(RD)
> +    |    lwz CARG4, 4(RD)
>      |.endif
>      |.if FFI
> -    |  cmpwi cr7, TMP0, LJ_TCDATA
> -    |  cmpwi cr5, TMP1, LJ_TCDATA
> +    |  cmpwi cr7, CARG1, LJ_TCDATA
> +    |  cmpwi cr5, CARG3, LJ_TCDATA
>      |.endif
> -    |   not TMP3, TMP0
> -    |  cmplw TMP0, TMP1
> -    |   cmplwi cr1, TMP3, ~LJ_TISPRI		// Primitive?
> +    |   not TMP2, CARG1
> +    |  cmplw CARG1, CARG3
> +    |   cmplwi cr1, TMP2, ~LJ_TISPRI		// Primitive?
>      |.if FFI
>      |  cror 4*cr7+eq, 4*cr7+eq, 4*cr5+eq
>      |.endif
> -    |   cmplwi cr6, TMP3, ~LJ_TISTABUD		// Table or userdata?
> +    |   cmplwi cr6, TMP2, ~LJ_TISTABUD		// Table or userdata?
>      |.if FFI
>      |  beq cr7, ->vmeta_equal_cd
>      |.endif
> -    |    cmplw cr5, CARG2, CARG3
> +    |    cmplw cr5, CARG2, CARG4
>      |  crandc 4*cr0+gt, 4*cr0+eq, 4*cr1+gt	// 2: Same type and primitive.
>      |  crorc 4*cr0+lt, 4*cr5+eq, 4*cr0+eq	// 1: Same tv or different type.
>      |  crand 4*cr0+eq, 4*cr0+eq, 4*cr5+eq	// 0: Same type and same tv.
> -    |   mr SAVE0, PC
> +    |   mr SAVE1, PC
>      |  cror 4*cr0+eq, 4*cr0+eq, 4*cr0+gt	// 0 or 2.
>      |  cror 4*cr0+lt, 4*cr0+lt, 4*cr0+gt	// 1 or 2.
>      if (vk) {
>        |  bne cr0, >6
> -      |  add PC, PC, TMP2
> +      |  add PC, PC, SAVE0
>        |6:
>      } else {
>        |  beq cr0, >6
> -      |  add PC, PC, TMP2
> +      |  add PC, PC, SAVE0
>        |6:
>      }
>      |.if DUALNUM
> @@ -3103,6 +3528,7 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |
>      |  // Different tables or userdatas. Need to check __eq metamethod.
>      |  // Field metatable must be at same offset for GCtab and GCudata!
> +    |   mr CARG3, CARG4
>      |  lwz TAB:TMP2, TAB:CARG2->metatable
>      |   li CARG4, 1-vk			// ne = 0 or 1.
>      |  cmplwi TAB:TMP2, 0
> @@ -3110,7 +3536,7 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |  lbz TMP2, TAB:TMP2->nomm
>      |  andix. TMP2, TMP2, 1<<MM_eq
>      |  bne <1				// Or 'no __eq' flag set?
> -    |  mr PC, SAVE0			// Restore old PC.
> +    |  mr PC, SAVE1			// Restore old PC.
>      |  b ->vmeta_equal			// Handle __eq metamethod.
>      break;
>  
> @@ -3151,16 +3577,16 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      vk = op == BC_ISEQN;
>      |  // RA = src*8, RD = num_const*8, JMP with RD = target
>      |.if DUALNUM
> -    |  lwzux TMP0, RA, BASE
> +    |  lwzux CARG1, RA, BASE
>      |    addi PC, PC, 4
>      |   lwz CARG2, 4(RA)
> -    |  lwzux TMP1, RD, KBASE
> -    |  checknum cr0, TMP0
> -    |    lwz TMP2, -4(PC)
> -    |  checknum cr1, TMP1
> -    |    decode_RD4 TMP2, TMP2
> -    |   lwz CARG3, 4(RD)
> -    |    addis TMP2, TMP2, -(BCBIAS_J*4 >> 16)
> +    |  lwzux CARG3, RD, KBASE
> +    |  checknum cr0, CARG1
> +    |    lwz SAVE0, -4(PC)
> +    |  checknum cr1, CARG3
> +    |    decode_RD4 SAVE0, SAVE0
> +    |   lwz CARG4, 4(RD)
> +    |    addis SAVE0, SAVE0, -(BCBIAS_J*4 >> 16)
>      if (vk) {
>        |->BC_ISEQN_Z:
>      } else {
> @@ -3168,7 +3594,7 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      }
>      |  bne cr0, >7
>      |  bne cr1, >8
> -    |   cmpw CARG2, CARG3
> +    |   cmpw CARG2, CARG4
>      |4:
>      |.else
>      if (vk) {
> @@ -3176,20 +3602,20 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      } else {
>        |->BC_ISNEN_Z:  // Dummy label.
>      }
> -    |  lwzx TMP0, BASE, RA
> +    |  lwzx CARG1, BASE, RA
>      |    addi PC, PC, 4
>      |   lfdx f0, BASE, RA
> -    |    lwz TMP2, -4(PC)
> +    |    lwz SAVE0, -4(PC)
>      |  lfdx f1, KBASE, RD
> -    |    decode_RD4 TMP2, TMP2
> -    |  checknum TMP0
> -    |    addis TMP2, TMP2, -(BCBIAS_J*4 >> 16)
> +    |    decode_RD4 SAVE0, SAVE0
> +    |  checknum CARG1
> +    |    addis SAVE0, SAVE0, -(BCBIAS_J*4 >> 16)
>      |  bge >3
>      |  fcmpu cr0, f0, f1
>      |.endif
>      if (vk) {
>        |  bne >1
> -      |  add PC, PC, TMP2
> +      |  add PC, PC, SAVE0
>        |1:
>        |.if not FFI
>        |3:
> @@ -3200,13 +3626,13 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>        |.if not FFI
>        |3:
>        |.endif
> -      |  add PC, PC, TMP2
> +      |  add PC, PC, SAVE0
>        |2:
>      }
>      |  ins_next
>      |.if FFI
>      |3:
> -    |  cmpwi TMP0, LJ_TCDATA
> +    |  cmpwi CARG1, LJ_TCDATA
>      |  beq ->vmeta_equal_cd
>      |  b <1
>      |.endif
> @@ -3214,18 +3640,31 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |7:  // RA is not an integer.
>      |  bge cr0, <3
>      |  // RA is a number.
> -    |   lfd f0, 0(RA)
> +    |   .FPU lfd f0, 0(RA)
>      |  blt cr1, >1
>      |  // RA is a number, RD is an integer.
> -    |  tonum_i f1, CARG3
> +    |.if FPU
> +    |  tonum_i f1, CARG4
> +    |.else
> +    |  bl ->vm_sfi2d_2
> +    |.endif
>      |  b >2
>      |
>      |8: // RA is an integer, RD is a number.
> +    |.if FPU
>      |  tonum_i f0, CARG2
> +    |.else
> +    |  bl ->vm_sfi2d_1
> +    |.endif
>      |1:
> -    |  lfd f1, 0(RD)
> +    |  .FPU lfd f1, 0(RD)
>      |2:
> +    |.if FPU
>      |  fcmpu cr0, f0, f1
> +    |.else
> +    |  blex __ledf2
> +    |  cmpwi CRET1, 0
> +    |.endif
>      |  b <4
>      |.endif
>      break;
> @@ -3280,7 +3719,12 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>        |  add PC, PC, TMP2
>      } else {
>        |  li TMP1, LJ_TFALSE
> +      |.if FPU
>        |   lfdx f0, BASE, RD
> +      |.else
> +      |   lwzux CARG1, RD, BASE
> +      |   lwz CARG2, 4(RD)
> +      |.endif
>        |  cmplw TMP0, TMP1
>        if (op == BC_ISTC) {
>  	|  bge >1
> @@ -3289,7 +3733,12 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>        }
>        |  addis PC, PC, -(BCBIAS_J*4 >> 16)
>        |  decode_RD4 TMP2, INS
> +      |.if FPU
>        |   stfdx f0, BASE, RA
> +      |.else
> +      |   stwux CARG1, RA, BASE
> +      |   stw CARG2, 4(RA)
> +      |.endif
>        |  add PC, PC, TMP2
>        |1:
>      }
> @@ -3324,8 +3773,15 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>    case BC_MOV:
>      |  // RA = dst*8, RD = src*8
>      |  ins_next1
> +    |.if FPU
>      |  lfdx f0, BASE, RD
>      |  stfdx f0, BASE, RA
> +    |.else
> +    |  lwzux TMP0, RD, BASE
> +    |  lwz TMP1, 4(RD)
> +    |  stwux TMP0, RA, BASE
> +    |  stw TMP1, 4(RA)
> +    |.endif
>      |  ins_next2
>      break;
>    case BC_NOT:
> @@ -3427,44 +3883,65 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      ||vk = ((int)op - BC_ADDVN) / (BC_ADDNV-BC_ADDVN);
>      ||switch (vk) {
>      ||case 0:
> -    |   lwzx TMP1, BASE, RB
> +    |   lwzx CARG1, BASE, RB
>      |   .if DUALNUM
> -    |     lwzx TMP2, KBASE, RC
> +    |     lwzx CARG3, KBASE, RC
>      |   .endif
> +    |   .if FPU
>      |    lfdx f14, BASE, RB
>      |    lfdx f15, KBASE, RC
> +    |   .else
> +    |    add TMP1, BASE, RB
> +    |    add TMP2, KBASE, RC
> +    |    lwz CARG2, 4(TMP1)
> +    |    lwz CARG4, 4(TMP2)
> +    |   .endif
>      |   .if DUALNUM
> -    |     checknum cr0, TMP1
> -    |     checknum cr1, TMP2
> +    |     checknum cr0, CARG1
> +    |     checknum cr1, CARG3
>      |     crand 4*cr0+lt, 4*cr0+lt, 4*cr1+lt
>      |     bge ->vmeta_arith_vn
>      |   .else
> -    |     checknum TMP1; bge ->vmeta_arith_vn
> +    |     checknum CARG1; bge ->vmeta_arith_vn
>      |   .endif
>      ||  break;
>      ||case 1:
> -    |   lwzx TMP1, BASE, RB
> +    |   lwzx CARG1, BASE, RB
>      |   .if DUALNUM
> -    |     lwzx TMP2, KBASE, RC
> +    |     lwzx CARG3, KBASE, RC
>      |   .endif
> +    |   .if FPU
>      |    lfdx f15, BASE, RB
>      |    lfdx f14, KBASE, RC
> +    |   .else
> +    |    add TMP1, BASE, RB
> +    |    add TMP2, KBASE, RC
> +    |    lwz CARG2, 4(TMP1)
> +    |    lwz CARG4, 4(TMP2)
> +    |   .endif
>      |   .if DUALNUM
> -    |     checknum cr0, TMP1
> -    |     checknum cr1, TMP2
> +    |     checknum cr0, CARG1
> +    |     checknum cr1, CARG3
>      |     crand 4*cr0+lt, 4*cr0+lt, 4*cr1+lt
>      |     bge ->vmeta_arith_nv
>      |   .else
> -    |     checknum TMP1; bge ->vmeta_arith_nv
> +    |     checknum CARG1; bge ->vmeta_arith_nv
>      |   .endif
>      ||  break;
>      ||default:
> -    |   lwzx TMP1, BASE, RB
> -    |   lwzx TMP2, BASE, RC
> +    |   lwzx CARG1, BASE, RB
> +    |   lwzx CARG3, BASE, RC
> +    |   .if FPU
>      |    lfdx f14, BASE, RB
>      |    lfdx f15, BASE, RC
> -    |   checknum cr0, TMP1
> -    |   checknum cr1, TMP2
> +    |   .else
> +    |    add TMP1, BASE, RB
> +    |    add TMP2, BASE, RC
> +    |    lwz CARG2, 4(TMP1)
> +    |    lwz CARG4, 4(TMP2)
> +    |   .endif
> +    |   checknum cr0, CARG1
> +    |   checknum cr1, CARG3
>      |   crand 4*cr0+lt, 4*cr0+lt, 4*cr1+lt
>      |   bge ->vmeta_arith_vv
>      ||  break;
> @@ -3498,48 +3975,78 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |  fsub a, b, a			// b - floor(b/c)*c
>      |.endmacro
>      |
> +    |.macro sfpmod
> +    |->BC_MODVN_Z:
> +    |  stw CARG1, SFSAVE_1
> +    |  stw CARG2, SFSAVE_2
> +    |  mr SAVE0, CARG3
> +    |  mr SAVE1, CARG4
> +    |  blex __divdf3
> +    |  blex floor
> +    |  mr CARG3, SAVE0
> +    |  mr CARG4, SAVE1
> +    |  blex __muldf3
> +    |  mr CARG3, CRET1
> +    |  mr CARG4, CRET2
> +    |  lwz CARG1, SFSAVE_1
> +    |  lwz CARG2, SFSAVE_2
> +    |  blex __subdf3
> +    |.endmacro
> +    |
>      |.macro ins_arithfp, fpins
>      |  ins_arithpre
>      |.if "fpins" == "fpmod_"
>      |  b ->BC_MODVN_Z			// Avoid 3 copies. It's slow anyway.
> -    |.else
> +    |.elif FPU
>      |  fpins f0, f14, f15
>      |  ins_next1
>      |  stfdx f0, BASE, RA
>      |  ins_next2
> +    |.else
> +    |  blex __divdf3			// Only soft-float div uses this macro.
> +    |  ins_next1
> +    |  stwux CRET1, RA, BASE
> +    |  stw CRET2, 4(RA)
> +    |  ins_next2
>      |.endif
>      |.endmacro
>      |
> -    |.macro ins_arithdn, intins, fpins
> +    |.macro ins_arithdn, intins, fpins, fpcall
>      |  // RA = dst*8, RB = src1*8, RC = src2*8 | num_const*8
>      ||vk = ((int)op - BC_ADDVN) / (BC_ADDNV-BC_ADDVN);
>      ||switch (vk) {
>      ||case 0:
> -    |   lwzux TMP1, RB, BASE
> -    |   lwzux TMP2, RC, KBASE
> -    |    lwz CARG1, 4(RB)
> -    |   checknum cr0, TMP1
> -    |    lwz CARG2, 4(RC)
> +    |   lwzux CARG1, RB, BASE
> +    |   lwzux CARG3, RC, KBASE
> +    |    lwz CARG2, 4(RB)
> +    |   checknum cr0, CARG1
> +    |    lwz CARG4, 4(RC)
> +    |   checknum cr1, CARG3
>      ||  break;
>      ||case 1:
> -    |   lwzux TMP1, RB, BASE
> -    |   lwzux TMP2, RC, KBASE
> -    |    lwz CARG2, 4(RB)
> -    |   checknum cr0, TMP1
> -    |    lwz CARG1, 4(RC)
> +    |   lwzux CARG3, RB, BASE
> +    |   lwzux CARG1, RC, KBASE
> +    |    lwz CARG4, 4(RB)
> +    |   checknum cr0, CARG3
> +    |    lwz CARG2, 4(RC)
> +    |   checknum cr1, CARG1
>      ||  break;
>      ||default:
> -    |   lwzux TMP1, RB, BASE
> -    |   lwzux TMP2, RC, BASE
> -    |    lwz CARG1, 4(RB)
> -    |   checknum cr0, TMP1
> -    |    lwz CARG2, 4(RC)
> +    |   lwzux CARG1, RB, BASE
> +    |   lwzux CARG3, RC, BASE
> +    |    lwz CARG2, 4(RB)
> +    |   checknum cr0, CARG1
> +    |    lwz CARG4, 4(RC)
> +    |   checknum cr1, CARG3
>      ||  break;
>      ||}
> -    |  checknum cr1, TMP2
>      |  bne >5
>      |  bne cr1, >5
> -    |  intins CARG1, CARG1, CARG2
> +    |.if "intins" == "intmod"
> +    |  mr CARG1, CARG2
> +    |  mr CARG2, CARG4
> +    |.endif
> +    |  intins CARG1, CARG2, CARG4
>      |  bso >4
>      |1:
>      |  ins_next1
> @@ -3551,29 +4058,40 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |  checkov TMP0, <1			// Ignore unrelated overflow.
>      |  ins_arithfallback b
>      |5:  // FP variant.
> +    |.if FPU
>      ||if (vk == 1) {
>      |  lfd f15, 0(RB)
> -    |   crand 4*cr0+lt, 4*cr0+lt, 4*cr1+lt
>      |  lfd f14, 0(RC)
>      ||} else {
>      |  lfd f14, 0(RB)
> -    |   crand 4*cr0+lt, 4*cr0+lt, 4*cr1+lt
>      |  lfd f15, 0(RC)
>      ||}
> +    |.endif
> +    |  crand 4*cr0+lt, 4*cr0+lt, 4*cr1+lt
>      |   ins_arithfallback bge
>      |.if "fpins" == "fpmod_"
>      |  b ->BC_MODVN_Z			// Avoid 3 copies. It's slow anyway.
>      |.else
> +    |.if FPU
>      |  fpins f0, f14, f15
> -    |  ins_next1
>      |  stfdx f0, BASE, RA
> +    |.else
> +    |.if "fpcall" == "sfpmod"
> +    |  sfpmod
> +    |.else
> +    |  blex fpcall
> +    |.endif
> +    |  stwux CRET1, RA, BASE
> +    |  stw CRET2, 4(RA)
> +    |.endif
> +    |  ins_next1
>      |  b <2
>      |.endif
>      |.endmacro
>      |
> -    |.macro ins_arith, intins, fpins
> +    |.macro ins_arith, intins, fpins, fpcall
>      |.if DUALNUM
> -    |  ins_arithdn intins, fpins
> +    |  ins_arithdn intins, fpins, fpcall
>      |.else
>      |  ins_arithfp fpins
>      |.endif
> @@ -3588,9 +4106,9 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |  addo. TMP0, TMP0, TMP3
>      |  add y, a, b
>      |.endmacro
> -    |  ins_arith addo32., fadd
> +    |  ins_arith addo32., fadd, __adddf3
>      |.else
> -    |  ins_arith addo., fadd
> +    |  ins_arith addo., fadd, __adddf3
>      |.endif
>      break;
>    case BC_SUBVN: case BC_SUBNV: case BC_SUBVV:
> @@ -3602,36 +4120,48 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |  subo. TMP0, TMP0, TMP3
>      |  sub y, a, b
>      |.endmacro
> -    |  ins_arith subo32., fsub
> +    |  ins_arith subo32., fsub, __subdf3
>      |.else
> -    |  ins_arith subo., fsub
> +    |  ins_arith subo., fsub, __subdf3
>      |.endif
>      break;
>    case BC_MULVN: case BC_MULNV: case BC_MULVV:
> -    |  ins_arith mullwo., fmul
> +    |  ins_arith mullwo., fmul, __muldf3
>      break;
>    case BC_DIVVN: case BC_DIVNV: case BC_DIVVV:
>      |  ins_arithfp fdiv
>      break;
>    case BC_MODVN:
> -    |  ins_arith intmod, fpmod
> +    |  ins_arith intmod, fpmod, sfpmod
>      break;
>    case BC_MODNV: case BC_MODVV:
> -    |  ins_arith intmod, fpmod_
> +    |  ins_arith intmod, fpmod_, sfpmod
>      break;
>    case BC_POW:
>      |  // NYI: (partial) integer arithmetic.
> -    |  lwzx TMP1, BASE, RB
> +    |  lwzx CARG1, BASE, RB
> +    |  lwzx CARG3, BASE, RC
> +    |.if FPU
>      |   lfdx FARG1, BASE, RB
> -    |  lwzx TMP2, BASE, RC
>      |   lfdx FARG2, BASE, RC
> -    |  checknum cr0, TMP1
> -    |  checknum cr1, TMP2
> +    |.else
> +    |   add TMP1, BASE, RB
> +    |   add TMP2, BASE, RC
> +    |   lwz CARG2, 4(TMP1)
> +    |   lwz CARG4, 4(TMP2)
> +    |.endif
> +    |  checknum cr0, CARG1
> +    |  checknum cr1, CARG3
>      |  crand 4*cr0+lt, 4*cr0+lt, 4*cr1+lt
>      |  bge ->vmeta_arith_vv
>      |  blex pow
>      |  ins_next1
> +    |.if FPU
>      |  stfdx FARG1, BASE, RA
> +    |.else
> +    |  stwux CARG1, RA, BASE
> +    |  stw CARG2, 4(RA)
> +    |.endif
>      |  ins_next2
>      break;
>  
> @@ -3651,8 +4181,15 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |   lp BASE, L->base
>      |  bne ->vmeta_binop
>      |  ins_next1
> +    |.if FPU
>      |  lfdx f0, BASE, SAVE0		// Copy result from RB to RA.
>      |  stfdx f0, BASE, RA
> +    |.else
> +    |  lwzux TMP0, SAVE0, BASE
> +    |  lwz TMP1, 4(SAVE0)
> +    |  stwux TMP0, RA, BASE
> +    |  stw TMP1, 4(RA)
> +    |.endif
>      |  ins_next2
>      break;
>  
> @@ -3715,8 +4252,15 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>    case BC_KNUM:
>      |  // RA = dst*8, RD = num_const*8
>      |  ins_next1
> +    |.if FPU
>      |  lfdx f0, KBASE, RD
>      |  stfdx f0, BASE, RA
> +    |.else
> +    |  lwzux TMP0, RD, KBASE
> +    |  lwz TMP1, 4(RD)
> +    |  stwux TMP0, RA, BASE
> +    |  stw TMP1, 4(RA)
> +    |.endif
>      |  ins_next2
>      break;
>    case BC_KPRI:
> @@ -3749,8 +4293,15 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |  lwzx UPVAL:RB, LFUNC:RB, RD
>      |  ins_next1
>      |  lwz TMP1, UPVAL:RB->v
> +    |.if FPU
>      |  lfd f0, 0(TMP1)
>      |  stfdx f0, BASE, RA
> +    |.else
> +    |  lwz TMP2, 0(TMP1)
> +    |  lwz TMP3, 4(TMP1)
> +    |  stwux TMP2, RA, BASE
> +    |  stw TMP3, 4(RA)
> +    |.endif
>      |  ins_next2
>      break;
>    case BC_USETV:
> @@ -3758,14 +4309,24 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |  lwz LFUNC:RB, FRAME_FUNC(BASE)
>      |    srwi RA, RA, 1
>      |    addi RA, RA, offsetof(GCfuncL, uvptr)
> +    |.if FPU
>      |   lfdux f0, RD, BASE
> +    |.else
> +    |   lwzux CARG1, RD, BASE
> +    |   lwz CARG3, 4(RD)
> +    |.endif
>      |  lwzx UPVAL:RB, LFUNC:RB, RA
>      |  lbz TMP3, UPVAL:RB->marked
>      |   lwz CARG2, UPVAL:RB->v
>      |  andix. TMP3, TMP3, LJ_GC_BLACK	// isblack(uv)
>      |    lbz TMP0, UPVAL:RB->closed
>      |   lwz TMP2, 0(RD)
> +    |.if FPU
>      |   stfd f0, 0(CARG2)
> +    |.else
> +    |   stw CARG1, 0(CARG2)
> +    |   stw CARG3, 4(CARG2)
> +    |.endif
>      |    cmplwi cr1, TMP0, 0
>      |   lwz TMP1, 4(RD)
>      |  cror 4*cr0+eq, 4*cr0+eq, 4*cr1+eq
> @@ -3821,11 +4382,21 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |  lwz LFUNC:RB, FRAME_FUNC(BASE)
>      |   srwi RA, RA, 1
>      |   addi RA, RA, offsetof(GCfuncL, uvptr)
> +    |.if FPU
>      |    lfdx f0, KBASE, RD
> +    |.else
> +    |    lwzux TMP2, RD, KBASE
> +    |    lwz TMP3, 4(RD)
> +    |.endif
>      |  lwzx UPVAL:RB, LFUNC:RB, RA
>      |  ins_next1
>      |  lwz TMP1, UPVAL:RB->v
> +    |.if FPU
>      |  stfd f0, 0(TMP1)
> +    |.else
> +    |  stw TMP2, 0(TMP1)
> +    |  stw TMP3, 4(TMP1)
> +    |.endif
>      |  ins_next2
>      break;
>    case BC_USETP:
> @@ -3973,11 +4544,21 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |.endif
>      |  ble ->vmeta_tgetv		// Integer key and in array part?
>      |  lwzx TMP0, TMP1, TMP2
> +    |.if FPU
>      |   lfdx f14, TMP1, TMP2
> +    |.else
> +    |   lwzux SAVE0, TMP1, TMP2
> +    |   lwz SAVE1, 4(TMP1)
> +    |.endif
>      |  checknil TMP0; beq >2
>      |1:
>      |  ins_next1
> +    |.if FPU
>      |   stfdx f14, BASE, RA
> +    |.else
> +    |   stwux SAVE0, RA, BASE
> +    |   stw SAVE1, 4(RA)
> +    |.endif
>      |  ins_next2
>      |
>      |2:  // Check for __index if table value is nil.
> @@ -4053,12 +4634,22 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |  lwz TMP1, TAB:RB->asize
>      |   lwz TMP2, TAB:RB->array
>      |  cmplw TMP0, TMP1; bge ->vmeta_tgetb
> +    |.if FPU
>      |  lwzx TMP1, TMP2, RC
>      |   lfdx f0, TMP2, RC
> +    |.else
> +    |  lwzux TMP1, TMP2, RC
> +    |   lwz TMP3, 4(TMP2)
> +    |.endif
>      |  checknil TMP1; beq >5
>      |1:
>      |  ins_next1
> +    |.if FPU
>      |   stfdx f0, BASE, RA
> +    |.else
> +    |   stwux TMP1, RA, BASE
> +    |   stw TMP3, 4(RA)
> +    |.endif
>      |  ins_next2
>      |
>      |5:  // Check for __index if table value is nil.
> @@ -4088,10 +4679,20 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |  cmplw TMP0, CARG2
>      |   slwi TMP2, CARG2, 3
>      |  ble ->vmeta_tgetr		// In array part?
> +    |.if FPU
>      |   lfdx f14, TMP1, TMP2
> +    |.else
> +    |   lwzux SAVE0, TMP2, TMP1
> +    |   lwz SAVE1, 4(TMP2)
> +    |.endif
>      |->BC_TGETR_Z:
>      |  ins_next1
> +    |.if FPU
>      |   stfdx f14, BASE, RA
> +    |.else
> +    |   stwux SAVE0, RA, BASE
> +    |   stw SAVE1, 4(RA)
> +    |.endif
>      |  ins_next2
>      break;
>  
> @@ -4132,11 +4733,22 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |  ble ->vmeta_tsetv		// Integer key and in array part?
>      |   lwzx TMP2, TMP1, TMP0
>      |  lbz TMP3, TAB:RB->marked
> +    |.if FPU
>      |    lfdx f14, BASE, RA
> +    |.else
> +    |    add SAVE1, BASE, RA
> +    |    lwz SAVE0, 0(SAVE1)
> +    |    lwz SAVE1, 4(SAVE1)
> +    |.endif
>      |   checknil TMP2; beq >3
>      |1:
>      |  andix. TMP2, TMP3, LJ_GC_BLACK	// isblack(table)
> +    |.if FPU
>      |    stfdx f14, TMP1, TMP0
> +    |.else
> +    |    stwux SAVE0, TMP1, TMP0
> +    |    stw SAVE1, 4(TMP1)
> +    |.endif
>      |  bne >7
>      |2:
>      |  ins_next
> @@ -4177,7 +4789,13 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |  lwz NODE:TMP2, TAB:RB->node
>      |    stb ZERO, TAB:RB->nomm		// Clear metamethod cache.
>      |  and TMP1, TMP1, TMP0		// idx = str->hash & tab->hmask
> +    |.if FPU
>      |    lfdx f14, BASE, RA
> +    |.else
> +    |    add CARG2, BASE, RA
> +    |    lwz SAVE0, 0(CARG2)
> +    |    lwz SAVE1, 4(CARG2)
> +    |.endif
>      |  slwi TMP0, TMP1, 5
>      |  slwi TMP1, TMP1, 3
>      |  sub TMP1, TMP0, TMP1
> @@ -4193,7 +4811,12 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |    checknil CARG2; beq >4		// Key found, but nil value?
>      |2:
>      |  andix. TMP0, TMP3, LJ_GC_BLACK	// isblack(table)
> +    |.if FPU
>      |    stfd f14, NODE:TMP2->val
> +    |.else
> +    |    stw SAVE0, NODE:TMP2->val.u32.hi
> +    |    stw SAVE1, NODE:TMP2->val.u32.lo
> +    |.endif
>      |  bne >7
>      |3:
>      |  ins_next
> @@ -4232,7 +4855,12 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |  bl extern lj_tab_newkey		// (lua_State *L, GCtab *t, TValue *k)
>      |  // Returns TValue *.
>      |  lp BASE, L->base
> +    |.if FPU
>      |  stfd f14, 0(CRET1)
> +    |.else
> +    |  stw SAVE0, 0(CRET1)
> +    |  stw SAVE1, 4(CRET1)
> +    |.endif
>      |  b <3				// No 2nd write barrier needed.
>      |
>      |7:  // Possible table write barrier for the value. Skip valiswhite check.
> @@ -4249,13 +4877,24 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |   lwz TMP2, TAB:RB->array
>      |    lbz TMP3, TAB:RB->marked
>      |  cmplw TMP0, TMP1
> +    |.if FPU
>      |   lfdx f14, BASE, RA
> +    |.else
> +    |   add CARG2, BASE, RA
> +    |   lwz SAVE0, 0(CARG2)
> +    |   lwz SAVE1, 4(CARG2)
> +    |.endif
>      |  bge ->vmeta_tsetb
>      |  lwzx TMP1, TMP2, RC
>      |  checknil TMP1; beq >5
>      |1:
>      |  andix. TMP0, TMP3, LJ_GC_BLACK	// isblack(table)
> +    |.if FPU
>      |   stfdx f14, TMP2, RC
> +    |.else
> +    |   stwux SAVE0, RC, TMP2
> +    |   stw SAVE1, 4(RC)
> +    |.endif
>      |  bne >7
>      |2:
>      |  ins_next
> @@ -4295,10 +4934,20 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |2:
>      |  cmplw TMP0, CARG3
>      |   slwi TMP2, CARG3, 3
> +    |.if FPU
>      |   lfdx f14, BASE, RA
> +    |.else
> +    |  lwzux SAVE0, RA, BASE
> +    |  lwz SAVE1, 4(RA)
> +    |.endif
>      |  ble ->vmeta_tsetr		// In array part?
>      |  ins_next1
> +    |.if FPU
>      |   stfdx f14, TMP1, TMP2
> +    |.else
> +    |   stwux SAVE0, TMP1, TMP2
> +    |   stw SAVE1, 4(TMP1)
> +    |.endif
>      |  ins_next2
>      |
>      |7:  // Possible table write barrier for the value. Skip valiswhite check.
> @@ -4328,10 +4977,20 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |   add TMP1, TMP1, TMP0
>      |    andix. TMP0, TMP3, LJ_GC_BLACK	// isblack(table)
>      |3:  // Copy result slots to table.
> +    |.if FPU
>      |   lfd f0, 0(RA)
> +    |.else
> +    |   lwz SAVE0, 0(RA)
> +    |   lwz SAVE1, 4(RA)
> +    |.endif
>      |  addi RA, RA, 8
>      |  cmpw cr1, RA, TMP2
> +    |.if FPU
>      |   stfd f0, 0(TMP1)
> +    |.else
> +    |   stw SAVE0, 0(TMP1)
> +    |   stw SAVE1, 4(TMP1)
> +    |.endif
>      |    addi TMP1, TMP1, 8
>      |  blt cr1, <3
>      |  bne >7
> @@ -4398,9 +5057,20 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |    beq cr1, >3
>      |2:
>      |  addi TMP3, TMP2, 8
> +    |.if FPU
>      |   lfdx f0, RA, TMP2
> +    |.else
> +    |   add CARG3, RA, TMP2
> +    |   lwz CARG1, 0(CARG3)
> +    |   lwz CARG2, 4(CARG3)
> +    |.endif
>      |  cmplw cr1, TMP3, NARGS8:RC
> +    |.if FPU
>      |   stfdx f0, BASE, TMP2
> +    |.else
> +    |   stwux CARG1, TMP2, BASE
> +    |   stw CARG2, 4(TMP2)
> +    |.endif
>      |  mr TMP2, TMP3
>      |  bne cr1, <2
>      |3:
> @@ -4433,14 +5103,28 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |  add BASE, BASE, RA
>      |  lwz TMP1, -24(BASE)
>      |   lwz LFUNC:RB, -20(BASE)
> +    |.if FPU
>      |    lfd f1, -8(BASE)
>      |    lfd f0, -16(BASE)
> +    |.else
> +    |    lwz CARG1, -8(BASE)
> +    |    lwz CARG2, -4(BASE)
> +    |    lwz CARG3, -16(BASE)
> +    |    lwz CARG4, -12(BASE)
> +    |.endif
>      |  stw TMP1, 0(BASE)		// Copy callable.
>      |   stw LFUNC:RB, 4(BASE)
>      |  checkfunc TMP1
> -    |    stfd f1, 16(BASE)		// Copy control var.
>      |     li NARGS8:RC, 16		// Iterators get 2 arguments.
> +    |.if FPU
> +    |    stfd f1, 16(BASE)		// Copy control var.
>      |    stfdu f0, 8(BASE)		// Copy state.
> +    |.else
> +    |    stw CARG1, 16(BASE)		// Copy control var.
> +    |    stw CARG2, 20(BASE)
> +    |    stwu CARG3, 8(BASE)		// Copy state.
> +    |    stw CARG4, 4(BASE)
> +    |.endif
>      |  bne ->vmeta_call
>      |  ins_call
>      break;
> @@ -4461,7 +5145,12 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |   slwi TMP3, RC, 3
>      |  bge >5				// Index points after array part?
>      |  lwzx TMP2, TMP1, TMP3
> +    |.if FPU
>      |   lfdx f0, TMP1, TMP3
> +    |.else
> +    |   lwzux CARG1, TMP3, TMP1
> +    |   lwz CARG2, 4(TMP3)
> +    |.endif
>      |  checknil TMP2
>      |     lwz INS, -4(PC)
>      |  beq >4
> @@ -4473,7 +5162,12 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |.endif
>      |    addi RC, RC, 1
>      |     addis TMP3, PC, -(BCBIAS_J*4 >> 16)
> +    |.if FPU
>      |  stfd f0, 8(RA)
> +    |.else
> +    |  stw CARG1, 8(RA)
> +    |  stw CARG2, 12(RA)
> +    |.endif
>      |     decode_RD4 TMP1, INS
>      |    stw RC, -4(RA)			// Update control var.
>      |     add PC, TMP1, TMP3
> @@ -4498,17 +5192,38 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |   slwi RB, RC, 3
>      |   sub TMP3, TMP3, RB
>      |  lwzx RB, TMP2, TMP3
> +    |.if FPU
>      |  lfdx f0, TMP2, TMP3
> +    |.else
> +    |  add CARG3, TMP2, TMP3
> +    |  lwz CARG1, 0(CARG3)
> +    |  lwz CARG2, 4(CARG3)
> +    |.endif
>      |   add NODE:TMP3, TMP2, TMP3
>      |  checknil RB
>      |     lwz INS, -4(PC)
>      |  beq >7
> +    |.if FPU
>      |   lfd f1, NODE:TMP3->key
> +    |.else
> +    |   lwz CARG3, NODE:TMP3->key.u32.hi
> +    |   lwz CARG4, NODE:TMP3->key.u32.lo
> +    |.endif
>      |     addis TMP2, PC, -(BCBIAS_J*4 >> 16)
> +    |.if FPU
>      |  stfd f0, 8(RA)
> +    |.else
> +    |  stw CARG1, 8(RA)
> +    |  stw CARG2, 12(RA)
> +    |.endif
>      |    add RC, RC, TMP0
>      |     decode_RD4 TMP1, INS
> +    |.if FPU
>      |   stfd f1, 0(RA)
> +    |.else
> +    |   stw CARG3, 0(RA)
> +    |   stw CARG4, 4(RA)
> +    |.endif
>      |    addi RC, RC, 1
>      |     add PC, TMP1, TMP2
>      |    stw RC, -4(RA)			// Update control var.
> @@ -4574,9 +5289,19 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |   subi TMP2, TMP2, 16
>      |   ble >2				// No vararg slots?
>      |1:  // Copy vararg slots to destination slots.
> +    |.if FPU
>      |  lfd f0, 0(RC)
> +    |.else
> +    |  lwz CARG1, 0(RC)
> +    |  lwz CARG2, 4(RC)
> +    |.endif
>      |   addi RC, RC, 8
> +    |.if FPU
>      |  stfd f0, 0(RA)
> +    |.else
> +    |  stw CARG1, 0(RA)
> +    |  stw CARG2, 4(RA)
> +    |.endif
>      |  cmplw RA, TMP2
>      |   cmplw cr1, RC, TMP3
>      |  bge >3				// All destination slots filled?
> @@ -4599,9 +5324,19 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |   addi MULTRES, TMP1, 8
>      |  bgt >7
>      |6:
> +    |.if FPU
>      |  lfd f0, 0(RC)
> +    |.else
> +    |  lwz CARG1, 0(RC)
> +    |  lwz CARG2, 4(RC)
> +    |.endif
>      |   addi RC, RC, 8
> +    |.if FPU
>      |  stfd f0, 0(RA)
> +    |.else
> +    |  stw CARG1, 0(RA)
> +    |  stw CARG2, 4(RA)
> +    |.endif
>      |  cmplw RC, TMP3
>      |   addi RA, RA, 8
>      |  blt <6				// More vararg slots?
> @@ -4652,14 +5387,38 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |   li TMP1, 0
>      |2:
>      |  addi TMP3, TMP1, 8
> +    |.if FPU
>      |   lfdx f0, RA, TMP1
> +    |.else
> +    |   add CARG3, RA, TMP1
> +    |   lwz CARG1, 0(CARG3)
> +    |   lwz CARG2, 4(CARG3)
> +    |.endif
>      |  cmpw TMP3, RC
> +    |.if FPU
>      |   stfdx f0, TMP2, TMP1
> +    |.else
> +    |   add CARG3, TMP2, TMP1
> +    |   stw CARG1, 0(CARG3)
> +    |   stw CARG2, 4(CARG3)
> +    |.endif
>      |  beq >3
>      |  addi TMP1, TMP3, 8
> +    |.if FPU
>      |   lfdx f1, RA, TMP3
> +    |.else
> +    |   add CARG3, RA, TMP3
> +    |   lwz CARG1, 0(CARG3)
> +    |   lwz CARG2, 4(CARG3)
> +    |.endif
>      |  cmpw TMP1, RC
> +    |.if FPU
>      |   stfdx f1, TMP2, TMP3
> +    |.else
> +    |   add CARG3, TMP2, TMP3
> +    |   stw CARG1, 0(CARG3)
> +    |   stw CARG2, 4(CARG3)
> +    |.endif
>      |  bne <2
>      |3:
>      |5:
> @@ -4701,8 +5460,15 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      |   subi TMP2, BASE, 8
>      |  decode_RB8 RB, INS
>      if (op == BC_RET1) {
> +      |.if FPU
>        |  lfd f0, 0(RA)
>        |  stfd f0, 0(TMP2)
> +      |.else
> +      |  lwz CARG1, 0(RA)
> +      |  lwz CARG2, 4(RA)
> +      |  stw CARG1, 0(TMP2)
> +      |  stw CARG2, 4(TMP2)
> +      |.endif
>      }
>      |5:
>      |  cmplw RB, RD
> @@ -4763,11 +5529,11 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>        |4:
>        |  stw CARG1, FORL_IDX*8+4(RA)
>      } else {
> -      |  lwz TMP3, FORL_STEP*8(RA)
> +      |  lwz SAVE0, FORL_STEP*8(RA)
>        |   lwz CARG3, FORL_STEP*8+4(RA)
>        |  lwz TMP2, FORL_STOP*8(RA)
>        |   lwz CARG2, FORL_STOP*8+4(RA)
> -      |  cmplw cr7, TMP3, TISNUM
> +      |  cmplw cr7, SAVE0, TISNUM
>        |  cmplw cr1, TMP2, TISNUM
>        |  crand 4*cr0+eq, 4*cr0+eq, 4*cr7+eq
>        |  crand 4*cr0+eq, 4*cr0+eq, 4*cr1+eq
> @@ -4810,41 +5576,80 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>      if (vk) {
>        |.if DUALNUM
>        |9:  // FP loop.
> +      |.if FPU
>        |  lfd f1, FORL_IDX*8(RA)
>        |.else
> +      |  lwz CARG1, FORL_IDX*8(RA)
> +      |  lwz CARG2, FORL_IDX*8+4(RA)
> +      |.endif
> +      |.else
>        |  lfdux f1, RA, BASE
>        |.endif
> +      |.if FPU
>        |  lfd f3, FORL_STEP*8(RA)
>        |  lfd f2, FORL_STOP*8(RA)
> -      |   lwz TMP3, FORL_STEP*8(RA)
>        |  fadd f1, f1, f3
>        |  stfd f1, FORL_IDX*8(RA)
> +      |.else
> +      |  lwz CARG3, FORL_STEP*8(RA)
> +      |  lwz CARG4, FORL_STEP*8+4(RA)
> +      |  mr SAVE1, RD
> +      |  blex __adddf3
> +      |  mr RD, SAVE1
> +      |  stw CRET1, FORL_IDX*8(RA)
> +      |  stw CRET2, FORL_IDX*8+4(RA)
> +      |  lwz CARG3, FORL_STOP*8(RA)
> +      |  lwz CARG4, FORL_STOP*8+4(RA)
> +      |.endif
> +      |   lwz SAVE0, FORL_STEP*8(RA)
>      } else {
>        |.if DUALNUM
>        |9:  // FP loop.
>        |.else
>        |  lwzux TMP1, RA, BASE
> -      |  lwz TMP3, FORL_STEP*8(RA)
> +      |  lwz SAVE0, FORL_STEP*8(RA)
>        |  lwz TMP2, FORL_STOP*8(RA)
>        |  cmplw cr0, TMP1, TISNUM
> -      |  cmplw cr7, TMP3, TISNUM
> +      |  cmplw cr7, SAVE0, TISNUM
>        |  cmplw cr1, TMP2, TISNUM
>        |.endif
> +      |.if FPU
>        |   lfd f1, FORL_IDX*8(RA)
> +      |.else
> +      |   lwz CARG1, FORL_IDX*8(RA)
> +      |   lwz CARG2, FORL_IDX*8+4(RA)
> +      |.endif
>        |  crand 4*cr0+lt, 4*cr0+lt, 4*cr7+lt
>        |  crand 4*cr0+lt, 4*cr0+lt, 4*cr1+lt
> +      |.if FPU
>        |   lfd f2, FORL_STOP*8(RA)
> +      |.else
> +      |   lwz CARG3, FORL_STOP*8(RA)
> +      |   lwz CARG4, FORL_STOP*8+4(RA)
> +      |.endif
>        |  bge ->vmeta_for
>      }
> -    |  cmpwi cr6, TMP3, 0
> +    |  cmpwi cr6, SAVE0, 0
>      if (op != BC_JFORL) {
>        |  srwi RD, RD, 1
>      }
> +    |.if FPU
>      |   stfd f1, FORL_EXT*8(RA)
> +    |.else
> +    |   stw CARG1, FORL_EXT*8(RA)
> +    |   stw CARG2, FORL_EXT*8+4(RA)
> +    |.endif
>      if (op != BC_JFORL) {
>        |  add RD, PC, RD
>      }
> +    |.if FPU
>      |  fcmpu cr0, f1, f2
> +    |.else
> +    |  mr SAVE1, RD
> +    |  blex __ledf2
> +    |  cmpwi CRET1, 0
> +    |  mr RD, SAVE1
> +    |.endif
>      if (op == BC_JFORI) {
>        |  addis PC, RD, -(BCBIAS_J*4 >> 16)
>      }
> -- 
> 2.41.0
>