From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tarantool-patches-bounces@dev.tarantool.org>
Received: from [87.239.111.99] (localhost [127.0.0.1])
	by dev.tarantool.org (Postfix) with ESMTP id C2E455C320C;
	Thu, 17 Aug 2023 17:53:25 +0300 (MSK)
DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org C2E455C320C
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev;
	t=1692284005; bh=R2fmUe4KZMW6Fb9Q9+KqVVyap2uo6wYiKbSdAXbZcMQ=;
	h=Date:To:Cc:References:In-Reply-To:Subject:List-Id:
	 List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe:
	 From:Reply-To:From;
	b=JPFJdsM8oFbAe+tmKj5C2EYyXMa/iZ4QL53WPmqo6qCMXqMcoG9zamBiyW77YYFhX
	 OJZupIMbIVXXdaPrN6p1tzf/mO+GvvHwFiSbiBsXwieIc0MJgjPfGbFDH/juRYctH/
	 8z5pCKKwsYjOFMQufrmqba1KpASBhxHicf4JQ4ds=
Received: from smtp35.i.mail.ru (smtp35.i.mail.ru [95.163.41.76])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id C49915C320C
 for <tarantool-patches@dev.tarantool.org>;
 Thu, 17 Aug 2023 17:53:24 +0300 (MSK)
DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org C49915C320C
Received: by smtp35.i.mail.ru with esmtpa (envelope-from
 <sergeyb@tarantool.org>)
 id 1qWeNH-007Dko-1h; Thu, 17 Aug 2023 17:53:24 +0300
Message-ID: <6eb33fda-1c3b-7ce2-f7f5-955229429873@tarantool.org>
Date: Thu, 17 Aug 2023 17:53:23 +0300
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.13.0
Content-Language: en-US
To: Sergey Kaplun <skaplun@tarantool.org>, Igor Munkin <imun@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org
References: <cover.1691592488.git.skaplun@tarantool.org>
 <e0d5e1d22c2349991be8efff5dae3e5533696c97.1691592488.git.skaplun@tarantool.org>
In-Reply-To: <e0d5e1d22c2349991be8efff5dae3e5533696c97.1691592488.git.skaplun@tarantool.org>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Mailru-Src: smtp
X-4EC0790: 10
X-7564579A: 646B95376F6C166E
X-77F55803: 4F1203BC0FB41BD9700E0DCE2907754D1399757346C038D9B2FE68E137CB1837182A05F538085040C166940AD2354C3C3209D0D68FDDE11C0E6F343DCF10A4A38200ED41C4AA01B7
X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE780D115B306136E0AEA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F790063788758EA7442DD2858638F802B75D45FF36EB9D2243A4F8B5A6FCA7DBDB1FC311F39EFFDF887939037866D6147AF826D87D4401612F3AFC419DEC67508F7BFD8C117882F4460429724CE54428C33FAD305F5C1EE8F4F765FCDCBA8CBAA3833548A471835C12D1D9774AD6D5ED66289B52BA9C0B312567BB23117882F446042972877693876707352033AC447995A7AD18BDFBBEFFF4125B51D2E47CDBA5A96583BA9C0B312567BB2376E601842F6C81A19E625A9149C048EE26055571C92BF10F287C8E22D4AE2A51D8FC6C240DEA76429C9F4D5AE37F343AA9539A8B242431040A6AB1C7CE11FEE32D01283D1ACF37BA302FCEF25BFAB345C4224003CC836476E2F48590F00D11D6E2021AF6380DFAD1A18204E546F3947C2FFDA4F57982C5F42E808ACE2090B5E1725E5C173C3A84C34964A708C60C975A089D37D7C0E48F6C8AA50765F79006373BA04B6A498D0BA4731C566533BA786AA5CC5B56E945C8DA
X-B7AD71C0: 4965CFDFE0519134C1FE400A9E48C5401DD40DE57556AFB266D16FC5F53507A1816E0A2A8F779BBED8D40077074E805C66D16FC5F53507A117535B0CF9F6D0C3EE9D5CB6078CC77CED87DEF3E013AF28EFFFE7C7C1A70394
X-C1DE0DAB: 0D63561A33F958A5C59FCEC3F48A39AF3EF1F0C60D6113E166D32F0EC6D2A0F4F87CCE6106E1FC07E67D4AC08A07B9B04E7D9683544204AFCB5012B2E24CD356
X-C8649E89: 1C3962B70DF3F0AD75DCE07D45A749953FED46C3ACD6F73ED3581295AF09D3DF87807E0823442EA2ED31085941D9CD0AF7F820E7B07EA4CFC2CED21642DB732680B972B0CC9B070641DED7D6619D935B7770ED11B6ED1804F6452569441AF36424F7434E9608BF9FD24239AACBFE4720E5CCB0490AD0A615E48CAC7CA610320002C26D483E81D6BE0DBAE6F56676BC7117BB6831D7356A2DEC5B5AD62611EEC62B5AFB4261A09AF0
X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojFRrmMqSMPxr6IG1/5e0/Bw==
X-Mailru-Sender: 11C2EC085EDE56FAC07928AF2646A76911A8B28D3081CB003209D0D68FDDE11C14D23FB96A9F340BEBA65886582A37BD66FEC6BF5C9C28D98A98C1125256619760D574B6FC815AB872D6B4FCE48DF648AE208404248635DF
X-Mras: Ok
Subject: Re: [Tarantool-patches] [PATCH luajit 05/19] PPC: Add soft-float
 support to interpreter.
X-BeenThere: tarantool-patches@dev.tarantool.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
From: Sergey Bronnikov via Tarantool-patches
 <tarantool-patches@dev.tarantool.org>
Reply-To: Sergey Bronnikov <sergeyb@tarantool.org>
Errors-To: tarantool-patches-bounces@dev.tarantool.org
Sender: "Tarantool-patches" <tarantool-patches-bounces@dev.tarantool.org>

Hi, Sergey


thanks for the patch! LGTM

On 8/9/23 18:35, Sergey Kaplun wrote:
> From: Mike Pall <mike>
>
> Contributed by Djordje Kovacevic and Stefan Pejic from RT-RK.com.
> Sponsored by Cisco Systems, Inc.
>
> (cherry-picked from commit 71b7bc88341945f13f3951e2bb5fd247b639ff7a)
>
> The software floating point library is used on machines which do not
> have hardware support for floating point [1]. This patch enables
> support for such machines in the VM for powerpc.
> This includes:
> * Any loads/storages of double values use load/storage through 32-bit
>    registers of `lo` and `hi` part of the TValue union.
> * Macro .FPU is added to skip instructions necessary only for
>    hard-float operations (load/store floating point registers from/on the
>    stack, when leave/enter VM, for example).
> * Now r25 named as `SAVE1` is used as saved temporary register (used in
>    different fast functions)
> * `sfi2d` macro is introduced to convert integer, that represents a
>    soft-float, to double. Receives destination and source registers, uses
>    `TMP0` and `TMP1`.
> * `sfpmod` macro is introduced for soft-float point `fmod` built-in.
> * `ins_arith` now receives the third parameter -- operation to use for
>    soft-float point.
> * `LJ_ARCH_HASFPU`, `LJ_ABI_SOFTFP` macros are introduced to mark that
>    there is defined `_SOFT_FLOAT` or `_SOFT_DOUBLE`. `LJ_ARCH_NUMMODE` is
>    set to the `LJ_NUMMODE_DUAL`, when `LJ_ABI_SOFTFP` is true.
>
> Support of soft-float point for the JIT compiler will be added in the
> next patch.
>
> [1]: https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html
>
> Sergey Kaplun:
> * added the description for the feature
>
> Part of tarantool/tarantool#8825
> ---
>   src/host/buildvm_asm.c |    2 +-
>   src/lj_arch.h          |   29 +-
>   src/lj_ccall.c         |   38 +-
>   src/lj_ccall.h         |    4 +-
>   src/lj_ccallback.c     |   30 +-
>   src/lj_frame.h         |    2 +-
>   src/lj_ircall.h        |    2 +-
>   src/vm_ppc.dasc        | 1249 +++++++++++++++++++++++++++++++++-------
>   8 files changed, 1101 insertions(+), 255 deletions(-)
>
> diff --git a/src/host/buildvm_asm.c b/src/host/buildvm_asm.c
> index ffd14903..43595b31 100644
> --- a/src/host/buildvm_asm.c
> +++ b/src/host/buildvm_asm.c
> @@ -338,7 +338,7 @@ void emit_asm(BuildCtx *ctx)
>   #if !(LJ_TARGET_PS3 || LJ_TARGET_PSVITA)
>       fprintf(ctx->fp, "\t.section .note.GNU-stack,\"\"," ELFASM_PX "progbits\n");
>   #endif
> -#if LJ_TARGET_PPC && !LJ_TARGET_PS3
> +#if LJ_TARGET_PPC && !LJ_TARGET_PS3 && !LJ_ABI_SOFTFP
>       /* Hard-float ABI. */
>       fprintf(ctx->fp, "\t.gnu_attribute 4, 1\n");
>   #endif
> diff --git a/src/lj_arch.h b/src/lj_arch.h
> index c39526ea..8bb8757d 100644
> --- a/src/lj_arch.h
> +++ b/src/lj_arch.h
> @@ -262,6 +262,29 @@
>   #else
>   #define LJ_ARCH_BITS		32
>   #define LJ_ARCH_NAME		"ppc"
> +
> +#if !defined(LJ_ARCH_HASFPU)
> +#if defined(_SOFT_FLOAT) || defined(_SOFT_DOUBLE)
> +#define LJ_ARCH_HASFPU		0
> +#else
> +#define LJ_ARCH_HASFPU		1
> +#endif
> +#endif
> +
> +#if !defined(LJ_ABI_SOFTFP)
> +#if defined(_SOFT_FLOAT) || defined(_SOFT_DOUBLE)
> +#define LJ_ABI_SOFTFP		1
> +#else
> +#define LJ_ABI_SOFTFP		0
> +#endif
> +#endif
> +#endif
> +
> +#if LJ_ABI_SOFTFP
> +#define LJ_ARCH_NOJIT		1  /* NYI */
> +#define LJ_ARCH_NUMMODE		LJ_NUMMODE_DUAL
> +#else
> +#define LJ_ARCH_NUMMODE		LJ_NUMMODE_DUAL_SINGLE
>   #endif
>   
>   #define LJ_TARGET_PPC		1
> @@ -271,7 +294,6 @@
>   #define LJ_TARGET_MASKSHIFT	0
>   #define LJ_TARGET_MASKROT	1
>   #define LJ_TARGET_UNIFYROT	1	/* Want only IR_BROL. */
> -#define LJ_ARCH_NUMMODE		LJ_NUMMODE_DUAL_SINGLE
>   
>   #if LJ_TARGET_CONSOLE
>   #define LJ_ARCH_PPC32ON64	1
> @@ -431,16 +453,13 @@
>   #error "No support for ILP32 model on ARM64"
>   #endif
>   #elif LJ_TARGET_PPC
> -#if defined(_SOFT_FLOAT) || defined(_SOFT_DOUBLE)
> -#error "No support for PowerPC CPUs without double-precision FPU"
> -#endif
>   #if !LJ_ARCH_PPC64 && LJ_ARCH_ENDIAN == LUAJIT_LE
>   #error "No support for little-endian PPC32"
>   #endif
>   #if LJ_ARCH_PPC64
>   #error "No support for PowerPC 64 bit mode (yet)"
>   #endif
> -#ifdef __NO_FPRS__
> +#if defined(__NO_FPRS__) && !defined(_SOFT_FLOAT)
>   #error "No support for PPC/e500 anymore (use LuaJIT 2.0)"
>   #endif
>   #elif LJ_TARGET_MIPS32
> diff --git a/src/lj_ccall.c b/src/lj_ccall.c
> index d39ff861..c1e12f56 100644
> --- a/src/lj_ccall.c
> +++ b/src/lj_ccall.c
> @@ -388,6 +388,24 @@
>   #define CCALL_HANDLE_COMPLEXARG \
>     /* Pass complex by value in 2 or 4 GPRs. */
>   
> +#define CCALL_HANDLE_GPR \
> +  /* Try to pass argument in GPRs. */ \
> +  if (n > 1) { \
> +    lua_assert(n == 2 || n == 4);  /* int64_t or complex (float). */ \
> +    if (ctype_isinteger(d->info) || ctype_isfp(d->info)) \
> +      ngpr = (ngpr + 1u) & ~1u;  /* Align int64_t to regpair. */ \
> +    else if (ngpr + n > maxgpr) \
> +      ngpr = maxgpr;  /* Prevent reordering. */ \
> +  } \
> +  if (ngpr + n <= maxgpr) { \
> +    dp = &cc->gpr[ngpr]; \
> +    ngpr += n; \
> +    goto done; \
> +  } \
> +
> +#if LJ_ABI_SOFTFP
> +#define CCALL_HANDLE_REGARG  CCALL_HANDLE_GPR
> +#else
>   #define CCALL_HANDLE_REGARG \
>     if (isfp) {  /* Try to pass argument in FPRs. */ \
>       if (nfpr + 1 <= CCALL_NARG_FPR) { \
> @@ -396,24 +414,16 @@
>         d = ctype_get(cts, CTID_DOUBLE);  /* FPRs always hold doubles. */ \
>         goto done; \
>       } \
> -  } else {  /* Try to pass argument in GPRs. */ \
> -    if (n > 1) { \
> -      lua_assert(n == 2 || n == 4);  /* int64_t or complex (float). */ \
> -      if (ctype_isinteger(d->info)) \
> -	ngpr = (ngpr + 1u) & ~1u;  /* Align int64_t to regpair. */ \
> -      else if (ngpr + n > maxgpr) \
> -	ngpr = maxgpr;  /* Prevent reordering. */ \
> -    } \
> -    if (ngpr + n <= maxgpr) { \
> -      dp = &cc->gpr[ngpr]; \
> -      ngpr += n; \
> -      goto done; \
> -    } \
> +  } else { \
> +    CCALL_HANDLE_GPR \
>     }
> +#endif
>   
> +#if !LJ_ABI_SOFTFP
>   #define CCALL_HANDLE_RET \
>     if (ctype_isfp(ctr->info) && ctr->size == sizeof(float)) \
>       ctr = ctype_get(cts, CTID_DOUBLE);  /* FPRs always hold doubles. */
> +#endif
>   
>   #elif LJ_TARGET_MIPS32
>   /* -- MIPS o32 calling conventions ---------------------------------------- */
> @@ -1081,7 +1091,7 @@ static int ccall_set_args(lua_State *L, CTState *cts, CType *ct,
>     }
>     if (fid) lj_err_caller(L, LJ_ERR_FFI_NUMARG);  /* Too few arguments. */
>   
> -#if LJ_TARGET_X64 || LJ_TARGET_PPC
> +#if LJ_TARGET_X64 || (LJ_TARGET_PPC && !LJ_ABI_SOFTFP)
>     cc->nfpr = nfpr;  /* Required for vararg functions. */
>   #endif
>     cc->nsp = nsp;
> diff --git a/src/lj_ccall.h b/src/lj_ccall.h
> index 59f66481..6efa48c7 100644
> --- a/src/lj_ccall.h
> +++ b/src/lj_ccall.h
> @@ -86,9 +86,9 @@ typedef union FPRArg {
>   #elif LJ_TARGET_PPC
>   
>   #define CCALL_NARG_GPR		8
> -#define CCALL_NARG_FPR		8
> +#define CCALL_NARG_FPR		(LJ_ABI_SOFTFP ? 0 : 8)
>   #define CCALL_NRET_GPR		4	/* For complex double. */
> -#define CCALL_NRET_FPR		1
> +#define CCALL_NRET_FPR		(LJ_ABI_SOFTFP ? 0 : 1)
>   #define CCALL_SPS_EXTRA		4
>   #define CCALL_SPS_FREE		0
>   
> diff --git a/src/lj_ccallback.c b/src/lj_ccallback.c
> index 224b6b94..c33190d7 100644
> --- a/src/lj_ccallback.c
> +++ b/src/lj_ccallback.c
> @@ -419,6 +419,23 @@ void lj_ccallback_mcode_free(CTState *cts)
>   
>   #elif LJ_TARGET_PPC
>   
> +#define CALLBACK_HANDLE_GPR \
> +  if (n > 1) { \
> +    lua_assert(((LJ_ABI_SOFTFP && ctype_isnum(cta->info)) ||  /* double. */ \
> +		ctype_isinteger(cta->info)) && n == 2);  /* int64_t. */ \
> +    ngpr = (ngpr + 1u) & ~1u;  /* Align int64_t to regpair. */ \
> +  } \
> +  if (ngpr + n <= maxgpr) { \
> +    sp = &cts->cb.gpr[ngpr]; \
> +    ngpr += n; \
> +    goto done; \
> +  }
> +
> +#if LJ_ABI_SOFTFP
> +#define CALLBACK_HANDLE_REGARG \
> +  CALLBACK_HANDLE_GPR \
> +  UNUSED(isfp);
> +#else
>   #define CALLBACK_HANDLE_REGARG \
>     if (isfp) { \
>       if (nfpr + 1 <= CCALL_NARG_FPR) { \
> @@ -427,20 +444,15 @@ void lj_ccallback_mcode_free(CTState *cts)
>         goto done; \
>       } \
>     } else {  /* Try to pass argument in GPRs. */ \
> -    if (n > 1) { \
> -      lua_assert(ctype_isinteger(cta->info) && n == 2);  /* int64_t. */ \
> -      ngpr = (ngpr + 1u) & ~1u;  /* Align int64_t to regpair. */ \
> -    } \
> -    if (ngpr + n <= maxgpr) { \
> -      sp = &cts->cb.gpr[ngpr]; \
> -      ngpr += n; \
> -      goto done; \
> -    } \
> +    CALLBACK_HANDLE_GPR \
>     }
> +#endif
>   
> +#if !LJ_ABI_SOFTFP
>   #define CALLBACK_HANDLE_RET \
>     if (ctype_isfp(ctr->info) && ctr->size == sizeof(float)) \
>       *(double *)dp = *(float *)dp;  /* FPRs always hold doubles. */
> +#endif
>   
>   #elif LJ_TARGET_MIPS32
>   
> diff --git a/src/lj_frame.h b/src/lj_frame.h
> index 2bdf3c48..5cb3d639 100644
> --- a/src/lj_frame.h
> +++ b/src/lj_frame.h
> @@ -226,7 +226,7 @@ enum { LJ_CONT_TAILCALL, LJ_CONT_FFI_CALLBACK };  /* Special continuations. */
>   #define CFRAME_OFS_L		36
>   #define CFRAME_OFS_PC		32
>   #define CFRAME_OFS_MULTRES	28
> -#define CFRAME_SIZE		272
> +#define CFRAME_SIZE		(LJ_ARCH_HASFPU ? 272 : 128)
>   #define CFRAME_SHIFT_MULTRES	3
>   #endif
>   #elif LJ_TARGET_MIPS32
> diff --git a/src/lj_ircall.h b/src/lj_ircall.h
> index c1ac29d1..bbad35b1 100644
> --- a/src/lj_ircall.h
> +++ b/src/lj_ircall.h
> @@ -291,7 +291,7 @@ LJ_DATA const CCallInfo lj_ir_callinfo[IRCALL__MAX+1];
>   #define fp64_f2l __aeabi_f2lz
>   #define fp64_f2ul __aeabi_f2ulz
>   #endif
> -#elif LJ_TARGET_MIPS
> +#elif LJ_TARGET_MIPS || LJ_TARGET_PPC
>   #define softfp_add __adddf3
>   #define softfp_sub __subdf3
>   #define softfp_mul __muldf3
> diff --git a/src/vm_ppc.dasc b/src/vm_ppc.dasc
> index 7ad8df37..980ad897 100644
> --- a/src/vm_ppc.dasc
> +++ b/src/vm_ppc.dasc
> @@ -103,6 +103,18 @@
>   |// Fixed register assignments for the interpreter.
>   |// Don't use: r1 = sp, r2 and r13 = reserved (TOC, TLS or SDATA)
>   |
> +|.macro .FPU, a, b
> +|.if FPU
> +|  a, b
> +|.endif
> +|.endmacro
> +|
> +|.macro .FPU, a, b, c
> +|.if FPU
> +|  a, b, c
> +|.endif
> +|.endmacro
> +|
>   |// The following must be C callee-save (but BASE is often refetched).
>   |.define BASE,		r14	// Base of current Lua stack frame.
>   |.define KBASE,		r15	// Constants of current Lua function.
> @@ -116,8 +128,10 @@
>   |.define TISNUM,	r22
>   |.define TISNIL,	r23
>   |.define ZERO,		r24
> +|.if FPU
>   |.define TOBIT,		f30	// 2^52 + 2^51.
>   |.define TONUM,		f31	// 2^52 + 2^51 + 2^31.
> +|.endif
>   |
>   |// The following temporaries are not saved across C calls, except for RA.
>   |.define RA,		r20	// Callee-save.
> @@ -133,6 +147,7 @@
>   |
>   |// Saved temporaries.
>   |.define SAVE0,		r21
> +|.define SAVE1,		r25
>   |
>   |// Calling conventions.
>   |.define CARG1,		r3
> @@ -141,8 +156,10 @@
>   |.define CARG4,		r6	// Overlaps TMP3.
>   |.define CARG5,		r7	// Overlaps INS.
>   |
> +|.if FPU
>   |.define FARG1,		f1
>   |.define FARG2,		f2
> +|.endif
>   |
>   |.define CRET1,		r3
>   |.define CRET2,		r4
> @@ -213,10 +230,16 @@
>   |.endif
>   |.else
>   |
> +|.if FPU
>   |.define SAVE_LR,	276(sp)
>   |.define CFRAME_SPACE,	272     // Delta for sp.
>   |// Back chain for sp:	272(sp) <-- sp entering interpreter
>   |.define SAVE_FPR_,	128     // .. 128+18*8: 64 bit FPR saves.
> +|.else
> +|.define SAVE_LR,	132(sp)
> +|.define CFRAME_SPACE,	128     // Delta for sp.
> +|// Back chain for sp:	128(sp) <-- sp entering interpreter
> +|.endif
>   |.define SAVE_GPR_,	56      // .. 56+18*4: 32 bit GPR saves.
>   |.define SAVE_CR,	52(sp)  // 32 bit CR save.
>   |.define SAVE_ERRF,	48(sp)  // 32 bit C frame info.
> @@ -226,16 +249,25 @@
>   |.define SAVE_PC,	32(sp)
>   |.define SAVE_MULTRES,	28(sp)
>   |.define UNUSED1,	24(sp)
> +|.if FPU
>   |.define TMPD_LO,	20(sp)
>   |.define TMPD_HI,	16(sp)
>   |.define TONUM_LO,	12(sp)
>   |.define TONUM_HI,	8(sp)
> +|.else
> +|.define SFSAVE_4,	20(sp)
> +|.define SFSAVE_3,	16(sp)
> +|.define SFSAVE_2,	12(sp)
> +|.define SFSAVE_1,	8(sp)
> +|.endif
>   |// Next frame lr:	4(sp)
>   |// Back chain for sp:	0(sp)	<-- sp while in interpreter
>   |
> +|.if FPU
>   |.define TMPD_BLO,	23(sp)
>   |.define TMPD,		TMPD_HI
>   |.define TONUM_D,	TONUM_HI
> +|.endif
>   |
>   |.endif
>   |
> @@ -245,7 +277,7 @@
>   |.else
>   |  stw r..reg, SAVE_GPR_+(reg-14)*4(sp)
>   |.endif
> -|  stfd f..reg, SAVE_FPR_+(reg-14)*8(sp)
> +|  .FPU stfd f..reg, SAVE_FPR_+(reg-14)*8(sp)
>   |.endmacro
>   |.macro rest_, reg
>   |.if GPR64
> @@ -253,7 +285,7 @@
>   |.else
>   |  lwz r..reg, SAVE_GPR_+(reg-14)*4(sp)
>   |.endif
> -|  lfd f..reg, SAVE_FPR_+(reg-14)*8(sp)
> +|  .FPU lfd f..reg, SAVE_FPR_+(reg-14)*8(sp)
>   |.endmacro
>   |
>   |.macro saveregs
> @@ -323,6 +355,7 @@
>   |// Trap for not-yet-implemented parts.
>   |.macro NYI; tw 4, sp, sp; .endmacro
>   |
> +|.if FPU
>   |// int/FP conversions.
>   |.macro tonum_i, freg, reg
>   |  xoris reg, reg, 0x8000
> @@ -346,6 +379,7 @@
>   |.macro toint, reg, freg
>   |  toint reg, freg, freg
>   |.endmacro
> +|.endif
>   |
>   |//-----------------------------------------------------------------------
>   |
> @@ -533,9 +567,19 @@ static void build_subroutines(BuildCtx *ctx)
>     |  beq >2
>     |1:
>     |  addic. TMP1, TMP1, -8
> +  |.if FPU
>     |   lfd f0, 0(RA)
> +  |.else
> +  |   lwz CARG1, 0(RA)
> +  |   lwz CARG2, 4(RA)
> +  |.endif
>     |    addi RA, RA, 8
> +  |.if FPU
>     |   stfd f0, 0(BASE)
> +  |.else
> +  |   stw CARG1, 0(BASE)
> +  |   stw CARG2, 4(BASE)
> +  |.endif
>     |    addi BASE, BASE, 8
>     |  bney <1
>     |
> @@ -613,23 +657,23 @@ static void build_subroutines(BuildCtx *ctx)
>     |  .toc ld TOCREG, SAVE_TOC
>     |     li TISNUM, LJ_TISNUM		// Setup type comparison constants.
>     |  lp BASE, L->base
> -  |     lus TMP3, 0x59c0		// TOBIT = 2^52 + 2^51 (float).
> +  |     .FPU lus TMP3, 0x59c0		// TOBIT = 2^52 + 2^51 (float).
>     |   lwz DISPATCH, L->glref		// Setup pointer to dispatch table.
>     |     li ZERO, 0
> -  |     stw TMP3, TMPD
> +  |     .FPU stw TMP3, TMPD
>     |  li TMP1, LJ_TFALSE
> -  |     ori TMP3, TMP3, 0x0004		// TONUM = 2^52 + 2^51 + 2^31 (float).
> +  |     .FPU ori TMP3, TMP3, 0x0004	// TONUM = 2^52 + 2^51 + 2^31 (float).
>     |     li TISNIL, LJ_TNIL
>     |    li_vmstate INTERP
> -  |     lfs TOBIT, TMPD
> +  |     .FPU lfs TOBIT, TMPD
>     |  lwz PC, FRAME_PC(BASE)		// Fetch PC of previous frame.
>     |  la RA, -8(BASE)			// Results start at BASE-8.
> -  |     stw TMP3, TMPD
> +  |     .FPU stw TMP3, TMPD
>     |   addi DISPATCH, DISPATCH, GG_G2DISP
>     |  stw TMP1, 0(RA)			// Prepend false to error message.
>     |  li RD, 16				// 2 results: false + error message.
>     |    st_vmstate
> -  |     lfs TONUM, TMPD
> +  |     .FPU lfs TONUM, TMPD
>     |  b ->vm_returnc
>     |
>     |//-----------------------------------------------------------------------
> @@ -690,22 +734,22 @@ static void build_subroutines(BuildCtx *ctx)
>     |     li TISNUM, LJ_TISNUM		// Setup type comparison constants.
>     |   lp TMP1, L->top
>     |  lwz PC, FRAME_PC(BASE)
> -  |     lus TMP3, 0x59c0		// TOBIT = 2^52 + 2^51 (float).
> +  |     .FPU lus TMP3, 0x59c0		// TOBIT = 2^52 + 2^51 (float).
>     |    stb CARG3, L->status
> -  |     stw TMP3, TMPD
> -  |     ori TMP3, TMP3, 0x0004		// TONUM = 2^52 + 2^51 + 2^31 (float).
> -  |     lfs TOBIT, TMPD
> +  |     .FPU stw TMP3, TMPD
> +  |     .FPU ori TMP3, TMP3, 0x0004	// TONUM = 2^52 + 2^51 + 2^31 (float).
> +  |     .FPU lfs TOBIT, TMPD
>     |   sub RD, TMP1, BASE
> -  |     stw TMP3, TMPD
> -  |     lus TMP0, 0x4338		// Hiword of 2^52 + 2^51 (double)
> +  |     .FPU stw TMP3, TMPD
> +  |     .FPU lus TMP0, 0x4338		// Hiword of 2^52 + 2^51 (double)
>     |   addi RD, RD, 8
> -  |     stw TMP0, TONUM_HI
> +  |     .FPU stw TMP0, TONUM_HI
>     |    li_vmstate INTERP
>     |     li ZERO, 0
>     |    st_vmstate
>     |  andix. TMP0, PC, FRAME_TYPE
>     |   mr MULTRES, RD
> -  |     lfs TONUM, TMPD
> +  |     .FPU lfs TONUM, TMPD
>     |     li TISNIL, LJ_TNIL
>     |  beq ->BC_RET_Z
>     |  b ->vm_return
> @@ -739,19 +783,19 @@ static void build_subroutines(BuildCtx *ctx)
>     |  lp TMP2, L->base			// TMP2 = old base (used in vmeta_call).
>     |     li TISNUM, LJ_TISNUM		// Setup type comparison constants.
>     |   lp TMP1, L->top
> -  |     lus TMP3, 0x59c0		// TOBIT = 2^52 + 2^51 (float).
> +  |     .FPU lus TMP3, 0x59c0		// TOBIT = 2^52 + 2^51 (float).
>     |  add PC, PC, BASE
> -  |     stw TMP3, TMPD
> +  |     .FPU stw TMP3, TMPD
>     |     li ZERO, 0
> -  |     ori TMP3, TMP3, 0x0004		// TONUM = 2^52 + 2^51 + 2^31 (float).
> -  |     lfs TOBIT, TMPD
> +  |     .FPU ori TMP3, TMP3, 0x0004	// TONUM = 2^52 + 2^51 + 2^31 (float).
> +  |     .FPU lfs TOBIT, TMPD
>     |  sub PC, PC, TMP2			// PC = frame delta + frame type
> -  |     stw TMP3, TMPD
> -  |     lus TMP0, 0x4338		// Hiword of 2^52 + 2^51 (double)
> +  |     .FPU stw TMP3, TMPD
> +  |     .FPU lus TMP0, 0x4338		// Hiword of 2^52 + 2^51 (double)
>     |   sub NARGS8:RC, TMP1, BASE
> -  |     stw TMP0, TONUM_HI
> +  |     .FPU stw TMP0, TONUM_HI
>     |    li_vmstate INTERP
> -  |     lfs TONUM, TMPD
> +  |     .FPU lfs TONUM, TMPD
>     |     li TISNIL, LJ_TNIL
>     |    st_vmstate
>     |
> @@ -839,15 +883,30 @@ static void build_subroutines(BuildCtx *ctx)
>     |  lwz INS, -4(PC)
>     |   subi CARG2, RB, 16
>     |  decode_RB8 SAVE0, INS
> +  |.if FPU
>     |   lfd f0, 0(RA)
> +  |.else
> +  |   lwz TMP2, 0(RA)
> +  |   lwz TMP3, 4(RA)
> +  |.endif
>     |  add TMP1, BASE, SAVE0
>     |   stp BASE, L->base
>     |  cmplw TMP1, CARG2
>     |   sub CARG3, CARG2, TMP1
>     |  decode_RA8 RA, INS
> +  |.if FPU
>     |   stfd f0, 0(CARG2)
> +  |.else
> +  |   stw TMP2, 0(CARG2)
> +  |   stw TMP3, 4(CARG2)
> +  |.endif
>     |  bney ->BC_CAT_Z
> +  |.if FPU
>     |   stfdx f0, BASE, RA
> +  |.else
> +  |   stwux TMP2, RA, BASE
> +  |   stw TMP3, 4(RA)
> +  |.endif
>     |  b ->cont_nop
>     |
>     |//-- Table indexing metamethods -----------------------------------------
> @@ -900,9 +959,19 @@ static void build_subroutines(BuildCtx *ctx)
>     |  // Returns TValue * (finished) or NULL (metamethod).
>     |  cmplwi CRET1, 0
>     |  beq >3
> +  |.if FPU
>     |   lfd f0, 0(CRET1)
> +  |.else
> +  |   lwz TMP0, 0(CRET1)
> +  |   lwz TMP1, 4(CRET1)
> +  |.endif
>     |  ins_next1
> +  |.if FPU
>     |   stfdx f0, BASE, RA
> +  |.else
> +  |   stwux TMP0, RA, BASE
> +  |   stw TMP1, 4(RA)
> +  |.endif
>     |  ins_next2
>     |
>     |3:  // Call __index metamethod.
> @@ -920,7 +989,12 @@ static void build_subroutines(BuildCtx *ctx)
>     |  // Returns cTValue * or NULL.
>     |  cmplwi CRET1, 0
>     |  beq >1
> +  |.if FPU
>     |  lfd f14, 0(CRET1)
> +  |.else
> +  |  lwz SAVE0, 0(CRET1)
> +  |  lwz SAVE1, 4(CRET1)
> +  |.endif
>     |  b ->BC_TGETR_Z
>     |1:
>     |  stwx TISNIL, BASE, RA
> @@ -975,11 +1049,21 @@ static void build_subroutines(BuildCtx *ctx)
>     |  bl extern lj_meta_tset		// (lua_State *L, TValue *o, TValue *k)
>     |  // Returns TValue * (finished) or NULL (metamethod).
>     |  cmplwi CRET1, 0
> +  |.if FPU
>     |   lfdx f0, BASE, RA
> +  |.else
> +  |   lwzux TMP2, RA, BASE
> +  |   lwz TMP3, 4(RA)
> +  |.endif
>     |  beq >3
>     |  // NOBARRIER: lj_meta_tset ensures the table is not black.
>     |  ins_next1
> +  |.if FPU
>     |   stfd f0, 0(CRET1)
> +  |.else
> +  |   stw TMP2, 0(CRET1)
> +  |   stw TMP3, 4(CRET1)
> +  |.endif
>     |  ins_next2
>     |
>     |3:  // Call __newindex metamethod.
> @@ -990,7 +1074,12 @@ static void build_subroutines(BuildCtx *ctx)
>     |   add PC, TMP1, BASE
>     |  lwz LFUNC:RB, FRAME_FUNC(BASE)	// Guaranteed to be a function here.
>     |   li NARGS8:RC, 24			// 3 args for func(t, k, v)
> +  |.if FPU
>     |  stfd f0, 16(BASE)			// Copy value to third argument.
> +  |.else
> +  |  stw TMP2, 16(BASE)
> +  |  stw TMP3, 20(BASE)
> +  |.endif
>     |  b ->vm_call_dispatch_f
>     |
>     |->vmeta_tsetr:
> @@ -999,7 +1088,12 @@ static void build_subroutines(BuildCtx *ctx)
>     |  stw PC, SAVE_PC
>     |  bl extern lj_tab_setinth  // (lua_State *L, GCtab *t, int32_t key)
>     |  // Returns TValue *.
> +  |.if FPU
>     |  stfd f14, 0(CRET1)
> +  |.else
> +  |  stw SAVE0, 0(CRET1)
> +  |  stw SAVE1, 4(CRET1)
> +  |.endif
>     |  b ->cont_nop
>     |
>     |//-- Comparison metamethods ---------------------------------------------
> @@ -1038,9 +1132,19 @@ static void build_subroutines(BuildCtx *ctx)
>     |
>     |->cont_ra:				// RA = resultptr
>     |  lwz INS, -4(PC)
> +  |.if FPU
>     |   lfd f0, 0(RA)
> +  |.else
> +  |   lwz CARG1, 0(RA)
> +  |   lwz CARG2, 4(RA)
> +  |.endif
>     |  decode_RA8 TMP1, INS
> +  |.if FPU
>     |   stfdx f0, BASE, TMP1
> +  |.else
> +  |   stwux CARG1, TMP1, BASE
> +  |   stw CARG2, 4(TMP1)
> +  |.endif
>     |  b ->cont_nop
>     |
>     |->cont_condt:			// RA = resultptr
> @@ -1246,22 +1350,32 @@ static void build_subroutines(BuildCtx *ctx)
>     |.macro .ffunc_n, name
>     |->ff_ .. name:
>     |  cmplwi NARGS8:RC, 8
> -  |   lwz CARG3, 0(BASE)
> +  |   lwz CARG1, 0(BASE)
> +  |.if FPU
>     |    lfd FARG1, 0(BASE)
> +  |.else
> +  |    lwz CARG2, 4(BASE)
> +  |.endif
>     |  blt ->fff_fallback
> -  |  checknum CARG3; bge ->fff_fallback
> +  |  checknum CARG1; bge ->fff_fallback
>     |.endmacro
>     |
>     |.macro .ffunc_nn, name
>     |->ff_ .. name:
>     |  cmplwi NARGS8:RC, 16
> -  |   lwz CARG3, 0(BASE)
> +  |   lwz CARG1, 0(BASE)
> +  |.if FPU
>     |    lfd FARG1, 0(BASE)
> -  |   lwz CARG4, 8(BASE)
> +  |   lwz CARG3, 8(BASE)
>     |    lfd FARG2, 8(BASE)
> +  |.else
> +  |    lwz CARG2, 4(BASE)
> +  |   lwz CARG3, 8(BASE)
> +  |    lwz CARG4, 12(BASE)
> +  |.endif
>     |  blt ->fff_fallback
> +  |  checknum CARG1; bge ->fff_fallback
>     |  checknum CARG3; bge ->fff_fallback
> -  |  checknum CARG4; bge ->fff_fallback
>     |.endmacro
>     |
>     |// Inlined GC threshold check. Caveat: uses TMP0 and TMP1.
> @@ -1282,14 +1396,21 @@ static void build_subroutines(BuildCtx *ctx)
>     |  bge cr1, ->fff_fallback
>     |   stw CARG3, 0(RA)
>     |  addi RD, NARGS8:RC, 8		// Compute (nresults+1)*8.
> +  |  addi TMP1, BASE, 8
> +  |  add TMP2, RA, NARGS8:RC
>     |   stw CARG1, 4(RA)
>     |  beq ->fff_res			// Done if exactly 1 argument.
> -  |  li TMP1, 8
> -  |  subi RC, RC, 8
>     |1:
> -  |  cmplw TMP1, RC
> -  |   lfdx f0, BASE, TMP1
> -  |   stfdx f0, RA, TMP1
> +  |  cmplw TMP1, TMP2
> +  |.if FPU
> +  |   lfd f0, 0(TMP1)
> +  |   stfd f0, 0(TMP1)
> +  |.else
> +  |   lwz CARG1, 0(TMP1)
> +  |   lwz CARG2, 4(TMP1)
> +  |   stw CARG1, -8(TMP1)
> +  |   stw CARG2, -4(TMP1)
> +  |.endif
>     |    addi TMP1, TMP1, 8
>     |  bney <1
>     |  b ->fff_res
> @@ -1304,8 +1425,14 @@ static void build_subroutines(BuildCtx *ctx)
>     |  orc TMP1, TMP2, TMP0
>     |  addi TMP1, TMP1, ~LJ_TISNUM+1
>     |  slwi TMP1, TMP1, 3
> +  |.if FPU
>     |   la TMP2, CFUNC:RB->upvalue
>     |  lfdx FARG1, TMP2, TMP1
> +  |.else
> +  |  add TMP1, CFUNC:RB, TMP1
> +  |  lwz CARG1, CFUNC:TMP1->upvalue[0].u32.hi
> +  |  lwz CARG2, CFUNC:TMP1->upvalue[0].u32.lo
> +  |.endif
>     |  b ->fff_resn
>     |
>     |//-- Base library: getters and setters ---------------------------------
> @@ -1383,7 +1510,12 @@ static void build_subroutines(BuildCtx *ctx)
>     |   mr CARG1, L
>     |  bl extern lj_tab_get  // (lua_State *L, GCtab *t, cTValue *key)
>     |  // Returns cTValue *.
> +  |.if FPU
>     |  lfd FARG1, 0(CRET1)
> +  |.else
> +  |  lwz CARG2, 4(CRET1)
> +  |  lwz CARG1, 0(CRET1)	// Caveat: CARG1 == CRET1.
> +  |.endif
>     |  b ->fff_resn
>     |
>     |//-- Base library: conversions ------------------------------------------
> @@ -1392,7 +1524,11 @@ static void build_subroutines(BuildCtx *ctx)
>     |  // Only handles the number case inline (without a base argument).
>     |  cmplwi NARGS8:RC, 8
>     |   lwz CARG1, 0(BASE)
> +  |.if FPU
>     |    lfd FARG1, 0(BASE)
> +  |.else
> +  |    lwz CARG2, 4(BASE)
> +  |.endif
>     |  bne ->fff_fallback			// Exactly one argument.
>     |   checknum CARG1; bgt ->fff_fallback
>     |  b ->fff_resn
> @@ -1443,12 +1579,23 @@ static void build_subroutines(BuildCtx *ctx)
>     |  cmplwi CRET1, 0
>     |   li CARG3, LJ_TNIL
>     |  beq ->fff_restv			// End of traversal: return nil.
> -  |  lfd f0, 8(BASE)			// Copy key and value to results.
>     |   la RA, -8(BASE)
> +  |.if FPU
> +  |  lfd f0, 8(BASE)			// Copy key and value to results.
>     |  lfd f1, 16(BASE)
>     |  stfd f0, 0(RA)
> -  |   li RD, (2+1)*8
>     |  stfd f1, 8(RA)
> +  |.else
> +  |  lwz CARG1, 8(BASE)
> +  |  lwz CARG2, 12(BASE)
> +  |  lwz CARG3, 16(BASE)
> +  |  lwz CARG4, 20(BASE)
> +  |  stw CARG1, 0(RA)
> +  |  stw CARG2, 4(RA)
> +  |  stw CARG3, 8(RA)
> +  |  stw CARG4, 12(RA)
> +  |.endif
> +  |   li RD, (2+1)*8
>     |  b ->fff_res
>     |
>     |.ffunc_1 pairs
> @@ -1457,17 +1604,32 @@ static void build_subroutines(BuildCtx *ctx)
>     |  bne ->fff_fallback
>   #if LJ_52
>     |   lwz TAB:TMP2, TAB:CARG1->metatable
> +  |.if FPU
>     |  lfd f0, CFUNC:RB->upvalue[0]
> +  |.else
> +  |  lwz TMP0, CFUNC:RB->upvalue[0].u32.hi
> +  |  lwz TMP1, CFUNC:RB->upvalue[0].u32.lo
> +  |.endif
>     |   cmplwi TAB:TMP2, 0
>     |  la RA, -8(BASE)
>     |   bne ->fff_fallback
>   #else
> +  |.if FPU
>     |  lfd f0, CFUNC:RB->upvalue[0]
> +  |.else
> +  |  lwz TMP0, CFUNC:RB->upvalue[0].u32.hi
> +  |  lwz TMP1, CFUNC:RB->upvalue[0].u32.lo
> +  |.endif
>     |  la RA, -8(BASE)
>   #endif
>     |   stw TISNIL, 8(BASE)
>     |  li RD, (3+1)*8
> +  |.if FPU
>     |  stfd f0, 0(RA)
> +  |.else
> +  |  stw TMP0, 0(RA)
> +  |  stw TMP1, 4(RA)
> +  |.endif
>     |  b ->fff_res
>     |
>     |.ffunc ipairs_aux
> @@ -1513,14 +1675,24 @@ static void build_subroutines(BuildCtx *ctx)
>     |  stfd FARG2, 0(RA)
>     |.endif
>     |  ble >2				// Not in array part?
> +  |.if FPU
>     |  lwzx TMP2, TMP1, TMP3
>     |  lfdx f0, TMP1, TMP3
> +  |.else
> +  |  lwzux TMP2, TMP1, TMP3
> +  |  lwz TMP3, 4(TMP1)
> +  |.endif
>     |1:
>     |  checknil TMP2
>     |   li RD, (0+1)*8
>     |  beq ->fff_res			// End of iteration, return 0 results.
>     |   li RD, (2+1)*8
> +  |.if FPU
>     |  stfd f0, 8(RA)
> +  |.else
> +  |  stw TMP2, 8(RA)
> +  |  stw TMP3, 12(RA)
> +  |.endif
>     |  b ->fff_res
>     |2:  // Check for empty hash part first. Otherwise call C function.
>     |  lwz TMP0, TAB:CARG1->hmask
> @@ -1534,7 +1706,11 @@ static void build_subroutines(BuildCtx *ctx)
>     |   li RD, (0+1)*8
>     |  beq ->fff_res
>     |  lwz TMP2, 0(CRET1)
> +  |.if FPU
>     |  lfd f0, 0(CRET1)
> +  |.else
> +  |  lwz TMP3, 4(CRET1)
> +  |.endif
>     |  b <1
>     |
>     |.ffunc_1 ipairs
> @@ -1543,12 +1719,22 @@ static void build_subroutines(BuildCtx *ctx)
>     |  bne ->fff_fallback
>   #if LJ_52
>     |   lwz TAB:TMP2, TAB:CARG1->metatable
> +  |.if FPU
>     |  lfd f0, CFUNC:RB->upvalue[0]
> +  |.else
> +  |  lwz TMP0, CFUNC:RB->upvalue[0].u32.hi
> +  |  lwz TMP1, CFUNC:RB->upvalue[0].u32.lo
> +  |.endif
>     |   cmplwi TAB:TMP2, 0
>     |  la RA, -8(BASE)
>     |   bne ->fff_fallback
>   #else
> +  |.if FPU
>     |  lfd f0, CFUNC:RB->upvalue[0]
> +  |.else
> +  |  lwz TMP0, CFUNC:RB->upvalue[0].u32.hi
> +  |  lwz TMP1, CFUNC:RB->upvalue[0].u32.lo
> +  |.endif
>     |  la RA, -8(BASE)
>   #endif
>     |.if DUALNUM
> @@ -1558,7 +1744,12 @@ static void build_subroutines(BuildCtx *ctx)
>     |.endif
>     |   stw ZERO, 12(BASE)
>     |  li RD, (3+1)*8
> +  |.if FPU
>     |  stfd f0, 0(RA)
> +  |.else
> +  |  stw TMP0, 0(RA)
> +  |  stw TMP1, 4(RA)
> +  |.endif
>     |  b ->fff_res
>     |
>     |//-- Base library: catch errors ----------------------------------------
> @@ -1577,19 +1768,32 @@ static void build_subroutines(BuildCtx *ctx)
>     |
>     |.ffunc xpcall
>     |  cmplwi NARGS8:RC, 16
> -  |   lwz CARG4, 8(BASE)
> +  |   lwz CARG3, 8(BASE)
> +  |.if FPU
>     |    lfd FARG2, 8(BASE)
>     |    lfd FARG1, 0(BASE)
> +  |.else
> +  |    lwz CARG1, 0(BASE)
> +  |    lwz CARG2, 4(BASE)
> +  |    lwz CARG4, 12(BASE)
> +  |.endif
>     |  blt ->fff_fallback
>     |  lbz TMP1, DISPATCH_GL(hookmask)(DISPATCH)
>     |   mr TMP2, BASE
> -  |  checkfunc CARG4; bne ->fff_fallback  // Traceback must be a function.
> +  |  checkfunc CARG3; bne ->fff_fallback  // Traceback must be a function.
>     |   la BASE, 16(BASE)
>     |  // Remember active hook before pcall.
>     |  rlwinm TMP1, TMP1, 32-HOOK_ACTIVE_SHIFT, 31, 31
> +  |.if FPU
>     |    stfd FARG2, 0(TMP2)		// Swap function and traceback.
> -  |  subi NARGS8:RC, NARGS8:RC, 16
>     |    stfd FARG1, 8(TMP2)
> +  |.else
> +  |    stw CARG3, 0(TMP2)
> +  |    stw CARG4, 4(TMP2)
> +  |    stw CARG1, 8(TMP2)
> +  |    stw CARG2, 12(TMP2)
> +  |.endif
> +  |  subi NARGS8:RC, NARGS8:RC, 16
>     |  addi PC, TMP1, 16+FRAME_PCALL
>     |  b ->vm_call_dispatch
>     |
> @@ -1632,9 +1836,21 @@ static void build_subroutines(BuildCtx *ctx)
>     |  stp BASE, L->top
>     |2:  // Move args to coroutine.
>     |  cmpw TMP1, NARGS8:RC
> +  |.if FPU
>     |   lfdx f0, BASE, TMP1
> +  |.else
> +  |   add CARG3, BASE, TMP1
> +  |   lwz TMP2, 0(CARG3)
> +  |   lwz TMP3, 4(CARG3)
> +  |.endif
>     |  beq >3
> +  |.if FPU
>     |   stfdx f0, CARG2, TMP1
> +  |.else
> +  |   add CARG3, CARG2, TMP1
> +  |   stw TMP2, 0(CARG3)
> +  |   stw TMP3, 4(CARG3)
> +  |.endif
>     |  addi TMP1, TMP1, 8
>     |  b <2
>     |3:
> @@ -1665,8 +1881,17 @@ static void build_subroutines(BuildCtx *ctx)
>     |   stp TMP2, L:SAVE0->top		// Clear coroutine stack.
>     |5:  // Move results from coroutine.
>     |  cmplw TMP1, TMP3
> +  |.if FPU
>     |   lfdx f0, TMP2, TMP1
>     |   stfdx f0, BASE, TMP1
> +  |.else
> +  |   add CARG3, TMP2, TMP1
> +  |   lwz CARG1, 0(CARG3)
> +  |   lwz CARG2, 4(CARG3)
> +  |   add CARG3, BASE, TMP1
> +  |   stw CARG1, 0(CARG3)
> +  |   stw CARG2, 4(CARG3)
> +  |.endif
>     |    addi TMP1, TMP1, 8
>     |  bne <5
>     |6:
> @@ -1691,12 +1916,22 @@ static void build_subroutines(BuildCtx *ctx)
>     |  andix. TMP0, PC, FRAME_TYPE
>     |  la TMP3, -8(TMP3)
>     |   li TMP1, LJ_TFALSE
> +  |.if FPU
>     |  lfd f0, 0(TMP3)
> +  |.else
> +  |  lwz CARG1, 0(TMP3)
> +  |  lwz CARG2, 4(TMP3)
> +  |.endif
>     |   stp TMP3, L:SAVE0->top		// Remove error from coroutine stack.
>     |    li RD, (2+1)*8
>     |   stw TMP1, -8(BASE)		// Prepend false to results.
>     |    la RA, -8(BASE)
> +  |.if FPU
>     |  stfd f0, 0(BASE)			// Copy error message.
> +  |.else
> +  |  stw CARG1, 0(BASE)			// Copy error message.
> +  |  stw CARG2, 4(BASE)
> +  |.endif
>     |  b <7
>     |.else
>     |  mr CARG1, L
> @@ -1875,7 +2110,12 @@ static void build_subroutines(BuildCtx *ctx)
>     |  lus CARG1, 0x8000			// -(2^31).
>     |  beqy ->fff_resi
>     |5:
> +  |.if FPU
>     |  lfd FARG1, 0(BASE)
> +  |.else
> +  |  lwz CARG1, 0(BASE)
> +  |  lwz CARG2, 4(BASE)
> +  |.endif
>     |  blex func
>     |  b ->fff_resn
>     |.endmacro
> @@ -1899,10 +2139,14 @@ static void build_subroutines(BuildCtx *ctx)
>     |
>     |.ffunc math_log
>     |  cmplwi NARGS8:RC, 8
> -  |   lwz CARG3, 0(BASE)
> -  |    lfd FARG1, 0(BASE)
> +  |   lwz CARG1, 0(BASE)
>     |  bne ->fff_fallback			// Need exactly 1 argument.
> -  |  checknum CARG3; bge ->fff_fallback
> +  |  checknum CARG1; bge ->fff_fallback
> +  |.if FPU
> +  |  lfd FARG1, 0(BASE)
> +  |.else
> +  |  lwz CARG2, 4(BASE)
> +  |.endif
>     |  blex log
>     |  b ->fff_resn
>     |
> @@ -1924,17 +2168,24 @@ static void build_subroutines(BuildCtx *ctx)
>     |.if DUALNUM
>     |.ffunc math_ldexp
>     |  cmplwi NARGS8:RC, 16
> -  |   lwz CARG3, 0(BASE)
> +  |   lwz TMP0, 0(BASE)
> +  |.if FPU
>     |    lfd FARG1, 0(BASE)
> -  |   lwz CARG4, 8(BASE)
> +  |.else
> +  |    lwz CARG1, 0(BASE)
> +  |    lwz CARG2, 4(BASE)
> +  |.endif
> +  |   lwz TMP1, 8(BASE)
>     |.if GPR64
>     |    lwz CARG2, 12(BASE)
> -  |.else
> +  |.elif FPU
>     |    lwz CARG1, 12(BASE)
> +  |.else
> +  |    lwz CARG3, 12(BASE)
>     |.endif
>     |  blt ->fff_fallback
> -  |  checknum CARG3; bge ->fff_fallback
> -  |  checknum CARG4; bne ->fff_fallback
> +  |  checknum TMP0; bge ->fff_fallback
> +  |  checknum TMP1; bne ->fff_fallback
>     |.else
>     |.ffunc_nn math_ldexp
>     |.if GPR64
> @@ -1949,8 +2200,10 @@ static void build_subroutines(BuildCtx *ctx)
>     |.ffunc_n math_frexp
>     |.if GPR64
>     |  la CARG2, DISPATCH_GL(tmptv)(DISPATCH)
> -  |.else
> +  |.elif FPU
>     |  la CARG1, DISPATCH_GL(tmptv)(DISPATCH)
> +  |.else
> +  |  la CARG3, DISPATCH_GL(tmptv)(DISPATCH)
>     |.endif
>     |   lwz PC, FRAME_PC(BASE)
>     |  blex frexp
> @@ -1959,7 +2212,12 @@ static void build_subroutines(BuildCtx *ctx)
>     |.if not DUALNUM
>     |   tonum_i FARG2, TMP1
>     |.endif
> +  |.if FPU
>     |  stfd FARG1, 0(RA)
> +  |.else
> +  |  stw CRET1, 0(RA)
> +  |  stw CRET2, 4(RA)
> +  |.endif
>     |  li RD, (2+1)*8
>     |.if DUALNUM
>     |   stw TISNUM, 8(RA)
> @@ -1972,13 +2230,20 @@ static void build_subroutines(BuildCtx *ctx)
>     |.ffunc_n math_modf
>     |.if GPR64
>     |  la CARG2, -8(BASE)
> -  |.else
> +  |.elif FPU
>     |  la CARG1, -8(BASE)
> +  |.else
> +  |  la CARG3, -8(BASE)
>     |.endif
>     |   lwz PC, FRAME_PC(BASE)
>     |  blex modf
>     |   la RA, -8(BASE)
> +  |.if FPU
>     |  stfd FARG1, 0(BASE)
> +  |.else
> +  |  stw CRET1, 0(BASE)
> +  |  stw CRET2, 4(BASE)
> +  |.endif
>     |  li RD, (2+1)*8
>     |  b ->fff_res
>     |
> @@ -1986,13 +2251,13 @@ static void build_subroutines(BuildCtx *ctx)
>     |.if DUALNUM
>     |  .ffunc_1 name
>     |  checknum CARG3
> -  |   addi TMP1, BASE, 8
> -  |   add TMP2, BASE, NARGS8:RC
> +  |   addi SAVE0, BASE, 8
> +  |   add SAVE1, BASE, NARGS8:RC
>     |  bne >4
>     |1:  // Handle integers.
> -  |  lwz CARG4, 0(TMP1)
> -  |   cmplw cr1, TMP1, TMP2
> -  |  lwz CARG2, 4(TMP1)
> +  |  lwz CARG4, 0(SAVE0)
> +  |   cmplw cr1, SAVE0, SAVE1
> +  |  lwz CARG2, 4(SAVE0)
>     |   bge cr1, ->fff_resi
>     |  checknum CARG4
>     |   xoris TMP0, CARG1, 0x8000
> @@ -2009,36 +2274,76 @@ static void build_subroutines(BuildCtx *ctx)
>     |.if GPR64
>     |  rldicl CARG1, CARG1, 0, 32
>     |.endif
> -  |   addi TMP1, TMP1, 8
> +  |   addi SAVE0, SAVE0, 8
>     |  b <1
>     |3:
>     |  bge ->fff_fallback
>     |  // Convert intermediate result to number and continue below.
> +  |.if FPU
>     |  tonum_i FARG1, CARG1
> -  |  lfd FARG2, 0(TMP1)
> +  |  lfd FARG2, 0(SAVE0)
> +  |.else
> +  |  mr CARG2, CARG1
> +  |  bl ->vm_sfi2d_1
> +  |  lwz CARG3, 0(SAVE0)
> +  |  lwz CARG4, 4(SAVE0)
> +  |.endif
>     |  b >6
>     |4:
> +  |.if FPU
>     |   lfd FARG1, 0(BASE)
> +  |.else
> +  |   lwz CARG1, 0(BASE)
> +  |   lwz CARG2, 4(BASE)
> +  |.endif
>     |  bge ->fff_fallback
>     |5:  // Handle numbers.
> -  |  lwz CARG4, 0(TMP1)
> -  |   cmplw cr1, TMP1, TMP2
> -  |  lfd FARG2, 0(TMP1)
> +  |  lwz CARG3, 0(SAVE0)
> +  |   cmplw cr1, SAVE0, SAVE1
> +  |.if FPU
> +  |  lfd FARG2, 0(SAVE0)
> +  |.else
> +  |  lwz CARG4, 4(SAVE0)
> +  |.endif
>     |   bge cr1, ->fff_resn
> -  |  checknum CARG4; bge >7
> +  |  checknum CARG3; bge >7
>     |6:
> +  |   addi SAVE0, SAVE0, 8
> +  |.if FPU
>     |  fsub f0, FARG1, FARG2
> -  |   addi TMP1, TMP1, 8
>     |.if ismax
>     |  fsel FARG1, f0, FARG1, FARG2
>     |.else
>     |  fsel FARG1, f0, FARG2, FARG1
>     |.endif
> +  |.else
> +  |  stw CARG1, SFSAVE_1
> +  |  stw CARG2, SFSAVE_2
> +  |  stw CARG3, SFSAVE_3
> +  |  stw CARG4, SFSAVE_4
> +  |  blex __ledf2
> +  |  cmpwi CRET1, 0
> +  |.if ismax
> +  |  blt >8
> +  |.else
> +  |  bge >8
> +  |.endif
> +  |  lwz CARG1, SFSAVE_1
> +  |  lwz CARG2, SFSAVE_2
> +  |  b <5
> +  |8:
> +  |  lwz CARG1, SFSAVE_3
> +  |  lwz CARG2, SFSAVE_4
> +  |.endif
>     |  b <5
>     |7:  // Convert integer to number and continue above.
> -  |   lwz CARG2, 4(TMP1)
> +  |   lwz CARG3, 4(SAVE0)
>     |  bne ->fff_fallback
> -  |  tonum_i FARG2, CARG2
> +  |.if FPU
> +  |  tonum_i FARG2, CARG3
> +  |.else
> +  |  bl ->vm_sfi2d_2
> +  |.endif
>     |  b <6
>     |.else
>     |  .ffunc_n name
> @@ -2238,28 +2543,37 @@ static void build_subroutines(BuildCtx *ctx)
>     |
>     |.macro .ffunc_bit_op, name, ins
>     |  .ffunc_bit name
> -  |  addi TMP1, BASE, 8
> -  |  add TMP2, BASE, NARGS8:RC
> +  |  addi SAVE0, BASE, 8
> +  |  add SAVE1, BASE, NARGS8:RC
>     |1:
> -  |  lwz CARG4, 0(TMP1)
> -  |   cmplw cr1, TMP1, TMP2
> +  |  lwz CARG4, 0(SAVE0)
> +  |   cmplw cr1, SAVE0, SAVE1
>     |.if DUALNUM
> -  |  lwz CARG2, 4(TMP1)
> +  |  lwz CARG2, 4(SAVE0)
>     |.else
> -  |  lfd FARG1, 0(TMP1)
> +  |  lfd FARG1, 0(SAVE0)
>     |.endif
>     |   bgey cr1, ->fff_resi
>     |  checknum CARG4
>     |.if DUALNUM
> +  |.if FPU
>     |  bnel ->fff_bitop_fb
>     |.else
> +  |  beq >3
> +  |  stw CARG1, SFSAVE_1
> +  |  bl ->fff_bitop_fb
> +  |  mr CARG2, CARG1
> +  |  lwz CARG1, SFSAVE_1
> +  |3:
> +  |.endif
> +  |.else
>     |  fadd FARG1, FARG1, TOBIT
>     |  bge ->fff_fallback
>     |  stfd FARG1, TMPD
>     |  lwz CARG2, TMPD_LO
>     |.endif
>     |  ins CARG1, CARG1, CARG2
> -  |   addi TMP1, TMP1, 8
> +  |   addi SAVE0, SAVE0, 8
>     |  b <1
>     |.endmacro
>     |
> @@ -2281,7 +2595,14 @@ static void build_subroutines(BuildCtx *ctx)
>     |.macro .ffunc_bit_sh, name, ins, shmod
>     |.if DUALNUM
>     |  .ffunc_2 bit_..name
> +  |.if FPU
>     |  checknum CARG3; bnel ->fff_tobit_fb
> +  |.else
> +  |  checknum CARG3; beq >1
> +  |  bl ->fff_tobit_fb
> +  |  lwz CARG2, 12(BASE)	// Conversion polluted CARG2.
> +  |1:
> +  |.endif
>     |  // Note: no inline conversion from number for 2nd argument!
>     |  checknum CARG4; bne ->fff_fallback
>     |.else
> @@ -2318,27 +2639,77 @@ static void build_subroutines(BuildCtx *ctx)
>     |->fff_resn:
>     |  lwz PC, FRAME_PC(BASE)
>     |  la RA, -8(BASE)
> +  |.if FPU
>     |  stfd FARG1, -8(BASE)
> +  |.else
> +  |  stw CARG1, -8(BASE)
> +  |  stw CARG2, -4(BASE)
> +  |.endif
>     |  b ->fff_res1
>     |
>     |// Fallback FP number to bit conversion.
>     |->fff_tobit_fb:
>     |.if DUALNUM
> +  |.if FPU
>     |  lfd FARG1, 0(BASE)
>     |  bgt ->fff_fallback
>     |  fadd FARG1, FARG1, TOBIT
>     |  stfd FARG1, TMPD
>     |  lwz CARG1, TMPD_LO
>     |  blr
> +  |.else
> +  |  bgt ->fff_fallback
> +  |  mr CARG2, CARG1
> +  |  mr CARG1, CARG3
> +  |// Modifies: CARG1, CARG2, TMP0, TMP1, TMP2.
> +  |->vm_tobit:
> +  |  slwi TMP2, CARG1, 1
> +  |  addis TMP2, TMP2, 0x0020
> +  |  cmpwi TMP2, 0
> +  |  bge >2
> +  |   li TMP1, 0x3e0
> +  |  srawi TMP2, TMP2, 21
> +  |   not TMP1, TMP1
> +  |  sub. TMP2, TMP1, TMP2
> +  |    cmpwi cr7, CARG1, 0
> +  |  blt >1
> +  |   slwi TMP1, CARG1, 11
> +  |    srwi TMP0, CARG2, 21
> +  |   oris TMP1, TMP1, 0x8000
> +  |   or TMP1, TMP1, TMP0
> +  |   srw CARG1, TMP1, TMP2
> +  |  bclr 4, 28			// Return if cr7[lt] == 0, no hint.
> +  |   neg CARG1, CARG1
> +  |  blr
> +  |1:
> +  |  addi TMP2, TMP2, 21
> +  |  srw TMP1, CARG2, TMP2
> +  |   slwi CARG2, CARG1, 12
> +  |  subfic TMP2, TMP2, 20
> +  |   slw TMP0, CARG2, TMP2
> +  |   or CARG1, TMP1, TMP0
> +  |  bclr 4, 28			// Return if cr7[lt] == 0, no hint.
> +  |   neg CARG1, CARG1
> +  |  blr
> +  |2:
> +  |  li CARG1, 0
> +  |  blr
> +  |.endif
>     |.endif
>     |->fff_bitop_fb:
>     |.if DUALNUM
> -  |  lfd FARG1, 0(TMP1)
> +  |.if FPU
> +  |  lfd FARG1, 0(SAVE0)
>     |  bgt ->fff_fallback
>     |  fadd FARG1, FARG1, TOBIT
>     |  stfd FARG1, TMPD
>     |  lwz CARG2, TMPD_LO
>     |  blr
> +  |.else
> +  |  bgt ->fff_fallback
> +  |  mr CARG1, CARG4
> +  |  b ->vm_tobit
> +  |.endif
>     |.endif
>     |
>     |//-----------------------------------------------------------------------
> @@ -2531,10 +2902,21 @@ static void build_subroutines(BuildCtx *ctx)
>     |  decode_RA8 RC, INS			// Call base.
>     |   beq >2
>     |1:  // Move results down.
> +  |.if FPU
>     |  lfd f0, 0(RA)
> +  |.else
> +  |  lwz CARG1, 0(RA)
> +  |  lwz CARG2, 4(RA)
> +  |.endif
>     |   addic. TMP1, TMP1, -8
>     |    addi RA, RA, 8
> +  |.if FPU
>     |  stfdx f0, BASE, RC
> +  |.else
> +  |  add CARG3, BASE, RC
> +  |  stw CARG1, 0(CARG3)
> +  |  stw CARG2, 4(CARG3)
> +  |.endif
>     |    addi RC, RC, 8
>     |   bne <1
>     |2:
> @@ -2587,10 +2969,12 @@ static void build_subroutines(BuildCtx *ctx)
>     |//-----------------------------------------------------------------------
>     |
>     |.macro savex_, a, b, c, d
> +  |.if FPU
>     |  stfd f..a, 16+a*8(sp)
>     |  stfd f..b, 16+b*8(sp)
>     |  stfd f..c, 16+c*8(sp)
>     |  stfd f..d, 16+d*8(sp)
> +  |.endif
>     |.endmacro
>     |
>     |->vm_exit_handler:
> @@ -2662,16 +3046,16 @@ static void build_subroutines(BuildCtx *ctx)
>     |  lwz KBASE, PC2PROTO(k)(TMP1)
>     |  // Setup type comparison constants.
>     |  li TISNUM, LJ_TISNUM
> -  |  lus TMP3, 0x59c0			// TOBIT = 2^52 + 2^51 (float).
> -  |  stw TMP3, TMPD
> +  |  .FPU lus TMP3, 0x59c0		// TOBIT = 2^52 + 2^51 (float).
> +  |  .FPU stw TMP3, TMPD
>     |  li ZERO, 0
> -  |  ori TMP3, TMP3, 0x0004		// TONUM = 2^52 + 2^51 + 2^31 (float).
> -  |  lfs TOBIT, TMPD
> -  |  stw TMP3, TMPD
> -  |  lus TMP0, 0x4338			// Hiword of 2^52 + 2^51 (double)
> +  |  .FPU ori TMP3, TMP3, 0x0004	// TONUM = 2^52 + 2^51 + 2^31 (float).
> +  |  .FPU lfs TOBIT, TMPD
> +  |  .FPU stw TMP3, TMPD
> +  |  .FPU lus TMP0, 0x4338			// Hiword of 2^52 + 2^51 (double)
>     |    li TISNIL, LJ_TNIL
> -  |  stw TMP0, TONUM_HI
> -  |  lfs TONUM, TMPD
> +  |  .FPU stw TMP0, TONUM_HI
> +  |  .FPU lfs TONUM, TMPD
>     |  // Modified copy of ins_next which handles function header dispatch, too.
>     |  lwz INS, 0(PC)
>     |   addi PC, PC, 4
> @@ -2716,7 +3100,35 @@ static void build_subroutines(BuildCtx *ctx)
>     |//-- Math helper functions ----------------------------------------------
>     |//-----------------------------------------------------------------------
>     |
> -  |// NYI: Use internal implementations of floor, ceil, trunc.
> +  |// NYI: Use internal implementations of floor, ceil, trunc, sfcmp.
> +  |
> +  |.macro sfi2d, AHI, ALO
> +  |.if not FPU
> +  |  mr. AHI, ALO
> +  |  bclr 12, 2				// Handle zero first.
> +  |  srawi TMP0, ALO, 31
> +  |  xor TMP1, ALO, TMP0
> +  |  sub TMP1, TMP1, TMP0		// Absolute value in TMP1.
> +  |  cntlzw AHI, TMP1
> +  |  andix. TMP0, TMP0, 0x800		// Mask sign bit.
> +  |  slw TMP1, TMP1, AHI		// Align mantissa left with leading 1.
> +  |  subfic AHI, AHI, 0x3ff+31-1	// Exponent -1 in AHI.
> +  |  slwi ALO, TMP1, 21
> +  |  or AHI, AHI, TMP0			// Sign | Exponent.
> +  |  srwi TMP1, TMP1, 11
> +  |  slwi AHI, AHI, 20			// Align left.
> +  |  add AHI, AHI, TMP1			// Add mantissa, increment exponent.
> +  |  blr
> +  |.endif
> +  |.endmacro
> +  |
> +  |// Input: CARG2. Output: CARG1, CARG2. Temporaries: TMP0, TMP1.
> +  |->vm_sfi2d_1:
> +  |  sfi2d CARG1, CARG2
> +  |
> +  |// Input: CARG4. Output: CARG3, CARG4. Temporaries: TMP0, TMP1.
> +  |->vm_sfi2d_2:
> +  |  sfi2d CARG3, CARG4
>     |
>     |->vm_modi:
>     |  divwo. TMP0, CARG1, CARG2
> @@ -2784,21 +3196,21 @@ static void build_subroutines(BuildCtx *ctx)
>     |   addi DISPATCH, r12, GG_G2DISP
>     |  stw r11, CTSTATE->cb.slot
>     |  stw r3, CTSTATE->cb.gpr[0]
> -  |   stfd f1, CTSTATE->cb.fpr[0]
> +  |   .FPU stfd f1, CTSTATE->cb.fpr[0]
>     |  stw r4, CTSTATE->cb.gpr[1]
> -  |   stfd f2, CTSTATE->cb.fpr[1]
> +  |   .FPU stfd f2, CTSTATE->cb.fpr[1]
>     |  stw r5, CTSTATE->cb.gpr[2]
> -  |   stfd f3, CTSTATE->cb.fpr[2]
> +  |   .FPU stfd f3, CTSTATE->cb.fpr[2]
>     |  stw r6, CTSTATE->cb.gpr[3]
> -  |   stfd f4, CTSTATE->cb.fpr[3]
> +  |   .FPU stfd f4, CTSTATE->cb.fpr[3]
>     |  stw r7, CTSTATE->cb.gpr[4]
> -  |   stfd f5, CTSTATE->cb.fpr[4]
> +  |   .FPU stfd f5, CTSTATE->cb.fpr[4]
>     |  stw r8, CTSTATE->cb.gpr[5]
> -  |   stfd f6, CTSTATE->cb.fpr[5]
> +  |   .FPU stfd f6, CTSTATE->cb.fpr[5]
>     |  stw r9, CTSTATE->cb.gpr[6]
> -  |   stfd f7, CTSTATE->cb.fpr[6]
> +  |   .FPU stfd f7, CTSTATE->cb.fpr[6]
>     |  stw r10, CTSTATE->cb.gpr[7]
> -  |   stfd f8, CTSTATE->cb.fpr[7]
> +  |   .FPU stfd f8, CTSTATE->cb.fpr[7]
>     |  addi TMP0, sp, CFRAME_SPACE+8
>     |  stw TMP0, CTSTATE->cb.stack
>     |   mr CARG1, CTSTATE
> @@ -2809,21 +3221,21 @@ static void build_subroutines(BuildCtx *ctx)
>     |  lp BASE, L:CRET1->base
>     |     li TISNUM, LJ_TISNUM		// Setup type comparison constants.
>     |  lp RC, L:CRET1->top
> -  |     lus TMP3, 0x59c0		// TOBIT = 2^52 + 2^51 (float).
> +  |     .FPU lus TMP3, 0x59c0		// TOBIT = 2^52 + 2^51 (float).
>     |     li ZERO, 0
>     |   mr L, CRET1
> -  |     stw TMP3, TMPD
> -  |     lus TMP0, 0x4338		// Hiword of 2^52 + 2^51 (double)
> +  |     .FPU stw TMP3, TMPD
> +  |     .FPU lus TMP0, 0x4338		// Hiword of 2^52 + 2^51 (double)
>     |  lwz LFUNC:RB, FRAME_FUNC(BASE)
> -  |     ori TMP3, TMP3, 0x0004		// TONUM = 2^52 + 2^51 + 2^31 (float).
> -  |     stw TMP0, TONUM_HI
> +  |     .FPU ori TMP3, TMP3, 0x0004	// TONUM = 2^52 + 2^51 + 2^31 (float).
> +  |     .FPU stw TMP0, TONUM_HI
>     |     li TISNIL, LJ_TNIL
>     |    li_vmstate INTERP
> -  |     lfs TOBIT, TMPD
> -  |     stw TMP3, TMPD
> +  |     .FPU lfs TOBIT, TMPD
> +  |     .FPU stw TMP3, TMPD
>     |  sub RC, RC, BASE
>     |    st_vmstate
> -  |     lfs TONUM, TMPD
> +  |     .FPU lfs TONUM, TMPD
>     |  ins_callt
>     |.endif
>     |
> @@ -2837,7 +3249,7 @@ static void build_subroutines(BuildCtx *ctx)
>     |  mr CARG2, RA
>     |  bl extern lj_ccallback_leave	// (CTState *cts, TValue *o)
>     |  lwz CRET1, CTSTATE->cb.gpr[0]
> -  |  lfd FARG1, CTSTATE->cb.fpr[0]
> +  |  .FPU lfd FARG1, CTSTATE->cb.fpr[0]
>     |  lwz CRET2, CTSTATE->cb.gpr[1]
>     |  b ->vm_leave_unw
>     |.endif
> @@ -2871,14 +3283,14 @@ static void build_subroutines(BuildCtx *ctx)
>     |  bge <1
>     |2:
>     |  bney cr1, >3
> -  |  lfd f1, CCSTATE->fpr[0]
> -  |  lfd f2, CCSTATE->fpr[1]
> -  |  lfd f3, CCSTATE->fpr[2]
> -  |  lfd f4, CCSTATE->fpr[3]
> -  |  lfd f5, CCSTATE->fpr[4]
> -  |  lfd f6, CCSTATE->fpr[5]
> -  |  lfd f7, CCSTATE->fpr[6]
> -  |  lfd f8, CCSTATE->fpr[7]
> +  |  .FPU lfd f1, CCSTATE->fpr[0]
> +  |  .FPU lfd f2, CCSTATE->fpr[1]
> +  |  .FPU lfd f3, CCSTATE->fpr[2]
> +  |  .FPU lfd f4, CCSTATE->fpr[3]
> +  |  .FPU lfd f5, CCSTATE->fpr[4]
> +  |  .FPU lfd f6, CCSTATE->fpr[5]
> +  |  .FPU lfd f7, CCSTATE->fpr[6]
> +  |  .FPU lfd f8, CCSTATE->fpr[7]
>     |3:
>     |   lp TMP0, CCSTATE->func
>     |  lwz CARG2, CCSTATE->gpr[1]
> @@ -2895,7 +3307,7 @@ static void build_subroutines(BuildCtx *ctx)
>     |  lwz TMP2, -4(r14)
>     |   lwz TMP0, 4(r14)
>     |  stw CARG1, CCSTATE:TMP1->gpr[0]
> -  |  stfd FARG1, CCSTATE:TMP1->fpr[0]
> +  |  .FPU stfd FARG1, CCSTATE:TMP1->fpr[0]
>     |  stw CARG2, CCSTATE:TMP1->gpr[1]
>     |   mtlr TMP0
>     |  stw CARG3, CCSTATE:TMP1->gpr[2]
> @@ -2924,19 +3336,19 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>     case BC_ISLT: case BC_ISGE: case BC_ISLE: case BC_ISGT:
>       |  // RA = src1*8, RD = src2*8, JMP with RD = target
>       |.if DUALNUM
> -    |  lwzux TMP0, RA, BASE
> +    |  lwzux CARG1, RA, BASE
>       |    addi PC, PC, 4
>       |   lwz CARG2, 4(RA)
> -    |  lwzux TMP1, RD, BASE
> +    |  lwzux CARG3, RD, BASE
>       |    lwz TMP2, -4(PC)
> -    |  checknum cr0, TMP0
> -    |   lwz CARG3, 4(RD)
> +    |  checknum cr0, CARG1
> +    |   lwz CARG4, 4(RD)
>       |    decode_RD4 TMP2, TMP2
> -    |  checknum cr1, TMP1
> -    |    addis TMP2, TMP2, -(BCBIAS_J*4 >> 16)
> +    |  checknum cr1, CARG3
> +    |    addis SAVE0, TMP2, -(BCBIAS_J*4 >> 16)
>       |  bne cr0, >7
>       |  bne cr1, >8
> -    |   cmpw CARG2, CARG3
> +    |   cmpw CARG2, CARG4
>       if (op == BC_ISLT) {
>         |  bge >2
>       } else if (op == BC_ISGE) {
> @@ -2947,28 +3359,41 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>         |  ble >2
>       }
>       |1:
> -    |  add PC, PC, TMP2
> +    |  add PC, PC, SAVE0
>       |2:
>       |  ins_next
>       |
>       |7:  // RA is not an integer.
>       |  bgt cr0, ->vmeta_comp
>       |  // RA is a number.
> -    |   lfd f0, 0(RA)
> +    |   .FPU lfd f0, 0(RA)
>       |  bgt cr1, ->vmeta_comp
>       |  blt cr1, >4
>       |  // RA is a number, RD is an integer.
> -    |  tonum_i f1, CARG3
> +    |.if FPU
> +    |  tonum_i f1, CARG4
> +    |.else
> +    |  bl ->vm_sfi2d_2
> +    |.endif
>       |  b >5
>       |
>       |8: // RA is an integer, RD is not an integer.
>       |  bgt cr1, ->vmeta_comp
>       |  // RA is an integer, RD is a number.
> +    |.if FPU
>       |  tonum_i f0, CARG2
> +    |.else
> +    |  bl ->vm_sfi2d_1
> +    |.endif
>       |4:
> -    |  lfd f1, 0(RD)
> +    |  .FPU lfd f1, 0(RD)
>       |5:
> +    |.if FPU
>       |  fcmpu cr0, f0, f1
> +    |.else
> +    |  blex __ledf2
> +    |  cmpwi CRET1, 0
> +    |.endif
>       if (op == BC_ISLT) {
>         |  bge <2
>       } else if (op == BC_ISGE) {
> @@ -3016,42 +3441,42 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       vk = op == BC_ISEQV;
>       |  // RA = src1*8, RD = src2*8, JMP with RD = target
>       |.if DUALNUM
> -    |  lwzux TMP0, RA, BASE
> +    |  lwzux CARG1, RA, BASE
>       |    addi PC, PC, 4
>       |   lwz CARG2, 4(RA)
> -    |  lwzux TMP1, RD, BASE
> -    |  checknum cr0, TMP0
> -    |    lwz TMP2, -4(PC)
> -    |  checknum cr1, TMP1
> -    |    decode_RD4 TMP2, TMP2
> -    |   lwz CARG3, 4(RD)
> +    |  lwzux CARG3, RD, BASE
> +    |  checknum cr0, CARG1
> +    |    lwz SAVE0, -4(PC)
> +    |  checknum cr1, CARG3
> +    |    decode_RD4 SAVE0, SAVE0
> +    |   lwz CARG4, 4(RD)
>       |  cror 4*cr7+gt, 4*cr0+gt, 4*cr1+gt
> -    |    addis TMP2, TMP2, -(BCBIAS_J*4 >> 16)
> +    |    addis SAVE0, SAVE0, -(BCBIAS_J*4 >> 16)
>       if (vk) {
>         |  ble cr7, ->BC_ISEQN_Z
>       } else {
>         |  ble cr7, ->BC_ISNEN_Z
>       }
>       |.else
> -    |  lwzux TMP0, RA, BASE
> -    |   lwz TMP2, 0(PC)
> +    |  lwzux CARG1, RA, BASE
> +    |   lwz SAVE0, 0(PC)
>       |    lfd f0, 0(RA)
>       |   addi PC, PC, 4
> -    |  lwzux TMP1, RD, BASE
> -    |  checknum cr0, TMP0
> -    |   decode_RD4 TMP2, TMP2
> +    |  lwzux CARG3, RD, BASE
> +    |  checknum cr0, CARG1
> +    |   decode_RD4 SAVE0, SAVE0
>       |    lfd f1, 0(RD)
> -    |  checknum cr1, TMP1
> -    |   addis TMP2, TMP2, -(BCBIAS_J*4 >> 16)
> +    |  checknum cr1, CARG3
> +    |   addis SAVE0, SAVE0, -(BCBIAS_J*4 >> 16)
>       |  bge cr0, >5
>       |  bge cr1, >5
>       |  fcmpu cr0, f0, f1
>       if (vk) {
>         |  bne >1
> -      |  add PC, PC, TMP2
> +      |  add PC, PC, SAVE0
>       } else {
>         |  beq >1
> -      |  add PC, PC, TMP2
> +      |  add PC, PC, SAVE0
>       }
>       |1:
>       |  ins_next
> @@ -3059,36 +3484,36 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |5:  // Either or both types are not numbers.
>       |.if not DUALNUM
>       |    lwz CARG2, 4(RA)
> -    |    lwz CARG3, 4(RD)
> +    |    lwz CARG4, 4(RD)
>       |.endif
>       |.if FFI
> -    |  cmpwi cr7, TMP0, LJ_TCDATA
> -    |  cmpwi cr5, TMP1, LJ_TCDATA
> +    |  cmpwi cr7, CARG1, LJ_TCDATA
> +    |  cmpwi cr5, CARG3, LJ_TCDATA
>       |.endif
> -    |   not TMP3, TMP0
> -    |  cmplw TMP0, TMP1
> -    |   cmplwi cr1, TMP3, ~LJ_TISPRI		// Primitive?
> +    |   not TMP2, CARG1
> +    |  cmplw CARG1, CARG3
> +    |   cmplwi cr1, TMP2, ~LJ_TISPRI		// Primitive?
>       |.if FFI
>       |  cror 4*cr7+eq, 4*cr7+eq, 4*cr5+eq
>       |.endif
> -    |   cmplwi cr6, TMP3, ~LJ_TISTABUD		// Table or userdata?
> +    |   cmplwi cr6, TMP2, ~LJ_TISTABUD		// Table or userdata?
>       |.if FFI
>       |  beq cr7, ->vmeta_equal_cd
>       |.endif
> -    |    cmplw cr5, CARG2, CARG3
> +    |    cmplw cr5, CARG2, CARG4
>       |  crandc 4*cr0+gt, 4*cr0+eq, 4*cr1+gt	// 2: Same type and primitive.
>       |  crorc 4*cr0+lt, 4*cr5+eq, 4*cr0+eq	// 1: Same tv or different type.
>       |  crand 4*cr0+eq, 4*cr0+eq, 4*cr5+eq	// 0: Same type and same tv.
> -    |   mr SAVE0, PC
> +    |   mr SAVE1, PC
>       |  cror 4*cr0+eq, 4*cr0+eq, 4*cr0+gt	// 0 or 2.
>       |  cror 4*cr0+lt, 4*cr0+lt, 4*cr0+gt	// 1 or 2.
>       if (vk) {
>         |  bne cr0, >6
> -      |  add PC, PC, TMP2
> +      |  add PC, PC, SAVE0
>         |6:
>       } else {
>         |  beq cr0, >6
> -      |  add PC, PC, TMP2
> +      |  add PC, PC, SAVE0
>         |6:
>       }
>       |.if DUALNUM
> @@ -3103,6 +3528,7 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |
>       |  // Different tables or userdatas. Need to check __eq metamethod.
>       |  // Field metatable must be at same offset for GCtab and GCudata!
> +    |   mr CARG3, CARG4
>       |  lwz TAB:TMP2, TAB:CARG2->metatable
>       |   li CARG4, 1-vk			// ne = 0 or 1.
>       |  cmplwi TAB:TMP2, 0
> @@ -3110,7 +3536,7 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |  lbz TMP2, TAB:TMP2->nomm
>       |  andix. TMP2, TMP2, 1<<MM_eq
>       |  bne <1				// Or 'no __eq' flag set?
> -    |  mr PC, SAVE0			// Restore old PC.
> +    |  mr PC, SAVE1			// Restore old PC.
>       |  b ->vmeta_equal			// Handle __eq metamethod.
>       break;
>   
> @@ -3151,16 +3577,16 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       vk = op == BC_ISEQN;
>       |  // RA = src*8, RD = num_const*8, JMP with RD = target
>       |.if DUALNUM
> -    |  lwzux TMP0, RA, BASE
> +    |  lwzux CARG1, RA, BASE
>       |    addi PC, PC, 4
>       |   lwz CARG2, 4(RA)
> -    |  lwzux TMP1, RD, KBASE
> -    |  checknum cr0, TMP0
> -    |    lwz TMP2, -4(PC)
> -    |  checknum cr1, TMP1
> -    |    decode_RD4 TMP2, TMP2
> -    |   lwz CARG3, 4(RD)
> -    |    addis TMP2, TMP2, -(BCBIAS_J*4 >> 16)
> +    |  lwzux CARG3, RD, KBASE
> +    |  checknum cr0, CARG1
> +    |    lwz SAVE0, -4(PC)
> +    |  checknum cr1, CARG3
> +    |    decode_RD4 SAVE0, SAVE0
> +    |   lwz CARG4, 4(RD)
> +    |    addis SAVE0, SAVE0, -(BCBIAS_J*4 >> 16)
>       if (vk) {
>         |->BC_ISEQN_Z:
>       } else {
> @@ -3168,7 +3594,7 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       }
>       |  bne cr0, >7
>       |  bne cr1, >8
> -    |   cmpw CARG2, CARG3
> +    |   cmpw CARG2, CARG4
>       |4:
>       |.else
>       if (vk) {
> @@ -3176,20 +3602,20 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       } else {
>         |->BC_ISNEN_Z:  // Dummy label.
>       }
> -    |  lwzx TMP0, BASE, RA
> +    |  lwzx CARG1, BASE, RA
>       |    addi PC, PC, 4
>       |   lfdx f0, BASE, RA
> -    |    lwz TMP2, -4(PC)
> +    |    lwz SAVE0, -4(PC)
>       |  lfdx f1, KBASE, RD
> -    |    decode_RD4 TMP2, TMP2
> -    |  checknum TMP0
> -    |    addis TMP2, TMP2, -(BCBIAS_J*4 >> 16)
> +    |    decode_RD4 SAVE0, SAVE0
> +    |  checknum CARG1
> +    |    addis SAVE0, SAVE0, -(BCBIAS_J*4 >> 16)
>       |  bge >3
>       |  fcmpu cr0, f0, f1
>       |.endif
>       if (vk) {
>         |  bne >1
> -      |  add PC, PC, TMP2
> +      |  add PC, PC, SAVE0
>         |1:
>         |.if not FFI
>         |3:
> @@ -3200,13 +3626,13 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>         |.if not FFI
>         |3:
>         |.endif
> -      |  add PC, PC, TMP2
> +      |  add PC, PC, SAVE0
>         |2:
>       }
>       |  ins_next
>       |.if FFI
>       |3:
> -    |  cmpwi TMP0, LJ_TCDATA
> +    |  cmpwi CARG1, LJ_TCDATA
>       |  beq ->vmeta_equal_cd
>       |  b <1
>       |.endif
> @@ -3214,18 +3640,31 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |7:  // RA is not an integer.
>       |  bge cr0, <3
>       |  // RA is a number.
> -    |   lfd f0, 0(RA)
> +    |   .FPU lfd f0, 0(RA)
>       |  blt cr1, >1
>       |  // RA is a number, RD is an integer.
> -    |  tonum_i f1, CARG3
> +    |.if FPU
> +    |  tonum_i f1, CARG4
> +    |.else
> +    |  bl ->vm_sfi2d_2
> +    |.endif
>       |  b >2
>       |
>       |8: // RA is an integer, RD is a number.
> +    |.if FPU
>       |  tonum_i f0, CARG2
> +    |.else
> +    |  bl ->vm_sfi2d_1
> +    |.endif
>       |1:
> -    |  lfd f1, 0(RD)
> +    |  .FPU lfd f1, 0(RD)
>       |2:
> +    |.if FPU
>       |  fcmpu cr0, f0, f1
> +    |.else
> +    |  blex __ledf2
> +    |  cmpwi CRET1, 0
> +    |.endif
>       |  b <4
>       |.endif
>       break;
> @@ -3280,7 +3719,12 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>         |  add PC, PC, TMP2
>       } else {
>         |  li TMP1, LJ_TFALSE
> +      |.if FPU
>         |   lfdx f0, BASE, RD
> +      |.else
> +      |   lwzux CARG1, RD, BASE
> +      |   lwz CARG2, 4(RD)
> +      |.endif
>         |  cmplw TMP0, TMP1
>         if (op == BC_ISTC) {
>   	|  bge >1
> @@ -3289,7 +3733,12 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>         }
>         |  addis PC, PC, -(BCBIAS_J*4 >> 16)
>         |  decode_RD4 TMP2, INS
> +      |.if FPU
>         |   stfdx f0, BASE, RA
> +      |.else
> +      |   stwux CARG1, RA, BASE
> +      |   stw CARG2, 4(RA)
> +      |.endif
>         |  add PC, PC, TMP2
>         |1:
>       }
> @@ -3324,8 +3773,15 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>     case BC_MOV:
>       |  // RA = dst*8, RD = src*8
>       |  ins_next1
> +    |.if FPU
>       |  lfdx f0, BASE, RD
>       |  stfdx f0, BASE, RA
> +    |.else
> +    |  lwzux TMP0, RD, BASE
> +    |  lwz TMP1, 4(RD)
> +    |  stwux TMP0, RA, BASE
> +    |  stw TMP1, 4(RA)
> +    |.endif
>       |  ins_next2
>       break;
>     case BC_NOT:
> @@ -3427,44 +3883,65 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       ||vk = ((int)op - BC_ADDVN) / (BC_ADDNV-BC_ADDVN);
>       ||switch (vk) {
>       ||case 0:
> -    |   lwzx TMP1, BASE, RB
> +    |   lwzx CARG1, BASE, RB
>       |   .if DUALNUM
> -    |     lwzx TMP2, KBASE, RC
> +    |     lwzx CARG3, KBASE, RC
>       |   .endif
> +    |   .if FPU
>       |    lfdx f14, BASE, RB
>       |    lfdx f15, KBASE, RC
> +    |   .else
> +    |    add TMP1, BASE, RB
> +    |    add TMP2, KBASE, RC
> +    |    lwz CARG2, 4(TMP1)
> +    |    lwz CARG4, 4(TMP2)
> +    |   .endif
>       |   .if DUALNUM
> -    |     checknum cr0, TMP1
> -    |     checknum cr1, TMP2
> +    |     checknum cr0, CARG1
> +    |     checknum cr1, CARG3
>       |     crand 4*cr0+lt, 4*cr0+lt, 4*cr1+lt
>       |     bge ->vmeta_arith_vn
>       |   .else
> -    |     checknum TMP1; bge ->vmeta_arith_vn
> +    |     checknum CARG1; bge ->vmeta_arith_vn
>       |   .endif
>       ||  break;
>       ||case 1:
> -    |   lwzx TMP1, BASE, RB
> +    |   lwzx CARG1, BASE, RB
>       |   .if DUALNUM
> -    |     lwzx TMP2, KBASE, RC
> +    |     lwzx CARG3, KBASE, RC
>       |   .endif
> +    |   .if FPU
>       |    lfdx f15, BASE, RB
>       |    lfdx f14, KBASE, RC
> +    |   .else
> +    |    add TMP1, BASE, RB
> +    |    add TMP2, KBASE, RC
> +    |    lwz CARG2, 4(TMP1)
> +    |    lwz CARG4, 4(TMP2)
> +    |   .endif
>       |   .if DUALNUM
> -    |     checknum cr0, TMP1
> -    |     checknum cr1, TMP2
> +    |     checknum cr0, CARG1
> +    |     checknum cr1, CARG3
>       |     crand 4*cr0+lt, 4*cr0+lt, 4*cr1+lt
>       |     bge ->vmeta_arith_nv
>       |   .else
> -    |     checknum TMP1; bge ->vmeta_arith_nv
> +    |     checknum CARG1; bge ->vmeta_arith_nv
>       |   .endif
>       ||  break;
>       ||default:
> -    |   lwzx TMP1, BASE, RB
> -    |   lwzx TMP2, BASE, RC
> +    |   lwzx CARG1, BASE, RB
> +    |   lwzx CARG3, BASE, RC
> +    |   .if FPU
>       |    lfdx f14, BASE, RB
>       |    lfdx f15, BASE, RC
> -    |   checknum cr0, TMP1
> -    |   checknum cr1, TMP2
> +    |   .else
> +    |    add TMP1, BASE, RB
> +    |    add TMP2, BASE, RC
> +    |    lwz CARG2, 4(TMP1)
> +    |    lwz CARG4, 4(TMP2)
> +    |   .endif
> +    |   checknum cr0, CARG1
> +    |   checknum cr1, CARG3
>       |   crand 4*cr0+lt, 4*cr0+lt, 4*cr1+lt
>       |   bge ->vmeta_arith_vv
>       ||  break;
> @@ -3498,48 +3975,78 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |  fsub a, b, a			// b - floor(b/c)*c
>       |.endmacro
>       |
> +    |.macro sfpmod
> +    |->BC_MODVN_Z:
> +    |  stw CARG1, SFSAVE_1
> +    |  stw CARG2, SFSAVE_2
> +    |  mr SAVE0, CARG3
> +    |  mr SAVE1, CARG4
> +    |  blex __divdf3
> +    |  blex floor
> +    |  mr CARG3, SAVE0
> +    |  mr CARG4, SAVE1
> +    |  blex __muldf3
> +    |  mr CARG3, CRET1
> +    |  mr CARG4, CRET2
> +    |  lwz CARG1, SFSAVE_1
> +    |  lwz CARG2, SFSAVE_2
> +    |  blex __subdf3
> +    |.endmacro
> +    |
>       |.macro ins_arithfp, fpins
>       |  ins_arithpre
>       |.if "fpins" == "fpmod_"
>       |  b ->BC_MODVN_Z			// Avoid 3 copies. It's slow anyway.
> -    |.else
> +    |.elif FPU
>       |  fpins f0, f14, f15
>       |  ins_next1
>       |  stfdx f0, BASE, RA
>       |  ins_next2
> +    |.else
> +    |  blex __divdf3			// Only soft-float div uses this macro.
> +    |  ins_next1
> +    |  stwux CRET1, RA, BASE
> +    |  stw CRET2, 4(RA)
> +    |  ins_next2
>       |.endif
>       |.endmacro
>       |
> -    |.macro ins_arithdn, intins, fpins
> +    |.macro ins_arithdn, intins, fpins, fpcall
>       |  // RA = dst*8, RB = src1*8, RC = src2*8 | num_const*8
>       ||vk = ((int)op - BC_ADDVN) / (BC_ADDNV-BC_ADDVN);
>       ||switch (vk) {
>       ||case 0:
> -    |   lwzux TMP1, RB, BASE
> -    |   lwzux TMP2, RC, KBASE
> -    |    lwz CARG1, 4(RB)
> -    |   checknum cr0, TMP1
> -    |    lwz CARG2, 4(RC)
> +    |   lwzux CARG1, RB, BASE
> +    |   lwzux CARG3, RC, KBASE
> +    |    lwz CARG2, 4(RB)
> +    |   checknum cr0, CARG1
> +    |    lwz CARG4, 4(RC)
> +    |   checknum cr1, CARG3
>       ||  break;
>       ||case 1:
> -    |   lwzux TMP1, RB, BASE
> -    |   lwzux TMP2, RC, KBASE
> -    |    lwz CARG2, 4(RB)
> -    |   checknum cr0, TMP1
> -    |    lwz CARG1, 4(RC)
> +    |   lwzux CARG3, RB, BASE
> +    |   lwzux CARG1, RC, KBASE
> +    |    lwz CARG4, 4(RB)
> +    |   checknum cr0, CARG3
> +    |    lwz CARG2, 4(RC)
> +    |   checknum cr1, CARG1
>       ||  break;
>       ||default:
> -    |   lwzux TMP1, RB, BASE
> -    |   lwzux TMP2, RC, BASE
> -    |    lwz CARG1, 4(RB)
> -    |   checknum cr0, TMP1
> -    |    lwz CARG2, 4(RC)
> +    |   lwzux CARG1, RB, BASE
> +    |   lwzux CARG3, RC, BASE
> +    |    lwz CARG2, 4(RB)
> +    |   checknum cr0, CARG1
> +    |    lwz CARG4, 4(RC)
> +    |   checknum cr1, CARG3
>       ||  break;
>       ||}
> -    |  checknum cr1, TMP2
>       |  bne >5
>       |  bne cr1, >5
> -    |  intins CARG1, CARG1, CARG2
> +    |.if "intins" == "intmod"
> +    |  mr CARG1, CARG2
> +    |  mr CARG2, CARG4
> +    |.endif
> +    |  intins CARG1, CARG2, CARG4
>       |  bso >4
>       |1:
>       |  ins_next1
> @@ -3551,29 +4058,40 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |  checkov TMP0, <1			// Ignore unrelated overflow.
>       |  ins_arithfallback b
>       |5:  // FP variant.
> +    |.if FPU
>       ||if (vk == 1) {
>       |  lfd f15, 0(RB)
> -    |   crand 4*cr0+lt, 4*cr0+lt, 4*cr1+lt
>       |  lfd f14, 0(RC)
>       ||} else {
>       |  lfd f14, 0(RB)
> -    |   crand 4*cr0+lt, 4*cr0+lt, 4*cr1+lt
>       |  lfd f15, 0(RC)
>       ||}
> +    |.endif
> +    |  crand 4*cr0+lt, 4*cr0+lt, 4*cr1+lt
>       |   ins_arithfallback bge
>       |.if "fpins" == "fpmod_"
>       |  b ->BC_MODVN_Z			// Avoid 3 copies. It's slow anyway.
>       |.else
> +    |.if FPU
>       |  fpins f0, f14, f15
> -    |  ins_next1
>       |  stfdx f0, BASE, RA
> +    |.else
> +    |.if "fpcall" == "sfpmod"
> +    |  sfpmod
> +    |.else
> +    |  blex fpcall
> +    |.endif
> +    |  stwux CRET1, RA, BASE
> +    |  stw CRET2, 4(RA)
> +    |.endif
> +    |  ins_next1
>       |  b <2
>       |.endif
>       |.endmacro
>       |
> -    |.macro ins_arith, intins, fpins
> +    |.macro ins_arith, intins, fpins, fpcall
>       |.if DUALNUM
> -    |  ins_arithdn intins, fpins
> +    |  ins_arithdn intins, fpins, fpcall
>       |.else
>       |  ins_arithfp fpins
>       |.endif
> @@ -3588,9 +4106,9 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |  addo. TMP0, TMP0, TMP3
>       |  add y, a, b
>       |.endmacro
> -    |  ins_arith addo32., fadd
> +    |  ins_arith addo32., fadd, __adddf3
>       |.else
> -    |  ins_arith addo., fadd
> +    |  ins_arith addo., fadd, __adddf3
>       |.endif
>       break;
>     case BC_SUBVN: case BC_SUBNV: case BC_SUBVV:
> @@ -3602,36 +4120,48 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |  subo. TMP0, TMP0, TMP3
>       |  sub y, a, b
>       |.endmacro
> -    |  ins_arith subo32., fsub
> +    |  ins_arith subo32., fsub, __subdf3
>       |.else
> -    |  ins_arith subo., fsub
> +    |  ins_arith subo., fsub, __subdf3
>       |.endif
>       break;
>     case BC_MULVN: case BC_MULNV: case BC_MULVV:
> -    |  ins_arith mullwo., fmul
> +    |  ins_arith mullwo., fmul, __muldf3
>       break;
>     case BC_DIVVN: case BC_DIVNV: case BC_DIVVV:
>       |  ins_arithfp fdiv
>       break;
>     case BC_MODVN:
> -    |  ins_arith intmod, fpmod
> +    |  ins_arith intmod, fpmod, sfpmod
>       break;
>     case BC_MODNV: case BC_MODVV:
> -    |  ins_arith intmod, fpmod_
> +    |  ins_arith intmod, fpmod_, sfpmod
>       break;
>     case BC_POW:
>       |  // NYI: (partial) integer arithmetic.
> -    |  lwzx TMP1, BASE, RB
> +    |  lwzx CARG1, BASE, RB
> +    |  lwzx CARG3, BASE, RC
> +    |.if FPU
>       |   lfdx FARG1, BASE, RB
> -    |  lwzx TMP2, BASE, RC
>       |   lfdx FARG2, BASE, RC
> -    |  checknum cr0, TMP1
> -    |  checknum cr1, TMP2
> +    |.else
> +    |   add TMP1, BASE, RB
> +    |   add TMP2, BASE, RC
> +    |   lwz CARG2, 4(TMP1)
> +    |   lwz CARG4, 4(TMP2)
> +    |.endif
> +    |  checknum cr0, CARG1
> +    |  checknum cr1, CARG3
>       |  crand 4*cr0+lt, 4*cr0+lt, 4*cr1+lt
>       |  bge ->vmeta_arith_vv
>       |  blex pow
>       |  ins_next1
> +    |.if FPU
>       |  stfdx FARG1, BASE, RA
> +    |.else
> +    |  stwux CARG1, RA, BASE
> +    |  stw CARG2, 4(RA)
> +    |.endif
>       |  ins_next2
>       break;
>   
> @@ -3651,8 +4181,15 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |   lp BASE, L->base
>       |  bne ->vmeta_binop
>       |  ins_next1
> +    |.if FPU
>       |  lfdx f0, BASE, SAVE0		// Copy result from RB to RA.
>       |  stfdx f0, BASE, RA
> +    |.else
> +    |  lwzux TMP0, SAVE0, BASE
> +    |  lwz TMP1, 4(SAVE0)
> +    |  stwux TMP0, RA, BASE
> +    |  stw TMP1, 4(RA)
> +    |.endif
>       |  ins_next2
>       break;
>   
> @@ -3715,8 +4252,15 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>     case BC_KNUM:
>       |  // RA = dst*8, RD = num_const*8
>       |  ins_next1
> +    |.if FPU
>       |  lfdx f0, KBASE, RD
>       |  stfdx f0, BASE, RA
> +    |.else
> +    |  lwzux TMP0, RD, KBASE
> +    |  lwz TMP1, 4(RD)
> +    |  stwux TMP0, RA, BASE
> +    |  stw TMP1, 4(RA)
> +    |.endif
>       |  ins_next2
>       break;
>     case BC_KPRI:
> @@ -3749,8 +4293,15 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |  lwzx UPVAL:RB, LFUNC:RB, RD
>       |  ins_next1
>       |  lwz TMP1, UPVAL:RB->v
> +    |.if FPU
>       |  lfd f0, 0(TMP1)
>       |  stfdx f0, BASE, RA
> +    |.else
> +    |  lwz TMP2, 0(TMP1)
> +    |  lwz TMP3, 4(TMP1)
> +    |  stwux TMP2, RA, BASE
> +    |  stw TMP3, 4(RA)
> +    |.endif
>       |  ins_next2
>       break;
>     case BC_USETV:
> @@ -3758,14 +4309,24 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |  lwz LFUNC:RB, FRAME_FUNC(BASE)
>       |    srwi RA, RA, 1
>       |    addi RA, RA, offsetof(GCfuncL, uvptr)
> +    |.if FPU
>       |   lfdux f0, RD, BASE
> +    |.else
> +    |   lwzux CARG1, RD, BASE
> +    |   lwz CARG3, 4(RD)
> +    |.endif
>       |  lwzx UPVAL:RB, LFUNC:RB, RA
>       |  lbz TMP3, UPVAL:RB->marked
>       |   lwz CARG2, UPVAL:RB->v
>       |  andix. TMP3, TMP3, LJ_GC_BLACK	// isblack(uv)
>       |    lbz TMP0, UPVAL:RB->closed
>       |   lwz TMP2, 0(RD)
> +    |.if FPU
>       |   stfd f0, 0(CARG2)
> +    |.else
> +    |   stw CARG1, 0(CARG2)
> +    |   stw CARG3, 4(CARG2)
> +    |.endif
>       |    cmplwi cr1, TMP0, 0
>       |   lwz TMP1, 4(RD)
>       |  cror 4*cr0+eq, 4*cr0+eq, 4*cr1+eq
> @@ -3821,11 +4382,21 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |  lwz LFUNC:RB, FRAME_FUNC(BASE)
>       |   srwi RA, RA, 1
>       |   addi RA, RA, offsetof(GCfuncL, uvptr)
> +    |.if FPU
>       |    lfdx f0, KBASE, RD
> +    |.else
> +    |    lwzux TMP2, RD, KBASE
> +    |    lwz TMP3, 4(RD)
> +    |.endif
>       |  lwzx UPVAL:RB, LFUNC:RB, RA
>       |  ins_next1
>       |  lwz TMP1, UPVAL:RB->v
> +    |.if FPU
>       |  stfd f0, 0(TMP1)
> +    |.else
> +    |  stw TMP2, 0(TMP1)
> +    |  stw TMP3, 4(TMP1)
> +    |.endif
>       |  ins_next2
>       break;
>     case BC_USETP:
> @@ -3973,11 +4544,21 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |.endif
>       |  ble ->vmeta_tgetv		// Integer key and in array part?
>       |  lwzx TMP0, TMP1, TMP2
> +    |.if FPU
>       |   lfdx f14, TMP1, TMP2
> +    |.else
> +    |   lwzux SAVE0, TMP1, TMP2
> +    |   lwz SAVE1, 4(TMP1)
> +    |.endif
>       |  checknil TMP0; beq >2
>       |1:
>       |  ins_next1
> +    |.if FPU
>       |   stfdx f14, BASE, RA
> +    |.else
> +    |   stwux SAVE0, RA, BASE
> +    |   stw SAVE1, 4(RA)
> +    |.endif
>       |  ins_next2
>       |
>       |2:  // Check for __index if table value is nil.
> @@ -4053,12 +4634,22 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |  lwz TMP1, TAB:RB->asize
>       |   lwz TMP2, TAB:RB->array
>       |  cmplw TMP0, TMP1; bge ->vmeta_tgetb
> +    |.if FPU
>       |  lwzx TMP1, TMP2, RC
>       |   lfdx f0, TMP2, RC
> +    |.else
> +    |  lwzux TMP1, TMP2, RC
> +    |   lwz TMP3, 4(TMP2)
> +    |.endif
>       |  checknil TMP1; beq >5
>       |1:
>       |  ins_next1
> +    |.if FPU
>       |   stfdx f0, BASE, RA
> +    |.else
> +    |   stwux TMP1, RA, BASE
> +    |   stw TMP3, 4(RA)
> +    |.endif
>       |  ins_next2
>       |
>       |5:  // Check for __index if table value is nil.
> @@ -4088,10 +4679,20 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |  cmplw TMP0, CARG2
>       |   slwi TMP2, CARG2, 3
>       |  ble ->vmeta_tgetr		// In array part?
> +    |.if FPU
>       |   lfdx f14, TMP1, TMP2
> +    |.else
> +    |   lwzux SAVE0, TMP2, TMP1
> +    |   lwz SAVE1, 4(TMP2)
> +    |.endif
>       |->BC_TGETR_Z:
>       |  ins_next1
> +    |.if FPU
>       |   stfdx f14, BASE, RA
> +    |.else
> +    |   stwux SAVE0, RA, BASE
> +    |   stw SAVE1, 4(RA)
> +    |.endif
>       |  ins_next2
>       break;
>   
> @@ -4132,11 +4733,22 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |  ble ->vmeta_tsetv		// Integer key and in array part?
>       |   lwzx TMP2, TMP1, TMP0
>       |  lbz TMP3, TAB:RB->marked
> +    |.if FPU
>       |    lfdx f14, BASE, RA
> +    |.else
> +    |    add SAVE1, BASE, RA
> +    |    lwz SAVE0, 0(SAVE1)
> +    |    lwz SAVE1, 4(SAVE1)
> +    |.endif
>       |   checknil TMP2; beq >3
>       |1:
>       |  andix. TMP2, TMP3, LJ_GC_BLACK	// isblack(table)
> +    |.if FPU
>       |    stfdx f14, TMP1, TMP0
> +    |.else
> +    |    stwux SAVE0, TMP1, TMP0
> +    |    stw SAVE1, 4(TMP1)
> +    |.endif
>       |  bne >7
>       |2:
>       |  ins_next
> @@ -4177,7 +4789,13 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |  lwz NODE:TMP2, TAB:RB->node
>       |    stb ZERO, TAB:RB->nomm		// Clear metamethod cache.
>       |  and TMP1, TMP1, TMP0		// idx = str->hash & tab->hmask
> +    |.if FPU
>       |    lfdx f14, BASE, RA
> +    |.else
> +    |    add CARG2, BASE, RA
> +    |    lwz SAVE0, 0(CARG2)
> +    |    lwz SAVE1, 4(CARG2)
> +    |.endif
>       |  slwi TMP0, TMP1, 5
>       |  slwi TMP1, TMP1, 3
>       |  sub TMP1, TMP0, TMP1
> @@ -4193,7 +4811,12 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |    checknil CARG2; beq >4		// Key found, but nil value?
>       |2:
>       |  andix. TMP0, TMP3, LJ_GC_BLACK	// isblack(table)
> +    |.if FPU
>       |    stfd f14, NODE:TMP2->val
> +    |.else
> +    |    stw SAVE0, NODE:TMP2->val.u32.hi
> +    |    stw SAVE1, NODE:TMP2->val.u32.lo
> +    |.endif
>       |  bne >7
>       |3:
>       |  ins_next
> @@ -4232,7 +4855,12 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |  bl extern lj_tab_newkey		// (lua_State *L, GCtab *t, TValue *k)
>       |  // Returns TValue *.
>       |  lp BASE, L->base
> +    |.if FPU
>       |  stfd f14, 0(CRET1)
> +    |.else
> +    |  stw SAVE0, 0(CRET1)
> +    |  stw SAVE1, 4(CRET1)
> +    |.endif
>       |  b <3				// No 2nd write barrier needed.
>       |
>       |7:  // Possible table write barrier for the value. Skip valiswhite check.
> @@ -4249,13 +4877,24 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |   lwz TMP2, TAB:RB->array
>       |    lbz TMP3, TAB:RB->marked
>       |  cmplw TMP0, TMP1
> +    |.if FPU
>       |   lfdx f14, BASE, RA
> +    |.else
> +    |   add CARG2, BASE, RA
> +    |   lwz SAVE0, 0(CARG2)
> +    |   lwz SAVE1, 4(CARG2)
> +    |.endif
>       |  bge ->vmeta_tsetb
>       |  lwzx TMP1, TMP2, RC
>       |  checknil TMP1; beq >5
>       |1:
>       |  andix. TMP0, TMP3, LJ_GC_BLACK	// isblack(table)
> +    |.if FPU
>       |   stfdx f14, TMP2, RC
> +    |.else
> +    |   stwux SAVE0, RC, TMP2
> +    |   stw SAVE1, 4(RC)
> +    |.endif
>       |  bne >7
>       |2:
>       |  ins_next
> @@ -4295,10 +4934,20 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |2:
>       |  cmplw TMP0, CARG3
>       |   slwi TMP2, CARG3, 3
> +    |.if FPU
>       |   lfdx f14, BASE, RA
> +    |.else
> +    |  lwzux SAVE0, RA, BASE
> +    |  lwz SAVE1, 4(RA)
> +    |.endif
>       |  ble ->vmeta_tsetr		// In array part?
>       |  ins_next1
> +    |.if FPU
>       |   stfdx f14, TMP1, TMP2
> +    |.else
> +    |   stwux SAVE0, TMP1, TMP2
> +    |   stw SAVE1, 4(TMP1)
> +    |.endif
>       |  ins_next2
>       |
>       |7:  // Possible table write barrier for the value. Skip valiswhite check.
> @@ -4328,10 +4977,20 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |   add TMP1, TMP1, TMP0
>       |    andix. TMP0, TMP3, LJ_GC_BLACK	// isblack(table)
>       |3:  // Copy result slots to table.
> +    |.if FPU
>       |   lfd f0, 0(RA)
> +    |.else
> +    |   lwz SAVE0, 0(RA)
> +    |   lwz SAVE1, 4(RA)
> +    |.endif
>       |  addi RA, RA, 8
>       |  cmpw cr1, RA, TMP2
> +    |.if FPU
>       |   stfd f0, 0(TMP1)
> +    |.else
> +    |   stw SAVE0, 0(TMP1)
> +    |   stw SAVE1, 4(TMP1)
> +    |.endif
>       |    addi TMP1, TMP1, 8
>       |  blt cr1, <3
>       |  bne >7
> @@ -4398,9 +5057,20 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |    beq cr1, >3
>       |2:
>       |  addi TMP3, TMP2, 8
> +    |.if FPU
>       |   lfdx f0, RA, TMP2
> +    |.else
> +    |   add CARG3, RA, TMP2
> +    |   lwz CARG1, 0(CARG3)
> +    |   lwz CARG2, 4(CARG3)
> +    |.endif
>       |  cmplw cr1, TMP3, NARGS8:RC
> +    |.if FPU
>       |   stfdx f0, BASE, TMP2
> +    |.else
> +    |   stwux CARG1, TMP2, BASE
> +    |   stw CARG2, 4(TMP2)
> +    |.endif
>       |  mr TMP2, TMP3
>       |  bne cr1, <2
>       |3:
> @@ -4433,14 +5103,28 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |  add BASE, BASE, RA
>       |  lwz TMP1, -24(BASE)
>       |   lwz LFUNC:RB, -20(BASE)
> +    |.if FPU
>       |    lfd f1, -8(BASE)
>       |    lfd f0, -16(BASE)
> +    |.else
> +    |    lwz CARG1, -8(BASE)
> +    |    lwz CARG2, -4(BASE)
> +    |    lwz CARG3, -16(BASE)
> +    |    lwz CARG4, -12(BASE)
> +    |.endif
>       |  stw TMP1, 0(BASE)		// Copy callable.
>       |   stw LFUNC:RB, 4(BASE)
>       |  checkfunc TMP1
> -    |    stfd f1, 16(BASE)		// Copy control var.
>       |     li NARGS8:RC, 16		// Iterators get 2 arguments.
> +    |.if FPU
> +    |    stfd f1, 16(BASE)		// Copy control var.
>       |    stfdu f0, 8(BASE)		// Copy state.
> +    |.else
> +    |    stw CARG1, 16(BASE)		// Copy control var.
> +    |    stw CARG2, 20(BASE)
> +    |    stwu CARG3, 8(BASE)		// Copy state.
> +    |    stw CARG4, 4(BASE)
> +    |.endif
>       |  bne ->vmeta_call
>       |  ins_call
>       break;
> @@ -4461,7 +5145,12 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |   slwi TMP3, RC, 3
>       |  bge >5				// Index points after array part?
>       |  lwzx TMP2, TMP1, TMP3
> +    |.if FPU
>       |   lfdx f0, TMP1, TMP3
> +    |.else
> +    |   lwzux CARG1, TMP3, TMP1
> +    |   lwz CARG2, 4(TMP3)
> +    |.endif
>       |  checknil TMP2
>       |     lwz INS, -4(PC)
>       |  beq >4
> @@ -4473,7 +5162,12 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |.endif
>       |    addi RC, RC, 1
>       |     addis TMP3, PC, -(BCBIAS_J*4 >> 16)
> +    |.if FPU
>       |  stfd f0, 8(RA)
> +    |.else
> +    |  stw CARG1, 8(RA)
> +    |  stw CARG2, 12(RA)
> +    |.endif
>       |     decode_RD4 TMP1, INS
>       |    stw RC, -4(RA)			// Update control var.
>       |     add PC, TMP1, TMP3
> @@ -4498,17 +5192,38 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |   slwi RB, RC, 3
>       |   sub TMP3, TMP3, RB
>       |  lwzx RB, TMP2, TMP3
> +    |.if FPU
>       |  lfdx f0, TMP2, TMP3
> +    |.else
> +    |  add CARG3, TMP2, TMP3
> +    |  lwz CARG1, 0(CARG3)
> +    |  lwz CARG2, 4(CARG3)
> +    |.endif
>       |   add NODE:TMP3, TMP2, TMP3
>       |  checknil RB
>       |     lwz INS, -4(PC)
>       |  beq >7
> +    |.if FPU
>       |   lfd f1, NODE:TMP3->key
> +    |.else
> +    |   lwz CARG3, NODE:TMP3->key.u32.hi
> +    |   lwz CARG4, NODE:TMP3->key.u32.lo
> +    |.endif
>       |     addis TMP2, PC, -(BCBIAS_J*4 >> 16)
> +    |.if FPU
>       |  stfd f0, 8(RA)
> +    |.else
> +    |  stw CARG1, 8(RA)
> +    |  stw CARG2, 12(RA)
> +    |.endif
>       |    add RC, RC, TMP0
>       |     decode_RD4 TMP1, INS
> +    |.if FPU
>       |   stfd f1, 0(RA)
> +    |.else
> +    |   stw CARG3, 0(RA)
> +    |   stw CARG4, 4(RA)
> +    |.endif
>       |    addi RC, RC, 1
>       |     add PC, TMP1, TMP2
>       |    stw RC, -4(RA)			// Update control var.
> @@ -4574,9 +5289,19 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |   subi TMP2, TMP2, 16
>       |   ble >2				// No vararg slots?
>       |1:  // Copy vararg slots to destination slots.
> +    |.if FPU
>       |  lfd f0, 0(RC)
> +    |.else
> +    |  lwz CARG1, 0(RC)
> +    |  lwz CARG2, 4(RC)
> +    |.endif
>       |   addi RC, RC, 8
> +    |.if FPU
>       |  stfd f0, 0(RA)
> +    |.else
> +    |  stw CARG1, 0(RA)
> +    |  stw CARG2, 4(RA)
> +    |.endif
>       |  cmplw RA, TMP2
>       |   cmplw cr1, RC, TMP3
>       |  bge >3				// All destination slots filled?
> @@ -4599,9 +5324,19 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |   addi MULTRES, TMP1, 8
>       |  bgt >7
>       |6:
> +    |.if FPU
>       |  lfd f0, 0(RC)
> +    |.else
> +    |  lwz CARG1, 0(RC)
> +    |  lwz CARG2, 4(RC)
> +    |.endif
>       |   addi RC, RC, 8
> +    |.if FPU
>       |  stfd f0, 0(RA)
> +    |.else
> +    |  stw CARG1, 0(RA)
> +    |  stw CARG2, 4(RA)
> +    |.endif
>       |  cmplw RC, TMP3
>       |   addi RA, RA, 8
>       |  blt <6				// More vararg slots?
> @@ -4652,14 +5387,38 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |   li TMP1, 0
>       |2:
>       |  addi TMP3, TMP1, 8
> +    |.if FPU
>       |   lfdx f0, RA, TMP1
> +    |.else
> +    |   add CARG3, RA, TMP1
> +    |   lwz CARG1, 0(CARG3)
> +    |   lwz CARG2, 4(CARG3)
> +    |.endif
>       |  cmpw TMP3, RC
> +    |.if FPU
>       |   stfdx f0, TMP2, TMP1
> +    |.else
> +    |   add CARG3, TMP2, TMP1
> +    |   stw CARG1, 0(CARG3)
> +    |   stw CARG2, 4(CARG3)
> +    |.endif
>       |  beq >3
>       |  addi TMP1, TMP3, 8
> +    |.if FPU
>       |   lfdx f1, RA, TMP3
> +    |.else
> +    |   add CARG3, RA, TMP3
> +    |   lwz CARG1, 0(CARG3)
> +    |   lwz CARG2, 4(CARG3)
> +    |.endif
>       |  cmpw TMP1, RC
> +    |.if FPU
>       |   stfdx f1, TMP2, TMP3
> +    |.else
> +    |   add CARG3, TMP2, TMP3
> +    |   stw CARG1, 0(CARG3)
> +    |   stw CARG2, 4(CARG3)
> +    |.endif
>       |  bne <2
>       |3:
>       |5:
> @@ -4701,8 +5460,15 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       |   subi TMP2, BASE, 8
>       |  decode_RB8 RB, INS
>       if (op == BC_RET1) {
> +      |.if FPU
>         |  lfd f0, 0(RA)
>         |  stfd f0, 0(TMP2)
> +      |.else
> +      |  lwz CARG1, 0(RA)
> +      |  lwz CARG2, 4(RA)
> +      |  stw CARG1, 0(TMP2)
> +      |  stw CARG2, 4(TMP2)
> +      |.endif
>       }
>       |5:
>       |  cmplw RB, RD
> @@ -4763,11 +5529,11 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>         |4:
>         |  stw CARG1, FORL_IDX*8+4(RA)
>       } else {
> -      |  lwz TMP3, FORL_STEP*8(RA)
> +      |  lwz SAVE0, FORL_STEP*8(RA)
>         |   lwz CARG3, FORL_STEP*8+4(RA)
>         |  lwz TMP2, FORL_STOP*8(RA)
>         |   lwz CARG2, FORL_STOP*8+4(RA)
> -      |  cmplw cr7, TMP3, TISNUM
> +      |  cmplw cr7, SAVE0, TISNUM
>         |  cmplw cr1, TMP2, TISNUM
>         |  crand 4*cr0+eq, 4*cr0+eq, 4*cr7+eq
>         |  crand 4*cr0+eq, 4*cr0+eq, 4*cr1+eq
> @@ -4810,41 +5576,80 @@ static void build_ins(BuildCtx *ctx, BCOp op, int defop)
>       if (vk) {
>         |.if DUALNUM
>         |9:  // FP loop.
> +      |.if FPU
>         |  lfd f1, FORL_IDX*8(RA)
>         |.else
> +      |  lwz CARG1, FORL_IDX*8(RA)
> +      |  lwz CARG2, FORL_IDX*8+4(RA)
> +      |.endif
> +      |.else
>         |  lfdux f1, RA, BASE
>         |.endif
> +      |.if FPU
>         |  lfd f3, FORL_STEP*8(RA)
>         |  lfd f2, FORL_STOP*8(RA)
> -      |   lwz TMP3, FORL_STEP*8(RA)
>         |  fadd f1, f1, f3
>         |  stfd f1, FORL_IDX*8(RA)
> +      |.else
> +      |  lwz CARG3, FORL_STEP*8(RA)
> +      |  lwz CARG4, FORL_STEP*8+4(RA)
> +      |  mr SAVE1, RD
> +      |  blex __adddf3
> +      |  mr RD, SAVE1
> +      |  stw CRET1, FORL_IDX*8(RA)
> +      |  stw CRET2, FORL_IDX*8+4(RA)
> +      |  lwz CARG3, FORL_STOP*8(RA)
> +      |  lwz CARG4, FORL_STOP*8+4(RA)
> +      |.endif
> +      |   lwz SAVE0, FORL_STEP*8(RA)
>       } else {
>         |.if DUALNUM
>         |9:  // FP loop.
>         |.else
>         |  lwzux TMP1, RA, BASE
> -      |  lwz TMP3, FORL_STEP*8(RA)
> +      |  lwz SAVE0, FORL_STEP*8(RA)
>         |  lwz TMP2, FORL_STOP*8(RA)
>         |  cmplw cr0, TMP1, TISNUM
> -      |  cmplw cr7, TMP3, TISNUM
> +      |  cmplw cr7, SAVE0, TISNUM
>         |  cmplw cr1, TMP2, TISNUM
>         |.endif
> +      |.if FPU
>         |   lfd f1, FORL_IDX*8(RA)
> +      |.else
> +      |   lwz CARG1, FORL_IDX*8(RA)
> +      |   lwz CARG2, FORL_IDX*8+4(RA)
> +      |.endif
>         |  crand 4*cr0+lt, 4*cr0+lt, 4*cr7+lt
>         |  crand 4*cr0+lt, 4*cr0+lt, 4*cr1+lt
> +      |.if FPU
>         |   lfd f2, FORL_STOP*8(RA)
> +      |.else
> +      |   lwz CARG3, FORL_STOP*8(RA)
> +      |   lwz CARG4, FORL_STOP*8+4(RA)
> +      |.endif
>         |  bge ->vmeta_for
>       }
> -    |  cmpwi cr6, TMP3, 0
> +    |  cmpwi cr6, SAVE0, 0
>       if (op != BC_JFORL) {
>         |  srwi RD, RD, 1
>       }
> +    |.if FPU
>       |   stfd f1, FORL_EXT*8(RA)
> +    |.else
> +    |   stw CARG1, FORL_EXT*8(RA)
> +    |   stw CARG2, FORL_EXT*8+4(RA)
> +    |.endif
>       if (op != BC_JFORL) {
>         |  add RD, PC, RD
>       }
> +    |.if FPU
>       |  fcmpu cr0, f1, f2
> +    |.else
> +    |  mr SAVE1, RD
> +    |  blex __ledf2
> +    |  cmpwi CRET1, 0
> +    |  mr RD, SAVE1
> +    |.endif
>       if (op == BC_JFORI) {
>         |  addis PC, RD, -(BCBIAS_J*4 >> 16)
>       }