Projects
Essentials
x265
Sign Up
Log In
Username
Password
Overview
Repositories
Revisions
Requests
Users
Attributes
Meta
Expand all
Collapse all
Changes of Revision 42
View file
x265.changes
Changed
@@ -1,4 +1,53 @@ ------------------------------------------------------------------- +Thu Jun 13 05:58:19 UTC 2024 - Luigi Baldoni <aloisio@gmx.com> + +- Update to version 3.6 + New features: + * Segment based Ratecontrol (SBRC) feature + * Motion-Compensated Spatio-Temporal Filtering + * Scene-cut aware qp - BBAQ (Bidirectional Boundary Aware + Quantization) + * Histogram-Based Scene Change Detection + * Film-Grain characteristics as a SEI message to support Film + Grain Synthesis(FGS) + * Add temporal layer implementation(Hierarchical B-frame + implementation) + Enhancements to existing features: + * Added Dolby Vision 8.4 Profile Support + API changes: + * Add Segment based Ratecontrol(SBRC) feature: "--no-sbrc". + * Add command line parameter for mcstf feature: "--no-mctf". + * Add command line parameters for the scene cut aware qp + feature: "--scenecut-aware-qp" and "--masking-strength". + * Add command line parameters for Histogram-Based Scene Change + Detection: "--hist-scenecut". + * Add film grain characteristics as a SEI message to the + bitstream: "--film-grain <filename>" + * cli: add new option --cra-nal (Force nal type to CRA to all + frames expect for the first frame, works only with keyint 1) + Optimizations: + * ARM64 NEON optimizations:- Several time-consuming C + functions have been optimized for the targeted platform - + aarch64. The overall performance increased by around 20%. + * SVE/SVE2 optimizations + Bug fixes: + * Linux bug to utilize all the cores + * Crash with hist-scenecut build when source resolution is not + multiple of minCuSize + * 32bit and 64bit builds generation for ARM + * bugs in zonefile feature (Reflect Zonefile Parameters inside + Lookahead, extra IDR issue, Avg I Slice QP value issue etc..) + * Add x86 ASM implementation for subsampling luma + * Fix for abrladder segfault with load reuse level 1 + * Reorder miniGOP based on temporal layer hierarchy and add + support for more B frame + * Add MacOS aarch64 build support + * Fix boundary condition issue for Gaussian filter +- Drop arm.patch and replace it with 0001-Fix-arm-flags.patch + and 0004-Do-not-build-with-assembly-support-on-arm.patch + (courtesy of Debian) + +------------------------------------------------------------------- Wed May 19 13:21:09 UTC 2021 - Luigi Baldoni <aloisio@gmx.com> - Build libx265_main10 and libx265_main12 unconditionally and
View file
x265.spec
Changed
@@ -1,7 +1,7 @@ # # spec file for package x265 # -# Copyright (c) 2021 Packman Team <packman@links2linux.de> +# Copyright (c) 2024 Packman Team <packman@links2linux.de> # Copyright (c) 2014 Torsten Gruner <t.gruner@katodev.de> # # All modifications and additions to the file contributed by third parties @@ -17,21 +17,22 @@ # -%define sover 199 +%define sover 209 %define libname lib%{name} %define libsoname %{libname}-%{sover} -%define uver 3_5 +%define uver 3_6 Name: x265 -Version: 3.5 +Version: 3.6 Release: 0 Summary: A free h265/HEVC encoder - encoder binary License: GPL-2.0-or-later Group: Productivity/Multimedia/Video/Editors and Convertors URL: https://bitbucket.org/multicoreware/x265_git Source0: https://bitbucket.org/multicoreware/x265_git/downloads/%{name}_%{version}.tar.gz -Patch0: arm.patch Patch1: x265.pkgconfig.patch Patch2: x265-fix_enable512.patch +Patch3: 0001-Fix-arm-flags.patch +Patch4: 0004-Do-not-build-with-assembly-support-on-arm.patch BuildRequires: cmake >= 2.8.8 BuildRequires: gcc-c++ BuildRequires: nasm >= 2.13 @@ -130,6 +131,8 @@ %cmake_install find %{buildroot} -type f -name "*.a" -delete -print0 +%check + %post -n %{libsoname} -p /sbin/ldconfig %postun -n %{libsoname} -p /sbin/ldconfig
View file
0001-Fix-arm-flags.patch
Added
@@ -0,0 +1,39 @@ +From: Sebastian Ramacher <sramacher@debian.org> +Date: Sun, 21 Jun 2020 17:54:56 +0200 +Subject: Fix arm* flags + +--- + source/CMakeLists.txt | 7 ++----- + 1 file changed, 2 insertions(+), 5 deletions(-) + +diff --git a/source/CMakeLists.txt b/source/CMakeLists.txt +index ab5ddfe..eb9b19b 100755 +--- a/source/CMakeLists.txt ++++ b/source/CMakeLists.txt +@@ -253,10 +253,7 @@ if(GCC) + elseif(ARM) + find_package(Neon) + if(CPU_HAS_NEON) +- set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=neon -marm -fPIC) + add_definitions(-DHAVE_NEON) +- else() +- set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=vfp -marm) + endif() + endif() + if(ARM64 OR CROSS_COMPILE_ARM64) +@@ -265,13 +262,13 @@ if(GCC) + find_package(SVE2) + if(CPU_HAS_SVE2 OR CROSS_COMPILE_SVE2) + message(STATUS "Found SVE2") +- set(ARM_ARGS -O3 -march=armv8-a+sve2 -fPIC -flax-vector-conversions) ++ set(ARM_ARGS -fPIC -flax-vector-conversions) + add_definitions(-DHAVE_SVE2) + add_definitions(-DHAVE_SVE) + add_definitions(-DHAVE_NEON) # for NEON c/c++ primitives, as currently there is no implementation that use SVE2 + elseif(CPU_HAS_SVE OR CROSS_COMPILE_SVE) + message(STATUS "Found SVE") +- set(ARM_ARGS -O3 -march=armv8-a+sve -fPIC -flax-vector-conversions) ++ set(ARM_ARGS -fPIC -flax-vector-conversions) + add_definitions(-DHAVE_SVE) + add_definitions(-DHAVE_NEON) # for NEON c/c++ primitives, as currently there is no implementation that use SVE + elseif(CPU_HAS_NEON)
View file
0004-Do-not-build-with-assembly-support-on-arm.patch
Added
@@ -0,0 +1,28 @@ +From: Sebastian Ramacher <sramacher@debian.org> +Date: Fri, 31 May 2024 23:38:23 +0200 +Subject: Do not build with assembly support on arm* + +--- + source/CMakeLists.txt | 9 --------- + 1 file changed, 9 deletions(-) + +diff --git a/source/CMakeLists.txt b/source/CMakeLists.txt +index 672cc2d..f112330 100755 +--- a/source/CMakeLists.txt ++++ b/source/CMakeLists.txt +@@ -73,15 +73,6 @@ elseif(POWERMATCH GREATER "-1") + add_definitions(-DPPC64=1) + message(STATUS "Detected POWER PPC64 target processor") + endif() +-elseif(ARMMATCH GREATER "-1") +- if(CROSS_COMPILE_ARM) +- message(STATUS "Cross compiling for ARM arch") +- else() +- set(CROSS_COMPILE_ARM 0) +- endif() +- message(STATUS "Detected ARM target processor") +- set(ARM 1) +- add_definitions(-DX265_ARCH_ARM=1 -DHAVE_ARMV6=1) + elseif(ARM64MATCH GREATER "-1") + #if(CROSS_COMPILE_ARM64) + #message(STATUS "Cross compiling for ARM64 arch")
View file
arm.patch
Deleted
@@ -1,108 +0,0 @@ -Index: x265_3.4/source/CMakeLists.txt -=================================================================== ---- x265_3.4.orig/source/CMakeLists.txt -+++ x265_3.4/source/CMakeLists.txt -@@ -64,26 +64,26 @@ elseif(POWERMATCH GREATER "-1") - add_definitions(-DPPC64=1) - message(STATUS "Detected POWER PPC64 target processor") - endif() --elseif(ARMMATCH GREATER "-1") -- if(CROSS_COMPILE_ARM) -- message(STATUS "Cross compiling for ARM arch") -- else() -- set(CROSS_COMPILE_ARM 0) -- endif() -- set(ARM 1) -- if("${CMAKE_SIZEOF_VOID_P}" MATCHES 8) -- message(STATUS "Detected ARM64 target processor") -- set(ARM64 1) -- add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=1 -DHAVE_ARMV6=0) -- else() -- message(STATUS "Detected ARM target processor") -- add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=0 -DHAVE_ARMV6=1) -- endif() -+elseif(${SYSPROC} MATCHES "armv5.*") -+ message(STATUS "Detected ARMV5 system processor") -+ set(ARMV5 1) -+ add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=0 -DHAVE_ARMV6=0 -DHAVE_NEON=0) -+elseif(${SYSPROC} STREQUAL "armv6l") -+ message(STATUS "Detected ARMV6 system processor") -+ set(ARMV6 1) -+ add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=0 -DHAVE_ARMV6=1 -DHAVE_NEON=0) -+elseif(${SYSPROC} STREQUAL "armv7l") -+ message(STATUS "Detected ARMV7 system processor") -+ set(ARMV7 1) -+ add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=0 -DHAVE_ARMV6=1 -DHAVE_NEON=0) -+elseif(${SYSPROC} STREQUAL "aarch64") -+ message(STATUS "Detected AArch64 system processor") -+ set(ARMV7 1) -+ add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=1 -DHAVE_ARMV6=0 -DHAVE_NEON=0) - else() - message(STATUS "CMAKE_SYSTEM_PROCESSOR value `${CMAKE_SYSTEM_PROCESSOR}` is unknown") - message(STATUS "Please add this value near ${CMAKE_CURRENT_LIST_FILE}:${CMAKE_CURRENT_LIST_LINE}") - endif() -- - if(UNIX) - list(APPEND PLATFORM_LIBS pthread) - find_library(LIBRT rt) -@@ -238,28 +238,9 @@ if(GCC) - endif() - endif() - endif() -- if(ARM AND CROSS_COMPILE_ARM) -- if(ARM64) -- set(ARM_ARGS -fPIC) -- else() -- set(ARM_ARGS -march=armv6 -mfloat-abi=soft -mfpu=vfp -marm -fPIC) -- endif() -- message(STATUS "cross compile arm") -- elseif(ARM) -- if(ARM64) -- set(ARM_ARGS -fPIC) -- add_definitions(-DHAVE_NEON) -- else() -- find_package(Neon) -- if(CPU_HAS_NEON) -- set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=neon -marm -fPIC) -- add_definitions(-DHAVE_NEON) -- else() -- set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=vfp -marm) -- endif() -- endif() -+ if(ARMV7) -+ add_definitions(-fPIC) - endif() -- add_definitions(${ARM_ARGS}) - if(FPROFILE_GENERATE) - if(INTEL_CXX) - add_definitions(-prof-gen -prof-dir="${CMAKE_CURRENT_BINARY_DIR}") -Index: x265_3.4/source/common/cpu.cpp -=================================================================== ---- x265_3.4.orig/source/common/cpu.cpp -+++ x265_3.4/source/common/cpu.cpp -@@ -39,7 +39,7 @@ - #include <machine/cpu.h> - #endif - --#if X265_ARCH_ARM && !defined(HAVE_NEON) -+#if X265_ARCH_ARM && (!defined(HAVE_NEON) || HAVE_NEON==0) - #include <signal.h> - #include <setjmp.h> - static sigjmp_buf jmpbuf; -@@ -350,7 +350,6 @@ uint32_t cpu_detect(bool benableavx512) - } - - canjump = 1; -- PFX(cpu_neon_test)(); - canjump = 0; - signal(SIGILL, oldsig); - #endif // if !HAVE_NEON -@@ -366,7 +365,7 @@ uint32_t cpu_detect(bool benableavx512) - // which may result in incorrect detection and the counters stuck enabled. - // right now Apple does not seem to support performance counters for this test - #ifndef __MACH__ -- flags |= PFX(cpu_fast_neon_mrc_test)() ? X265_CPU_FAST_NEON_MRC : 0; -+ //flags |= PFX(cpu_fast_neon_mrc_test)() ? X265_CPU_FAST_NEON_MRC : 0; - #endif - // TODO: write dual issue test? currently it's A8 (dual issue) vs. A9 (fast mrc) - #elif X265_ARCH_ARM64
View file
baselibs.conf
Changed
@@ -1,1 +1,1 @@ -libx265-199 +libx265-209
View file
x265_3.5.tar.gz/source/common/aarch64/ipfilter8.S
Deleted
@@ -1,414 +0,0 @@ -/***************************************************************************** - * Copyright (C) 2020 MulticoreWare, Inc - * - * Authors: Yimeng Su <yimeng.su@huawei.com> - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. - * - * This program is also available under a commercial proprietary license. - * For more information, contact us at license @ x265.com. - *****************************************************************************/ - -#include "asm.S" - -.section .rodata - -.align 4 - -.text - - - -.macro qpel_filter_0_32b - movi v24.8h, #64 - uxtl v19.8h, v5.8b - smull v17.4s, v19.4h, v24.4h - smull2 v18.4s, v19.8h, v24.8h -.endm - -.macro qpel_filter_1_32b - movi v16.8h, #58 - uxtl v19.8h, v5.8b - smull v17.4s, v19.4h, v16.4h - smull2 v18.4s, v19.8h, v16.8h - - movi v24.8h, #10 - uxtl v21.8h, v1.8b - smull v19.4s, v21.4h, v24.4h - smull2 v20.4s, v21.8h, v24.8h - - movi v16.8h, #17 - uxtl v23.8h, v2.8b - smull v21.4s, v23.4h, v16.4h - smull2 v22.4s, v23.8h, v16.8h - - movi v24.8h, #5 - uxtl v1.8h, v6.8b - smull v23.4s, v1.4h, v24.4h - smull2 v16.4s, v1.8h, v24.8h - - sub v17.4s, v17.4s, v19.4s - sub v18.4s, v18.4s, v20.4s - - uxtl v1.8h, v4.8b - sshll v19.4s, v1.4h, #2 - sshll2 v20.4s, v1.8h, #2 - - add v17.4s, v17.4s, v21.4s - add v18.4s, v18.4s, v22.4s - - uxtl v1.8h, v0.8b - uxtl v2.8h, v3.8b - ssubl v21.4s, v2.4h, v1.4h - ssubl2 v22.4s, v2.8h, v1.8h - - add v17.4s, v17.4s, v19.4s - add v18.4s, v18.4s, v20.4s - sub v21.4s, v21.4s, v23.4s - sub v22.4s, v22.4s, v16.4s - add v17.4s, v17.4s, v21.4s - add v18.4s, v18.4s, v22.4s -.endm - -.macro qpel_filter_2_32b - movi v16.4s, #11 - uxtl v19.8h, v5.8b - uxtl v20.8h, v2.8b - saddl v17.4s, v19.4h, v20.4h - saddl2 v18.4s, v19.8h, v20.8h - - uxtl v21.8h, v1.8b - uxtl v22.8h, v6.8b - saddl v19.4s, v21.4h, v22.4h - saddl2 v20.4s, v21.8h, v22.8h - - mul v19.4s, v19.4s, v16.4s - mul v20.4s, v20.4s, v16.4s - - movi v16.4s, #40 - mul v17.4s, v17.4s, v16.4s - mul v18.4s, v18.4s, v16.4s - - uxtl v21.8h, v4.8b - uxtl v22.8h, v3.8b - saddl v23.4s, v21.4h, v22.4h - saddl2 v16.4s, v21.8h, v22.8h - - uxtl v1.8h, v0.8b - uxtl v2.8h, v7.8b - saddl v21.4s, v1.4h, v2.4h - saddl2 v22.4s, v1.8h, v2.8h - - shl v23.4s, v23.4s, #2 - shl v16.4s, v16.4s, #2 - - add v19.4s, v19.4s, v21.4s - add v20.4s, v20.4s, v22.4s - add v17.4s, v17.4s, v23.4s - add v18.4s, v18.4s, v16.4s - sub v17.4s, v17.4s, v19.4s - sub v18.4s, v18.4s, v20.4s -.endm - -.macro qpel_filter_3_32b - movi v16.8h, #17 - movi v24.8h, #5 - - uxtl v19.8h, v5.8b - smull v17.4s, v19.4h, v16.4h - smull2 v18.4s, v19.8h, v16.8h - - uxtl v21.8h, v1.8b - smull v19.4s, v21.4h, v24.4h - smull2 v20.4s, v21.8h, v24.8h - - movi v16.8h, #58 - uxtl v23.8h, v2.8b - smull v21.4s, v23.4h, v16.4h - smull2 v22.4s, v23.8h, v16.8h - - movi v24.8h, #10 - uxtl v1.8h, v6.8b - smull v23.4s, v1.4h, v24.4h - smull2 v16.4s, v1.8h, v24.8h - - sub v17.4s, v17.4s, v19.4s - sub v18.4s, v18.4s, v20.4s - - uxtl v1.8h, v3.8b - sshll v19.4s, v1.4h, #2 - sshll2 v20.4s, v1.8h, #2 - - add v17.4s, v17.4s, v21.4s - add v18.4s, v18.4s, v22.4s - - uxtl v1.8h, v4.8b - uxtl v2.8h, v7.8b - ssubl v21.4s, v1.4h, v2.4h - ssubl2 v22.4s, v1.8h, v2.8h - - add v17.4s, v17.4s, v19.4s - add v18.4s, v18.4s, v20.4s - sub v21.4s, v21.4s, v23.4s - sub v22.4s, v22.4s, v16.4s - add v17.4s, v17.4s, v21.4s - add v18.4s, v18.4s, v22.4s -.endm - - - - -.macro vextin8 - ld1 {v3.16b}, x11, #16 - mov v7.d0, v3.d1 - ext v0.8b, v3.8b, v7.8b, #1 - ext v4.8b, v3.8b, v7.8b, #2 - ext v1.8b, v3.8b, v7.8b, #3 - ext v5.8b, v3.8b, v7.8b, #4 - ext v2.8b, v3.8b, v7.8b, #5 - ext v6.8b, v3.8b, v7.8b, #6 - ext v3.8b, v3.8b, v7.8b, #7 -.endm - - - -// void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt) -.macro HPS_FILTER a b filterhps - mov w12, #8192 - mov w6, w10 - sub x3, x3, #\a - lsl x3, x3, #1 - mov w9, #\a - cmp w9, #4 - b.eq 14f - cmp w9, #12 - b.eq 15f - b 7f -14: - HPS_FILTER_4 \a \b \filterhps - b 10f -15: - HPS_FILTER_12 \a \b \filterhps - b 10f -7: - cmp w5, #0 - b.eq 8f - cmp w5, #1 - b.eq 9f -8: -loop1_hps_\filterhps\()_\a\()x\b\()_rowext0: - mov w7, #\a - lsr w7, w7, #3 - mov x11, x0 - sub x11, x11, #4 -loop2_hps_\filterhps\()_\a\()x\b\()_rowext0: - vextin8 - \filterhps - dup v16.4s, w12 - sub v17.4s, v17.4s, v16.4s - sub v18.4s, v18.4s, v16.4s - xtn v0.4h, v17.4s - xtn2 v0.8h, v18.4s - st1 {v0.8h}, x2, #16 - subs w7, w7, #1 - sub x11, x11, #8 - b.ne loop2_hps_\filterhps\()_\a\()x\b\()_rowext0 - subs w6, w6, #1 - add x0, x0, x1 - add x2, x2, x3 - b.ne loop1_hps_\filterhps\()_\a\()x\b\()_rowext0 - b 10f -9: -loop3_hps_\filterhps\()_\a\()x\b\()_rowext1: - mov w7, #\a - lsr w7, w7, #3 - mov x11, x0 - sub x11, x11, #4 -loop4_hps_\filterhps\()_\a\()x\b\()_rowext1: - vextin8 - \filterhps - dup v16.4s, w12 - sub v17.4s, v17.4s, v16.4s - sub v18.4s, v18.4s, v16.4s - xtn v0.4h, v17.4s - xtn2 v0.8h, v18.4s - st1 {v0.8h}, x2, #16 - subs w7, w7, #1 - sub x11, x11, #8 - b.ne loop4_hps_\filterhps\()_\a\()x\b\()_rowext1 - subs w6, w6, #1 - add x0, x0, x1 - add x2, x2, x3 - b.ne loop3_hps_\filterhps\()_\a\()x\b\()_rowext1 -10: -.endm - -.macro HPS_FILTER_4 w h filterhps - cmp w5, #0 - b.eq 11f - cmp w5, #1 - b.eq 12f -11: -loop4_hps_\filterhps\()_\w\()x\h\()_rowext0: - mov x11, x0 - sub x11, x11, #4 - vextin8 - \filterhps - dup v16.4s, w12 - sub v17.4s, v17.4s, v16.4s - xtn v0.4h, v17.4s - st1 {v0.4h}, x2, #8 - sub x11, x11, #8 - subs w6, w6, #1 - add x0, x0, x1 - add x2, x2, x3 - b.ne loop4_hps_\filterhps\()_\w\()x\h\()_rowext0 - b 13f -12: -loop5_hps_\filterhps\()_\w\()x\h\()_rowext1: - mov x11, x0 - sub x11, x11, #4 - vextin8 - \filterhps - dup v16.4s, w12 - sub v17.4s, v17.4s, v16.4s - xtn v0.4h, v17.4s - st1 {v0.4h}, x2, #8 - sub x11, x11, #8 - subs w6, w6, #1 - add x0, x0, x1 - add x2, x2, x3 - b.ne loop5_hps_\filterhps\()_\w\()x\h\()_rowext1 -13: -.endm - -.macro HPS_FILTER_12 w h filterhps - cmp w5, #0 - b.eq 14f - cmp w5, #1 - b.eq 15f -14: -loop12_hps_\filterhps\()_\w\()x\h\()_rowext0: - mov x11, x0 - sub x11, x11, #4 - vextin8 - \filterhps - dup v16.4s, w12 - sub v17.4s, v17.4s, v16.4s - sub v18.4s, v18.4s, v16.4s - xtn v0.4h, v17.4s - xtn2 v0.8h, v18.4s - st1 {v0.8h}, x2, #16 - sub x11, x11, #8 - - vextin8 - \filterhps - dup v16.4s, w12 - sub v17.4s, v17.4s, v16.4s - xtn v0.4h, v17.4s - st1 {v0.4h}, x2, #8 - add x2, x2, x3 - subs w6, w6, #1 - add x0, x0, x1 - b.ne loop12_hps_\filterhps\()_\w\()x\h\()_rowext0 - b 16f -15: -loop12_hps_\filterhps\()_\w\()x\h\()_rowext1: - mov x11, x0 - sub x11, x11, #4 - vextin8 - \filterhps - dup v16.4s, w12 - sub v17.4s, v17.4s, v16.4s - sub v18.4s, v18.4s, v16.4s - xtn v0.4h, v17.4s - xtn2 v0.8h, v18.4s - st1 {v0.8h}, x2, #16 - sub x11, x11, #8 - - vextin8 - \filterhps - dup v16.4s, w12 - sub v17.4s, v17.4s, v16.4s - xtn v0.4h, v17.4s - st1 {v0.4h}, x2, #8 - add x2, x2, x3 - subs w6, w6, #1 - add x0, x0, x1 - b.ne loop12_hps_\filterhps\()_\w\()x\h\()_rowext1 -16: -.endm - -// void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt) -.macro LUMA_HPS w h -function x265_interp_8tap_horiz_ps_\w\()x\h\()_neon - mov w10, #\h - cmp w5, #0 - b.eq 6f - sub x0, x0, x1, lsl #2 - - add x0, x0, x1 - add w10, w10, #7 -6: - cmp w4, #0 - b.eq 0f - cmp w4, #1 - b.eq 1f - cmp w4, #2 - b.eq 2f - cmp w4, #3 - b.eq 3f -0: - HPS_FILTER \w \h qpel_filter_0_32b - b 5f -1: - HPS_FILTER \w \h qpel_filter_1_32b - b 5f -2: - HPS_FILTER \w \h qpel_filter_2_32b - b 5f -3: - HPS_FILTER \w \h qpel_filter_3_32b - b 5f -5: - ret -endfunc -.endm - -LUMA_HPS 4 4 -LUMA_HPS 4 8 -LUMA_HPS 4 16 -LUMA_HPS 8 4 -LUMA_HPS 8 8 -LUMA_HPS 8 16 -LUMA_HPS 8 32 -LUMA_HPS 12 16 -LUMA_HPS 16 4 -LUMA_HPS 16 8 -LUMA_HPS 16 12 -LUMA_HPS 16 16 -LUMA_HPS 16 32 -LUMA_HPS 16 64 -LUMA_HPS 24 32 -LUMA_HPS 32 8 -LUMA_HPS 32 16 -LUMA_HPS 32 24 -LUMA_HPS 32 32 -LUMA_HPS 32 64 -LUMA_HPS 48 64 -LUMA_HPS 64 16 -LUMA_HPS 64 32 -LUMA_HPS 64 48 -LUMA_HPS 64 64
View file
x265_3.5.tar.gz/source/common/aarch64/ipfilter8.h
Deleted
@@ -1,55 +0,0 @@ -/***************************************************************************** - * Copyright (C) 2020 MulticoreWare, Inc - * - * Authors: Yimeng Su <yimeng.su@huawei.com> - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. - * - * This program is also available under a commercial proprietary license. - * For more information, contact us at license @ x265.com. - *****************************************************************************/ - -#ifndef X265_IPFILTER8_AARCH64_H -#define X265_IPFILTER8_AARCH64_H - - -void x265_interp_8tap_horiz_ps_4x4_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_4x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_4x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_8x4_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_8x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_8x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_8x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_12x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_16x4_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_16x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_16x12_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_16x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_16x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_16x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_24x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_32x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_32x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_32x24_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_32x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_32x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_48x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_64x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_64x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_64x48_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_64x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); - - -#endif // ifndef X265_IPFILTER8_AARCH64_H
View file
x265_3.5.tar.gz/source/common/aarch64/pixel-util.h
Deleted
@@ -1,40 +0,0 @@ -/***************************************************************************** - * Copyright (C) 2020 MulticoreWare, Inc - * - * Authors: Yimeng Su <yimeng.su@huawei.com> - * Hongbin Liu <liuhongbin1@huawei.com> - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. - * - * This program is also available under a commercial proprietary license. - * For more information, contact us at license @ x265.com. - *****************************************************************************/ - -#ifndef X265_PIXEL_UTIL_AARCH64_H -#define X265_PIXEL_UTIL_AARCH64_H - -int x265_pixel_satd_4x4_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); -int x265_pixel_satd_4x8_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); -int x265_pixel_satd_4x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); -int x265_pixel_satd_4x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); -int x265_pixel_satd_8x4_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); -int x265_pixel_satd_8x8_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); -int x265_pixel_satd_12x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); -int x265_pixel_satd_12x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); - -uint32_t x265_quant_neon(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff); -int PFX(psyCost_4x4_neon)(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); - -#endif // ifndef X265_PIXEL_UTIL_AARCH64_H
View file
x265_3.5.tar.gz/source/common/aarch64/pixel.h
Deleted
@@ -1,105 +0,0 @@ -/***************************************************************************** - * Copyright (C) 2020 MulticoreWare, Inc - * - * Authors: Hongbin Liu <liuhongbin1@huawei.com> - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. - * - * This program is also available under a commercial proprietary license. - * For more information, contact us at license @ x265.com. - *****************************************************************************/ - -#ifndef X265_I386_PIXEL_AARCH64_H -#define X265_I386_PIXEL_AARCH64_H - -void x265_pixel_avg_pp_4x4_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_4x8_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_4x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_8x4_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_8x8_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_8x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_8x32_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_12x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_16x4_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_16x8_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_16x12_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_16x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_16x32_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_16x64_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_24x32_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_32x8_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_32x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_32x24_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_32x32_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_32x64_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_48x64_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_64x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_64x32_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_64x48_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_pp_64x64_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); - -void x265_sad_x3_4x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_4x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_4x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_8x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_8x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_8x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_8x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_12x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_16x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_16x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_16x12_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_16x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_16x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_16x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_24x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_32x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_32x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_32x24_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_32x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_32x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_48x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_64x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_64x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_64x48_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); -void x265_sad_x3_64x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); - -void x265_sad_x4_4x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_4x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_4x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_8x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_8x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_8x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_8x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_12x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_16x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_16x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_16x12_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_16x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_16x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_16x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_24x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_32x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_32x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_32x24_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_32x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_32x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_48x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_64x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_64x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_64x48_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); -void x265_sad_x4_64x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); - -#endif // ifndef X265_I386_PIXEL_AARCH64_H
View file
x265_3.6.tar.gz/.gitignore
Added
@@ -0,0 +1,36 @@ +# Prerequisites +*.d + +# Compiled Object files +*.slo +*.lo +*.o +*.obj + +# Precompiled Headers +*.gch +*.pch + +# Compiled Dynamic libraries +*.so +*.dylib +*.dll + +# Fortran module files +*.mod +*.smod + +# Compiled Static libraries +*.lai +*.la +*.a +*.lib + +# Executables +*.exe +*.out +*.app + +# Build directory +build/ +
View file
x265_3.5.tar.gz/build/README.txt -> x265_3.6.tar.gz/build/README.txt
Changed
@@ -6,6 +6,9 @@ Note: MSVC12 requires cmake 2.8.11 or later +Note: When the SVE/SVE2 instruction set of Arm AArch64 architecture is to be used, the GCC10.x and onwards must + be installed in order to compile x265. + = Optional Prerequisites = @@ -88,3 +91,25 @@ building out of a Mercurial source repository. If you are building out of a release source package, the version will not change. If Mercurial is not found, the version will be "unknown". + += Build Instructions for cross-compilation for Arm AArch64 Targets= + +When the target platform is based on Arm AArch64 architecture, the x265 can be +built in x86 platforms. However, the CMAKE_C_COMPILER and CMAKE_CXX_COMPILER +enviroment variables should be set to point to the cross compilers of the +appropriate gcc. For example: + +1. export CMAKE_C_COMPILER=aarch64-unknown-linux-gnu-gcc +2. export CMAKE_CXX_COMPILER=aarch64-unknown-linux-gnu-g++ + +The default ones are aarch64-linux-gnu-gcc and aarch64-linux-gnu-g++. +Then, the normal building process can be followed. + +Moreover, if the target platform supports SVE or SVE2 instruction set, the +CROSS_COMPILE_SVE or CROSS_COMPILE_SVE2 environment variables should be set +to true, respectively. For example: + +1. export CROSS_COMPILE_SVE2=true +2. export CROSS_COMPILE_SVE=true + +Then, the normal building process can be followed.
View file
x265_3.6.tar.gz/build/aarch64-darwin
Added
+(directory)
View file
x265_3.6.tar.gz/build/aarch64-darwin/crosscompile.cmake
Added
@@ -0,0 +1,23 @@ +# CMake toolchain file for cross compiling x265 for aarch64 +# This feature is only supported as experimental. Use with caution. +# Please report bugs on bitbucket +# Run cmake with: cmake -DCMAKE_TOOLCHAIN_FILE=crosscompile.cmake -G "Unix Makefiles" ../../source && ccmake ../../source + +set(CROSS_COMPILE_ARM64 1) +set(CMAKE_SYSTEM_NAME Darwin) +set(CMAKE_SYSTEM_PROCESSOR aarch64) + +# specify the cross compiler +set(CMAKE_C_COMPILER gcc-12) +set(CMAKE_CXX_COMPILER g++-12) + +# specify the target environment +SET(CMAKE_FIND_ROOT_PATH /opt/homebrew/bin/) + +# specify whether SVE/SVE2 is supported by the target platform +if(DEFINED ENV{CROSS_COMPILE_SVE2}) + set(CROSS_COMPILE_SVE2 1) +elseif(DEFINED ENV{CROSS_COMPILE_SVE}) + set(CROSS_COMPILE_SVE 1) +endif() +
View file
x265_3.6.tar.gz/build/aarch64-darwin/make-Makefiles.bash
Added
@@ -0,0 +1,4 @@ +#!/bin/bash +# Run this from within a bash shell + +cmake -DCMAKE_TOOLCHAIN_FILE="crosscompile.cmake" -G "Unix Makefiles" ../../source && ccmake ../../source
View file
x265_3.5.tar.gz/build/aarch64-linux/crosscompile.cmake -> x265_3.6.tar.gz/build/aarch64-linux/crosscompile.cmake
Changed
@@ -3,13 +3,29 @@ # Please report bugs on bitbucket # Run cmake with: cmake -DCMAKE_TOOLCHAIN_FILE=crosscompile.cmake -G "Unix Makefiles" ../../source && ccmake ../../source -set(CROSS_COMPILE_ARM 1) +set(CROSS_COMPILE_ARM64 1) set(CMAKE_SYSTEM_NAME Linux) set(CMAKE_SYSTEM_PROCESSOR aarch64) # specify the cross compiler -set(CMAKE_C_COMPILER aarch64-linux-gnu-gcc) -set(CMAKE_CXX_COMPILER aarch64-linux-gnu-g++) +if(DEFINED ENV{CMAKE_C_COMPILER}) + set(CMAKE_C_COMPILER $ENV{CMAKE_C_COMPILER}) +else() + set(CMAKE_C_COMPILER aarch64-linux-gnu-gcc) +endif() +if(DEFINED ENV{CMAKE_CXX_COMPILER}) + set(CMAKE_CXX_COMPILER $ENV{CMAKE_CXX_COMPILER}) +else() + set(CMAKE_CXX_COMPILER aarch64-linux-gnu-g++) +endif() # specify the target environment SET(CMAKE_FIND_ROOT_PATH /usr/aarch64-linux-gnu) + +# specify whether SVE/SVE2 is supported by the target platform +if(DEFINED ENV{CROSS_COMPILE_SVE2}) + set(CROSS_COMPILE_SVE2 1) +elseif(DEFINED ENV{CROSS_COMPILE_SVE}) + set(CROSS_COMPILE_SVE 1) +endif() +
View file
x265_3.5.tar.gz/build/arm-linux/make-Makefiles.bash -> x265_3.6.tar.gz/build/arm-linux/make-Makefiles.bash
Changed
@@ -1,4 +1,4 @@ #!/bin/bash # Run this from within a bash shell -cmake -G "Unix Makefiles" ../../source && ccmake ../../source +cmake -DCMAKE_TOOLCHAIN_FILE="crosscompile.cmake" -G "Unix Makefiles" ../../source && ccmake ../../source
View file
x265_3.5.tar.gz/doc/reST/cli.rst -> x265_3.6.tar.gz/doc/reST/cli.rst
Changed
@@ -632,9 +632,8 @@ auto-detection by the encoder. If specified, the encoder will attempt to bring the encode specifications within that specified level. If the encoder is unable to reach the level it issues a - warning and aborts the encode. If the requested requirement level is - higher than the actual level, the actual requirement level is - signaled. + warning and aborts the encode. The requested level will be signaled + in the bitstream even if it is higher than the actual level. Beware, specifying a decoder level will force the encoder to enable VBV for constant rate factor encodes, which may introduce @@ -714,11 +713,8 @@ (main, main10, etc). Second, an encoder is created from this x265_param instance and the :option:`--level-idc` and :option:`--high-tier` parameters are used to reduce bitrate or other - features in order to enforce the target level. Finally, the encoder - re-examines the final set of parameters and detects the actual - minimum decoder requirement level and this is what is signaled in - the bitstream headers. The detected decoder level will only use High - tier if the user specified a High tier level. + features in order to enforce the target level. The detected decoder level + will only use High tier if the user specified a High tier level. The signaled profile will be determined by the encoder's internal bitdepth and input color space. If :option:`--keyint` is 0 or 1, @@ -961,21 +957,21 @@ Note that :option:`--analysis-save-reuse-level` and :option:`--analysis-load-reuse-level` must be paired with :option:`--analysis-save` and :option:`--analysis-load` respectively. - +--------------+------------------------------------------+ - | Level | Description | - +==============+==========================================+ - | 1 | Lookahead information | - +--------------+------------------------------------------+ - | 2 to 4 | Level 1 + intra/inter modes, ref's | - +--------------+------------------------------------------+ - | 5 and 6 | Level 2 + rect-amp | - +--------------+------------------------------------------+ - | 7 | Level 5 + AVC size CU refinement | - +--------------+------------------------------------------+ - | 8 and 9 | Level 5 + AVC size Full CU analysis-info | - +--------------+------------------------------------------+ - | 10 | Level 5 + Full CU analysis-info | - +--------------+------------------------------------------+ + +--------------+---------------------------------------------------+ + | Level | Description | + +==============+===================================================+ + | 1 | Lookahead information | + +--------------+---------------------------------------------------+ + | 2 to 4 | Level 1 + intra/inter modes, depth, ref's, cutree | + +--------------+---------------------------------------------------+ + | 5 and 6 | Level 2 + rect-amp | + +--------------+---------------------------------------------------+ + | 7 | Level 5 + AVC size CU refinement | + +--------------+---------------------------------------------------+ + | 8 and 9 | Level 5 + AVC size Full CU analysis-info | + +--------------+---------------------------------------------------+ + | 10 | Level 5 + Full CU analysis-info | + +--------------+---------------------------------------------------+ .. option:: --refine-mv-type <string> @@ -1332,6 +1328,11 @@ Search range for HME level 0, 1 and 2. The Search Range for each HME level must be between 0 and 32768(excluding). Default search range is 16,32,48 for level 0,1,2 respectively. + +.. option:: --mcstf, --no-mcstf + + Enable Motion Compensated Temporal filtering. + Default: disabled Spatial/intra options ===================== @@ -1473,17 +1474,9 @@ .. option:: --hist-scenecut, --no-hist-scenecut - Indicates that scenecuts need to be detected using luma edge and chroma histograms. - :option:`--hist-scenecut` enables scenecut detection using the histograms and disables the default scene cut algorithm. - :option:`--no-hist-scenecut` disables histogram based scenecut algorithm. - -.. option:: --hist-threshold <0.0..1.0> - - This value represents the threshold for normalized SAD of edge histograms used in scenecut detection. - This requires :option:`--hist-scenecut` to be enabled. For example, a value of 0.2 indicates that a frame with normalized SAD value - greater than 0.2 against the previous frame as scenecut. - Increasing the threshold reduces the number of scenecuts detected. - Default 0.03. + Scenecuts detected based on histogram, intensity and variance of the picture. + :option:`--hist-scenecut` enables or :option:`--no-hist-scenecut` disables scenecut detection based on + histogram. .. option:: --radl <integer> @@ -1766,6 +1759,12 @@ Default 1.0. **Range of values:** 0.0 to 3.0 +.. option:: --sbrc --no-sbrc + + To enable and disable segment based rate control.Segment duration depends on the + keyframe interval specified.If unspecified,default keyframe interval will be used. + Default: disabled. + .. option:: --hevc-aq Enable adaptive quantization @@ -1976,12 +1975,18 @@ **CLI ONLY** +.. option:: --scenecut-qp-config <filename> + + Specify a text file which contains the scenecut aware QP options. + The options include :option:`--scenecut-aware-qp` and :option:`--masking-strength` + + **CLI ONLY** + .. option:: --scenecut-aware-qp <integer> It reduces the bits spent on the inter-frames within the scenecut window before and after a scenecut by increasing their QP in ratecontrol pass2 algorithm - without any deterioration in visual quality. If a scenecut falls within the window, - the QP of the inter-frames after this scenecut will not be modified. + without any deterioration in visual quality. :option:`--scenecut-aware-qp` works only with --pass 2. Default 0. +-------+---------------------------------------------------------------+ @@ -2006,48 +2011,83 @@ for the QP increment for inter-frames when :option:`--scenecut-aware-qp` is enabled. - When :option:`--scenecut-aware-qp` is:: + When :option:`--scenecut-aware-qp` is: + * 1 (Forward masking): - --masking-strength <fwdWindow,fwdRefQPDelta,fwdNonRefQPDelta> + --masking-strength <fwdMaxWindow,fwdRefQPDelta,fwdNonRefQPDelta> + or + --masking-strength <fwdWindow1,fwdRefQPDelta1,fwdNonRefQPDelta1,fwdWindow2,fwdRefQPDelta2,fwdNonRefQPDelta2, + fwdWindow3,fwdRefQPDelta3,fwdNonRefQPDelta3,fwdWindow4,fwdRefQPDelta4,fwdNonRefQPDelta4, + fwdWindow5,fwdRefQPDelta5,fwdNonRefQPDelta5,fwdWindow6,fwdRefQPDelta6,fwdNonRefQPDelta6> * 2 (Backward masking): - --masking-strength <bwdWindow,bwdRefQPDelta,bwdNonRefQPDelta> + --masking-strength <bwdMaxWindow,bwdRefQPDelta,bwdNonRefQPDelta> + or + --masking-strength <bwdWindow1,bwdRefQPDelta1,bwdNonRefQPDelta1,bwdWindow2,bwdRefQPDelta2,bwdNonRefQPDelta2, + bwdWindow3,bwdRefQPDelta3,bwdNonRefQPDelta3,bwdWindow4,bwdRefQPDelta4,bwdNonRefQPDelta4, + bwdWindow5,bwdRefQPDelta5,bwdNonRefQPDelta5,bwdWindow6,bwdRefQPDelta6,bwdNonRefQPDelta6> * 3 (Bi-directional masking): - --masking-strength <fwdWindow,fwdRefQPDelta,fwdNonRefQPDelta,bwdWindow,bwdRefQPDelta,bwdNonRefQPDelta> + --masking-strength <fwdMaxWindow,fwdRefQPDelta,fwdNonRefQPDelta,bwdMaxWindow,bwdRefQPDelta,bwdNonRefQPDelta> + or + --masking-strength <fwdWindow1,fwdRefQPDelta1,fwdNonRefQPDelta1,fwdWindow2,fwdRefQPDelta2,fwdNonRefQPDelta2, + fwdWindow3,fwdRefQPDelta3,fwdNonRefQPDelta3,fwdWindow4,fwdRefQPDelta4,fwdNonRefQPDelta4, + fwdWindow5,fwdRefQPDelta5,fwdNonRefQPDelta5,fwdWindow6,fwdRefQPDelta6,fwdNonRefQPDelta6, + bwdWindow1,bwdRefQPDelta1,bwdNonRefQPDelta1,bwdWindow2,bwdRefQPDelta2,bwdNonRefQPDelta2, + bwdWindow3,bwdRefQPDelta3,bwdNonRefQPDelta3,bwdWindow4,bwdRefQPDelta4,bwdNonRefQPDelta4, + bwdWindow5,bwdRefQPDelta5,bwdNonRefQPDelta5,bwdWindow6,bwdRefQPDelta6,bwdNonRefQPDelta6> +-----------------+---------------------------------------------------------------+ | Parameter | Description | +=================+===============================================================+ - | fwdWindow | The duration(in milliseconds) for which there is a reduction | - | | in the bits spent on the inter-frames after a scenecut by | - | | increasing their QP. Default 500ms. | - | | **Range of values:** 0 to 1000 | + | fwdMaxWindow | The maximum duration(in milliseconds) for which there is a | + | | reduction in the bits spent on the inter-frames after a | + | | scenecut by increasing their QP. Default 500ms. | + | | **Range of values:** 0 to 2000 | + +-----------------+---------------------------------------------------------------+ + | fwdWindow | The duration of a sub-window(in milliseconds) for which there | + | | is a reduction in the bits spent on the inter-frames after a | + | | scenecut by increasing their QP. Default 500ms. | + | | **Range of values:** 0 to 2000 | +-----------------+---------------------------------------------------------------+ | fwdRefQPDelta | The offset by which QP is incremented for inter-frames | | | after a scenecut. Default 5. | - | | **Range of values:** 0 to 10 | + | | **Range of values:** 0 to 20 | +-----------------+---------------------------------------------------------------+ | fwdNonRefQPDelta| The offset by which QP is incremented for non-referenced | | | inter-frames after a scenecut. The offset is computed from | | | fwdRefQPDelta when it is not explicitly specified. | - | | **Range of values:** 0 to 10 | + | | **Range of values:** 0 to 20 | + +-----------------+---------------------------------------------------------------+ + | bwdMaxWindow | The maximum duration(in milliseconds) for which there is a | + | | reduction in the bits spent on the inter-frames before a | + | | scenecut by increasing their QP. Default 100ms. | + | | **Range of values:** 0 to 2000 | +-----------------+---------------------------------------------------------------+ - | bwdWindow | The duration(in milliseconds) for which there is a reduction | - | | in the bits spent on the inter-frames before a scenecut by | - | | increasing their QP. Default 100ms. | - | | **Range of values:** 0 to 1000 | + | bwdWindow | The duration of a sub-window(in milliseconds) for which there | + | | is a reduction in the bits spent on the inter-frames before a | + | | scenecut by increasing their QP. Default 100ms. | + | | **Range of values:** 0 to 2000 | +-----------------+---------------------------------------------------------------+ | bwdRefQPDelta | The offset by which QP is incremented for inter-frames | | | before a scenecut. The offset is computed from | | | fwdRefQPDelta when it is not explicitly specified. | - | | **Range of values:** 0 to 10 | + | | **Range of values:** 0 to 20 | +-----------------+---------------------------------------------------------------+ | bwdNonRefQPDelta| The offset by which QP is incremented for non-referenced | | | inter-frames before a scenecut. The offset is computed from | | | bwdRefQPDelta when it is not explicitly specified. | - | | **Range of values:** 0 to 10 | + | | **Range of values:** 0 to 20 | +-----------------+---------------------------------------------------------------+ - **CLI ONLY** + We can specify the value for the Use :option:`--masking-strength` parameter in different formats. + 1. If we don't specify --masking-strength and specify only --scenecut-aware-qp, then default offset and window size values are considered. + 2. If we specify --masking-strength with the format 1 mentioned above, the values of window, refQpDelta and nonRefQpDelta given by the user are taken for window 1 and the offsets for the remaining windows are derived with 15% difference between windows. + 3. If we specify the --masking-strength with the format 2 mentioned above, the values of window, refQpDelta and nonRefQpDelta given by the user for each window from 1 to 6 are directly used.NOTE: We can use this format to specify zero offsets for any particular window + + Sample config file:: (Format 2 Forward masking explained here) + + --scenecut-aware-qp 1 --masking-strength 1000,8,12 + + The above sample config file is available in `the downloads page <https://bitbucket.org/multicoreware/x265_git/downloads/scenecut_qp_config.txt>`_ .. option:: --vbv-live-multi-pass, --no-vbv-live-multi-pass @@ -2057,6 +2097,14 @@ rate control mode. Default disabled. **Experimental feature** + + +.. option:: bEncFocusedFramesOnly + + Used to trigger encoding of selective GOPs; Disabled by default. + + **API ONLY** + Quantization Options ==================== @@ -2427,6 +2475,81 @@ Values in the range 0..12. See D.3.3 of the HEVC spec. for a detailed explanation. Required for HLG (Hybrid Log Gamma) signaling. Not signaled by default. +.. option:: --video-signal-type-preset <string> + + Specify combinations of color primaries, transfer characteristics, color matrix, + range of luma and chroma signals, and chroma sample location. + String format: <system-id>:<color-volume> + + This has higher precedence than individual VUI parameters. If any individual VUI option + is specified together with this, which changes the values set corresponding to the system-id + or color-volume, it will be discarded. + + system-id options and their corresponding values: + +----------------+---------------------------------------------------------------+ + | system-id | Value | + +================+===============================================================+ + | BT601_525 | --colorprim smpte170m --transfer smpte170m | + | | --colormatrix smpte170m --range limited --chromaloc 0 | + +----------------+---------------------------------------------------------------+ + | BT601_626 | --colorprim bt470bg --transfer smpte170m --colormatrix bt470bg| + | | --range limited --chromaloc 0 | + +----------------+---------------------------------------------------------------+ + | BT709_YCC | --colorprim bt709 --transfer bt709 --colormatrix bt709 | + | | --range limited --chromaloc 0 | + +----------------+---------------------------------------------------------------+ + | BT709_RGB | --colorprim bt709 --transfer bt709 --colormatrix gbr | + | | --range limited | + +----------------+---------------------------------------------------------------+ + | BT2020_YCC_NCL | --colorprim bt2020 --transfer bt2020-10 --colormatrix bt709 | + | | --range limited --chromaloc 2 | + +----------------+---------------------------------------------------------------+ + | BT2020_RGB | --colorprim bt2020 --transfer smpte2084 --colormatrix bt2020nc| + | | --range limited | + +----------------+---------------------------------------------------------------+ + | BT2100_PQ_YCC | --colorprim bt2020 --transfer smpte2084 --colormatrix bt2020nc| + | | --range limited --chromaloc 2 | + +----------------+---------------------------------------------------------------+ + | BT2100_PQ_ICTCP| --colorprim bt2020 --transfer smpte2084 --colormatrix ictcp | + | | --range limited --chromaloc 2 | + +----------------+---------------------------------------------------------------+ + | BT2100_PQ_RGB | --colorprim bt2020 --transfer smpte2084 --colormatrix gbr | + | | --range limited | + +----------------+---------------------------------------------------------------+ + | BT2100_HLG_YCC | --colorprim bt2020 --transfer arib-std-b67 | + | | --colormatrix bt2020nc --range limited --chromaloc 2 | + +----------------+---------------------------------------------------------------+ + | BT2100_HLG_RGB | --colorprim bt2020 --transfer arib-std-b67 --colormatrix gbr | + | | --range limited | + +----------------+---------------------------------------------------------------+ + | FR709_RGB | --colorprim bt709 --transfer bt709 --colormatrix gbr | + | | --range full | + +----------------+---------------------------------------------------------------+ + | FR2020_RGB | --colorprim bt2020 --transfer bt2020-10 --colormatrix gbr | + | | --range full | + +----------------+---------------------------------------------------------------+ + | FRP3D65_YCC | --colorprim smpte432 --transfer bt709 --colormatrix smpte170m | + | | --range full --chromaloc 1 | + +----------------+---------------------------------------------------------------+ + + color-volume options and their corresponding values: + +----------------+---------------------------------------------------------------+ + | color-volume | Value | + +================+===============================================================+ + | P3D65x1000n0005| --master-display G(13250,34500)B(7500,3000)R(34000,16000) | + | | WP(15635,16450)L(10000000,5) | + +----------------+---------------------------------------------------------------+ + | P3D65x4000n005 | --master-display G(13250,34500)B(7500,3000)R(34000,16000) | + | | WP(15635,16450)L(40000000,50) | + +----------------+---------------------------------------------------------------+ + | BT2100x108n0005| --master-display G(8500,39850)B(6550,2300)R(34000,146000) | + | | WP(15635,16450)L(10000000,1) | + +----------------+---------------------------------------------------------------+ + + Note: The color-volume options can be used only with the system-id options BT2100_PQ_YCC, + BT2100_PQ_ICTCP, and BT2100_PQ_RGB. It is incompatible with other options. + + Bitstream options ================= @@ -2454,6 +2577,16 @@ the very first AUD will be skipped since it cannot be placed at the start of the access unit, where it belongs. Default disabled +.. option:: --eob, --no-eob + + Emit an end of bitstream NAL unit at the end of the bitstream. + Default disabled + +.. option:: --eos, --no-eos + + Emit an end of sequence NAL unit at the end of every coded + video sequence. Default disabled + .. option:: --hrd, --no-hrd Enable the signaling of HRD parameters to the decoder. The HRD @@ -2480,7 +2613,7 @@ The value is specified as a float or as an integer with the profile times 10, for example profile 5 is specified as "5" or "5.0" or "50". - Currently only profile 5, profile 8.1 and profile 8.2 enabled, Default 0 (disabled) + Currently only profile 5, profile 8.1, profile 8.2 and profile 8.4 enabled, Default 0 (disabled) .. option:: --dolby-vision-rpu <filename> @@ -2509,17 +2642,26 @@ 2. CRC 3. Checksum -.. option:: --temporal-layers,--no-temporal-layers +.. option:: --temporal-layers <integer> - Enable a temporal sub layer. All referenced I/P/B frames are in the - base layer and all unreferenced B frames are placed in a temporal - enhancement layer. A decoder may choose to drop the enhancement layer - and only decode and display the base layer slices. - - If used with a fixed GOP (:option:`--b-adapt` 0) and :option:`--bframes` - 3 then the two layers evenly split the frame rate, with a cadence of - PbBbP. You probably also want :option:`--no-scenecut` and a keyframe - interval that is a multiple of 4. + Enable specified number of temporal sub layers. For any frame in layer N, + all referenced frames are in the layer N or N-1.A decoder may choose to drop the enhancement layer + and only decode and display the base layer slices.Allowed number of temporal sub-layers + are 2 to 5.(2 and 5 inclusive) + + When enabled,temporal layers 3 through 5 configures a fixed miniGOP with the number of bframes as shown below + unless miniGOP size is modified due to lookahead decisions.Temporal layer 2 is a special case that has + all reference frames in base layer and non-reference frames in enhancement layer without any constraint on the + number of bframes.Default disabled. + +----------------+--------+ + | temporal layer | bframes| + +================+========+ + | 3 | 3 | + +----------------+--------+ + | 4 | 7 | + +----------------+--------+ + | 5 | 15 | + +----------------+--------+ .. option:: --log2-max-poc-lsb <integer> @@ -2564,6 +2706,12 @@ Emit SEI messages in a single NAL unit instead of multiple NALs. Default disabled. When HRD SEI is enabled the HM decoder will throw a warning. +.. option:: --film-grain <filename> + + Refers to the film grain model characteristics for signal enhancement information transmission + + **CLI_ONLY** + DCT Approximations =================
View file
x265_3.5.tar.gz/doc/reST/introduction.rst -> x265_3.6.tar.gz/doc/reST/introduction.rst
Changed
@@ -77,6 +77,6 @@ to start is with the `Motion Picture Experts Group - Licensing Authority - HEVC Licensing Program <http://www.mpegla.com/main/PID/HEVC/default.aspx>`_. -x265 is a registered trademark of MulticoreWare, Inc. The x265 logo is +x265 is a registered trademark of MulticoreWare, Inc. The X265 logo is a trademark of MulticoreWare, and may only be used with explicit written permission. All rights reserved.
View file
x265_3.5.tar.gz/doc/reST/releasenotes.rst -> x265_3.6.tar.gz/doc/reST/releasenotes.rst
Changed
@@ -2,6 +2,53 @@ Release Notes ************* +Version 3.6 +=========== + +Release date - 4th April, 2024. + +New feature +----------- +1. Segment based Ratecontrol (SBRC) feature +2. Motion-Compensated Spatio-Temporal Filtering +3. Scene-cut aware qp - BBAQ (Bidirectional Boundary Aware Quantization) +4. Histogram-Based Scene Change Detection +5. Film-Grain characteristics as a SEI message to support Film Grain Synthesis(FGS) +6. Add temporal layer implementation(Hierarchical B-frame implementation) + +Enhancements to existing features +--------------------------------- +1. Added Dolby Vision 8.4 Profile Support + + +API changes +----------- +1. Add Segment based Ratecontrol(SBRC) feature: "--no-sbrc". +2. Add command line parameter for mcstf feature: "--no-mctf". +3. Add command line parameters for the scene cut aware qp feature: "--scenecut-aware-qp" and "--masking-strength". +4. Add command line parameters for Histogram-Based Scene Change Detection: "--hist-scenecut". +5. Add film grain characteristics as a SEI message to the bitstream: "--film-grain <filename>" +6. cli: add new option --cra-nal (Force nal type to CRA to all frames expect for the first frame, works only with keyint 1) + +Optimizations +--------------------- +ARM64 NEON optimizations:- Several time-consuming C functions have been optimized for the targeted platform - aarch64. The overall performance increased by around 20%. +SVE/SVE2 optimizations + + +Bug fixes +--------- +1. Linux bug to utilize all the cores +2. Crash with hist-scenecut build when source resolution is not multiple of minCuSize +3. 32bit and 64bit builds generation for ARM +4. bugs in zonefile feature (Reflect Zonefile Parameters inside Lookahead, extra IDR issue, Avg I Slice QP value issue etc..) +5. Add x86 ASM implementation for subsampling luma +6. Fix for abrladder segfault with load reuse level 1 +7. Reorder miniGOP based on temporal layer hierarchy and add support for more B frame +8. Add MacOS aarch64 build support +9. Fix boundary condition issue for Gaussian filter + + Version 3.5 ===========
View file
x265_3.5.tar.gz/readme.rst -> x265_3.6.tar.gz/readme.rst
Changed
@@ -2,7 +2,7 @@ x265 HEVC Encoder ================= -| **Read:** | Online `documentation <http://x265.readthedocs.org/en/default/>`_ | Developer `wiki <http://bitbucket.org/multicoreware/x265/wiki/>`_ +| **Read:** | Online `documentation <http://x265.readthedocs.org/en/master/>`_ | Developer `wiki <http://bitbucket.org/multicoreware/x265_git/wiki/>`_ | **Download:** | `releases <http://ftp.videolan.org/pub/videolan/x265/>`_ | **Interact:** | #x265 on freenode.irc.net | `x265-devel@videolan.org <http://mailman.videolan.org/listinfo/x265-devel>`_ | `Report an issue <https://bitbucket.org/multicoreware/x265/issues?status=new&status=open>`_
View file
x265_3.5.tar.gz/source/CMakeLists.txt -> x265_3.6.tar.gz/source/CMakeLists.txt
Changed
@@ -29,7 +29,7 @@ option(STATIC_LINK_CRT "Statically link C runtime for release builds" OFF) mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD) # X265_BUILD must be incremented each time the public API is changed -set(X265_BUILD 199) +set(X265_BUILD 209) configure_file("${PROJECT_SOURCE_DIR}/x265.def.in" "${PROJECT_BINARY_DIR}/x265.def") configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in" @@ -38,14 +38,20 @@ SET(CMAKE_MODULE_PATH "${PROJECT_SOURCE_DIR}/cmake" "${CMAKE_MODULE_PATH}") # System architecture detection -string(TOLOWER "${CMAKE_SYSTEM_PROCESSOR}" SYSPROC) +if (APPLE AND CMAKE_OSX_ARCHITECTURES) + string(TOLOWER "${CMAKE_OSX_ARCHITECTURES}" SYSPROC) +else() + string(TOLOWER "${CMAKE_SYSTEM_PROCESSOR}" SYSPROC) +endif() set(X86_ALIASES x86 i386 i686 x86_64 amd64) -set(ARM_ALIASES armv6l armv7l aarch64) +set(ARM_ALIASES armv6l armv7l) +set(ARM64_ALIASES arm64 arm64e aarch64) list(FIND X86_ALIASES "${SYSPROC}" X86MATCH) list(FIND ARM_ALIASES "${SYSPROC}" ARMMATCH) -set(POWER_ALIASES ppc64 ppc64le) +list(FIND ARM64_ALIASES "${SYSPROC}" ARM64MATCH) +set(POWER_ALIASES powerpc64 powerpc64le ppc64 ppc64le) list(FIND POWER_ALIASES "${SYSPROC}" POWERMATCH) -if("${SYSPROC}" STREQUAL "" OR X86MATCH GREATER "-1") +if(X86MATCH GREATER "-1") set(X86 1) add_definitions(-DX265_ARCH_X86=1) if(CMAKE_CXX_FLAGS STREQUAL "-m32") @@ -70,15 +76,18 @@ else() set(CROSS_COMPILE_ARM 0) endif() + message(STATUS "Detected ARM target processor") set(ARM 1) - if("${CMAKE_SIZEOF_VOID_P}" MATCHES 8) - message(STATUS "Detected ARM64 target processor") - set(ARM64 1) - add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=1 -DHAVE_ARMV6=0) - else() - message(STATUS "Detected ARM target processor") - add_definitions(-DX265_ARCH_ARM=1 -DX265_ARCH_ARM64=0 -DHAVE_ARMV6=1) - endif() + add_definitions(-DX265_ARCH_ARM=1 -DHAVE_ARMV6=1) +elseif(ARM64MATCH GREATER "-1") + #if(CROSS_COMPILE_ARM64) + #message(STATUS "Cross compiling for ARM64 arch") + #else() + #set(CROSS_COMPILE_ARM64 0) + #endif() + message(STATUS "Detected ARM64 target processor") + set(ARM64 1) + add_definitions(-DX265_ARCH_ARM64=1 -DHAVE_NEON) else() message(STATUS "CMAKE_SYSTEM_PROCESSOR value `${CMAKE_SYSTEM_PROCESSOR}` is unknown") message(STATUS "Please add this value near ${CMAKE_CURRENT_LIST_FILE}:${CMAKE_CURRENT_LIST_LINE}") @@ -239,26 +248,43 @@ endif() endif() if(ARM AND CROSS_COMPILE_ARM) - if(ARM64) - set(ARM_ARGS -fPIC) - else() - set(ARM_ARGS -march=armv6 -mfloat-abi=soft -mfpu=vfp -marm -fPIC) - endif() message(STATUS "cross compile arm") + set(ARM_ARGS -march=armv6 -mfloat-abi=soft -mfpu=vfp -marm -fPIC) elseif(ARM) - if(ARM64) - set(ARM_ARGS -fPIC) + find_package(Neon) + if(CPU_HAS_NEON) + set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=neon -marm -fPIC) add_definitions(-DHAVE_NEON) else() - find_package(Neon) - if(CPU_HAS_NEON) - set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=neon -marm -fPIC) - add_definitions(-DHAVE_NEON) - else() - set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=vfp -marm) - endif() + set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=vfp -marm) endif() endif() + if(ARM64 OR CROSS_COMPILE_ARM64) + find_package(Neon) + find_package(SVE) + find_package(SVE2) + if(CPU_HAS_SVE2 OR CROSS_COMPILE_SVE2) + message(STATUS "Found SVE2") + set(ARM_ARGS -O3 -march=armv8-a+sve2 -fPIC -flax-vector-conversions) + add_definitions(-DHAVE_SVE2) + add_definitions(-DHAVE_SVE) + add_definitions(-DHAVE_NEON) # for NEON c/c++ primitives, as currently there is no implementation that use SVE2 + elseif(CPU_HAS_SVE OR CROSS_COMPILE_SVE) + message(STATUS "Found SVE") + set(ARM_ARGS -O3 -march=armv8-a+sve -fPIC -flax-vector-conversions) + add_definitions(-DHAVE_SVE) + add_definitions(-DHAVE_NEON) # for NEON c/c++ primitives, as currently there is no implementation that use SVE + elseif(CPU_HAS_NEON) + message(STATUS "Found NEON") + set(ARM_ARGS -fPIC -flax-vector-conversions) + add_definitions(-DHAVE_NEON) + else() + set(ARM_ARGS -fPIC -flax-vector-conversions) + endif() + endif() + if(ENABLE_PIC) + list(APPEND ARM_ARGS -DPIC) + endif() add_definitions(${ARM_ARGS}) if(FPROFILE_GENERATE) if(INTEL_CXX) @@ -350,7 +376,7 @@ endif(GCC) find_package(Nasm) -if(ARM OR CROSS_COMPILE_ARM) +if(ARM OR CROSS_COMPILE_ARM OR ARM64 OR CROSS_COMPILE_ARM64) option(ENABLE_ASSEMBLY "Enable use of assembly coded primitives" ON) elseif(NASM_FOUND AND X86) if (NASM_VERSION_STRING VERSION_LESS "2.13.0") @@ -384,7 +410,7 @@ endif(EXTRA_LIB) mark_as_advanced(EXTRA_LIB EXTRA_LINK_FLAGS) -if(X64) +if(X64 OR ARM64 OR PPC64) # NOTE: We only officially support high-bit-depth compiles of x265 # on 64bit architectures. Main10 plus large resolution plus slow # preset plus 32bit address space usually means malloc failure. You @@ -393,7 +419,7 @@ # license" so to speak. If it breaks you get to keep both halves. # You will need to disable assembly manually. option(HIGH_BIT_DEPTH "Store pixel samples as 16bit values (Main10/Main12)" OFF) -endif(X64) +endif(X64 OR ARM64 OR PPC64) if(HIGH_BIT_DEPTH) option(MAIN12 "Support Main12 instead of Main10" OFF) if(MAIN12) @@ -440,6 +466,18 @@ endif() add_definitions(-DX265_NS=${X265_NS}) +if(ARM64) + if(HIGH_BIT_DEPTH) + if(MAIN12) + list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=1 -DBIT_DEPTH=12 -DX265_NS=${X265_NS}) + else() + list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=1 -DBIT_DEPTH=10 -DX265_NS=${X265_NS}) + endif() + else() + list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=0 -DBIT_DEPTH=8 -DX265_NS=${X265_NS}) + endif() +endif(ARM64) + option(WARNINGS_AS_ERRORS "Stop compiles on first warning" OFF) if(WARNINGS_AS_ERRORS) if(GCC) @@ -536,11 +574,7 @@ # compile ARM arch asm files here enable_language(ASM) foreach(ASM ${ARM_ASMS}) - if(ARM64) - set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/aarch64/${ASM}) - else() - set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/arm/${ASM}) - endif() + set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/arm/${ASM}) list(APPEND ASM_SRCS ${ASM_SRC}) list(APPEND ASM_OBJS ${ASM}.${SUFFIX}) add_custom_command( @@ -549,6 +583,52 @@ ARGS ${ARM_ARGS} -c ${ASM_SRC} -o ${ASM}.${SUFFIX} DEPENDS ${ASM_SRC}) endforeach() + elseif(ARM64 OR CROSS_COMPILE_ARM64) + # compile ARM64 arch asm files here + enable_language(ASM) + foreach(ASM ${ARM_ASMS}) + set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/aarch64/${ASM}) + list(APPEND ASM_SRCS ${ASM_SRC}) + list(APPEND ASM_OBJS ${ASM}.${SUFFIX}) + add_custom_command( + OUTPUT ${ASM}.${SUFFIX} + COMMAND ${CMAKE_CXX_COMPILER} + ARGS ${ARM_ARGS} ${ASM_FLAGS} -c ${ASM_SRC} -o ${ASM}.${SUFFIX} + DEPENDS ${ASM_SRC}) + endforeach() + if(CPU_HAS_SVE2 OR CROSS_COMPILE_SVE2) + foreach(ASM ${ARM_ASMS_SVE}) + set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/aarch64/${ASM}) + list(APPEND ASM_SRCS ${ASM_SRC}) + list(APPEND ASM_OBJS ${ASM}.${SUFFIX}) + add_custom_command( + OUTPUT ${ASM}.${SUFFIX} + COMMAND ${CMAKE_CXX_COMPILER} + ARGS ${ARM_ARGS} ${ASM_FLAGS} -c ${ASM_SRC} -o ${ASM}.${SUFFIX} + DEPENDS ${ASM_SRC}) + endforeach() + foreach(ASM ${ARM_ASMS_SVE2}) + set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/aarch64/${ASM}) + list(APPEND ASM_SRCS ${ASM_SRC}) + list(APPEND ASM_OBJS ${ASM}.${SUFFIX}) + add_custom_command( + OUTPUT ${ASM}.${SUFFIX} + COMMAND ${CMAKE_CXX_COMPILER} + ARGS ${ARM_ARGS} ${ASM_FLAGS} -c ${ASM_SRC} -o ${ASM}.${SUFFIX} + DEPENDS ${ASM_SRC}) + endforeach() + elseif(CPU_HAS_SVE OR CROSS_COMPILE_SVE) + foreach(ASM ${ARM_ASMS_SVE}) + set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/aarch64/${ASM}) + list(APPEND ASM_SRCS ${ASM_SRC}) + list(APPEND ASM_OBJS ${ASM}.${SUFFIX}) + add_custom_command( + OUTPUT ${ASM}.${SUFFIX} + COMMAND ${CMAKE_CXX_COMPILER} + ARGS ${ARM_ARGS} ${ASM_FLAGS} -c ${ASM_SRC} -o ${ASM}.${SUFFIX} + DEPENDS ${ASM_SRC}) + endforeach() + endif() elseif(X86) # compile X86 arch asm files here foreach(ASM ${MSVC_ASMS})
View file
x265_3.5.tar.gz/source/abrEncApp.cpp -> x265_3.6.tar.gz/source/abrEncApp.cpp
Changed
@@ -1,1111 +1,1111 @@ -/***************************************************************************** -* Copyright (C) 2013-2020 MulticoreWare, Inc -* -* Authors: Pooja Venkatesan <pooja@multicorewareinc.com> -* Aruna Matheswaran <aruna@multicorewareinc.com> -* -* This program is free software; you can redistribute it and/or modify -* it under the terms of the GNU General Public License as published by -* the Free Software Foundation; either version 2 of the License, or -* (at your option) any later version. -* -* This program is distributed in the hope that it will be useful, -* but WITHOUT ANY WARRANTY; without even the implied warranty of -* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -* GNU General Public License for more details. -* -* You should have received a copy of the GNU General Public License -* along with this program; if not, write to the Free Software -* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. -* -* This program is also available under a commercial proprietary license. -* For more information, contact us at license @ x265.com. -*****************************************************************************/ - -#include "abrEncApp.h" -#include "mv.h" -#include "slice.h" -#include "param.h" - -#include <signal.h> -#include <errno.h> - -#include <queue> - -using namespace X265_NS; - -/* Ctrl-C handler */ -static volatile sig_atomic_t b_ctrl_c /* = 0 */; -static void sigint_handler(int) -{ - b_ctrl_c = 1; -} - -namespace X265_NS { - // private namespace -#define X265_INPUT_QUEUE_SIZE 250 - - AbrEncoder::AbrEncoder(CLIOptions cliopt, uint8_t numEncodes, int &ret) - { - m_numEncodes = numEncodes; - m_numActiveEncodes.set(numEncodes); - m_queueSize = (numEncodes > 1) ? X265_INPUT_QUEUE_SIZE : 1; - m_passEnc = X265_MALLOC(PassEncoder*, m_numEncodes); - - for (uint8_t i = 0; i < m_numEncodes; i++) - { - m_passEnci = new PassEncoder(i, cliopti, this); - if (!m_passEnci) - { - x265_log(NULL, X265_LOG_ERROR, "Unable to allocate memory for passEncoder\n"); - ret = 4; - } - m_passEnci->init(ret); - } - - if (!allocBuffers()) - { - x265_log(NULL, X265_LOG_ERROR, "Unable to allocate memory for buffers\n"); - ret = 4; - } - - /* start passEncoder worker threads */ - for (uint8_t pass = 0; pass < m_numEncodes; pass++) - m_passEncpass->startThreads(); - } - - bool AbrEncoder::allocBuffers() - { - m_inputPicBuffer = X265_MALLOC(x265_picture**, m_numEncodes); - m_analysisBuffer = X265_MALLOC(x265_analysis_data*, m_numEncodes); - - m_picWriteCnt = new ThreadSafeIntegerm_numEncodes; - m_picReadCnt = new ThreadSafeIntegerm_numEncodes; - m_analysisWriteCnt = new ThreadSafeIntegerm_numEncodes; - m_analysisReadCnt = new ThreadSafeIntegerm_numEncodes; - - m_picIdxReadCnt = X265_MALLOC(ThreadSafeInteger*, m_numEncodes); - m_analysisWrite = X265_MALLOC(ThreadSafeInteger*, m_numEncodes); - m_analysisRead = X265_MALLOC(ThreadSafeInteger*, m_numEncodes); - m_readFlag = X265_MALLOC(int*, m_numEncodes); - - for (uint8_t pass = 0; pass < m_numEncodes; pass++) - { - m_inputPicBufferpass = X265_MALLOC(x265_picture*, m_queueSize); - for (uint32_t idx = 0; idx < m_queueSize; idx++) - { - m_inputPicBufferpassidx = x265_picture_alloc(); - x265_picture_init(m_passEncpass->m_param, m_inputPicBufferpassidx); - } - - CHECKED_MALLOC_ZERO(m_analysisBufferpass, x265_analysis_data, m_queueSize); - m_picIdxReadCntpass = new ThreadSafeIntegerm_queueSize; - m_analysisWritepass = new ThreadSafeIntegerm_queueSize; - m_analysisReadpass = new ThreadSafeIntegerm_queueSize; - m_readFlagpass = X265_MALLOC(int, m_queueSize); - } - return true; - fail: - return false; - } - - void AbrEncoder::destroy() - { - x265_cleanup(); /* Free library singletons */ - for (uint8_t pass = 0; pass < m_numEncodes; pass++) - { - for (uint32_t index = 0; index < m_queueSize; index++) - { - X265_FREE(m_inputPicBufferpassindex->planes0); - x265_picture_free(m_inputPicBufferpassindex); - } - - X265_FREE(m_inputPicBufferpass); - X265_FREE(m_analysisBufferpass); - X265_FREE(m_readFlagpass); - delete m_picIdxReadCntpass; - delete m_analysisWritepass; - delete m_analysisReadpass; - m_passEncpass->destroy(); - delete m_passEncpass; - } - X265_FREE(m_inputPicBuffer); - X265_FREE(m_analysisBuffer); - X265_FREE(m_readFlag); - - delete m_picWriteCnt; - delete m_picReadCnt; - delete m_analysisWriteCnt; - delete m_analysisReadCnt; - - X265_FREE(m_picIdxReadCnt); - X265_FREE(m_analysisWrite); - X265_FREE(m_analysisRead); - - X265_FREE(m_passEnc); - } - - PassEncoder::PassEncoder(uint32_t id, CLIOptions cliopt, AbrEncoder *parent) - { - m_id = id; - m_cliopt = cliopt; - m_parent = parent; - if(!(m_cliopt.enableScaler && m_id)) - m_input = m_cliopt.input; - m_param = cliopt.param; - m_inputOver = false; - m_lastIdx = -1; - m_encoder = NULL; - m_scaler = NULL; - m_reader = NULL; - m_ret = 0; - } - - int PassEncoder::init(int &result) - { - if (m_parent->m_numEncodes > 1) - setReuseLevel(); - - if (!(m_cliopt.enableScaler && m_id)) - m_reader = new Reader(m_id, this); - else - { - VideoDesc *src = NULL, *dst = NULL; - dst = new VideoDesc(m_param->sourceWidth, m_param->sourceHeight, m_param->internalCsp, m_param->internalBitDepth); - int dstW = m_parent->m_passEncm_id - 1->m_param->sourceWidth; - int dstH = m_parent->m_passEncm_id - 1->m_param->sourceHeight; - src = new VideoDesc(dstW, dstH, m_param->internalCsp, m_param->internalBitDepth); - if (src != NULL && dst != NULL) - { - m_scaler = new Scaler(0, 1, m_id, src, dst, this); - if (!m_scaler) - { - x265_log(m_param, X265_LOG_ERROR, "\n MALLOC failure in Scaler"); - result = 4; - } - } - } - - /* note: we could try to acquire a different libx265 API here based on - * the profile found during option parsing, but it must be done before - * opening an encoder */ - - if (m_param) - m_encoder = m_cliopt.api->encoder_open(m_param); - if (!m_encoder) - { - x265_log(NULL, X265_LOG_ERROR, "x265_encoder_open() failed for Enc, \n"); - m_ret = 2; - return -1; - } - - /* get the encoder parameters post-initialization */ - m_cliopt.api->encoder_parameters(m_encoder, m_param); - - return 1; - } - - void PassEncoder::setReuseLevel() - { - uint32_t r, padh = 0, padw = 0; - - m_param->confWinBottomOffset = m_param->confWinRightOffset = 0; - - m_param->analysisLoadReuseLevel = m_cliopt.loadLevel; - m_param->analysisSaveReuseLevel = m_cliopt.saveLevel; - m_param->analysisSave = m_cliopt.saveLevel ? "save.dat" : NULL; - m_param->analysisLoad = m_cliopt.loadLevel ? "load.dat" : NULL; - m_param->bUseAnalysisFile = 0; - - if (m_cliopt.loadLevel) - { - x265_param *refParam = m_parent->m_passEncm_cliopt.refId->m_param; - - if (m_param->sourceHeight == (refParam->sourceHeight - refParam->confWinBottomOffset) && - m_param->sourceWidth == (refParam->sourceWidth - refParam->confWinRightOffset)) - { - m_parent->m_passEncm_id->m_param->confWinBottomOffset = refParam->confWinBottomOffset; - m_parent->m_passEncm_id->m_param->confWinRightOffset = refParam->confWinRightOffset; - } - else - { - int srcH = refParam->sourceHeight - refParam->confWinBottomOffset; - int srcW = refParam->sourceWidth - refParam->confWinRightOffset; - - double scaleFactorH = double(m_param->sourceHeight / srcH); - double scaleFactorW = double(m_param->sourceWidth / srcW); - - int absScaleFactorH = (int)(10 * scaleFactorH + 0.5); - int absScaleFactorW = (int)(10 * scaleFactorW + 0.5); - - if (absScaleFactorH == 20 && absScaleFactorW == 20) - { - m_param->scaleFactor = 2; - - m_parent->m_passEncm_id->m_param->confWinBottomOffset = refParam->confWinBottomOffset * 2; - m_parent->m_passEncm_id->m_param->confWinRightOffset = refParam->confWinRightOffset * 2; - - } - } - } - - int h = m_param->sourceHeight + m_param->confWinBottomOffset; - int w = m_param->sourceWidth + m_param->confWinRightOffset; - if (h & (m_param->minCUSize - 1)) - { - r = h & (m_param->minCUSize - 1); - padh = m_param->minCUSize - r; - m_param->confWinBottomOffset += padh; - - } - - if (w & (m_param->minCUSize - 1)) - { - r = w & (m_param->minCUSize - 1); - padw = m_param->minCUSize - r; - m_param->confWinRightOffset += padw; - } - } - - void PassEncoder::startThreads() - { - /* Start slave worker threads */ - m_threadActive = true; - start(); - /* Start reader threads*/ - if (m_reader != NULL) - { - m_reader->m_threadActive = true; - m_reader->start(); - } - /* Start scaling worker threads */ - if (m_scaler != NULL) - { - m_scaler->m_threadActive = true; - m_scaler->start(); - } - } - - void PassEncoder::copyInfo(x265_analysis_data * src) - { - - uint32_t written = m_parent->m_analysisWriteCntm_id.get(); - - int index = written % m_parent->m_queueSize; - //If all streams have read analysis data, reuse that position in Queue - - int read = m_parent->m_analysisReadm_idindex.get(); - int write = m_parent->m_analysisWritem_idindex.get(); - - int overwrite = written / m_parent->m_queueSize; - bool emptyIdxFound = 0; - while (!emptyIdxFound && overwrite) - { - for (uint32_t i = 0; i < m_parent->m_queueSize; i++) - { - read = m_parent->m_analysisReadm_idi.get(); - write = m_parent->m_analysisWritem_idi.get(); - write *= m_cliopt.numRefs; - - if (read == write) - { - index = i; - emptyIdxFound = 1; - } - } - } - - x265_analysis_data *m_analysisInfo = &m_parent->m_analysisBufferm_idindex; - - x265_free_analysis_data(m_param, m_analysisInfo); - memcpy(m_analysisInfo, src, sizeof(x265_analysis_data)); - x265_alloc_analysis_data(m_param, m_analysisInfo); - - bool isVbv = m_param->rc.vbvBufferSize && m_param->rc.vbvMaxBitrate; - if (m_param->bDisableLookahead && isVbv) - { - memcpy(m_analysisInfo->lookahead.intraSatdForVbv, src->lookahead.intraSatdForVbv, src->numCuInHeight * sizeof(uint32_t)); - memcpy(m_analysisInfo->lookahead.satdForVbv, src->lookahead.satdForVbv, src->numCuInHeight * sizeof(uint32_t)); - memcpy(m_analysisInfo->lookahead.intraVbvCost, src->lookahead.intraVbvCost, src->numCUsInFrame * sizeof(uint32_t)); - memcpy(m_analysisInfo->lookahead.vbvCost, src->lookahead.vbvCost, src->numCUsInFrame * sizeof(uint32_t)); - } - - if (src->sliceType == X265_TYPE_IDR || src->sliceType == X265_TYPE_I) - { - if (m_param->analysisSaveReuseLevel < 2) - goto ret; - x265_analysis_intra_data *intraDst, *intraSrc; - intraDst = (x265_analysis_intra_data*)m_analysisInfo->intraData; - intraSrc = (x265_analysis_intra_data*)src->intraData; - memcpy(intraDst->depth, intraSrc->depth, sizeof(uint8_t) * src->depthBytes); - memcpy(intraDst->modes, intraSrc->modes, sizeof(uint8_t) * src->numCUsInFrame * src->numPartitions); - memcpy(intraDst->partSizes, intraSrc->partSizes, sizeof(char) * src->depthBytes); - memcpy(intraDst->chromaModes, intraSrc->chromaModes, sizeof(uint8_t) * src->depthBytes); - if (m_param->rc.cuTree) - memcpy(intraDst->cuQPOff, intraSrc->cuQPOff, sizeof(int8_t) * src->depthBytes); - } - else - { - bool bIntraInInter = (src->sliceType == X265_TYPE_P || m_param->bIntraInBFrames); - int numDir = src->sliceType == X265_TYPE_P ? 1 : 2; - memcpy(m_analysisInfo->wt, src->wt, sizeof(WeightParam) * 3 * numDir); - if (m_param->analysisSaveReuseLevel < 2) - goto ret; - x265_analysis_inter_data *interDst, *interSrc; - interDst = (x265_analysis_inter_data*)m_analysisInfo->interData; - interSrc = (x265_analysis_inter_data*)src->interData; - memcpy(interDst->depth, interSrc->depth, sizeof(uint8_t) * src->depthBytes); - memcpy(interDst->modes, interSrc->modes, sizeof(uint8_t) * src->depthBytes); - if (m_param->rc.cuTree) - memcpy(interDst->cuQPOff, interSrc->cuQPOff, sizeof(int8_t) * src->depthBytes); - if (m_param->analysisSaveReuseLevel > 4) - { - memcpy(interDst->partSize, interSrc->partSize, sizeof(uint8_t) * src->depthBytes); - memcpy(interDst->mergeFlag, interSrc->mergeFlag, sizeof(uint8_t) * src->depthBytes); - if (m_param->analysisSaveReuseLevel == 10) - { - memcpy(interDst->interDir, interSrc->interDir, sizeof(uint8_t) * src->depthBytes); - for (int dir = 0; dir < numDir; dir++) - { - memcpy(interDst->mvpIdxdir, interSrc->mvpIdxdir, sizeof(uint8_t) * src->depthBytes); - memcpy(interDst->refIdxdir, interSrc->refIdxdir, sizeof(int8_t) * src->depthBytes); - memcpy(interDst->mvdir, interSrc->mvdir, sizeof(MV) * src->depthBytes); - } - if (bIntraInInter) - { - x265_analysis_intra_data *intraDst = (x265_analysis_intra_data*)m_analysisInfo->intraData; - x265_analysis_intra_data *intraSrc = (x265_analysis_intra_data*)src->intraData; - memcpy(intraDst->modes, intraSrc->modes, sizeof(uint8_t) * src->numPartitions * src->numCUsInFrame); - memcpy(intraDst->chromaModes, intraSrc->chromaModes, sizeof(uint8_t) * src->depthBytes); - } - } - } - if (m_param->analysisSaveReuseLevel != 10) - memcpy(interDst->ref, interSrc->ref, sizeof(int32_t) * src->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir); - } - -ret: - //increment analysis Write counter - m_parent->m_analysisWriteCntm_id.incr(); - m_parent->m_analysisWritem_idindex.incr(); - return; - } - - - bool PassEncoder::readPicture(x265_picture *dstPic) - { - /*Check and wait if there any input frames to read*/ - int ipread = m_parent->m_picReadCntm_id.get(); - int ipwrite = m_parent->m_picWriteCntm_id.get(); - - bool isAbrLoad = m_cliopt.loadLevel && (m_parent->m_numEncodes > 1); - while (!m_inputOver && (ipread == ipwrite)) - { - ipwrite = m_parent->m_picWriteCntm_id.waitForChange(ipwrite); - } - - if (m_threadActive && ipread < ipwrite) - { - /*Get input index to read from inputQueue. If doesn't need analysis info, it need not wait to fetch poc from analysisQueue*/ - int readPos = ipread % m_parent->m_queueSize; - x265_analysis_data* analysisData = 0; - - if (isAbrLoad) - { - /*If stream is master of each slave pass, then fetch analysis data from prev pass*/ - int analysisQId = m_cliopt.refId; - /*Check and wait if there any analysis Data to read*/ - int analysisWrite = m_parent->m_analysisWriteCntanalysisQId.get(); - int written = analysisWrite * m_parent->m_passEncanalysisQId->m_cliopt.numRefs; - int analysisRead = m_parent->m_analysisReadCntanalysisQId.get(); - - while (m_threadActive && written == analysisRead) - { - analysisWrite = m_parent->m_analysisWriteCntanalysisQId.waitForChange(analysisWrite); - written = analysisWrite * m_parent->m_passEncanalysisQId->m_cliopt.numRefs; - } - - if (analysisRead < written) - { - int analysisIdx = 0; - if (!m_param->bDisableLookahead) - { - bool analysisdRead = false; - while ((analysisRead < written) && !analysisdRead) - { - while (analysisWrite < ipread) - { - analysisWrite = m_parent->m_analysisWriteCntanalysisQId.waitForChange(analysisWrite); - written = analysisWrite * m_parent->m_passEncanalysisQId->m_cliopt.numRefs; - } - for (uint32_t i = 0; i < m_parent->m_queueSize; i++) - { - analysisData = &m_parent->m_analysisBufferanalysisQIdi; - int read = m_parent->m_analysisReadanalysisQIdi.get(); - int write = m_parent->m_analysisWriteanalysisQIdi.get() * m_parent->m_passEncanalysisQId->m_cliopt.numRefs; - if ((analysisData->poc == (uint32_t)(ipread)) && (read < write)) - { - analysisIdx = i; - analysisdRead = true; - break; - } - } - } - } - else - { - analysisIdx = analysisRead % m_parent->m_queueSize; - analysisData = &m_parent->m_analysisBufferanalysisQIdanalysisIdx; - readPos = analysisData->poc % m_parent->m_queueSize; - while ((ipwrite < readPos) || ((ipwrite - 1) < (int)analysisData->poc)) - { - ipwrite = m_parent->m_picWriteCntm_id.waitForChange(ipwrite); - } - } - - m_lastIdx = analysisIdx; - } - else - return false; - } - - - x265_picture *srcPic = (x265_picture*)(m_parent->m_inputPicBufferm_idreadPos); - - x265_picture *pic = (x265_picture*)(dstPic); - pic->colorSpace = srcPic->colorSpace; - pic->bitDepth = srcPic->bitDepth; - pic->framesize = srcPic->framesize; - pic->height = srcPic->height; - pic->pts = srcPic->pts; - pic->dts = srcPic->dts; - pic->reorderedPts = srcPic->reorderedPts; - pic->width = srcPic->width; - pic->analysisData = srcPic->analysisData; - pic->userSEI = srcPic->userSEI; - pic->stride0 = srcPic->stride0; - pic->stride1 = srcPic->stride1; - pic->stride2 = srcPic->stride2; - pic->planes0 = srcPic->planes0; - pic->planes1 = srcPic->planes1; - pic->planes2 = srcPic->planes2; - if (isAbrLoad) - pic->analysisData = *analysisData; - return true; - } - else - return false; - } - - void PassEncoder::threadMain() - { +/***************************************************************************** +* Copyright (C) 2013-2020 MulticoreWare, Inc +* +* Authors: Pooja Venkatesan <pooja@multicorewareinc.com> +* Aruna Matheswaran <aruna@multicorewareinc.com> +* +* This program is free software; you can redistribute it and/or modify +* it under the terms of the GNU General Public License as published by +* the Free Software Foundation; either version 2 of the License, or +* (at your option) any later version. +* +* This program is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +* GNU General Public License for more details. +* +* You should have received a copy of the GNU General Public License +* along with this program; if not, write to the Free Software +* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. +* +* This program is also available under a commercial proprietary license. +* For more information, contact us at license @ x265.com. +*****************************************************************************/ + +#include "abrEncApp.h" +#include "mv.h" +#include "slice.h" +#include "param.h" + +#include <signal.h> +#include <errno.h> + +#include <queue> + +using namespace X265_NS; + +/* Ctrl-C handler */ +static volatile sig_atomic_t b_ctrl_c /* = 0 */; +static void sigint_handler(int) +{ + b_ctrl_c = 1; +} + +namespace X265_NS { + // private namespace +#define X265_INPUT_QUEUE_SIZE 250 + + AbrEncoder::AbrEncoder(CLIOptions cliopt, uint8_t numEncodes, int &ret) + { + m_numEncodes = numEncodes; + m_numActiveEncodes.set(numEncodes); + m_queueSize = (numEncodes > 1) ? X265_INPUT_QUEUE_SIZE : 1; + m_passEnc = X265_MALLOC(PassEncoder*, m_numEncodes); + + for (uint8_t i = 0; i < m_numEncodes; i++) + { + m_passEnci = new PassEncoder(i, cliopti, this); + if (!m_passEnci) + { + x265_log(NULL, X265_LOG_ERROR, "Unable to allocate memory for passEncoder\n"); + ret = 4; + } + m_passEnci->init(ret); + } + + if (!allocBuffers()) + { + x265_log(NULL, X265_LOG_ERROR, "Unable to allocate memory for buffers\n"); + ret = 4; + } + + /* start passEncoder worker threads */ + for (uint8_t pass = 0; pass < m_numEncodes; pass++) + m_passEncpass->startThreads(); + } + + bool AbrEncoder::allocBuffers() + { + m_inputPicBuffer = X265_MALLOC(x265_picture**, m_numEncodes); + m_analysisBuffer = X265_MALLOC(x265_analysis_data*, m_numEncodes); + + m_picWriteCnt = new ThreadSafeIntegerm_numEncodes; + m_picReadCnt = new ThreadSafeIntegerm_numEncodes; + m_analysisWriteCnt = new ThreadSafeIntegerm_numEncodes; + m_analysisReadCnt = new ThreadSafeIntegerm_numEncodes; + + m_picIdxReadCnt = X265_MALLOC(ThreadSafeInteger*, m_numEncodes); + m_analysisWrite = X265_MALLOC(ThreadSafeInteger*, m_numEncodes); + m_analysisRead = X265_MALLOC(ThreadSafeInteger*, m_numEncodes); + m_readFlag = X265_MALLOC(int*, m_numEncodes); + + for (uint8_t pass = 0; pass < m_numEncodes; pass++) + { + m_inputPicBufferpass = X265_MALLOC(x265_picture*, m_queueSize); + for (uint32_t idx = 0; idx < m_queueSize; idx++) + { + m_inputPicBufferpassidx = x265_picture_alloc(); + x265_picture_init(m_passEncpass->m_param, m_inputPicBufferpassidx); + } + + CHECKED_MALLOC_ZERO(m_analysisBufferpass, x265_analysis_data, m_queueSize); + m_picIdxReadCntpass = new ThreadSafeIntegerm_queueSize; + m_analysisWritepass = new ThreadSafeIntegerm_queueSize; + m_analysisReadpass = new ThreadSafeIntegerm_queueSize; + m_readFlagpass = X265_MALLOC(int, m_queueSize); + } + return true; + fail: + return false; + } + + void AbrEncoder::destroy() + { + x265_cleanup(); /* Free library singletons */ + for (uint8_t pass = 0; pass < m_numEncodes; pass++) + { + for (uint32_t index = 0; index < m_queueSize; index++) + { + X265_FREE(m_inputPicBufferpassindex->planes0); + x265_picture_free(m_inputPicBufferpassindex); + } + + X265_FREE(m_inputPicBufferpass); + X265_FREE(m_analysisBufferpass); + X265_FREE(m_readFlagpass); + delete m_picIdxReadCntpass; + delete m_analysisWritepass; + delete m_analysisReadpass; + m_passEncpass->destroy(); + delete m_passEncpass; + } + X265_FREE(m_inputPicBuffer); + X265_FREE(m_analysisBuffer); + X265_FREE(m_readFlag); + + delete m_picWriteCnt; + delete m_picReadCnt; + delete m_analysisWriteCnt; + delete m_analysisReadCnt; + + X265_FREE(m_picIdxReadCnt); + X265_FREE(m_analysisWrite); + X265_FREE(m_analysisRead); + + X265_FREE(m_passEnc); + } + + PassEncoder::PassEncoder(uint32_t id, CLIOptions cliopt, AbrEncoder *parent) + { + m_id = id; + m_cliopt = cliopt; + m_parent = parent; + if(!(m_cliopt.enableScaler && m_id)) + m_input = m_cliopt.input; + m_param = cliopt.param; + m_inputOver = false; + m_lastIdx = -1; + m_encoder = NULL; + m_scaler = NULL; + m_reader = NULL; + m_ret = 0; + } + + int PassEncoder::init(int &result) + { + if (m_parent->m_numEncodes > 1) + setReuseLevel(); + + if (!(m_cliopt.enableScaler && m_id)) + m_reader = new Reader(m_id, this); + else + { + VideoDesc *src = NULL, *dst = NULL; + dst = new VideoDesc(m_param->sourceWidth, m_param->sourceHeight, m_param->internalCsp, m_param->internalBitDepth); + int dstW = m_parent->m_passEncm_id - 1->m_param->sourceWidth; + int dstH = m_parent->m_passEncm_id - 1->m_param->sourceHeight; + src = new VideoDesc(dstW, dstH, m_param->internalCsp, m_param->internalBitDepth); + if (src != NULL && dst != NULL) + { + m_scaler = new Scaler(0, 1, m_id, src, dst, this); + if (!m_scaler) + { + x265_log(m_param, X265_LOG_ERROR, "\n MALLOC failure in Scaler"); + result = 4; + } + } + } + + if (m_cliopt.zoneFile) + { + if (!m_cliopt.parseZoneFile()) + { + x265_log(NULL, X265_LOG_ERROR, "Unable to parse zonefile in %s\n"); + fclose(m_cliopt.zoneFile); + m_cliopt.zoneFile = NULL; + } + } + + /* note: we could try to acquire a different libx265 API here based on + * the profile found during option parsing, but it must be done before + * opening an encoder */ + + if (m_param) + m_encoder = m_cliopt.api->encoder_open(m_param); + if (!m_encoder) + { + x265_log(NULL, X265_LOG_ERROR, "x265_encoder_open() failed for Enc, \n"); + m_ret = 2; + return -1; + } + + /* get the encoder parameters post-initialization */ + m_cliopt.api->encoder_parameters(m_encoder, m_param); + + return 1; + } + + void PassEncoder::setReuseLevel() + { + uint32_t r, padh = 0, padw = 0; + + m_param->confWinBottomOffset = m_param->confWinRightOffset = 0; + + m_param->analysisLoadReuseLevel = m_cliopt.loadLevel; + m_param->analysisSaveReuseLevel = m_cliopt.saveLevel; + m_param->analysisSave = m_cliopt.saveLevel ? "save.dat" : NULL; + m_param->analysisLoad = m_cliopt.loadLevel ? "load.dat" : NULL; + m_param->bUseAnalysisFile = 0; + + if (m_cliopt.loadLevel) + { + x265_param *refParam = m_parent->m_passEncm_cliopt.refId->m_param; + + if (m_param->sourceHeight == (refParam->sourceHeight - refParam->confWinBottomOffset) && + m_param->sourceWidth == (refParam->sourceWidth - refParam->confWinRightOffset)) + { + m_parent->m_passEncm_id->m_param->confWinBottomOffset = refParam->confWinBottomOffset; + m_parent->m_passEncm_id->m_param->confWinRightOffset = refParam->confWinRightOffset; + } + else + { + int srcH = refParam->sourceHeight - refParam->confWinBottomOffset; + int srcW = refParam->sourceWidth - refParam->confWinRightOffset; + + double scaleFactorH = double(m_param->sourceHeight / srcH); + double scaleFactorW = double(m_param->sourceWidth / srcW); + + int absScaleFactorH = (int)(10 * scaleFactorH + 0.5); + int absScaleFactorW = (int)(10 * scaleFactorW + 0.5); + + if (absScaleFactorH == 20 && absScaleFactorW == 20) + { + m_param->scaleFactor = 2; + + m_parent->m_passEncm_id->m_param->confWinBottomOffset = refParam->confWinBottomOffset * 2; + m_parent->m_passEncm_id->m_param->confWinRightOffset = refParam->confWinRightOffset * 2; + + } + } + } + + int h = m_param->sourceHeight + m_param->confWinBottomOffset; + int w = m_param->sourceWidth + m_param->confWinRightOffset; + if (h & (m_param->minCUSize - 1)) + { + r = h & (m_param->minCUSize - 1); + padh = m_param->minCUSize - r; + m_param->confWinBottomOffset += padh; + + } + + if (w & (m_param->minCUSize - 1)) + { + r = w & (m_param->minCUSize - 1); + padw = m_param->minCUSize - r; + m_param->confWinRightOffset += padw; + } + } + + void PassEncoder::startThreads() + { + /* Start slave worker threads */ + m_threadActive = true; + start(); + /* Start reader threads*/ + if (m_reader != NULL) + { + m_reader->m_threadActive = true; + m_reader->start(); + } + /* Start scaling worker threads */ + if (m_scaler != NULL) + { + m_scaler->m_threadActive = true; + m_scaler->start(); + } + } + + void PassEncoder::copyInfo(x265_analysis_data * src) + { + + uint32_t written = m_parent->m_analysisWriteCntm_id.get(); + + int index = written % m_parent->m_queueSize; + //If all streams have read analysis data, reuse that position in Queue + + int read = m_parent->m_analysisReadm_idindex.get(); + int write = m_parent->m_analysisWritem_idindex.get(); + + int overwrite = written / m_parent->m_queueSize; + bool emptyIdxFound = 0; + while (!emptyIdxFound && overwrite) + { + for (uint32_t i = 0; i < m_parent->m_queueSize; i++) + { + read = m_parent->m_analysisReadm_idi.get(); + write = m_parent->m_analysisWritem_idi.get(); + write *= m_cliopt.numRefs; + + if (read == write) + { + index = i; + emptyIdxFound = 1; + } + } + } + + x265_analysis_data *m_analysisInfo = &m_parent->m_analysisBufferm_idindex; + + x265_free_analysis_data(m_param, m_analysisInfo); + memcpy(m_analysisInfo, src, sizeof(x265_analysis_data)); + x265_alloc_analysis_data(m_param, m_analysisInfo); + + bool isVbv = m_param->rc.vbvBufferSize && m_param->rc.vbvMaxBitrate; + if (m_param->bDisableLookahead && isVbv) + { + memcpy(m_analysisInfo->lookahead.intraSatdForVbv, src->lookahead.intraSatdForVbv, src->numCuInHeight * sizeof(uint32_t)); + memcpy(m_analysisInfo->lookahead.satdForVbv, src->lookahead.satdForVbv, src->numCuInHeight * sizeof(uint32_t)); + memcpy(m_analysisInfo->lookahead.intraVbvCost, src->lookahead.intraVbvCost, src->numCUsInFrame * sizeof(uint32_t)); + memcpy(m_analysisInfo->lookahead.vbvCost, src->lookahead.vbvCost, src->numCUsInFrame * sizeof(uint32_t)); + } + + if (src->sliceType == X265_TYPE_IDR || src->sliceType == X265_TYPE_I) + { + if (m_param->analysisSaveReuseLevel < 2) + goto ret; + x265_analysis_intra_data *intraDst, *intraSrc; + intraDst = (x265_analysis_intra_data*)m_analysisInfo->intraData; + intraSrc = (x265_analysis_intra_data*)src->intraData; + memcpy(intraDst->depth, intraSrc->depth, sizeof(uint8_t) * src->depthBytes); + memcpy(intraDst->modes, intraSrc->modes, sizeof(uint8_t) * src->numCUsInFrame * src->numPartitions); + memcpy(intraDst->partSizes, intraSrc->partSizes, sizeof(char) * src->depthBytes); + memcpy(intraDst->chromaModes, intraSrc->chromaModes, sizeof(uint8_t) * src->depthBytes); + if (m_param->rc.cuTree) + memcpy(intraDst->cuQPOff, intraSrc->cuQPOff, sizeof(int8_t) * src->depthBytes); + } + else + { + bool bIntraInInter = (src->sliceType == X265_TYPE_P || m_param->bIntraInBFrames); + int numDir = src->sliceType == X265_TYPE_P ? 1 : 2; + memcpy(m_analysisInfo->wt, src->wt, sizeof(WeightParam) * 3 * numDir); + if (m_param->analysisSaveReuseLevel < 2) + goto ret; + x265_analysis_inter_data *interDst, *interSrc; + interDst = (x265_analysis_inter_data*)m_analysisInfo->interData; + interSrc = (x265_analysis_inter_data*)src->interData; + memcpy(interDst->depth, interSrc->depth, sizeof(uint8_t) * src->depthBytes); + memcpy(interDst->modes, interSrc->modes, sizeof(uint8_t) * src->depthBytes); + if (m_param->rc.cuTree) + memcpy(interDst->cuQPOff, interSrc->cuQPOff, sizeof(int8_t) * src->depthBytes); + if (m_param->analysisSaveReuseLevel > 4) + { + memcpy(interDst->partSize, interSrc->partSize, sizeof(uint8_t) * src->depthBytes); + memcpy(interDst->mergeFlag, interSrc->mergeFlag, sizeof(uint8_t) * src->depthBytes); + if (m_param->analysisSaveReuseLevel == 10) + { + memcpy(interDst->interDir, interSrc->interDir, sizeof(uint8_t) * src->depthBytes); + for (int dir = 0; dir < numDir; dir++) + { + memcpy(interDst->mvpIdxdir, interSrc->mvpIdxdir, sizeof(uint8_t) * src->depthBytes); + memcpy(interDst->refIdxdir, interSrc->refIdxdir, sizeof(int8_t) * src->depthBytes); + memcpy(interDst->mvdir, interSrc->mvdir, sizeof(MV) * src->depthBytes); + } + if (bIntraInInter) + { + x265_analysis_intra_data *intraDst = (x265_analysis_intra_data*)m_analysisInfo->intraData; + x265_analysis_intra_data *intraSrc = (x265_analysis_intra_data*)src->intraData; + memcpy(intraDst->modes, intraSrc->modes, sizeof(uint8_t) * src->numPartitions * src->numCUsInFrame); + memcpy(intraDst->chromaModes, intraSrc->chromaModes, sizeof(uint8_t) * src->depthBytes); + } + } + } + if (m_param->analysisSaveReuseLevel != 10) + memcpy(interDst->ref, interSrc->ref, sizeof(int32_t) * src->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir); + } + +ret: + //increment analysis Write counter + m_parent->m_analysisWriteCntm_id.incr(); + m_parent->m_analysisWritem_idindex.incr(); + return; + } + + + bool PassEncoder::readPicture(x265_picture *dstPic) + { + /*Check and wait if there any input frames to read*/ + int ipread = m_parent->m_picReadCntm_id.get(); + int ipwrite = m_parent->m_picWriteCntm_id.get(); + + bool isAbrLoad = m_cliopt.loadLevel && (m_parent->m_numEncodes > 1); + while (!m_inputOver && (ipread == ipwrite)) + { + ipwrite = m_parent->m_picWriteCntm_id.waitForChange(ipwrite); + } + + if (m_threadActive && ipread < ipwrite) + { + /*Get input index to read from inputQueue. If doesn't need analysis info, it need not wait to fetch poc from analysisQueue*/ + int readPos = ipread % m_parent->m_queueSize; + x265_analysis_data* analysisData = 0; + + if (isAbrLoad) + { + /*If stream is master of each slave pass, then fetch analysis data from prev pass*/ + int analysisQId = m_cliopt.refId; + /*Check and wait if there any analysis Data to read*/ + int analysisWrite = m_parent->m_analysisWriteCntanalysisQId.get(); + int written = analysisWrite * m_parent->m_passEncanalysisQId->m_cliopt.numRefs; + int analysisRead = m_parent->m_analysisReadCntanalysisQId.get(); + + while (m_threadActive && written == analysisRead) + { + analysisWrite = m_parent->m_analysisWriteCntanalysisQId.waitForChange(analysisWrite); + written = analysisWrite * m_parent->m_passEncanalysisQId->m_cliopt.numRefs; + } + + if (analysisRead < written) + { + int analysisIdx = 0; + if (!m_param->bDisableLookahead) + { + bool analysisdRead = false; + while ((analysisRead < written) && !analysisdRead) + { + while (analysisWrite < ipread) + { + analysisWrite = m_parent->m_analysisWriteCntanalysisQId.waitForChange(analysisWrite); + written = analysisWrite * m_parent->m_passEncanalysisQId->m_cliopt.numRefs; + } + for (uint32_t i = 0; i < m_parent->m_queueSize; i++) + { + analysisData = &m_parent->m_analysisBufferanalysisQIdi; + int read = m_parent->m_analysisReadanalysisQIdi.get(); + int write = m_parent->m_analysisWriteanalysisQIdi.get() * m_parent->m_passEncanalysisQId->m_cliopt.numRefs; + if ((analysisData->poc == (uint32_t)(ipread)) && (read < write)) + { + analysisIdx = i; + analysisdRead = true; + break; + } + } + } + } + else + { + analysisIdx = analysisRead % m_parent->m_queueSize; + analysisData = &m_parent->m_analysisBufferanalysisQIdanalysisIdx; + readPos = analysisData->poc % m_parent->m_queueSize; + while ((ipwrite < readPos) || ((ipwrite - 1) < (int)analysisData->poc)) + { + ipwrite = m_parent->m_picWriteCntm_id.waitForChange(ipwrite); + } + } + + m_lastIdx = analysisIdx; + } + else + return false; + } + + + x265_picture *srcPic = (x265_picture*)(m_parent->m_inputPicBufferm_idreadPos); + + x265_picture *pic = (x265_picture*)(dstPic); + pic->colorSpace = srcPic->colorSpace; + pic->bitDepth = srcPic->bitDepth; + pic->framesize = srcPic->framesize; + pic->height = srcPic->height; + pic->pts = srcPic->pts; + pic->dts = srcPic->dts; + pic->reorderedPts = srcPic->reorderedPts; + pic->width = srcPic->width; + pic->analysisData = srcPic->analysisData; + pic->userSEI = srcPic->userSEI; + pic->stride0 = srcPic->stride0; + pic->stride1 = srcPic->stride1; + pic->stride2 = srcPic->stride2; + pic->planes0 = srcPic->planes0; + pic->planes1 = srcPic->planes1; + pic->planes2 = srcPic->planes2; + if (isAbrLoad) + pic->analysisData = *analysisData; + return true; + } + else + return false; + } + + void PassEncoder::threadMain() + { THREAD_NAME("PassEncoder", m_id); while (m_threadActive) { - -#if ENABLE_LIBVMAF - x265_vmaf_data* vmafdata = m_cliopt.vmafData; -#endif - /* This allows muxers to modify bitstream format */ - m_cliopt.output->setParam(m_param); - const x265_api* api = m_cliopt.api; - ReconPlay* reconPlay = NULL; - if (m_cliopt.reconPlayCmd) - reconPlay = new ReconPlay(m_cliopt.reconPlayCmd, *m_param); - char* profileName = m_cliopt.encName ? m_cliopt.encName : (char *)"x265"; - - if (m_cliopt.zoneFile) - { - if (!m_cliopt.parseZoneFile()) - { - x265_log(NULL, X265_LOG_ERROR, "Unable to parse zonefile in %s\n", profileName); - fclose(m_cliopt.zoneFile); - m_cliopt.zoneFile = NULL; - } - } - - if (signal(SIGINT, sigint_handler) == SIG_ERR) - x265_log(m_param, X265_LOG_ERROR, "Unable to register CTRL+C handler: %s in %s\n", - strerror(errno), profileName); - - x265_picture pic_orig, pic_out; - x265_picture *pic_in = &pic_orig; - /* Allocate recon picture if analysis save/load is enabled */ - std::priority_queue<int64_t>* pts_queue = m_cliopt.output->needPTS() ? new std::priority_queue<int64_t>() : NULL; - x265_picture *pic_recon = (m_cliopt.recon || m_param->analysisSave || m_param->analysisLoad || pts_queue || reconPlay || m_param->csvLogLevel) ? &pic_out : NULL; - uint32_t inFrameCount = 0; - uint32_t outFrameCount = 0; - x265_nal *p_nal; - x265_stats stats; - uint32_t nal; - int16_t *errorBuf = NULL; - bool bDolbyVisionRPU = false; - uint8_t *rpuPayload = NULL; - int inputPicNum = 1; - x265_picture picField1, picField2; - x265_analysis_data* analysisInfo = (x265_analysis_data*)(&pic_out.analysisData); - bool isAbrSave = m_cliopt.saveLevel && (m_parent->m_numEncodes > 1); - - if (!m_param->bRepeatHeaders && !m_param->bEnableSvtHevc) - { - if (api->encoder_headers(m_encoder, &p_nal, &nal) < 0) - { - x265_log(m_param, X265_LOG_ERROR, "Failure generating stream headers in %s\n", profileName); - m_ret = 3; - goto fail; - } - else - m_cliopt.totalbytes += m_cliopt.output->writeHeaders(p_nal, nal); - } - - if (m_param->bField && m_param->interlaceMode) - { - api->picture_init(m_param, &picField1); - api->picture_init(m_param, &picField2); - // return back the original height of input - m_param->sourceHeight *= 2; - api->picture_init(m_param, &pic_orig); - } - else - api->picture_init(m_param, &pic_orig); - - if (m_param->dolbyProfile && m_cliopt.dolbyVisionRpu) - { - rpuPayload = X265_MALLOC(uint8_t, 1024); - pic_in->rpu.payload = rpuPayload; - if (pic_in->rpu.payload) - bDolbyVisionRPU = true; - } - - if (m_cliopt.bDither) - { - errorBuf = X265_MALLOC(int16_t, m_param->sourceWidth + 1); - if (errorBuf) - memset(errorBuf, 0, (m_param->sourceWidth + 1) * sizeof(int16_t)); - else - m_cliopt.bDither = false; - } - - // main encoder loop - while (pic_in && !b_ctrl_c) - { - pic_orig.poc = (m_param->bField && m_param->interlaceMode) ? inFrameCount * 2 : inFrameCount; - if (m_cliopt.qpfile) - { - if (!m_cliopt.parseQPFile(pic_orig)) - { - x265_log(NULL, X265_LOG_ERROR, "can't parse qpfile for frame %d in %s\n", - pic_in->poc, profileName); - fclose(m_cliopt.qpfile); - m_cliopt.qpfile = NULL; - } - } - - if (m_cliopt.framesToBeEncoded && inFrameCount >= m_cliopt.framesToBeEncoded) - pic_in = NULL; - else if (readPicture(pic_in)) - inFrameCount++; - else - pic_in = NULL; - - if (pic_in) - { - if (pic_in->bitDepth > m_param->internalBitDepth && m_cliopt.bDither) - { - x265_dither_image(pic_in, m_cliopt.input->getWidth(), m_cliopt.input->getHeight(), errorBuf, m_param->internalBitDepth); - pic_in->bitDepth = m_param->internalBitDepth; - } - /* Overwrite PTS */ - pic_in->pts = pic_in->poc; - - // convert to field - if (m_param->bField && m_param->interlaceMode) - { - int height = pic_in->height >> 1; - - int static bCreated = 0; - if (bCreated == 0) - { - bCreated = 1; - inputPicNum = 2; - picField1.fieldNum = 1; - picField2.fieldNum = 2; - - picField1.bitDepth = picField2.bitDepth = pic_in->bitDepth; - picField1.colorSpace = picField2.colorSpace = pic_in->colorSpace; - picField1.height = picField2.height = pic_in->height >> 1; - picField1.framesize = picField2.framesize = pic_in->framesize >> 1; - - size_t fieldFrameSize = (size_t)pic_in->framesize >> 1; - char* field1Buf = X265_MALLOC(char, fieldFrameSize); - char* field2Buf = X265_MALLOC(char, fieldFrameSize); - - int stride = picField1.stride0 = picField2.stride0 = pic_in->stride0; - uint64_t framesize = stride * (height >> x265_cli_cspspic_in->colorSpace.height0); - picField1.planes0 = field1Buf; - picField2.planes0 = field2Buf; - for (int i = 1; i < x265_cli_cspspic_in->colorSpace.planes; i++) - { - picField1.planesi = field1Buf + framesize; - picField2.planesi = field2Buf + framesize; - - stride = picField1.stridei = picField2.stridei = pic_in->stridei; - framesize += (stride * (height >> x265_cli_cspspic_in->colorSpace.heighti)); - } - assert(framesize == picField1.framesize); - } - - picField1.pts = picField1.poc = pic_in->poc; - picField2.pts = picField2.poc = pic_in->poc + 1; - - picField1.userSEI = picField2.userSEI = pic_in->userSEI; - - //if (pic_in->userData) - //{ - // // Have to handle userData here - //} - - if (pic_in->framesize) - { - for (int i = 0; i < x265_cli_cspspic_in->colorSpace.planes; i++) - { - char* srcP1 = (char*)pic_in->planesi; - char* srcP2 = (char*)pic_in->planesi + pic_in->stridei; - char* p1 = (char*)picField1.planesi; - char* p2 = (char*)picField2.planesi; - - int stride = picField1.stridei; - - for (int y = 0; y < (height >> x265_cli_cspspic_in->colorSpace.heighti); y++) - { - memcpy(p1, srcP1, stride); - memcpy(p2, srcP2, stride); - srcP1 += 2 * stride; - srcP2 += 2 * stride; - p1 += stride; - p2 += stride; - } - } - } - } - - if (bDolbyVisionRPU) - { - if (m_param->bField && m_param->interlaceMode) - { - if (m_cliopt.rpuParser(&picField1) > 0) - goto fail; - if (m_cliopt.rpuParser(&picField2) > 0) - goto fail; - } - else - { - if (m_cliopt.rpuParser(pic_in) > 0) - goto fail; - } - } - } - - for (int inputNum = 0; inputNum < inputPicNum; inputNum++) - { - x265_picture *picInput = NULL; - if (inputPicNum == 2) - picInput = pic_in ? (inputNum ? &picField2 : &picField1) : NULL; - else - picInput = pic_in; - - int numEncoded = api->encoder_encode(m_encoder, &p_nal, &nal, picInput, pic_recon); - - int idx = (inFrameCount - 1) % m_parent->m_queueSize; - m_parent->m_picIdxReadCntm_ididx.incr(); - m_parent->m_picReadCntm_id.incr(); - if (m_cliopt.loadLevel && picInput) - { - m_parent->m_analysisReadCntm_cliopt.refId.incr(); - m_parent->m_analysisReadm_cliopt.refIdm_lastIdx.incr(); - } - - if (numEncoded < 0) - { - b_ctrl_c = 1; - m_ret = 4; - break; - } - - if (reconPlay && numEncoded) - reconPlay->writePicture(*pic_recon); - - outFrameCount += numEncoded; - - if (isAbrSave && numEncoded) - { - copyInfo(analysisInfo); - } - - if (numEncoded && pic_recon && m_cliopt.recon) - m_cliopt.recon->writePicture(pic_out); - if (nal) - { - m_cliopt.totalbytes += m_cliopt.output->writeFrame(p_nal, nal, pic_out); - if (pts_queue) - { - pts_queue->push(-pic_out.pts); - if (pts_queue->size() > 2) - pts_queue->pop(); - } - } - m_cliopt.printStatus(outFrameCount); - } - } - - /* Flush the encoder */ - while (!b_ctrl_c) - { - int numEncoded = api->encoder_encode(m_encoder, &p_nal, &nal, NULL, pic_recon); - if (numEncoded < 0) - { - m_ret = 4; - break; - } - - if (reconPlay && numEncoded) - reconPlay->writePicture(*pic_recon); - - outFrameCount += numEncoded; - if (isAbrSave && numEncoded) - { - copyInfo(analysisInfo); - } - - if (numEncoded && pic_recon && m_cliopt.recon) - m_cliopt.recon->writePicture(pic_out); - if (nal) - { - m_cliopt.totalbytes += m_cliopt.output->writeFrame(p_nal, nal, pic_out); - if (pts_queue) - { - pts_queue->push(-pic_out.pts); - if (pts_queue->size() > 2) - pts_queue->pop(); - } - } - - m_cliopt.printStatus(outFrameCount); - - if (!numEncoded) - break; - } - - if (bDolbyVisionRPU) - { - if (fgetc(m_cliopt.dolbyVisionRpu) != EOF) - x265_log(NULL, X265_LOG_WARNING, "Dolby Vision RPU count is greater than frame count in %s\n", - profileName); - x265_log(NULL, X265_LOG_INFO, "VES muxing with Dolby Vision RPU file successful in %s\n", - profileName); - } - - /* clear progress report */ - if (m_cliopt.bProgress) - fprintf(stderr, "%*s\r", 80, " "); - - fail: - - delete reconPlay; - - api->encoder_get_stats(m_encoder, &stats, sizeof(stats)); - if (m_param->csvfn && !b_ctrl_c) -#if ENABLE_LIBVMAF - api->vmaf_encoder_log(m_encoder, m_cliopt.argCnt, m_cliopt.argString, m_cliopt.param, vmafdata); -#else - api->encoder_log(m_encoder, m_cliopt.argCnt, m_cliopt.argString); -#endif - api->encoder_close(m_encoder); - - int64_t second_largest_pts = 0; - int64_t largest_pts = 0; - if (pts_queue && pts_queue->size() >= 2) - { - second_largest_pts = -pts_queue->top(); - pts_queue->pop(); - largest_pts = -pts_queue->top(); - pts_queue->pop(); - delete pts_queue; - pts_queue = NULL; - } - m_cliopt.output->closeFile(largest_pts, second_largest_pts); - - if (b_ctrl_c) - general_log(m_param, NULL, X265_LOG_INFO, "aborted at input frame %d, output frame %d in %s\n", - m_cliopt.seek + inFrameCount, stats.encodedPictureCount, profileName); - - api->param_free(m_param); - - X265_FREE(errorBuf); - X265_FREE(rpuPayload); - - m_threadActive = false; - m_parent->m_numActiveEncodes.decr(); - } - } - - void PassEncoder::destroy() - { - stop(); - if (m_reader) - { - m_reader->stop(); - delete m_reader; - } - else - { - m_scaler->stop(); - m_scaler->destroy(); - delete m_scaler; - } - } - - Scaler::Scaler(int threadId, int threadNum, int id, VideoDesc *src, VideoDesc *dst, PassEncoder *parentEnc) - { - m_parentEnc = parentEnc; - m_id = id; - m_srcFormat = src; - m_dstFormat = dst; - m_threadActive = false; - m_scaleFrameSize = 0; - m_filterManager = NULL; - m_threadId = threadId; - m_threadTotal = threadNum; - - int csp = dst->m_csp; - uint32_t pixelbytes = dst->m_inputDepth > 8 ? 2 : 1; - for (int i = 0; i < x265_cli_cspscsp.planes; i++) - { - int w = dst->m_width >> x265_cli_cspscsp.widthi; - int h = dst->m_height >> x265_cli_cspscsp.heighti; - m_scalePlanesi = w * h * pixelbytes; - m_scaleFrameSize += m_scalePlanesi; - } - - if (src->m_height != dst->m_height || src->m_width != dst->m_width) - { - m_filterManager = new ScalerFilterManager; - m_filterManager->init(4, m_srcFormat, m_dstFormat); - } - } - - bool Scaler::scalePic(x265_picture * destination, x265_picture * source) - { - if (!destination || !source) - return false; - x265_param* param = m_parentEnc->m_param; - int pixelBytes = m_dstFormat->m_inputDepth > 8 ? 2 : 1; - if (m_srcFormat->m_height != m_dstFormat->m_height || m_srcFormat->m_width != m_dstFormat->m_width) - { - void **srcPlane = NULL, **dstPlane = NULL; - int srcStride3, dstStride3; - destination->bitDepth = source->bitDepth; - destination->colorSpace = source->colorSpace; - destination->pts = source->pts; - destination->dts = source->dts; - destination->reorderedPts = source->reorderedPts; - destination->poc = source->poc; - destination->userSEI = source->userSEI; - srcPlane = source->planes; - dstPlane = destination->planes; - srcStride0 = source->stride0; - destination->stride0 = m_dstFormat->m_width * pixelBytes; - dstStride0 = destination->stride0; - if (param->internalCsp != X265_CSP_I400) - { - srcStride1 = source->stride1; - srcStride2 = source->stride2; - destination->stride1 = destination->stride0 >> x265_cli_cspsparam->internalCsp.width1; - destination->stride2 = destination->stride0 >> x265_cli_cspsparam->internalCsp.width2; - dstStride1 = destination->stride1; - dstStride2 = destination->stride2; - } - if (m_scaleFrameSize) - { - m_filterManager->scale_pic(srcPlane, dstPlane, srcStride, dstStride); - return true; - } - else - x265_log(param, X265_LOG_INFO, "Empty frame received\n"); - } - return false; - } - - void Scaler::threadMain() - { - THREAD_NAME("Scaler", m_id); - - /* unscaled picture is stored in the last index */ - uint32_t srcId = m_id - 1; - int QDepth = m_parentEnc->m_parent->m_queueSize; - while (!m_parentEnc->m_inputOver) - { - - uint32_t scaledWritten = m_parentEnc->m_parent->m_picWriteCntm_id.get(); - - if (m_parentEnc->m_cliopt.framesToBeEncoded && scaledWritten >= m_parentEnc->m_cliopt.framesToBeEncoded) - break; - - if (m_threadTotal > 1 && (m_threadId != scaledWritten % m_threadTotal)) - { - continue; - } - uint32_t written = m_parentEnc->m_parent->m_picWriteCntsrcId.get(); - - /*If all the input pictures are scaled by the current scale worker thread wait for input pictures*/ - while (m_threadActive && (scaledWritten == written)) { - written = m_parentEnc->m_parent->m_picWriteCntsrcId.waitForChange(written); - } - - if (m_threadActive && scaledWritten < written) - { - - int scaledWriteIdx = scaledWritten % QDepth; - int overWritePicBuffer = scaledWritten / QDepth; - int read = m_parentEnc->m_parent->m_picIdxReadCntm_idscaledWriteIdx.get(); - - while (overWritePicBuffer && read < overWritePicBuffer) - { - read = m_parentEnc->m_parent->m_picIdxReadCntm_idscaledWriteIdx.waitForChange(read); - } - - if (!m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx) - { - int framesize = 0; - int planesize3; - int csp = m_dstFormat->m_csp; - int stride3; - stride0 = m_dstFormat->m_width; - stride1 = stride0 >> x265_cli_cspscsp.width1; - stride2 = stride0 >> x265_cli_cspscsp.width2; - for (int i = 0; i < x265_cli_cspscsp.planes; i++) - { - uint32_t h = m_dstFormat->m_height >> x265_cli_cspscsp.heighti; - planesizei = h * stridei; - framesize += planesizei; - } - - m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx = x265_picture_alloc(); - x265_picture_init(m_parentEnc->m_param, m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx); - - ((x265_picture*)m_parentEnc->m_parent->m_inputPicBufferm_idscaledWritten % QDepth)->framesize = framesize; - for (int32_t j = 0; j < x265_cli_cspscsp.planes; j++) - { - m_parentEnc->m_parent->m_inputPicBufferm_idscaledWritten % QDepth->planesj = X265_MALLOC(char, planesizej); - } - } - - x265_picture *srcPic = m_parentEnc->m_parent->m_inputPicBuffersrcIdscaledWritten % QDepth; - x265_picture* destPic = m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx; - - // Enqueue this picture up with the current encoder so that it will asynchronously encode - if (!scalePic(destPic, srcPic)) - x265_log(NULL, X265_LOG_ERROR, "Unable to copy scaled input picture to input queue \n"); - else - m_parentEnc->m_parent->m_picWriteCntm_id.incr(); - m_scaledWriteCnt.incr(); - m_parentEnc->m_parent->m_picIdxReadCntsrcIdscaledWriteIdx.incr(); - } - if (m_threadTotal > 1) - { - written = m_parentEnc->m_parent->m_picWriteCntsrcId.get(); - int totalWrite = written / m_threadTotal; - if (written % m_threadTotal > m_threadId) - totalWrite++; - if (totalWrite == m_scaledWriteCnt.get()) - { - m_parentEnc->m_parent->m_picWriteCntsrcId.poke(); - m_parentEnc->m_parent->m_picWriteCntm_id.poke(); - break; - } - } - else - { - /* Once end of video is reached and all frames are scaled, release wait on picwritecount */ - scaledWritten = m_parentEnc->m_parent->m_picWriteCntm_id.get(); - written = m_parentEnc->m_parent->m_picWriteCntsrcId.get(); - if (written == scaledWritten) - { - m_parentEnc->m_parent->m_picWriteCntsrcId.poke(); - m_parentEnc->m_parent->m_picWriteCntm_id.poke(); - break; - } - } - - } - m_threadActive = false; - destroy(); - } - - Reader::Reader(int id, PassEncoder *parentEnc) - { - m_parentEnc = parentEnc; - m_id = id; - m_input = parentEnc->m_input; - } - - void Reader::threadMain() - { - THREAD_NAME("Reader", m_id); - - int QDepth = m_parentEnc->m_parent->m_queueSize; - x265_picture* src = x265_picture_alloc(); - x265_picture_init(m_parentEnc->m_param, src); - - while (m_threadActive) - { - uint32_t written = m_parentEnc->m_parent->m_picWriteCntm_id.get(); - uint32_t writeIdx = written % QDepth; - uint32_t read = m_parentEnc->m_parent->m_picIdxReadCntm_idwriteIdx.get(); - uint32_t overWritePicBuffer = written / QDepth; - - if (m_parentEnc->m_cliopt.framesToBeEncoded && written >= m_parentEnc->m_cliopt.framesToBeEncoded) - break; - - while (overWritePicBuffer && read < overWritePicBuffer) - { - read = m_parentEnc->m_parent->m_picIdxReadCntm_idwriteIdx.waitForChange(read); - } - - x265_picture* dest = m_parentEnc->m_parent->m_inputPicBufferm_idwriteIdx; - if (m_input->readPicture(*src)) - { - dest->poc = src->poc; - dest->pts = src->pts; - dest->userSEI = src->userSEI; - dest->bitDepth = src->bitDepth; - dest->framesize = src->framesize; - dest->height = src->height; - dest->width = src->width; - dest->colorSpace = src->colorSpace; - dest->userSEI = src->userSEI; - dest->rpu.payload = src->rpu.payload; - dest->picStruct = src->picStruct; - dest->stride0 = src->stride0; - dest->stride1 = src->stride1; - dest->stride2 = src->stride2; - - if (!dest->planes0) - dest->planes0 = X265_MALLOC(char, dest->framesize); - - memcpy(dest->planes0, src->planes0, src->framesize * sizeof(char)); - dest->planes1 = (char*)dest->planes0 + src->stride0 * src->height; - dest->planes2 = (char*)dest->planes1 + src->stride1 * (src->height >> x265_cli_cspssrc->colorSpace.height1); - m_parentEnc->m_parent->m_picWriteCntm_id.incr(); - } - else - { - m_threadActive = false; - m_parentEnc->m_inputOver = true; - m_parentEnc->m_parent->m_picWriteCntm_id.poke(); - } - } - x265_picture_free(src); - } -} + +#if ENABLE_LIBVMAF + x265_vmaf_data* vmafdata = m_cliopt.vmafData; +#endif + /* This allows muxers to modify bitstream format */ + m_cliopt.output->setParam(m_param); + const x265_api* api = m_cliopt.api; + ReconPlay* reconPlay = NULL; + if (m_cliopt.reconPlayCmd) + reconPlay = new ReconPlay(m_cliopt.reconPlayCmd, *m_param); + char* profileName = m_cliopt.encName ? m_cliopt.encName : (char *)"x265"; + + if (signal(SIGINT, sigint_handler) == SIG_ERR) + x265_log(m_param, X265_LOG_ERROR, "Unable to register CTRL+C handler: %s in %s\n", + strerror(errno), profileName); + + x265_picture pic_orig, pic_out; + x265_picture *pic_in = &pic_orig; + /* Allocate recon picture if analysis save/load is enabled */ + std::priority_queue<int64_t>* pts_queue = m_cliopt.output->needPTS() ? new std::priority_queue<int64_t>() : NULL; + x265_picture *pic_recon = (m_cliopt.recon || m_param->analysisSave || m_param->analysisLoad || pts_queue || reconPlay || m_param->csvLogLevel) ? &pic_out : NULL; + uint32_t inFrameCount = 0; + uint32_t outFrameCount = 0; + x265_nal *p_nal; + x265_stats stats; + uint32_t nal; + int16_t *errorBuf = NULL; + bool bDolbyVisionRPU = false; + uint8_t *rpuPayload = NULL; + int inputPicNum = 1; + x265_picture picField1, picField2; + x265_analysis_data* analysisInfo = (x265_analysis_data*)(&pic_out.analysisData); + bool isAbrSave = m_cliopt.saveLevel && (m_parent->m_numEncodes > 1); + + if (!m_param->bRepeatHeaders && !m_param->bEnableSvtHevc) + { + if (api->encoder_headers(m_encoder, &p_nal, &nal) < 0) + { + x265_log(m_param, X265_LOG_ERROR, "Failure generating stream headers in %s\n", profileName); + m_ret = 3; + goto fail; + } + else + m_cliopt.totalbytes += m_cliopt.output->writeHeaders(p_nal, nal); + } + + if (m_param->bField && m_param->interlaceMode) + { + api->picture_init(m_param, &picField1); + api->picture_init(m_param, &picField2); + // return back the original height of input + m_param->sourceHeight *= 2; + api->picture_init(m_param, &pic_orig); + } + else + api->picture_init(m_param, &pic_orig); + + if (m_param->dolbyProfile && m_cliopt.dolbyVisionRpu) + { + rpuPayload = X265_MALLOC(uint8_t, 1024); + pic_in->rpu.payload = rpuPayload; + if (pic_in->rpu.payload) + bDolbyVisionRPU = true; + } + + if (m_cliopt.bDither) + { + errorBuf = X265_MALLOC(int16_t, m_param->sourceWidth + 1); + if (errorBuf) + memset(errorBuf, 0, (m_param->sourceWidth + 1) * sizeof(int16_t)); + else + m_cliopt.bDither = false; + } + + // main encoder loop + while (pic_in && !b_ctrl_c) + { + pic_orig.poc = (m_param->bField && m_param->interlaceMode) ? inFrameCount * 2 : inFrameCount; + if (m_cliopt.qpfile) + { + if (!m_cliopt.parseQPFile(pic_orig)) + { + x265_log(NULL, X265_LOG_ERROR, "can't parse qpfile for frame %d in %s\n", + pic_in->poc, profileName); + fclose(m_cliopt.qpfile); + m_cliopt.qpfile = NULL; + } + } + + if (m_cliopt.framesToBeEncoded && inFrameCount >= m_cliopt.framesToBeEncoded) + pic_in = NULL; + else if (readPicture(pic_in)) + inFrameCount++; + else + pic_in = NULL; + + if (pic_in) + { + if (pic_in->bitDepth > m_param->internalBitDepth && m_cliopt.bDither) + { + x265_dither_image(pic_in, m_cliopt.input->getWidth(), m_cliopt.input->getHeight(), errorBuf, m_param->internalBitDepth); + pic_in->bitDepth = m_param->internalBitDepth; + } + /* Overwrite PTS */ + pic_in->pts = pic_in->poc; + + // convert to field + if (m_param->bField && m_param->interlaceMode) + { + int height = pic_in->height >> 1; + + int static bCreated = 0; + if (bCreated == 0) + { + bCreated = 1; + inputPicNum = 2; + picField1.fieldNum = 1; + picField2.fieldNum = 2; + + picField1.bitDepth = picField2.bitDepth = pic_in->bitDepth; + picField1.colorSpace = picField2.colorSpace = pic_in->colorSpace; + picField1.height = picField2.height = pic_in->height >> 1; + picField1.framesize = picField2.framesize = pic_in->framesize >> 1; + + size_t fieldFrameSize = (size_t)pic_in->framesize >> 1; + char* field1Buf = X265_MALLOC(char, fieldFrameSize); + char* field2Buf = X265_MALLOC(char, fieldFrameSize); + + int stride = picField1.stride0 = picField2.stride0 = pic_in->stride0; + uint64_t framesize = stride * (height >> x265_cli_cspspic_in->colorSpace.height0); + picField1.planes0 = field1Buf; + picField2.planes0 = field2Buf; + for (int i = 1; i < x265_cli_cspspic_in->colorSpace.planes; i++) + { + picField1.planesi = field1Buf + framesize; + picField2.planesi = field2Buf + framesize; + + stride = picField1.stridei = picField2.stridei = pic_in->stridei; + framesize += (stride * (height >> x265_cli_cspspic_in->colorSpace.heighti)); + } + assert(framesize == picField1.framesize); + } + + picField1.pts = picField1.poc = pic_in->poc; + picField2.pts = picField2.poc = pic_in->poc + 1; + + picField1.userSEI = picField2.userSEI = pic_in->userSEI; + + //if (pic_in->userData) + //{ + // // Have to handle userData here + //} + + if (pic_in->framesize) + { + for (int i = 0; i < x265_cli_cspspic_in->colorSpace.planes; i++) + { + char* srcP1 = (char*)pic_in->planesi; + char* srcP2 = (char*)pic_in->planesi + pic_in->stridei; + char* p1 = (char*)picField1.planesi; + char* p2 = (char*)picField2.planesi; + + int stride = picField1.stridei; + + for (int y = 0; y < (height >> x265_cli_cspspic_in->colorSpace.heighti); y++) + { + memcpy(p1, srcP1, stride); + memcpy(p2, srcP2, stride); + srcP1 += 2 * stride; + srcP2 += 2 * stride; + p1 += stride; + p2 += stride; + } + } + } + } + + if (bDolbyVisionRPU) + { + if (m_param->bField && m_param->interlaceMode) + { + if (m_cliopt.rpuParser(&picField1) > 0) + goto fail; + if (m_cliopt.rpuParser(&picField2) > 0) + goto fail; + } + else + { + if (m_cliopt.rpuParser(pic_in) > 0) + goto fail; + } + } + } + + for (int inputNum = 0; inputNum < inputPicNum; inputNum++) + { + x265_picture *picInput = NULL; + if (inputPicNum == 2) + picInput = pic_in ? (inputNum ? &picField2 : &picField1) : NULL; + else + picInput = pic_in; + + int numEncoded = api->encoder_encode(m_encoder, &p_nal, &nal, picInput, pic_recon); + + int idx = (inFrameCount - 1) % m_parent->m_queueSize; + m_parent->m_picIdxReadCntm_ididx.incr(); + m_parent->m_picReadCntm_id.incr(); + if (m_cliopt.loadLevel && picInput) + { + m_parent->m_analysisReadCntm_cliopt.refId.incr(); + m_parent->m_analysisReadm_cliopt.refIdm_lastIdx.incr(); + } + + if (numEncoded < 0) + { + b_ctrl_c = 1; + m_ret = 4; + break; + } + + if (reconPlay && numEncoded) + reconPlay->writePicture(*pic_recon); + + outFrameCount += numEncoded; + + if (isAbrSave && numEncoded) + { + copyInfo(analysisInfo); + } + + if (numEncoded && pic_recon && m_cliopt.recon) + m_cliopt.recon->writePicture(pic_out); + if (nal) + { + m_cliopt.totalbytes += m_cliopt.output->writeFrame(p_nal, nal, pic_out); + if (pts_queue) + { + pts_queue->push(-pic_out.pts); + if (pts_queue->size() > 2) + pts_queue->pop(); + } + } + m_cliopt.printStatus(outFrameCount); + } + } + + /* Flush the encoder */ + while (!b_ctrl_c) + { + int numEncoded = api->encoder_encode(m_encoder, &p_nal, &nal, NULL, pic_recon); + if (numEncoded < 0) + { + m_ret = 4; + break; + } + + if (reconPlay && numEncoded) + reconPlay->writePicture(*pic_recon); + + outFrameCount += numEncoded; + if (isAbrSave && numEncoded) + { + copyInfo(analysisInfo); + } + + if (numEncoded && pic_recon && m_cliopt.recon) + m_cliopt.recon->writePicture(pic_out); + if (nal) + { + m_cliopt.totalbytes += m_cliopt.output->writeFrame(p_nal, nal, pic_out); + if (pts_queue) + { + pts_queue->push(-pic_out.pts); + if (pts_queue->size() > 2) + pts_queue->pop(); + } + } + + m_cliopt.printStatus(outFrameCount); + + if (!numEncoded) + break; + } + + if (bDolbyVisionRPU) + { + if (fgetc(m_cliopt.dolbyVisionRpu) != EOF) + x265_log(NULL, X265_LOG_WARNING, "Dolby Vision RPU count is greater than frame count in %s\n", + profileName); + x265_log(NULL, X265_LOG_INFO, "VES muxing with Dolby Vision RPU file successful in %s\n", + profileName); + } + + /* clear progress report */ + if (m_cliopt.bProgress) + fprintf(stderr, "%*s\r", 80, " "); + + fail: + + delete reconPlay; + + api->encoder_get_stats(m_encoder, &stats, sizeof(stats)); + if (m_param->csvfn && !b_ctrl_c) +#if ENABLE_LIBVMAF + api->vmaf_encoder_log(m_encoder, m_cliopt.argCnt, m_cliopt.argString, m_cliopt.param, vmafdata); +#else + api->encoder_log(m_encoder, m_cliopt.argCnt, m_cliopt.argString); +#endif + api->encoder_close(m_encoder); + + int64_t second_largest_pts = 0; + int64_t largest_pts = 0; + if (pts_queue && pts_queue->size() >= 2) + { + second_largest_pts = -pts_queue->top(); + pts_queue->pop(); + largest_pts = -pts_queue->top(); + pts_queue->pop(); + delete pts_queue; + pts_queue = NULL; + } + m_cliopt.output->closeFile(largest_pts, second_largest_pts); + + if (b_ctrl_c) + general_log(m_param, NULL, X265_LOG_INFO, "aborted at input frame %d, output frame %d in %s\n", + m_cliopt.seek + inFrameCount, stats.encodedPictureCount, profileName); + + api->param_free(m_param); + + X265_FREE(errorBuf); + X265_FREE(rpuPayload); + + m_threadActive = false; + m_parent->m_numActiveEncodes.decr(); + } + } + + void PassEncoder::destroy() + { + stop(); + if (m_reader) + { + m_reader->stop(); + delete m_reader; + } + else + { + m_scaler->stop(); + m_scaler->destroy(); + delete m_scaler; + } + } + + Scaler::Scaler(int threadId, int threadNum, int id, VideoDesc *src, VideoDesc *dst, PassEncoder *parentEnc) + { + m_parentEnc = parentEnc; + m_id = id; + m_srcFormat = src; + m_dstFormat = dst; + m_threadActive = false; + m_scaleFrameSize = 0; + m_filterManager = NULL; + m_threadId = threadId; + m_threadTotal = threadNum; + + int csp = dst->m_csp; + uint32_t pixelbytes = dst->m_inputDepth > 8 ? 2 : 1; + for (int i = 0; i < x265_cli_cspscsp.planes; i++) + { + int w = dst->m_width >> x265_cli_cspscsp.widthi; + int h = dst->m_height >> x265_cli_cspscsp.heighti; + m_scalePlanesi = w * h * pixelbytes; + m_scaleFrameSize += m_scalePlanesi; + } + + if (src->m_height != dst->m_height || src->m_width != dst->m_width) + { + m_filterManager = new ScalerFilterManager; + m_filterManager->init(4, m_srcFormat, m_dstFormat); + } + } + + bool Scaler::scalePic(x265_picture * destination, x265_picture * source) + { + if (!destination || !source) + return false; + x265_param* param = m_parentEnc->m_param; + int pixelBytes = m_dstFormat->m_inputDepth > 8 ? 2 : 1; + if (m_srcFormat->m_height != m_dstFormat->m_height || m_srcFormat->m_width != m_dstFormat->m_width) + { + void **srcPlane = NULL, **dstPlane = NULL; + int srcStride3, dstStride3; + destination->bitDepth = source->bitDepth; + destination->colorSpace = source->colorSpace; + destination->pts = source->pts; + destination->dts = source->dts; + destination->reorderedPts = source->reorderedPts; + destination->poc = source->poc; + destination->userSEI = source->userSEI; + srcPlane = source->planes; + dstPlane = destination->planes; + srcStride0 = source->stride0; + destination->stride0 = m_dstFormat->m_width * pixelBytes; + dstStride0 = destination->stride0; + if (param->internalCsp != X265_CSP_I400) + { + srcStride1 = source->stride1; + srcStride2 = source->stride2; + destination->stride1 = destination->stride0 >> x265_cli_cspsparam->internalCsp.width1; + destination->stride2 = destination->stride0 >> x265_cli_cspsparam->internalCsp.width2; + dstStride1 = destination->stride1; + dstStride2 = destination->stride2; + } + if (m_scaleFrameSize) + { + m_filterManager->scale_pic(srcPlane, dstPlane, srcStride, dstStride); + return true; + } + else + x265_log(param, X265_LOG_INFO, "Empty frame received\n"); + } + return false; + } + + void Scaler::threadMain() + { + THREAD_NAME("Scaler", m_id); + + /* unscaled picture is stored in the last index */ + uint32_t srcId = m_id - 1; + int QDepth = m_parentEnc->m_parent->m_queueSize; + while (!m_parentEnc->m_inputOver) + { + + uint32_t scaledWritten = m_parentEnc->m_parent->m_picWriteCntm_id.get(); + + if (m_parentEnc->m_cliopt.framesToBeEncoded && scaledWritten >= m_parentEnc->m_cliopt.framesToBeEncoded) + break; + + if (m_threadTotal > 1 && (m_threadId != scaledWritten % m_threadTotal)) + { + continue; + } + uint32_t written = m_parentEnc->m_parent->m_picWriteCntsrcId.get(); + + /*If all the input pictures are scaled by the current scale worker thread wait for input pictures*/ + while (m_threadActive && (scaledWritten == written)) { + written = m_parentEnc->m_parent->m_picWriteCntsrcId.waitForChange(written); + } + + if (m_threadActive && scaledWritten < written) + { + + int scaledWriteIdx = scaledWritten % QDepth; + int overWritePicBuffer = scaledWritten / QDepth; + int read = m_parentEnc->m_parent->m_picIdxReadCntm_idscaledWriteIdx.get(); + + while (overWritePicBuffer && read < overWritePicBuffer) + { + read = m_parentEnc->m_parent->m_picIdxReadCntm_idscaledWriteIdx.waitForChange(read); + } + + if (!m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx) + { + int framesize = 0; + int planesize3; + int csp = m_dstFormat->m_csp; + int stride3; + stride0 = m_dstFormat->m_width; + stride1 = stride0 >> x265_cli_cspscsp.width1; + stride2 = stride0 >> x265_cli_cspscsp.width2; + for (int i = 0; i < x265_cli_cspscsp.planes; i++) + { + uint32_t h = m_dstFormat->m_height >> x265_cli_cspscsp.heighti; + planesizei = h * stridei; + framesize += planesizei; + } + + m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx = x265_picture_alloc(); + x265_picture_init(m_parentEnc->m_param, m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx); + + ((x265_picture*)m_parentEnc->m_parent->m_inputPicBufferm_idscaledWritten % QDepth)->framesize = framesize; + for (int32_t j = 0; j < x265_cli_cspscsp.planes; j++) + { + m_parentEnc->m_parent->m_inputPicBufferm_idscaledWritten % QDepth->planesj = X265_MALLOC(char, planesizej); + } + } + + x265_picture *srcPic = m_parentEnc->m_parent->m_inputPicBuffersrcIdscaledWritten % QDepth; + x265_picture* destPic = m_parentEnc->m_parent->m_inputPicBufferm_idscaledWriteIdx; + + // Enqueue this picture up with the current encoder so that it will asynchronously encode + if (!scalePic(destPic, srcPic)) + x265_log(NULL, X265_LOG_ERROR, "Unable to copy scaled input picture to input queue \n"); + else + m_parentEnc->m_parent->m_picWriteCntm_id.incr(); + m_scaledWriteCnt.incr(); + m_parentEnc->m_parent->m_picIdxReadCntsrcIdscaledWriteIdx.incr(); + } + if (m_threadTotal > 1) + { + written = m_parentEnc->m_parent->m_picWriteCntsrcId.get(); + int totalWrite = written / m_threadTotal; + if (written % m_threadTotal > m_threadId) + totalWrite++; + if (totalWrite == m_scaledWriteCnt.get()) + { + m_parentEnc->m_parent->m_picWriteCntsrcId.poke(); + m_parentEnc->m_parent->m_picWriteCntm_id.poke(); + break; + } + } + else + { + /* Once end of video is reached and all frames are scaled, release wait on picwritecount */ + scaledWritten = m_parentEnc->m_parent->m_picWriteCntm_id.get(); + written = m_parentEnc->m_parent->m_picWriteCntsrcId.get(); + if (written == scaledWritten) + { + m_parentEnc->m_parent->m_picWriteCntsrcId.poke(); + m_parentEnc->m_parent->m_picWriteCntm_id.poke(); + break; + } + } + + } + m_threadActive = false; + destroy(); + } + + Reader::Reader(int id, PassEncoder *parentEnc) + { + m_parentEnc = parentEnc; + m_id = id; + m_input = parentEnc->m_input; + } + + void Reader::threadMain() + { + THREAD_NAME("Reader", m_id); + + int QDepth = m_parentEnc->m_parent->m_queueSize; + x265_picture* src = x265_picture_alloc(); + x265_picture_init(m_parentEnc->m_param, src); + + while (m_threadActive) + { + uint32_t written = m_parentEnc->m_parent->m_picWriteCntm_id.get(); + uint32_t writeIdx = written % QDepth; + uint32_t read = m_parentEnc->m_parent->m_picIdxReadCntm_idwriteIdx.get(); + uint32_t overWritePicBuffer = written / QDepth; + + if (m_parentEnc->m_cliopt.framesToBeEncoded && written >= m_parentEnc->m_cliopt.framesToBeEncoded) + break; + + while (overWritePicBuffer && read < overWritePicBuffer) + { + read = m_parentEnc->m_parent->m_picIdxReadCntm_idwriteIdx.waitForChange(read); + } + + x265_picture* dest = m_parentEnc->m_parent->m_inputPicBufferm_idwriteIdx; + if (m_input->readPicture(*src)) + { + dest->poc = src->poc; + dest->pts = src->pts; + dest->userSEI = src->userSEI; + dest->bitDepth = src->bitDepth; + dest->framesize = src->framesize; + dest->height = src->height; + dest->width = src->width; + dest->colorSpace = src->colorSpace; + dest->userSEI = src->userSEI; + dest->rpu.payload = src->rpu.payload; + dest->picStruct = src->picStruct; + dest->stride0 = src->stride0; + dest->stride1 = src->stride1; + dest->stride2 = src->stride2; + + if (!dest->planes0) + dest->planes0 = X265_MALLOC(char, dest->framesize); + + memcpy(dest->planes0, src->planes0, src->framesize * sizeof(char)); + dest->planes1 = (char*)dest->planes0 + src->stride0 * src->height; + dest->planes2 = (char*)dest->planes1 + src->stride1 * (src->height >> x265_cli_cspssrc->colorSpace.height1); + m_parentEnc->m_parent->m_picWriteCntm_id.incr(); + } + else + { + m_threadActive = false; + m_parentEnc->m_inputOver = true; + m_parentEnc->m_parent->m_picWriteCntm_id.poke(); + } + } + x265_picture_free(src); + } +}
View file
x265_3.5.tar.gz/source/abrEncApp.h -> x265_3.6.tar.gz/source/abrEncApp.h
Changed
@@ -91,6 +91,7 @@ FILE* m_qpfile; FILE* m_zoneFile; FILE* m_dolbyVisionRpu;/* File containing Dolby Vision BL RPU metadata */ + FILE* m_scenecutAwareQpConfig; int m_ret;
View file
x265_3.5.tar.gz/source/cmake/FindNeon.cmake -> x265_3.6.tar.gz/source/cmake/FindNeon.cmake
Changed
@@ -1,10 +1,21 @@ include(FindPackageHandleStandardArgs) # Check the version of neon supported by the ARM CPU -execute_process(COMMAND cat /proc/cpuinfo | grep Features | grep neon - OUTPUT_VARIABLE neon_version - ERROR_QUIET - OUTPUT_STRIP_TRAILING_WHITESPACE) +if(APPLE) + execute_process(COMMAND sysctl -a + COMMAND grep "hw.optional.neon: 1" + OUTPUT_VARIABLE neon_version + ERROR_QUIET + OUTPUT_STRIP_TRAILING_WHITESPACE) +else() + execute_process(COMMAND cat /proc/cpuinfo + COMMAND grep Features + COMMAND grep neon + OUTPUT_VARIABLE neon_version + ERROR_QUIET + OUTPUT_STRIP_TRAILING_WHITESPACE) +endif() + if(neon_version) set(CPU_HAS_NEON 1) endif()
View file
x265_3.6.tar.gz/source/cmake/FindSVE.cmake
Added
@@ -0,0 +1,21 @@ +include(FindPackageHandleStandardArgs) + +# Check the version of SVE supported by the ARM CPU +if(APPLE) + execute_process(COMMAND sysctl -a + COMMAND grep "hw.optional.sve: 1" + OUTPUT_VARIABLE sve_version + ERROR_QUIET + OUTPUT_STRIP_TRAILING_WHITESPACE) +else() + execute_process(COMMAND cat /proc/cpuinfo + COMMAND grep Features + COMMAND grep -e "sve$" -e "sve:space:" + OUTPUT_VARIABLE sve_version + ERROR_QUIET + OUTPUT_STRIP_TRAILING_WHITESPACE) +endif() + +if(sve_version) + set(CPU_HAS_SVE 1) +endif()
View file
x265_3.6.tar.gz/source/cmake/FindSVE2.cmake
Added
@@ -0,0 +1,22 @@ +include(FindPackageHandleStandardArgs) + +# Check the version of SVE2 supported by the ARM CPU +if(APPLE) + execute_process(COMMAND sysctl -a + COMMAND grep "hw.optional.sve2: 1" + OUTPUT_VARIABLE sve2_version + ERROR_QUIET + OUTPUT_STRIP_TRAILING_WHITESPACE) +else() + execute_process(COMMAND cat /proc/cpuinfo + COMMAND grep Features + COMMAND grep sve2 + OUTPUT_VARIABLE sve2_version + ERROR_QUIET + OUTPUT_STRIP_TRAILING_WHITESPACE) +endif() + +if(sve2_version) + set(CPU_HAS_SVE 1) + set(CPU_HAS_SVE2 1) +endif()
View file
x265_3.5.tar.gz/source/common/CMakeLists.txt -> x265_3.6.tar.gz/source/common/CMakeLists.txt
Changed
@@ -84,35 +84,42 @@ endif(ENABLE_ASSEMBLY AND X86) if(ENABLE_ASSEMBLY AND (ARM OR CROSS_COMPILE_ARM)) - if(ARM64) - if(GCC AND (CMAKE_CXX_FLAGS_RELEASE MATCHES "-O3")) - message(STATUS "Detected CXX compiler using -O3 optimization level") - add_definitions(-DAUTO_VECTORIZE=1) - endif() - set(C_SRCS asm-primitives.cpp pixel.h ipfilter8.h) - - # add ARM assembly/intrinsic files here - set(A_SRCS asm.S mc-a.S sad-a.S pixel-util.S ipfilter8.S) - set(VEC_PRIMITIVES) + set(C_SRCS asm-primitives.cpp pixel.h mc.h ipfilter8.h blockcopy8.h dct8.h loopfilter.h) - set(ARM_ASMS "${A_SRCS}" CACHE INTERNAL "ARM Assembly Sources") - foreach(SRC ${C_SRCS}) - set(ASM_PRIMITIVES ${ASM_PRIMITIVES} aarch64/${SRC}) - endforeach() - else() - set(C_SRCS asm-primitives.cpp pixel.h mc.h ipfilter8.h blockcopy8.h dct8.h loopfilter.h) + # add ARM assembly/intrinsic files here + set(A_SRCS asm.S cpu-a.S mc-a.S sad-a.S pixel-util.S ssd-a.S blockcopy8.S ipfilter8.S dct-a.S) + set(VEC_PRIMITIVES) - # add ARM assembly/intrinsic files here - set(A_SRCS asm.S cpu-a.S mc-a.S sad-a.S pixel-util.S ssd-a.S blockcopy8.S ipfilter8.S dct-a.S) - set(VEC_PRIMITIVES) + set(ARM_ASMS "${A_SRCS}" CACHE INTERNAL "ARM Assembly Sources") + foreach(SRC ${C_SRCS}) + set(ASM_PRIMITIVES ${ASM_PRIMITIVES} arm/${SRC}) + endforeach() + source_group(Assembly FILES ${ASM_PRIMITIVES}) +endif(ENABLE_ASSEMBLY AND (ARM OR CROSS_COMPILE_ARM)) - set(ARM_ASMS "${A_SRCS}" CACHE INTERNAL "ARM Assembly Sources") - foreach(SRC ${C_SRCS}) - set(ASM_PRIMITIVES ${ASM_PRIMITIVES} arm/${SRC}) - endforeach() +if(ENABLE_ASSEMBLY AND (ARM64 OR CROSS_COMPILE_ARM64)) + if(GCC AND (CMAKE_CXX_FLAGS_RELEASE MATCHES "-O3")) + message(STATUS "Detected CXX compiler using -O3 optimization level") + add_definitions(-DAUTO_VECTORIZE=1) endif() + + set(C_SRCS asm-primitives.cpp pixel-prim.h pixel-prim.cpp filter-prim.h filter-prim.cpp dct-prim.h dct-prim.cpp loopfilter-prim.cpp loopfilter-prim.h intrapred-prim.cpp arm64-utils.cpp arm64-utils.h fun-decls.h) + enable_language(ASM) + + # add ARM assembly/intrinsic files here + set(A_SRCS asm.S mc-a.S mc-a-common.S sad-a.S sad-a-common.S pixel-util.S pixel-util-common.S p2s.S p2s-common.S ipfilter.S ipfilter-common.S blockcopy8.S blockcopy8-common.S ssd-a.S ssd-a-common.S) + set(A_SRCS_SVE asm-sve.S blockcopy8-sve.S p2s-sve.S pixel-util-sve.S ssd-a-sve.S) + set(A_SRCS_SVE2 mc-a-sve2.S sad-a-sve2.S pixel-util-sve2.S ipfilter-sve2.S ssd-a-sve2.S) + set(VEC_PRIMITIVES) + + set(ARM_ASMS "${A_SRCS}" CACHE INTERNAL "ARM Assembly Sources") + set(ARM_ASMS_SVE "${A_SRCS_SVE}" CACHE INTERNAL "ARM Assembly Sources that use SVE instruction set") + set(ARM_ASMS_SVE2 "${A_SRCS_SVE2}" CACHE INTERNAL "ARM Assembly Sources that use SVE2 instruction set") + foreach(SRC ${C_SRCS}) + set(ASM_PRIMITIVES ${ASM_PRIMITIVES} aarch64/${SRC}) + endforeach() source_group(Assembly FILES ${ASM_PRIMITIVES}) -endif(ENABLE_ASSEMBLY AND (ARM OR CROSS_COMPILE_ARM)) +endif(ENABLE_ASSEMBLY AND (ARM64 OR CROSS_COMPILE_ARM64)) if(POWER) set_source_files_properties(version.cpp PROPERTIES COMPILE_FLAGS -DX265_VERSION=${X265_VERSION}) @@ -169,4 +176,6 @@ scalinglist.cpp scalinglist.h quant.cpp quant.h contexts.h deblock.cpp deblock.h - scaler.cpp scaler.h) + scaler.cpp scaler.h + ringmem.cpp ringmem.h + temporalfilter.cpp temporalfilter.h)
View file
x265_3.6.tar.gz/source/common/aarch64/arm64-utils.cpp
Added
@@ -0,0 +1,300 @@ +#include "common.h" +#include "x265.h" +#include "arm64-utils.h" +#include <arm_neon.h> + +#define COPY_16(d,s) *(uint8x16_t *)(d) = *(uint8x16_t *)(s) +namespace X265_NS +{ + + + +void transpose8x8(uint8_t *dst, const uint8_t *src, intptr_t dstride, intptr_t sstride) +{ + uint8x8_t a0, a1, a2, a3, a4, a5, a6, a7; + uint8x8_t b0, b1, b2, b3, b4, b5, b6, b7; + + a0 = *(uint8x8_t *)(src + 0 * sstride); + a1 = *(uint8x8_t *)(src + 1 * sstride); + a2 = *(uint8x8_t *)(src + 2 * sstride); + a3 = *(uint8x8_t *)(src + 3 * sstride); + a4 = *(uint8x8_t *)(src + 4 * sstride); + a5 = *(uint8x8_t *)(src + 5 * sstride); + a6 = *(uint8x8_t *)(src + 6 * sstride); + a7 = *(uint8x8_t *)(src + 7 * sstride); + + b0 = vtrn1_u32(a0, a4); + b1 = vtrn1_u32(a1, a5); + b2 = vtrn1_u32(a2, a6); + b3 = vtrn1_u32(a3, a7); + b4 = vtrn2_u32(a0, a4); + b5 = vtrn2_u32(a1, a5); + b6 = vtrn2_u32(a2, a6); + b7 = vtrn2_u32(a3, a7); + + a0 = vtrn1_u16(b0, b2); + a1 = vtrn1_u16(b1, b3); + a2 = vtrn2_u16(b0, b2); + a3 = vtrn2_u16(b1, b3); + a4 = vtrn1_u16(b4, b6); + a5 = vtrn1_u16(b5, b7); + a6 = vtrn2_u16(b4, b6); + a7 = vtrn2_u16(b5, b7); + + b0 = vtrn1_u8(a0, a1); + b1 = vtrn2_u8(a0, a1); + b2 = vtrn1_u8(a2, a3); + b3 = vtrn2_u8(a2, a3); + b4 = vtrn1_u8(a4, a5); + b5 = vtrn2_u8(a4, a5); + b6 = vtrn1_u8(a6, a7); + b7 = vtrn2_u8(a6, a7); + + *(uint8x8_t *)(dst + 0 * dstride) = b0; + *(uint8x8_t *)(dst + 1 * dstride) = b1; + *(uint8x8_t *)(dst + 2 * dstride) = b2; + *(uint8x8_t *)(dst + 3 * dstride) = b3; + *(uint8x8_t *)(dst + 4 * dstride) = b4; + *(uint8x8_t *)(dst + 5 * dstride) = b5; + *(uint8x8_t *)(dst + 6 * dstride) = b6; + *(uint8x8_t *)(dst + 7 * dstride) = b7; +} + + + + + + +void transpose16x16(uint8_t *dst, const uint8_t *src, intptr_t dstride, intptr_t sstride) +{ + uint16x8_t a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, aA, aB, aC, aD, aE, aF; + uint16x8_t b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, bA, bB, bC, bD, bE, bF; + uint16x8_t c0, c1, c2, c3, c4, c5, c6, c7, c8, c9, cA, cB, cC, cD, cE, cF; + uint16x8_t d0, d1, d2, d3, d4, d5, d6, d7, d8, d9, dA, dB, dC, dD, dE, dF; + + a0 = *(uint16x8_t *)(src + 0 * sstride); + a1 = *(uint16x8_t *)(src + 1 * sstride); + a2 = *(uint16x8_t *)(src + 2 * sstride); + a3 = *(uint16x8_t *)(src + 3 * sstride); + a4 = *(uint16x8_t *)(src + 4 * sstride); + a5 = *(uint16x8_t *)(src + 5 * sstride); + a6 = *(uint16x8_t *)(src + 6 * sstride); + a7 = *(uint16x8_t *)(src + 7 * sstride); + a8 = *(uint16x8_t *)(src + 8 * sstride); + a9 = *(uint16x8_t *)(src + 9 * sstride); + aA = *(uint16x8_t *)(src + 10 * sstride); + aB = *(uint16x8_t *)(src + 11 * sstride); + aC = *(uint16x8_t *)(src + 12 * sstride); + aD = *(uint16x8_t *)(src + 13 * sstride); + aE = *(uint16x8_t *)(src + 14 * sstride); + aF = *(uint16x8_t *)(src + 15 * sstride); + + b0 = vtrn1q_u64(a0, a8); + b1 = vtrn1q_u64(a1, a9); + b2 = vtrn1q_u64(a2, aA); + b3 = vtrn1q_u64(a3, aB); + b4 = vtrn1q_u64(a4, aC); + b5 = vtrn1q_u64(a5, aD); + b6 = vtrn1q_u64(a6, aE); + b7 = vtrn1q_u64(a7, aF); + b8 = vtrn2q_u64(a0, a8); + b9 = vtrn2q_u64(a1, a9); + bA = vtrn2q_u64(a2, aA); + bB = vtrn2q_u64(a3, aB); + bC = vtrn2q_u64(a4, aC); + bD = vtrn2q_u64(a5, aD); + bE = vtrn2q_u64(a6, aE); + bF = vtrn2q_u64(a7, aF); + + c0 = vtrn1q_u32(b0, b4); + c1 = vtrn1q_u32(b1, b5); + c2 = vtrn1q_u32(b2, b6); + c3 = vtrn1q_u32(b3, b7); + c4 = vtrn2q_u32(b0, b4); + c5 = vtrn2q_u32(b1, b5); + c6 = vtrn2q_u32(b2, b6); + c7 = vtrn2q_u32(b3, b7); + c8 = vtrn1q_u32(b8, bC); + c9 = vtrn1q_u32(b9, bD); + cA = vtrn1q_u32(bA, bE); + cB = vtrn1q_u32(bB, bF); + cC = vtrn2q_u32(b8, bC); + cD = vtrn2q_u32(b9, bD); + cE = vtrn2q_u32(bA, bE); + cF = vtrn2q_u32(bB, bF); + + d0 = vtrn1q_u16(c0, c2); + d1 = vtrn1q_u16(c1, c3); + d2 = vtrn2q_u16(c0, c2); + d3 = vtrn2q_u16(c1, c3); + d4 = vtrn1q_u16(c4, c6); + d5 = vtrn1q_u16(c5, c7); + d6 = vtrn2q_u16(c4, c6); + d7 = vtrn2q_u16(c5, c7); + d8 = vtrn1q_u16(c8, cA); + d9 = vtrn1q_u16(c9, cB); + dA = vtrn2q_u16(c8, cA); + dB = vtrn2q_u16(c9, cB); + dC = vtrn1q_u16(cC, cE); + dD = vtrn1q_u16(cD, cF); + dE = vtrn2q_u16(cC, cE); + dF = vtrn2q_u16(cD, cF); + + *(uint16x8_t *)(dst + 0 * dstride) = vtrn1q_u8(d0, d1); + *(uint16x8_t *)(dst + 1 * dstride) = vtrn2q_u8(d0, d1); + *(uint16x8_t *)(dst + 2 * dstride) = vtrn1q_u8(d2, d3); + *(uint16x8_t *)(dst + 3 * dstride) = vtrn2q_u8(d2, d3); + *(uint16x8_t *)(dst + 4 * dstride) = vtrn1q_u8(d4, d5); + *(uint16x8_t *)(dst + 5 * dstride) = vtrn2q_u8(d4, d5); + *(uint16x8_t *)(dst + 6 * dstride) = vtrn1q_u8(d6, d7); + *(uint16x8_t *)(dst + 7 * dstride) = vtrn2q_u8(d6, d7); + *(uint16x8_t *)(dst + 8 * dstride) = vtrn1q_u8(d8, d9); + *(uint16x8_t *)(dst + 9 * dstride) = vtrn2q_u8(d8, d9); + *(uint16x8_t *)(dst + 10 * dstride) = vtrn1q_u8(dA, dB); + *(uint16x8_t *)(dst + 11 * dstride) = vtrn2q_u8(dA, dB); + *(uint16x8_t *)(dst + 12 * dstride) = vtrn1q_u8(dC, dD); + *(uint16x8_t *)(dst + 13 * dstride) = vtrn2q_u8(dC, dD); + *(uint16x8_t *)(dst + 14 * dstride) = vtrn1q_u8(dE, dF); + *(uint16x8_t *)(dst + 15 * dstride) = vtrn2q_u8(dE, dF); + + +} + + +void transpose32x32(uint8_t *dst, const uint8_t *src, intptr_t dstride, intptr_t sstride) +{ + //assumption: there is no partial overlap + transpose16x16(dst, src, dstride, sstride); + transpose16x16(dst + 16 * dstride + 16, src + 16 * sstride + 16, dstride, sstride); + if (dst == src) + { + uint8_t tmp16 * 16 __attribute__((aligned(64))); + transpose16x16(tmp, src + 16, 16, sstride); + transpose16x16(dst + 16, src + 16 * sstride, dstride, sstride); + for (int i = 0; i < 16; i++) + { + COPY_16(dst + (16 + i)*dstride, tmp + 16 * i); + } + } + else + { + transpose16x16(dst + 16 * dstride, src + 16, dstride, sstride); + transpose16x16(dst + 16, src + 16 * sstride, dstride, sstride); + } + +} + + + +void transpose8x8(uint16_t *dst, const uint16_t *src, intptr_t dstride, intptr_t sstride) +{ + uint16x8_t a0, a1, a2, a3, a4, a5, a6, a7; + uint16x8_t b0, b1, b2, b3, b4, b5, b6, b7; + + a0 = *(uint16x8_t *)(src + 0 * sstride); + a1 = *(uint16x8_t *)(src + 1 * sstride); + a2 = *(uint16x8_t *)(src + 2 * sstride); + a3 = *(uint16x8_t *)(src + 3 * sstride); + a4 = *(uint16x8_t *)(src + 4 * sstride); + a5 = *(uint16x8_t *)(src + 5 * sstride); + a6 = *(uint16x8_t *)(src + 6 * sstride); + a7 = *(uint16x8_t *)(src + 7 * sstride); + + b0 = vtrn1q_u64(a0, a4); + b1 = vtrn1q_u64(a1, a5); + b2 = vtrn1q_u64(a2, a6); + b3 = vtrn1q_u64(a3, a7); + b4 = vtrn2q_u64(a0, a4); + b5 = vtrn2q_u64(a1, a5); + b6 = vtrn2q_u64(a2, a6); + b7 = vtrn2q_u64(a3, a7); + + a0 = vtrn1q_u32(b0, b2); + a1 = vtrn1q_u32(b1, b3); + a2 = vtrn2q_u32(b0, b2); + a3 = vtrn2q_u32(b1, b3); + a4 = vtrn1q_u32(b4, b6); + a5 = vtrn1q_u32(b5, b7); + a6 = vtrn2q_u32(b4, b6); + a7 = vtrn2q_u32(b5, b7); + + b0 = vtrn1q_u16(a0, a1); + b1 = vtrn2q_u16(a0, a1); + b2 = vtrn1q_u16(a2, a3); + b3 = vtrn2q_u16(a2, a3); + b4 = vtrn1q_u16(a4, a5); + b5 = vtrn2q_u16(a4, a5); + b6 = vtrn1q_u16(a6, a7); + b7 = vtrn2q_u16(a6, a7); + + *(uint16x8_t *)(dst + 0 * dstride) = b0; + *(uint16x8_t *)(dst + 1 * dstride) = b1; + *(uint16x8_t *)(dst + 2 * dstride) = b2; + *(uint16x8_t *)(dst + 3 * dstride) = b3; + *(uint16x8_t *)(dst + 4 * dstride) = b4; + *(uint16x8_t *)(dst + 5 * dstride) = b5; + *(uint16x8_t *)(dst + 6 * dstride) = b6; + *(uint16x8_t *)(dst + 7 * dstride) = b7; +} + +void transpose16x16(uint16_t *dst, const uint16_t *src, intptr_t dstride, intptr_t sstride) +{ + //assumption: there is no partial overlap + transpose8x8(dst, src, dstride, sstride); + transpose8x8(dst + 8 * dstride + 8, src + 8 * sstride + 8, dstride, sstride); + + if (dst == src) + { + uint16_t tmp8 * 8; + transpose8x8(tmp, src + 8, 8, sstride); + transpose8x8(dst + 8, src + 8 * sstride, dstride, sstride); + for (int i = 0; i < 8; i++) + { + COPY_16(dst + (8 + i)*dstride, tmp + 8 * i); + } + } + else + { + transpose8x8(dst + 8 * dstride, src + 8, dstride, sstride); + transpose8x8(dst + 8, src + 8 * sstride, dstride, sstride); + } + +} + + + +void transpose32x32(uint16_t *dst, const uint16_t *src, intptr_t dstride, intptr_t sstride) +{ + //assumption: there is no partial overlap + for (int i = 0; i < 4; i++) + { + transpose8x8(dst + i * 8 * (1 + dstride), src + i * 8 * (1 + sstride), dstride, sstride); + for (int j = i + 1; j < 4; j++) + { + if (dst == src) + { + uint16_t tmp8 * 8 __attribute__((aligned(64))); + transpose8x8(tmp, src + 8 * i + 8 * j * sstride, 8, sstride); + transpose8x8(dst + 8 * i + 8 * j * dstride, src + 8 * j + 8 * i * sstride, dstride, sstride); + for (int k = 0; k < 8; k++) + { + COPY_16(dst + 8 * j + (8 * i + k)*dstride, tmp + 8 * k); + } + } + else + { + transpose8x8(dst + 8 * (j + i * dstride), src + 8 * (i + j * sstride), dstride, sstride); + transpose8x8(dst + 8 * (i + j * dstride), src + 8 * (j + i * sstride), dstride, sstride); + } + + } + } +} + + + + +} + + +
View file
x265_3.6.tar.gz/source/common/aarch64/arm64-utils.h
Added
@@ -0,0 +1,15 @@ +#ifndef __ARM64_UTILS_H__ +#define __ARM64_UTILS_H__ + + +namespace X265_NS +{ +void transpose8x8(uint8_t *dst, const uint8_t *src, intptr_t dstride, intptr_t sstride); +void transpose16x16(uint8_t *dst, const uint8_t *src, intptr_t dstride, intptr_t sstride); +void transpose32x32(uint8_t *dst, const uint8_t *src, intptr_t dstride, intptr_t sstride); +void transpose8x8(uint16_t *dst, const uint16_t *src, intptr_t dstride, intptr_t sstride); +void transpose16x16(uint16_t *dst, const uint16_t *src, intptr_t dstride, intptr_t sstride); +void transpose32x32(uint16_t *dst, const uint16_t *src, intptr_t dstride, intptr_t sstride); +} + +#endif
View file
x265_3.5.tar.gz/source/common/aarch64/asm-primitives.cpp -> x265_3.6.tar.gz/source/common/aarch64/asm-primitives.cpp
Changed
@@ -3,6 +3,7 @@ * * Authors: Hongbin Liu <liuhongbin1@huawei.com> * Yimeng Su <yimeng.su@huawei.com> + * Sebastian Pop <spop@amazon.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -22,11 +23,659 @@ * For more information, contact us at license @ x265.com. *****************************************************************************/ + #include "common.h" #include "primitives.h" #include "x265.h" #include "cpu.h" +extern "C" { +#include "fun-decls.h" +} + +#define ALL_LUMA_TU_TYPED(prim, fncdef, fname, cpu) \ + p.cuBLOCK_4x4.prim = fncdef PFX(fname ## _4x4_ ## cpu); \ + p.cuBLOCK_8x8.prim = fncdef PFX(fname ## _8x8_ ## cpu); \ + p.cuBLOCK_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \ + p.cuBLOCK_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \ + p.cuBLOCK_64x64.prim = fncdef PFX(fname ## _64x64_ ## cpu) +#define LUMA_TU_TYPED_NEON(prim, fncdef, fname) \ + p.cuBLOCK_4x4.prim = fncdef PFX(fname ## _4x4_ ## neon); \ + p.cuBLOCK_8x8.prim = fncdef PFX(fname ## _8x8_ ## neon); \ + p.cuBLOCK_16x16.prim = fncdef PFX(fname ## _16x16_ ## neon); \ + p.cuBLOCK_64x64.prim = fncdef PFX(fname ## _64x64_ ## neon) +#define LUMA_TU_TYPED_CAN_USE_SVE(prim, fncdef, fname) \ + p.cuBLOCK_32x32.prim = fncdef PFX(fname ## _32x32_ ## sve) +#define ALL_LUMA_TU(prim, fname, cpu) ALL_LUMA_TU_TYPED(prim, , fname, cpu) +#define LUMA_TU_NEON(prim, fname) LUMA_TU_TYPED_NEON(prim, , fname) +#define LUMA_TU_CAN_USE_SVE(prim, fname) LUMA_TU_TYPED_CAN_USE_SVE(prim, , fname) + +#define ALL_LUMA_PU_TYPED(prim, fncdef, fname, cpu) \ + p.puLUMA_4x4.prim = fncdef PFX(fname ## _4x4_ ## cpu); \ + p.puLUMA_8x8.prim = fncdef PFX(fname ## _8x8_ ## cpu); \ + p.puLUMA_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \ + p.puLUMA_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \ + p.puLUMA_64x64.prim = fncdef PFX(fname ## _64x64_ ## cpu); \ + p.puLUMA_8x4.prim = fncdef PFX(fname ## _8x4_ ## cpu); \ + p.puLUMA_4x8.prim = fncdef PFX(fname ## _4x8_ ## cpu); \ + p.puLUMA_16x8.prim = fncdef PFX(fname ## _16x8_ ## cpu); \ + p.puLUMA_8x16.prim = fncdef PFX(fname ## _8x16_ ## cpu); \ + p.puLUMA_16x32.prim = fncdef PFX(fname ## _16x32_ ## cpu); \ + p.puLUMA_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \ + p.puLUMA_64x32.prim = fncdef PFX(fname ## _64x32_ ## cpu); \ + p.puLUMA_32x64.prim = fncdef PFX(fname ## _32x64_ ## cpu); \ + p.puLUMA_16x12.prim = fncdef PFX(fname ## _16x12_ ## cpu); \ + p.puLUMA_12x16.prim = fncdef PFX(fname ## _12x16_ ## cpu); \ + p.puLUMA_16x4.prim = fncdef PFX(fname ## _16x4_ ## cpu); \ + p.puLUMA_4x16.prim = fncdef PFX(fname ## _4x16_ ## cpu); \ + p.puLUMA_32x24.prim = fncdef PFX(fname ## _32x24_ ## cpu); \ + p.puLUMA_24x32.prim = fncdef PFX(fname ## _24x32_ ## cpu); \ + p.puLUMA_32x8.prim = fncdef PFX(fname ## _32x8_ ## cpu); \ + p.puLUMA_8x32.prim = fncdef PFX(fname ## _8x32_ ## cpu); \ + p.puLUMA_64x48.prim = fncdef PFX(fname ## _64x48_ ## cpu); \ + p.puLUMA_48x64.prim = fncdef PFX(fname ## _48x64_ ## cpu); \ + p.puLUMA_64x16.prim = fncdef PFX(fname ## _64x16_ ## cpu); \ + p.puLUMA_16x64.prim = fncdef PFX(fname ## _16x64_ ## cpu) +#define LUMA_PU_TYPED_MULTIPLE_ARCHS_1(prim, fncdef, fname, cpu) \ + p.puLUMA_4x4.prim = fncdef PFX(fname ## _4x4_ ## cpu); \ + p.puLUMA_4x8.prim = fncdef PFX(fname ## _4x8_ ## cpu); \ + p.puLUMA_4x16.prim = fncdef PFX(fname ## _4x16_ ## cpu) +#define LUMA_PU_TYPED_MULTIPLE_ARCHS_2(prim, fncdef, fname, cpu) \ + p.puLUMA_8x8.prim = fncdef PFX(fname ## _8x8_ ## cpu); \ + p.puLUMA_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \ + p.puLUMA_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \ + p.puLUMA_64x64.prim = fncdef PFX(fname ## _64x64_ ## cpu); \ + p.puLUMA_8x4.prim = fncdef PFX(fname ## _8x4_ ## cpu); \ + p.puLUMA_16x8.prim = fncdef PFX(fname ## _16x8_ ## cpu); \ + p.puLUMA_8x16.prim = fncdef PFX(fname ## _8x16_ ## cpu); \ + p.puLUMA_16x32.prim = fncdef PFX(fname ## _16x32_ ## cpu); \ + p.puLUMA_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \ + p.puLUMA_64x32.prim = fncdef PFX(fname ## _64x32_ ## cpu); \ + p.puLUMA_32x64.prim = fncdef PFX(fname ## _32x64_ ## cpu); \ + p.puLUMA_16x12.prim = fncdef PFX(fname ## _16x12_ ## cpu); \ + p.puLUMA_12x16.prim = fncdef PFX(fname ## _12x16_ ## cpu); \ + p.puLUMA_16x4.prim = fncdef PFX(fname ## _16x4_ ## cpu); \ + p.puLUMA_32x24.prim = fncdef PFX(fname ## _32x24_ ## cpu); \ + p.puLUMA_24x32.prim = fncdef PFX(fname ## _24x32_ ## cpu); \ + p.puLUMA_32x8.prim = fncdef PFX(fname ## _32x8_ ## cpu); \ + p.puLUMA_8x32.prim = fncdef PFX(fname ## _8x32_ ## cpu); \ + p.puLUMA_64x48.prim = fncdef PFX(fname ## _64x48_ ## cpu); \ + p.puLUMA_48x64.prim = fncdef PFX(fname ## _48x64_ ## cpu); \ + p.puLUMA_64x16.prim = fncdef PFX(fname ## _64x16_ ## cpu); \ + p.puLUMA_16x64.prim = fncdef PFX(fname ## _16x64_ ## cpu) +#define LUMA_PU_TYPED_NEON_1(prim, fncdef, fname) \ + p.puLUMA_4x4.prim = fncdef PFX(fname ## _4x4_ ## neon); \ + p.puLUMA_4x8.prim = fncdef PFX(fname ## _4x8_ ## neon); \ + p.puLUMA_4x16.prim = fncdef PFX(fname ## _4x16_ ## neon); \ + p.puLUMA_12x16.prim = fncdef PFX(fname ## _12x16_ ## neon); \ + p.puLUMA_8x8.prim = fncdef PFX(fname ## _8x8_ ## neon); \ + p.puLUMA_16x16.prim = fncdef PFX(fname ## _16x16_ ## neon); \ + p.puLUMA_8x4.prim = fncdef PFX(fname ## _8x4_ ## neon); \ + p.puLUMA_16x8.prim = fncdef PFX(fname ## _16x8_ ## neon); \ + p.puLUMA_8x16.prim = fncdef PFX(fname ## _8x16_ ## neon); \ + p.puLUMA_16x12.prim = fncdef PFX(fname ## _16x12_ ## neon); \ + p.puLUMA_16x32.prim = fncdef PFX(fname ## _16x32_ ## neon); \ + p.puLUMA_16x4.prim = fncdef PFX(fname ## _16x4_ ## neon); \ + p.puLUMA_24x32.prim = fncdef PFX(fname ## _24x32_ ## neon); \ + p.puLUMA_8x32.prim = fncdef PFX(fname ## _8x32_ ## neon); \ + p.puLUMA_48x64.prim = fncdef PFX(fname ## _48x64_ ## neon); \ + p.puLUMA_16x64.prim = fncdef PFX(fname ## _16x64_ ## neon) +#define LUMA_PU_TYPED_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, fncdef, fname) \ + p.puLUMA_32x32.prim = fncdef PFX(fname ## _32x32_ ## sve); \ + p.puLUMA_64x64.prim = fncdef PFX(fname ## _64x64_ ## sve); \ + p.puLUMA_32x16.prim = fncdef PFX(fname ## _32x16_ ## sve); \ + p.puLUMA_64x32.prim = fncdef PFX(fname ## _64x32_ ## sve); \ + p.puLUMA_32x64.prim = fncdef PFX(fname ## _32x64_ ## sve); \ + p.puLUMA_32x24.prim = fncdef PFX(fname ## _32x24_ ## sve); \ + p.puLUMA_32x8.prim = fncdef PFX(fname ## _32x8_ ## sve); \ + p.puLUMA_64x48.prim = fncdef PFX(fname ## _64x48_ ## sve); \ + p.puLUMA_64x16.prim = fncdef PFX(fname ## _64x16_ ## sve) +#define LUMA_PU_TYPED_NEON_2(prim, fncdef, fname) \ + p.puLUMA_4x4.prim = fncdef PFX(fname ## _4x4_ ## neon); \ + p.puLUMA_8x4.prim = fncdef PFX(fname ## _8x4_ ## neon); \ + p.puLUMA_4x8.prim = fncdef PFX(fname ## _4x8_ ## neon); \ + p.puLUMA_8x8.prim = fncdef PFX(fname ## _8x8_ ## neon); \ + p.puLUMA_16x8.prim = fncdef PFX(fname ## _16x8_ ## neon); \ + p.puLUMA_8x16.prim = fncdef PFX(fname ## _8x16_ ## neon); \ + p.puLUMA_16x16.prim = fncdef PFX(fname ## _16x16_ ## neon); \ + p.puLUMA_16x32.prim = fncdef PFX(fname ## _16x32_ ## neon); \ + p.puLUMA_16x12.prim = fncdef PFX(fname ## _16x12_ ## neon); \ + p.puLUMA_16x4.prim = fncdef PFX(fname ## _16x4_ ## neon); \ + p.puLUMA_4x16.prim = fncdef PFX(fname ## _4x16_ ## neon); \ + p.puLUMA_8x32.prim = fncdef PFX(fname ## _8x32_ ## neon); \ + p.puLUMA_16x64.prim = fncdef PFX(fname ## _16x64_ ## neon) +#define LUMA_PU_TYPED_MULTIPLE_ARCHS_3(prim, fncdef, fname, cpu) \ + p.puLUMA_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \ + p.puLUMA_64x64.prim = fncdef PFX(fname ## _64x64_ ## cpu); \ + p.puLUMA_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \ + p.puLUMA_64x32.prim = fncdef PFX(fname ## _64x32_ ## cpu); \ + p.puLUMA_32x64.prim = fncdef PFX(fname ## _32x64_ ## cpu); \ + p.puLUMA_12x16.prim = fncdef PFX(fname ## _12x16_ ## cpu); \ + p.puLUMA_32x24.prim = fncdef PFX(fname ## _32x24_ ## cpu); \ + p.puLUMA_24x32.prim = fncdef PFX(fname ## _24x32_ ## cpu); \ + p.puLUMA_32x8.prim = fncdef PFX(fname ## _32x8_ ## cpu); \ + p.puLUMA_64x48.prim = fncdef PFX(fname ## _64x48_ ## cpu); \ + p.puLUMA_48x64.prim = fncdef PFX(fname ## _48x64_ ## cpu); \ + p.puLUMA_64x16.prim = fncdef PFX(fname ## _64x16_ ## cpu) +#define LUMA_PU_TYPED_NEON_3(prim, fncdef, fname) \ + p.puLUMA_4x4.prim = fncdef PFX(fname ## _4x4_ ## neon); \ + p.puLUMA_4x8.prim = fncdef PFX(fname ## _4x8_ ## neon); \ + p.puLUMA_4x16.prim = fncdef PFX(fname ## _4x16_ ## neon) +#define LUMA_PU_TYPED_CAN_USE_SVE2(prim, fncdef, fname) \ + p.puLUMA_8x8.prim = fncdef PFX(fname ## _8x8_ ## sve2); \ + p.puLUMA_16x16.prim = fncdef PFX(fname ## _16x16_ ## sve2); \ + p.puLUMA_32x32.prim = fncdef PFX(fname ## _32x32_ ## sve2); \ + p.puLUMA_64x64.prim = fncdef PFX(fname ## _64x64_ ## sve2); \ + p.puLUMA_8x4.prim = fncdef PFX(fname ## _8x4_ ## sve2); \ + p.puLUMA_16x8.prim = fncdef PFX(fname ## _16x8_ ## sve2); \ + p.puLUMA_8x16.prim = fncdef PFX(fname ## _8x16_ ## sve2); \ + p.puLUMA_16x32.prim = fncdef PFX(fname ## _16x32_ ## sve2); \ + p.puLUMA_32x16.prim = fncdef PFX(fname ## _32x16_ ## sve2); \ + p.puLUMA_64x32.prim = fncdef PFX(fname ## _64x32_ ## sve2); \ + p.puLUMA_32x64.prim = fncdef PFX(fname ## _32x64_ ## sve2); \ + p.puLUMA_16x12.prim = fncdef PFX(fname ## _16x12_ ## sve2); \ + p.puLUMA_12x16.prim = fncdef PFX(fname ## _12x16_ ## sve2); \ + p.puLUMA_16x4.prim = fncdef PFX(fname ## _16x4_ ## sve2); \ + p.puLUMA_32x24.prim = fncdef PFX(fname ## _32x24_ ## sve2); \ + p.puLUMA_24x32.prim = fncdef PFX(fname ## _24x32_ ## sve2); \ + p.puLUMA_32x8.prim = fncdef PFX(fname ## _32x8_ ## sve2); \ + p.puLUMA_8x32.prim = fncdef PFX(fname ## _8x32_ ## sve2); \ + p.puLUMA_64x48.prim = fncdef PFX(fname ## _64x48_ ## sve2); \ + p.puLUMA_48x64.prim = fncdef PFX(fname ## _48x64_ ## sve2); \ + p.puLUMA_64x16.prim = fncdef PFX(fname ## _64x16_ ## sve2); \ + p.puLUMA_16x64.prim = fncdef PFX(fname ## _16x64_ ## sve2) +#define LUMA_PU_TYPED_NEON_FILTER_PIXEL_TO_SHORT(prim, fncdef) \ + p.puLUMA_4x4.prim = fncdef PFX(filterPixelToShort ## _4x4_ ## neon); \ + p.puLUMA_8x8.prim = fncdef PFX(filterPixelToShort ## _8x8_ ## neon); \ + p.puLUMA_16x16.prim = fncdef PFX(filterPixelToShort ## _16x16_ ## neon); \ + p.puLUMA_8x4.prim = fncdef PFX(filterPixelToShort ## _8x4_ ## neon); \ + p.puLUMA_4x8.prim = fncdef PFX(filterPixelToShort ## _4x8_ ## neon); \ + p.puLUMA_16x8.prim = fncdef PFX(filterPixelToShort ## _16x8_ ## neon); \ + p.puLUMA_8x16.prim = fncdef PFX(filterPixelToShort ## _8x16_ ## neon); \ + p.puLUMA_16x32.prim = fncdef PFX(filterPixelToShort ## _16x32_ ## neon); \ + p.puLUMA_16x12.prim = fncdef PFX(filterPixelToShort ## _16x12_ ## neon); \ + p.puLUMA_12x16.prim = fncdef PFX(filterPixelToShort ## _12x16_ ## neon); \ + p.puLUMA_16x4.prim = fncdef PFX(filterPixelToShort ## _16x4_ ## neon); \ + p.puLUMA_4x16.prim = fncdef PFX(filterPixelToShort ## _4x16_ ## neon); \ + p.puLUMA_24x32.prim = fncdef PFX(filterPixelToShort ## _24x32_ ## neon); \ + p.puLUMA_8x32.prim = fncdef PFX(filterPixelToShort ## _8x32_ ## neon); \ + p.puLUMA_16x64.prim = fncdef PFX(filterPixelToShort ## _16x64_ ## neon) +#define LUMA_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, fncdef) \ + p.puLUMA_32x32.prim = fncdef PFX(filterPixelToShort ## _32x32_ ## sve); \ + p.puLUMA_32x16.prim = fncdef PFX(filterPixelToShort ## _32x16_ ## sve); \ + p.puLUMA_32x64.prim = fncdef PFX(filterPixelToShort ## _32x64_ ## sve); \ + p.puLUMA_32x24.prim = fncdef PFX(filterPixelToShort ## _32x24_ ## sve); \ + p.puLUMA_32x8.prim = fncdef PFX(filterPixelToShort ## _32x8_ ## sve); \ + p.puLUMA_64x64.prim = fncdef PFX(filterPixelToShort ## _64x64_ ## sve); \ + p.puLUMA_64x32.prim = fncdef PFX(filterPixelToShort ## _64x32_ ## sve); \ + p.puLUMA_64x48.prim = fncdef PFX(filterPixelToShort ## _64x48_ ## sve); \ + p.puLUMA_64x16.prim = fncdef PFX(filterPixelToShort ## _64x16_ ## sve); \ + p.puLUMA_48x64.prim = fncdef PFX(filterPixelToShort ## _48x64_ ## sve) +#define ALL_LUMA_PU(prim, fname, cpu) ALL_LUMA_PU_TYPED(prim, , fname, cpu) +#define LUMA_PU_MULTIPLE_ARCHS_1(prim, fname, cpu) LUMA_PU_TYPED_MULTIPLE_ARCHS_1(prim, , fname, cpu) +#define LUMA_PU_MULTIPLE_ARCHS_2(prim, fname, cpu) LUMA_PU_TYPED_MULTIPLE_ARCHS_2(prim, , fname, cpu) +#define LUMA_PU_NEON_1(prim, fname) LUMA_PU_TYPED_NEON_1(prim, , fname) +#define LUMA_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, fname) LUMA_PU_TYPED_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, , fname) +#define LUMA_PU_NEON_2(prim, fname) LUMA_PU_TYPED_NEON_2(prim, , fname) +#define LUMA_PU_MULTIPLE_ARCHS_3(prim, fname, cpu) LUMA_PU_TYPED_MULTIPLE_ARCHS_3(prim, , fname, cpu) +#define LUMA_PU_NEON_3(prim, fname) LUMA_PU_TYPED_NEON_3(prim, , fname) +#define LUMA_PU_CAN_USE_SVE2(prim, fname) LUMA_PU_TYPED_CAN_USE_SVE2(prim, , fname) +#define LUMA_PU_NEON_FILTER_PIXEL_TO_SHORT(prim) LUMA_PU_TYPED_NEON_FILTER_PIXEL_TO_SHORT(prim, ) +#define LUMA_PU_SVE_FILTER_PIXEL_TO_SHORT(prim) LUMA_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, ) + + +#define ALL_LUMA_PU_T(prim, fname) \ + p.puLUMA_4x4.prim = fname<LUMA_4x4>; \ + p.puLUMA_8x8.prim = fname<LUMA_8x8>; \ + p.puLUMA_16x16.prim = fname<LUMA_16x16>; \ + p.puLUMA_32x32.prim = fname<LUMA_32x32>; \ + p.puLUMA_64x64.prim = fname<LUMA_64x64>; \ + p.puLUMA_8x4.prim = fname<LUMA_8x4>; \ + p.puLUMA_4x8.prim = fname<LUMA_4x8>; \ + p.puLUMA_16x8.prim = fname<LUMA_16x8>; \ + p.puLUMA_8x16.prim = fname<LUMA_8x16>; \ + p.puLUMA_16x32.prim = fname<LUMA_16x32>; \ + p.puLUMA_32x16.prim = fname<LUMA_32x16>; \ + p.puLUMA_64x32.prim = fname<LUMA_64x32>; \ + p.puLUMA_32x64.prim = fname<LUMA_32x64>; \ + p.puLUMA_16x12.prim = fname<LUMA_16x12>; \ + p.puLUMA_12x16.prim = fname<LUMA_12x16>; \ + p.puLUMA_16x4.prim = fname<LUMA_16x4>; \ + p.puLUMA_4x16.prim = fname<LUMA_4x16>; \ + p.puLUMA_32x24.prim = fname<LUMA_32x24>; \ + p.puLUMA_24x32.prim = fname<LUMA_24x32>; \ + p.puLUMA_32x8.prim = fname<LUMA_32x8>; \ + p.puLUMA_8x32.prim = fname<LUMA_8x32>; \ + p.puLUMA_64x48.prim = fname<LUMA_64x48>; \ + p.puLUMA_48x64.prim = fname<LUMA_48x64>; \ + p.puLUMA_64x16.prim = fname<LUMA_64x16>; \ + p.puLUMA_16x64.prim = fname<LUMA_16x64> + +#define ALL_CHROMA_420_PU_TYPED(prim, fncdef, fname, cpu) \ + p.chromaX265_CSP_I420.puCHROMA_420_4x4.prim = fncdef PFX(fname ## _4x4_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x8.prim = fncdef PFX(fname ## _8x8_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_4x2.prim = fncdef PFX(fname ## _4x2_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_2x4.prim = fncdef PFX(fname ## _2x4_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x4.prim = fncdef PFX(fname ## _8x4_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_4x8.prim = fncdef PFX(fname ## _4x8_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x8.prim = fncdef PFX(fname ## _16x8_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x16.prim = fncdef PFX(fname ## _8x16_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x32.prim = fncdef PFX(fname ## _16x32_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x6.prim = fncdef PFX(fname ## _8x6_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_6x8.prim = fncdef PFX(fname ## _6x8_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x2.prim = fncdef PFX(fname ## _8x2_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_2x8.prim = fncdef PFX(fname ## _2x8_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x12.prim = fncdef PFX(fname ## _16x12_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_12x16.prim = fncdef PFX(fname ## _12x16_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x4.prim = fncdef PFX(fname ## _16x4_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_4x16.prim = fncdef PFX(fname ## _4x16_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_32x24.prim = fncdef PFX(fname ## _32x24_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_24x32.prim = fncdef PFX(fname ## _24x32_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_32x8.prim = fncdef PFX(fname ## _32x8_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x32.prim = fncdef PFX(fname ## _8x32_ ## cpu) +#define CHROMA_420_PU_TYPED_NEON_1(prim, fncdef, fname) \ + p.chromaX265_CSP_I420.puCHROMA_420_4x4.prim = fncdef PFX(fname ## _4x4_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_4x2.prim = fncdef PFX(fname ## _4x2_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_4x8.prim = fncdef PFX(fname ## _4x8_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_6x8.prim = fncdef PFX(fname ## _6x8_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_12x16.prim = fncdef PFX(fname ## _12x16_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_4x16.prim = fncdef PFX(fname ## _4x16_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_32x24.prim = fncdef PFX(fname ## _32x24_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_24x32.prim = fncdef PFX(fname ## _24x32_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_32x8.prim = fncdef PFX(fname ## _32x8_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x32.prim = fncdef PFX(fname ## _8x32_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x8.prim = fncdef PFX(fname ## _8x8_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x16.prim = fncdef PFX(fname ## _16x16_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_2x4.prim = fncdef PFX(fname ## _2x4_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x4.prim = fncdef PFX(fname ## _8x4_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x8.prim = fncdef PFX(fname ## _16x8_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x16.prim = fncdef PFX(fname ## _8x16_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x32.prim = fncdef PFX(fname ## _16x32_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x6.prim = fncdef PFX(fname ## _8x6_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x2.prim = fncdef PFX(fname ## _8x2_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_2x8.prim = fncdef PFX(fname ## _2x8_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x12.prim = fncdef PFX(fname ## _16x12_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x4.prim = fncdef PFX(fname ## _16x4_ ## neon) +#define CHROMA_420_PU_TYPED_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, fncdef, fname) \ + p.chromaX265_CSP_I420.puCHROMA_420_32x32.prim = fncdef PFX(fname ## _32x32_ ## sve); \ + p.chromaX265_CSP_I420.puCHROMA_420_32x16.prim = fncdef PFX(fname ## _32x16_ ## sve) +#define CHROMA_420_PU_TYPED_NEON_2(prim, fncdef, fname) \ + p.chromaX265_CSP_I420.puCHROMA_420_4x4.prim = fncdef PFX(fname ## _4x4_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_4x2.prim = fncdef PFX(fname ## _4x2_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_4x8.prim = fncdef PFX(fname ## _4x8_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_4x16.prim = fncdef PFX(fname ## _4x16_ ## neon) +#define CHROMA_420_PU_TYPED_MULTIPLE_ARCHS(prim, fncdef, fname, cpu) \ + p.chromaX265_CSP_I420.puCHROMA_420_8x8.prim = fncdef PFX(fname ## _8x8_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_2x4.prim = fncdef PFX(fname ## _2x4_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x4.prim = fncdef PFX(fname ## _8x4_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x8.prim = fncdef PFX(fname ## _16x8_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x16.prim = fncdef PFX(fname ## _8x16_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x32.prim = fncdef PFX(fname ## _16x32_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x6.prim = fncdef PFX(fname ## _8x6_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_6x8.prim = fncdef PFX(fname ## _6x8_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x2.prim = fncdef PFX(fname ## _8x2_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_2x8.prim = fncdef PFX(fname ## _2x8_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x12.prim = fncdef PFX(fname ## _16x12_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_12x16.prim = fncdef PFX(fname ## _12x16_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x4.prim = fncdef PFX(fname ## _16x4_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_32x24.prim = fncdef PFX(fname ## _32x24_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_24x32.prim = fncdef PFX(fname ## _24x32_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_32x8.prim = fncdef PFX(fname ## _32x8_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x32.prim = fncdef PFX(fname ## _8x32_ ## cpu) +#define CHROMA_420_PU_TYPED_FILTER_PIXEL_TO_SHORT_NEON(prim, fncdef) \ + p.chromaX265_CSP_I420.puCHROMA_420_4x4.prim = fncdef PFX(filterPixelToShort ## _4x4_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x8.prim = fncdef PFX(filterPixelToShort ## _8x8_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x16.prim = fncdef PFX(filterPixelToShort ## _16x16_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x4.prim = fncdef PFX(filterPixelToShort ## _8x4_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_4x8.prim = fncdef PFX(filterPixelToShort ## _4x8_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x8.prim = fncdef PFX(filterPixelToShort ## _16x8_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x16.prim = fncdef PFX(filterPixelToShort ## _8x16_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x32.prim = fncdef PFX(filterPixelToShort ## _16x32_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x6.prim = fncdef PFX(filterPixelToShort ## _8x6_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x2.prim = fncdef PFX(filterPixelToShort ## _8x2_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x12.prim = fncdef PFX(filterPixelToShort ## _16x12_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_12x16.prim = fncdef PFX(filterPixelToShort ## _12x16_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x4.prim = fncdef PFX(filterPixelToShort ## _16x4_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_4x16.prim = fncdef PFX(filterPixelToShort ## _4x16_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_24x32.prim = fncdef PFX(filterPixelToShort ## _24x32_ ## neon); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x32.prim = fncdef PFX(filterPixelToShort ## _8x32_ ## neon) +#define CHROMA_420_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, fncdef) \ + p.chromaX265_CSP_I420.puCHROMA_420_2x4.prim = fncdef PFX(filterPixelToShort ## _2x4_ ## sve); \ + p.chromaX265_CSP_I420.puCHROMA_420_2x8.prim = fncdef PFX(filterPixelToShort ## _2x8_ ## sve); \ + p.chromaX265_CSP_I420.puCHROMA_420_6x8.prim = fncdef PFX(filterPixelToShort ## _6x8_ ## sve); \ + p.chromaX265_CSP_I420.puCHROMA_420_4x2.prim = fncdef PFX(filterPixelToShort ## _4x2_ ## sve); \ + p.chromaX265_CSP_I420.puCHROMA_420_32x32.prim = fncdef PFX(filterPixelToShort ## _32x32_ ## sve); \ + p.chromaX265_CSP_I420.puCHROMA_420_32x16.prim = fncdef PFX(filterPixelToShort ## _32x16_ ## sve); \ + p.chromaX265_CSP_I420.puCHROMA_420_32x24.prim = fncdef PFX(filterPixelToShort ## _32x24_ ## sve); \ + p.chromaX265_CSP_I420.puCHROMA_420_32x8.prim = fncdef PFX(filterPixelToShort ## _32x8_ ## sve) +#define ALL_CHROMA_420_PU(prim, fname, cpu) ALL_CHROMA_420_PU_TYPED(prim, , fname, cpu) +#define CHROMA_420_PU_NEON_1(prim, fname) CHROMA_420_PU_TYPED_NEON_1(prim, , fname) +#define CHROMA_420_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, fname) CHROMA_420_PU_TYPED_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, , fname) +#define CHROMA_420_PU_NEON_2(prim, fname) CHROMA_420_PU_TYPED_NEON_2(prim, , fname) +#define CHROMA_420_PU_MULTIPLE_ARCHS(prim, fname, cpu) CHROMA_420_PU_TYPED_MULTIPLE_ARCHS(prim, , fname, cpu) +#define CHROMA_420_PU_FILTER_PIXEL_TO_SHORT_NEON(prim) CHROMA_420_PU_TYPED_FILTER_PIXEL_TO_SHORT_NEON(prim, ) +#define CHROMA_420_PU_SVE_FILTER_PIXEL_TO_SHORT(prim) CHROMA_420_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, ) + + +#define ALL_CHROMA_420_4x4_PU_TYPED(prim, fncdef, fname, cpu) \ + p.chromaX265_CSP_I420.puCHROMA_420_4x4.prim = fncdef PFX(fname ## _4x4_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x2.prim = fncdef PFX(fname ## _8x2_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x8.prim = fncdef PFX(fname ## _8x8_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x4.prim = fncdef PFX(fname ## _8x4_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x6.prim = fncdef PFX(fname ## _8x6_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_4x8.prim = fncdef PFX(fname ## _4x8_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x8.prim = fncdef PFX(fname ## _16x8_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x16.prim = fncdef PFX(fname ## _8x16_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x32.prim = fncdef PFX(fname ## _16x32_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x12.prim = fncdef PFX(fname ## _16x12_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_12x16.prim = fncdef PFX(fname ## _12x16_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_16x4.prim = fncdef PFX(fname ## _16x4_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_4x16.prim = fncdef PFX(fname ## _4x16_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_32x24.prim = fncdef PFX(fname ## _32x24_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_24x32.prim = fncdef PFX(fname ## _24x32_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_32x8.prim = fncdef PFX(fname ## _32x8_ ## cpu); \ + p.chromaX265_CSP_I420.puCHROMA_420_8x32.prim = fncdef PFX(fname ## _8x32_ ## cpu) +#define ALL_CHROMA_420_4x4_PU(prim, fname, cpu) ALL_CHROMA_420_4x4_PU_TYPED(prim, , fname, cpu) + +#define ALL_CHROMA_422_PU_TYPED(prim, fncdef, fname, cpu) \ + p.chromaX265_CSP_I422.puCHROMA_422_4x8.prim = fncdef PFX(fname ## _4x8_ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_8x16.prim = fncdef PFX(fname ## _8x16_ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_16x32.prim = fncdef PFX(fname ## _16x32_ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_32x64.prim = fncdef PFX(fname ## _32x64_ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_4x4.prim = fncdef PFX(fname ## _4x4_ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_2x8.prim = fncdef PFX(fname ## _2x8_ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_8x8.prim = fncdef PFX(fname ## _8x8_ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_4x16.prim = fncdef PFX(fname ## _4x16_ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_8x32.prim = fncdef PFX(fname ## _8x32_ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_16x64.prim = fncdef PFX(fname ## _16x64_ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_8x12.prim = fncdef PFX(fname ## _8x12_ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_6x16.prim = fncdef PFX(fname ## _6x16_ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_8x4.prim = fncdef PFX(fname ## _8x4_ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_2x16.prim = fncdef PFX(fname ## _2x16_ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_16x24.prim = fncdef PFX(fname ## _16x24_ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_12x32.prim = fncdef PFX(fname ## _12x32_ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_16x8.prim = fncdef PFX(fname ## _16x8_ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_4x32.prim = fncdef PFX(fname ## _4x32_ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_32x48.prim = fncdef PFX(fname ## _32x48_ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_24x64.prim = fncdef PFX(fname ## _24x64_ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_8x64.prim = fncdef PFX(fname ## _8x64_ ## cpu) +#define CHROMA_422_PU_TYPED_NEON_1(prim, fncdef, fname) \ + p.chromaX265_CSP_I422.puCHROMA_422_4x8.prim = fncdef PFX(fname ## _4x8_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_4x4.prim = fncdef PFX(fname ## _4x4_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_4x16.prim = fncdef PFX(fname ## _4x16_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_6x16.prim = fncdef PFX(fname ## _6x16_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_12x32.prim = fncdef PFX(fname ## _12x32_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_4x32.prim = fncdef PFX(fname ## _4x32_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_8x16.prim = fncdef PFX(fname ## _8x16_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_16x32.prim = fncdef PFX(fname ## _16x32_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_2x8.prim = fncdef PFX(fname ## _2x8_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_8x8.prim = fncdef PFX(fname ## _8x8_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_16x16.prim = fncdef PFX(fname ## _16x16_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_8x32.prim = fncdef PFX(fname ## _8x32_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_16x64.prim = fncdef PFX(fname ## _16x64_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_8x12.prim = fncdef PFX(fname ## _8x12_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_8x4.prim = fncdef PFX(fname ## _8x4_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_2x16.prim = fncdef PFX(fname ## _2x16_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_16x24.prim = fncdef PFX(fname ## _16x24_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_16x8.prim = fncdef PFX(fname ## _16x8_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_24x64.prim = fncdef PFX(fname ## _24x64_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_8x64.prim = fncdef PFX(fname ## _8x64_ ## neon) +#define CHROMA_422_PU_TYPED_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, fncdef, fname) \ + p.chromaX265_CSP_I422.puCHROMA_422_32x64.prim = fncdef PFX(fname ## _32x64_ ## sve); \ + p.chromaX265_CSP_I422.puCHROMA_422_32x32.prim = fncdef PFX(fname ## _32x32_ ## sve); \ + p.chromaX265_CSP_I422.puCHROMA_422_32x48.prim = fncdef PFX(fname ## _32x48_ ## sve); \ + p.chromaX265_CSP_I422.puCHROMA_422_32x16.prim = fncdef PFX(fname ## _32x16_ ## sve) +#define CHROMA_422_PU_TYPED_NEON_2(prim, fncdef, fname) \ + p.chromaX265_CSP_I422.puCHROMA_422_4x8.prim = fncdef PFX(fname ## _4x8_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_4x4.prim = fncdef PFX(fname ## _4x4_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_4x16.prim = fncdef PFX(fname ## _4x16_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_4x32.prim = fncdef PFX(fname ## _4x32_ ## neon) +#define CHROMA_422_PU_TYPED_CAN_USE_SVE2(prim, fncdef, fname) \ + p.chromaX265_CSP_I422.puCHROMA_422_8x16.prim = fncdef PFX(fname ## _8x16_ ## sve2); \ + p.chromaX265_CSP_I422.puCHROMA_422_16x32.prim = fncdef PFX(fname ## _16x32_ ## sve2); \ + p.chromaX265_CSP_I422.puCHROMA_422_32x64.prim = fncdef PFX(fname ## _32x64_ ## sve2); \ + p.chromaX265_CSP_I422.puCHROMA_422_2x8.prim = fncdef PFX(fname ## _2x8_ ## sve2); \ + p.chromaX265_CSP_I422.puCHROMA_422_8x8.prim = fncdef PFX(fname ## _8x8_ ## sve2); \ + p.chromaX265_CSP_I422.puCHROMA_422_16x16.prim = fncdef PFX(fname ## _16x16_ ## sve2); \ + p.chromaX265_CSP_I422.puCHROMA_422_8x32.prim = fncdef PFX(fname ## _8x32_ ## sve2); \ + p.chromaX265_CSP_I422.puCHROMA_422_32x32.prim = fncdef PFX(fname ## _32x32_ ## sve2); \ + p.chromaX265_CSP_I422.puCHROMA_422_16x64.prim = fncdef PFX(fname ## _16x64_ ## sve2); \ + p.chromaX265_CSP_I422.puCHROMA_422_8x12.prim = fncdef PFX(fname ## _8x12_ ## sve2); \ + p.chromaX265_CSP_I422.puCHROMA_422_6x16.prim = fncdef PFX(fname ## _6x16_ ## sve2); \ + p.chromaX265_CSP_I422.puCHROMA_422_8x4.prim = fncdef PFX(fname ## _8x4_ ## sve2); \ + p.chromaX265_CSP_I422.puCHROMA_422_2x16.prim = fncdef PFX(fname ## _2x16_ ## sve2); \ + p.chromaX265_CSP_I422.puCHROMA_422_16x24.prim = fncdef PFX(fname ## _16x24_ ## sve2); \ + p.chromaX265_CSP_I422.puCHROMA_422_12x32.prim = fncdef PFX(fname ## _12x32_ ## sve2); \ + p.chromaX265_CSP_I422.puCHROMA_422_16x8.prim = fncdef PFX(fname ## _16x8_ ## sve2); \ + p.chromaX265_CSP_I422.puCHROMA_422_32x48.prim = fncdef PFX(fname ## _32x48_ ## sve2); \ + p.chromaX265_CSP_I422.puCHROMA_422_24x64.prim = fncdef PFX(fname ## _24x64_ ## sve2); \ + p.chromaX265_CSP_I422.puCHROMA_422_32x16.prim = fncdef PFX(fname ## _32x16_ ## sve2); \ + p.chromaX265_CSP_I422.puCHROMA_422_8x64.prim = fncdef PFX(fname ## _8x64_ ## sve2) +#define CHROMA_422_PU_TYPED_NEON_FILTER_PIXEL_TO_SHORT(prim, fncdef) \ + p.chromaX265_CSP_I422.puCHROMA_422_4x8.prim = fncdef PFX(filterPixelToShort ## _4x8_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_8x16.prim = fncdef PFX(filterPixelToShort ## _8x16_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_16x32.prim = fncdef PFX(filterPixelToShort ## _16x32_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_4x4.prim = fncdef PFX(filterPixelToShort ## _4x4_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_8x8.prim = fncdef PFX(filterPixelToShort ## _8x8_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_4x16.prim = fncdef PFX(filterPixelToShort ## _4x16_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_16x16.prim = fncdef PFX(filterPixelToShort ## _16x16_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_8x32.prim = fncdef PFX(filterPixelToShort ## _8x32_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_16x64.prim = fncdef PFX(filterPixelToShort ## _16x64_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_8x12.prim = fncdef PFX(filterPixelToShort ## _8x12_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_8x4.prim = fncdef PFX(filterPixelToShort ## _8x4_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_16x24.prim = fncdef PFX(filterPixelToShort ## _16x24_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_12x32.prim = fncdef PFX(filterPixelToShort ## _12x32_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_16x8.prim = fncdef PFX(filterPixelToShort ## _16x8_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_4x32.prim = fncdef PFX(filterPixelToShort ## _4x32_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_24x64.prim = fncdef PFX(filterPixelToShort ## _24x64_ ## neon); \ + p.chromaX265_CSP_I422.puCHROMA_422_8x64.prim = fncdef PFX(filterPixelToShort ## _8x64_ ## neon) +#define CHROMA_422_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, fncdef) \ + p.chromaX265_CSP_I422.puCHROMA_422_2x8.prim = fncdef PFX(filterPixelToShort ## _2x8_ ## sve); \ + p.chromaX265_CSP_I422.puCHROMA_422_2x16.prim = fncdef PFX(filterPixelToShort ## _2x16_ ## sve); \ + p.chromaX265_CSP_I422.puCHROMA_422_6x16.prim = fncdef PFX(filterPixelToShort ## _6x16_ ## sve); \ + p.chromaX265_CSP_I422.puCHROMA_422_32x64.prim = fncdef PFX(filterPixelToShort ## _32x64_ ## sve); \ + p.chromaX265_CSP_I422.puCHROMA_422_32x32.prim = fncdef PFX(filterPixelToShort ## _32x32_ ## sve); \ + p.chromaX265_CSP_I422.puCHROMA_422_32x48.prim = fncdef PFX(filterPixelToShort ## _32x48_ ## sve); \ + p.chromaX265_CSP_I422.puCHROMA_422_32x16.prim = fncdef PFX(filterPixelToShort ## _32x16_ ## sve) +#define ALL_CHROMA_422_PU(prim, fname, cpu) ALL_CHROMA_422_PU_TYPED(prim, , fname, cpu) +#define CHROMA_422_PU_NEON_1(prim, fname) CHROMA_422_PU_TYPED_NEON_1(prim, , fname) +#define CHROMA_422_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, fname) CHROMA_422_PU_TYPED_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(prim, , fname) +#define CHROMA_422_PU_NEON_2(prim, fname) CHROMA_422_PU_TYPED_NEON_2(prim, , fname) +#define CHROMA_422_PU_CAN_USE_SVE2(prim, fname) CHROMA_422_PU_TYPED_CAN_USE_SVE2(prim, , fname) +#define CHROMA_422_PU_NEON_FILTER_PIXEL_TO_SHORT(prim) CHROMA_422_PU_TYPED_NEON_FILTER_PIXEL_TO_SHORT(prim, ) +#define CHROMA_422_PU_SVE_FILTER_PIXEL_TO_SHORT(prim) CHROMA_422_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, ) + +#define ALL_CHROMA_444_PU_TYPED(prim, fncdef, fname, cpu) \ + p.chromaX265_CSP_I444.puLUMA_4x4.prim = fncdef PFX(fname ## _4x4_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_8x8.prim = fncdef PFX(fname ## _8x8_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_16x16.prim = fncdef PFX(fname ## _16x16_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_32x32.prim = fncdef PFX(fname ## _32x32_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_64x64.prim = fncdef PFX(fname ## _64x64_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_8x4.prim = fncdef PFX(fname ## _8x4_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_4x8.prim = fncdef PFX(fname ## _4x8_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_16x8.prim = fncdef PFX(fname ## _16x8_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_8x16.prim = fncdef PFX(fname ## _8x16_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_16x32.prim = fncdef PFX(fname ## _16x32_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_32x16.prim = fncdef PFX(fname ## _32x16_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_64x32.prim = fncdef PFX(fname ## _64x32_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_32x64.prim = fncdef PFX(fname ## _32x64_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_16x12.prim = fncdef PFX(fname ## _16x12_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_12x16.prim = fncdef PFX(fname ## _12x16_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_16x4.prim = fncdef PFX(fname ## _16x4_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_4x16.prim = fncdef PFX(fname ## _4x16_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_32x24.prim = fncdef PFX(fname ## _32x24_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_24x32.prim = fncdef PFX(fname ## _24x32_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_32x8.prim = fncdef PFX(fname ## _32x8_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_8x32.prim = fncdef PFX(fname ## _8x32_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_64x48.prim = fncdef PFX(fname ## _64x48_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_48x64.prim = fncdef PFX(fname ## _48x64_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_64x16.prim = fncdef PFX(fname ## _64x16_ ## cpu); \ + p.chromaX265_CSP_I444.puLUMA_16x64.prim = fncdef PFX(fname ## _16x64_ ## cpu) +#define CHROMA_444_PU_TYPED_NEON_FILTER_PIXEL_TO_SHORT(prim, fncdef) \ + p.chromaX265_CSP_I444.puLUMA_4x4.prim = fncdef PFX(filterPixelToShort ## _4x4_ ## neon); \ + p.chromaX265_CSP_I444.puLUMA_8x8.prim = fncdef PFX(filterPixelToShort ## _8x8_ ## neon); \ + p.chromaX265_CSP_I444.puLUMA_16x16.prim = fncdef PFX(filterPixelToShort ## _16x16_ ## neon); \ + p.chromaX265_CSP_I444.puLUMA_8x4.prim = fncdef PFX(filterPixelToShort ## _8x4_ ## neon); \ + p.chromaX265_CSP_I444.puLUMA_4x8.prim = fncdef PFX(filterPixelToShort ## _4x8_ ## neon); \ + p.chromaX265_CSP_I444.puLUMA_16x8.prim = fncdef PFX(filterPixelToShort ## _16x8_ ## neon); \ + p.chromaX265_CSP_I444.puLUMA_8x16.prim = fncdef PFX(filterPixelToShort ## _8x16_ ## neon); \ + p.chromaX265_CSP_I444.puLUMA_16x32.prim = fncdef PFX(filterPixelToShort ## _16x32_ ## neon); \ + p.chromaX265_CSP_I444.puLUMA_16x12.prim = fncdef PFX(filterPixelToShort ## _16x12_ ## neon); \ + p.chromaX265_CSP_I444.puLUMA_12x16.prim = fncdef PFX(filterPixelToShort ## _12x16_ ## neon); \ + p.chromaX265_CSP_I444.puLUMA_16x4.prim = fncdef PFX(filterPixelToShort ## _16x4_ ## neon); \ + p.chromaX265_CSP_I444.puLUMA_4x16.prim = fncdef PFX(filterPixelToShort ## _4x16_ ## neon); \ + p.chromaX265_CSP_I444.puLUMA_24x32.prim = fncdef PFX(filterPixelToShort ## _24x32_ ## neon); \ + p.chromaX265_CSP_I444.puLUMA_8x32.prim = fncdef PFX(filterPixelToShort ## _8x32_ ## neon); \ + p.chromaX265_CSP_I444.puLUMA_16x64.prim = fncdef PFX(filterPixelToShort ## _16x64_ ## neon) +#define CHROMA_444_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, fncdef) \ + p.chromaX265_CSP_I444.puLUMA_32x32.prim = fncdef PFX(filterPixelToShort ## _32x32_ ## sve); \ + p.chromaX265_CSP_I444.puLUMA_32x16.prim = fncdef PFX(filterPixelToShort ## _32x16_ ## sve); \ + p.chromaX265_CSP_I444.puLUMA_32x64.prim = fncdef PFX(filterPixelToShort ## _32x64_ ## sve); \ + p.chromaX265_CSP_I444.puLUMA_32x24.prim = fncdef PFX(filterPixelToShort ## _32x24_ ## sve); \ + p.chromaX265_CSP_I444.puLUMA_32x8.prim = fncdef PFX(filterPixelToShort ## _32x8_ ## sve); \ + p.chromaX265_CSP_I444.puLUMA_64x64.prim = fncdef PFX(filterPixelToShort ## _64x64_ ## sve); \ + p.chromaX265_CSP_I444.puLUMA_64x32.prim = fncdef PFX(filterPixelToShort ## _64x32_ ## sve); \ + p.chromaX265_CSP_I444.puLUMA_64x48.prim = fncdef PFX(filterPixelToShort ## _64x48_ ## sve); \ + p.chromaX265_CSP_I444.puLUMA_64x16.prim = fncdef PFX(filterPixelToShort ## _64x16_ ## sve); \ + p.chromaX265_CSP_I444.puLUMA_48x64.prim = fncdef PFX(filterPixelToShort ## _48x64_ ## sve) +#define ALL_CHROMA_444_PU(prim, fname, cpu) ALL_CHROMA_444_PU_TYPED(prim, , fname, cpu) +#define CHROMA_444_PU_NEON_FILTER_PIXEL_TO_SHORT(prim) CHROMA_444_PU_TYPED_NEON_FILTER_PIXEL_TO_SHORT(prim, ) +#define CHROMA_444_PU_SVE_FILTER_PIXEL_TO_SHORT(prim) CHROMA_444_PU_TYPED_SVE_FILTER_PIXEL_TO_SHORT(prim, ) + +#define ALL_CHROMA_420_VERT_FILTERS(cpu) \ + ALL_CHROMA_420_4x4_PU(filter_vpp, interp_4tap_vert_pp, cpu); \ + ALL_CHROMA_420_4x4_PU(filter_vps, interp_4tap_vert_ps, cpu); \ + ALL_CHROMA_420_4x4_PU(filter_vsp, interp_4tap_vert_sp, cpu); \ + ALL_CHROMA_420_4x4_PU(filter_vss, interp_4tap_vert_ss, cpu) + +#define CHROMA_420_VERT_FILTERS_NEON() \ + ALL_CHROMA_420_4x4_PU(filter_vsp, interp_4tap_vert_sp, neon) + +#define CHROMA_420_VERT_FILTERS_CAN_USE_SVE2() \ + ALL_CHROMA_420_4x4_PU(filter_vpp, interp_4tap_vert_pp, sve2); \ + ALL_CHROMA_420_4x4_PU(filter_vps, interp_4tap_vert_ps, sve2); \ + ALL_CHROMA_420_4x4_PU(filter_vss, interp_4tap_vert_ss, sve2) + +#define SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(W, H) \ + p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vsp = PFX(interp_4tap_vert_sp_ ## W ## x ## H ## _ ## neon) + +#define SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(W, H, cpu) \ + p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vpp = PFX(interp_4tap_vert_pp_ ## W ## x ## H ## _ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vps = PFX(interp_4tap_vert_ps_ ## W ## x ## H ## _ ## cpu); \ + p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vss = PFX(interp_4tap_vert_ss_ ## W ## x ## H ## _ ## cpu) + +#define CHROMA_422_VERT_FILTERS_NEON() \ + SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(4, 8); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(8, 16); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(8, 8); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(4, 16); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(8, 12); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(8, 4); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(16, 32); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(16, 16); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(8, 32); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(16, 24); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(12, 32); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(16, 8); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(4, 32); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(32, 64); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(32, 32); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(16, 64); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(32, 48); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(24, 64); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(32, 16); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_NEON(8, 64) + +#define CHROMA_422_VERT_FILTERS_CAN_USE_SVE2(cpu) \ + SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(4, 8, cpu); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(8, 16, cpu); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(8, 8, cpu); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(4, 16, cpu); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(8, 12, cpu); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(8, 4, cpu); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(16, 32, cpu); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(16, 16, cpu); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(8, 32, cpu); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(16, 24, cpu); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(12, 32, cpu); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(16, 8, cpu); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(4, 32, cpu); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(32, 64, cpu); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(32, 32, cpu); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(16, 64, cpu); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(32, 48, cpu); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(24, 64, cpu); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(32, 16, cpu); \ + SETUP_CHROMA_422_VERT_FUNC_DEF_CAN_USE_SVE2(8, 64, cpu) + +#define ALL_CHROMA_444_VERT_FILTERS(cpu) \ + ALL_CHROMA_444_PU(filter_vpp, interp_4tap_vert_pp, cpu); \ + ALL_CHROMA_444_PU(filter_vps, interp_4tap_vert_ps, cpu); \ + ALL_CHROMA_444_PU(filter_vsp, interp_4tap_vert_sp, cpu); \ + ALL_CHROMA_444_PU(filter_vss, interp_4tap_vert_ss, cpu) + +#define CHROMA_444_VERT_FILTERS_NEON() \ + ALL_CHROMA_444_PU(filter_vsp, interp_4tap_vert_sp, neon) + +#define CHROMA_444_VERT_FILTERS_CAN_USE_SVE2() \ + ALL_CHROMA_444_PU(filter_vpp, interp_4tap_vert_pp, sve2); \ + ALL_CHROMA_444_PU(filter_vps, interp_4tap_vert_ps, sve2); \ + ALL_CHROMA_444_PU(filter_vss, interp_4tap_vert_ss, sve2) + +#define ALL_CHROMA_420_FILTERS(cpu) \ + ALL_CHROMA_420_PU(filter_hpp, interp_4tap_horiz_pp, cpu); \ + ALL_CHROMA_420_PU(filter_hps, interp_4tap_horiz_ps, cpu); \ + ALL_CHROMA_420_PU(filter_vpp, interp_4tap_vert_pp, cpu); \ + ALL_CHROMA_420_PU(filter_vps, interp_4tap_vert_ps, cpu) + +#define CHROMA_420_FILTERS_NEON() \ + ALL_CHROMA_420_PU(filter_hpp, interp_4tap_horiz_pp, neon); \ + ALL_CHROMA_420_PU(filter_hps, interp_4tap_horiz_ps, neon) + +#define CHROMA_420_FILTERS_CAN_USE_SVE2() \ + ALL_CHROMA_420_PU(filter_vpp, interp_4tap_vert_pp, sve2); \ + ALL_CHROMA_420_PU(filter_vps, interp_4tap_vert_ps, sve2) + +#define ALL_CHROMA_422_FILTERS(cpu) \ + ALL_CHROMA_422_PU(filter_hpp, interp_4tap_horiz_pp, cpu); \ + ALL_CHROMA_422_PU(filter_hps, interp_4tap_horiz_ps, cpu); \ + ALL_CHROMA_422_PU(filter_vpp, interp_4tap_vert_pp, cpu); \ + ALL_CHROMA_422_PU(filter_vps, interp_4tap_vert_ps, cpu) + +#define CHROMA_422_FILTERS_NEON() \ + ALL_CHROMA_422_PU(filter_hpp, interp_4tap_horiz_pp, neon); \ + ALL_CHROMA_422_PU(filter_hps, interp_4tap_horiz_ps, neon) + +#define CHROMA_422_FILTERS_CAN_USE_SVE2() \ + ALL_CHROMA_422_PU(filter_vpp, interp_4tap_vert_pp, sve2); \ + ALL_CHROMA_422_PU(filter_vps, interp_4tap_vert_ps, sve2) + +#define ALL_CHROMA_444_FILTERS(cpu) \ + ALL_CHROMA_444_PU(filter_hpp, interp_4tap_horiz_pp, cpu); \ + ALL_CHROMA_444_PU(filter_hps, interp_4tap_horiz_ps, cpu); \ + ALL_CHROMA_444_PU(filter_vpp, interp_4tap_vert_pp, cpu); \ + ALL_CHROMA_444_PU(filter_vps, interp_4tap_vert_ps, cpu) + +#define CHROMA_444_FILTERS_NEON() \ + ALL_CHROMA_444_PU(filter_hpp, interp_4tap_horiz_pp, neon); \ + ALL_CHROMA_444_PU(filter_hps, interp_4tap_horiz_ps, neon) + +#define CHROMA_444_FILTERS_CAN_USE_SVE2() \ + ALL_CHROMA_444_PU(filter_vpp, interp_4tap_vert_pp, sve2); \ + ALL_CHROMA_444_PU(filter_vps, interp_4tap_vert_ps, sve2) + #if defined(__GNUC__) #define GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) @@ -35,18 +684,19 @@ #define GCC_4_9_0 40900 #define GCC_5_1_0 50100 -extern "C" { -#include "pixel.h" -#include "pixel-util.h" -#include "ipfilter8.h" -} +#include "pixel-prim.h" +#include "filter-prim.h" +#include "dct-prim.h" +#include "loopfilter-prim.h" +#include "intrapred-prim.h" -namespace X265_NS { +namespace X265_NS +{ // private x265 namespace template<int size> -void interp_8tap_hv_pp_cpu(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY) +void interp_8tap_hv_pp_cpu(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int idxX, int idxY) { ALIGN_VAR_32(int16_t, immedMAX_CU_SIZE * (MAX_CU_SIZE + NTAPS_LUMA - 1)); const int halfFilterSize = NTAPS_LUMA >> 1; @@ -56,164 +706,1259 @@ primitives.pusize.luma_vsp(immed + (halfFilterSize - 1) * immedStride, immedStride, dst, dstStride, idxY); } - -/* Temporary workaround because luma_vsp assembly primitive has not been completed - * but interp_8tap_hv_pp_cpu uses mixed C primitive and assembly primitive. - * Otherwise, segment fault occurs. */ -void setupAliasCPrimitives(EncoderPrimitives &cp, EncoderPrimitives &asmp, int cpuMask) +void setupNeonPrimitives(EncoderPrimitives &p) { - if (cpuMask & X265_CPU_NEON) - { - asmp.puLUMA_8x4.luma_vsp = cp.puLUMA_8x4.luma_vsp; - asmp.puLUMA_8x8.luma_vsp = cp.puLUMA_8x8.luma_vsp; - asmp.puLUMA_8x16.luma_vsp = cp.puLUMA_8x16.luma_vsp; - asmp.puLUMA_8x32.luma_vsp = cp.puLUMA_8x32.luma_vsp; - asmp.puLUMA_12x16.luma_vsp = cp.puLUMA_12x16.luma_vsp; -#if !AUTO_VECTORIZE || GCC_VERSION < GCC_5_1_0 /* gcc_version < gcc-5.1.0 */ - asmp.puLUMA_16x4.luma_vsp = cp.puLUMA_16x4.luma_vsp; - asmp.puLUMA_16x8.luma_vsp = cp.puLUMA_16x8.luma_vsp; - asmp.puLUMA_16x12.luma_vsp = cp.puLUMA_16x12.luma_vsp; - asmp.puLUMA_16x16.luma_vsp = cp.puLUMA_16x16.luma_vsp; - asmp.puLUMA_16x32.luma_vsp = cp.puLUMA_16x32.luma_vsp; - asmp.puLUMA_16x64.luma_vsp = cp.puLUMA_16x64.luma_vsp; - asmp.puLUMA_32x16.luma_vsp = cp.puLUMA_32x16.luma_vsp; - asmp.puLUMA_32x24.luma_vsp = cp.puLUMA_32x24.luma_vsp; - asmp.puLUMA_32x32.luma_vsp = cp.puLUMA_32x32.luma_vsp; - asmp.puLUMA_32x64.luma_vsp = cp.puLUMA_32x64.luma_vsp; - asmp.puLUMA_48x64.luma_vsp = cp.puLUMA_48x64.luma_vsp; - asmp.puLUMA_64x16.luma_vsp = cp.puLUMA_64x16.luma_vsp; - asmp.puLUMA_64x32.luma_vsp = cp.puLUMA_64x32.luma_vsp; - asmp.puLUMA_64x48.luma_vsp = cp.puLUMA_64x48.luma_vsp; - asmp.puLUMA_64x64.luma_vsp = cp.puLUMA_64x64.luma_vsp; -#if !AUTO_VECTORIZE || GCC_VERSION < GCC_4_9_0 /* gcc_version < gcc-4.9.0 */ - asmp.puLUMA_4x4.luma_vsp = cp.puLUMA_4x4.luma_vsp; - asmp.puLUMA_4x8.luma_vsp = cp.puLUMA_4x8.luma_vsp; - asmp.puLUMA_4x16.luma_vsp = cp.puLUMA_4x16.luma_vsp; - asmp.puLUMA_24x32.luma_vsp = cp.puLUMA_24x32.luma_vsp; - asmp.puLUMA_32x8.luma_vsp = cp.puLUMA_32x8.luma_vsp; + setupPixelPrimitives_neon(p); + setupFilterPrimitives_neon(p); + setupDCTPrimitives_neon(p); + setupLoopFilterPrimitives_neon(p); + setupIntraPrimitives_neon(p); + + ALL_CHROMA_420_PU(p2sNONALIGNED, filterPixelToShort, neon); + ALL_CHROMA_422_PU(p2sALIGNED, filterPixelToShort, neon); + ALL_CHROMA_444_PU(p2sALIGNED, filterPixelToShort, neon); + ALL_LUMA_PU(convert_p2sALIGNED, filterPixelToShort, neon); + ALL_CHROMA_420_PU(p2sALIGNED, filterPixelToShort, neon); + ALL_CHROMA_422_PU(p2sNONALIGNED, filterPixelToShort, neon); + ALL_CHROMA_444_PU(p2sNONALIGNED, filterPixelToShort, neon); + ALL_LUMA_PU(convert_p2sNONALIGNED, filterPixelToShort, neon); + +#if !HIGH_BIT_DEPTH + ALL_LUMA_PU(luma_vpp, interp_8tap_vert_pp, neon); + ALL_LUMA_PU(luma_vsp, interp_8tap_vert_sp, neon); + ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, neon); + ALL_LUMA_PU(luma_hpp, interp_horiz_pp, neon); + ALL_LUMA_PU(luma_hps, interp_horiz_ps, neon); + ALL_LUMA_PU(luma_vss, interp_8tap_vert_ss, neon); + ALL_LUMA_PU_T(luma_hvpp, interp_8tap_hv_pp_cpu); + ALL_CHROMA_420_VERT_FILTERS(neon); + CHROMA_422_VERT_FILTERS_NEON(); + CHROMA_422_VERT_FILTERS_CAN_USE_SVE2(neon); + ALL_CHROMA_444_VERT_FILTERS(neon); + ALL_CHROMA_420_FILTERS(neon); + ALL_CHROMA_422_FILTERS(neon); + ALL_CHROMA_444_FILTERS(neon); + + // Blockcopy_pp + ALL_LUMA_PU(copy_pp, blockcopy_pp, neon); + ALL_CHROMA_420_PU(copy_pp, blockcopy_pp, neon); + ALL_CHROMA_422_PU(copy_pp, blockcopy_pp, neon); + p.cuBLOCK_4x4.copy_pp = PFX(blockcopy_pp_4x4_neon); + p.cuBLOCK_8x8.copy_pp = PFX(blockcopy_pp_8x8_neon); + p.cuBLOCK_16x16.copy_pp = PFX(blockcopy_pp_16x16_neon); + p.cuBLOCK_32x32.copy_pp = PFX(blockcopy_pp_32x32_neon); + p.cuBLOCK_64x64.copy_pp = PFX(blockcopy_pp_64x64_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_pp = PFX(blockcopy_pp_4x4_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_pp = PFX(blockcopy_pp_8x8_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_pp = PFX(blockcopy_pp_16x16_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_pp = PFX(blockcopy_pp_32x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_pp = PFX(blockcopy_pp_4x8_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_pp = PFX(blockcopy_pp_8x16_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_pp = PFX(blockcopy_pp_16x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_pp = PFX(blockcopy_pp_32x64_neon); + +#endif // !HIGH_BIT_DEPTH + + // Blockcopy_ss + p.cuBLOCK_4x4.copy_ss = PFX(blockcopy_ss_4x4_neon); + p.cuBLOCK_8x8.copy_ss = PFX(blockcopy_ss_8x8_neon); + p.cuBLOCK_16x16.copy_ss = PFX(blockcopy_ss_16x16_neon); + p.cuBLOCK_32x32.copy_ss = PFX(blockcopy_ss_32x32_neon); + p.cuBLOCK_64x64.copy_ss = PFX(blockcopy_ss_64x64_neon); + + // Blockcopy_ps + p.cuBLOCK_4x4.copy_ps = PFX(blockcopy_ps_4x4_neon); + p.cuBLOCK_8x8.copy_ps = PFX(blockcopy_ps_8x8_neon); + p.cuBLOCK_16x16.copy_ps = PFX(blockcopy_ps_16x16_neon); + p.cuBLOCK_32x32.copy_ps = PFX(blockcopy_ps_32x32_neon); + p.cuBLOCK_64x64.copy_ps = PFX(blockcopy_ps_64x64_neon); + + // Blockcopy_sp + p.cuBLOCK_4x4.copy_sp = PFX(blockcopy_sp_4x4_neon); + p.cuBLOCK_8x8.copy_sp = PFX(blockcopy_sp_8x8_neon); + p.cuBLOCK_16x16.copy_sp = PFX(blockcopy_sp_16x16_neon); + p.cuBLOCK_32x32.copy_sp = PFX(blockcopy_sp_32x32_neon); + p.cuBLOCK_64x64.copy_sp = PFX(blockcopy_sp_64x64_neon); + + // chroma blockcopy_ss + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_ss = PFX(blockcopy_ss_4x4_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_ss = PFX(blockcopy_ss_8x8_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_ss = PFX(blockcopy_ss_16x16_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_ss = PFX(blockcopy_ss_32x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_ss = PFX(blockcopy_ss_4x8_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_ss = PFX(blockcopy_ss_8x16_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_ss = PFX(blockcopy_ss_16x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_ss = PFX(blockcopy_ss_32x64_neon); + + // chroma blockcopy_ps + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_ps = PFX(blockcopy_ps_4x4_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_ps = PFX(blockcopy_ps_8x8_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_ps = PFX(blockcopy_ps_16x16_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_ps = PFX(blockcopy_ps_32x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_ps = PFX(blockcopy_ps_4x8_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_ps = PFX(blockcopy_ps_8x16_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_ps = PFX(blockcopy_ps_16x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_ps = PFX(blockcopy_ps_32x64_neon); + + // chroma blockcopy_sp + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_sp = PFX(blockcopy_sp_4x4_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_sp = PFX(blockcopy_sp_8x8_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_sp = PFX(blockcopy_sp_16x16_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_sp = PFX(blockcopy_sp_32x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_sp = PFX(blockcopy_sp_4x8_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_sp = PFX(blockcopy_sp_8x16_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_sp = PFX(blockcopy_sp_16x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_sp = PFX(blockcopy_sp_32x64_neon); + + // Block_fill + ALL_LUMA_TU(blockfill_sALIGNED, blockfill_s, neon); + ALL_LUMA_TU(blockfill_sNONALIGNED, blockfill_s, neon); + + // copy_count + p.cuBLOCK_4x4.copy_cnt = PFX(copy_cnt_4_neon); + p.cuBLOCK_8x8.copy_cnt = PFX(copy_cnt_8_neon); + p.cuBLOCK_16x16.copy_cnt = PFX(copy_cnt_16_neon); + p.cuBLOCK_32x32.copy_cnt = PFX(copy_cnt_32_neon); + + // count nonzero + p.cuBLOCK_4x4.count_nonzero = PFX(count_nonzero_4_neon); + p.cuBLOCK_8x8.count_nonzero = PFX(count_nonzero_8_neon); + p.cuBLOCK_16x16.count_nonzero = PFX(count_nonzero_16_neon); + p.cuBLOCK_32x32.count_nonzero = PFX(count_nonzero_32_neon); + + // cpy2Dto1D_shl + p.cuBLOCK_4x4.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_4x4_neon); + p.cuBLOCK_8x8.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_8x8_neon); + p.cuBLOCK_16x16.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_16x16_neon); + p.cuBLOCK_32x32.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_32x32_neon); + p.cuBLOCK_64x64.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_64x64_neon); + + // cpy2Dto1D_shr + p.cuBLOCK_4x4.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_4x4_neon); + p.cuBLOCK_8x8.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_8x8_neon); + p.cuBLOCK_16x16.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_16x16_neon); + p.cuBLOCK_32x32.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_32x32_neon); + + // cpy1Dto2D_shl + p.cuBLOCK_4x4.cpy1Dto2D_shlALIGNED = PFX(cpy1Dto2D_shl_4x4_neon); + p.cuBLOCK_8x8.cpy1Dto2D_shlALIGNED = PFX(cpy1Dto2D_shl_8x8_neon); + p.cuBLOCK_16x16.cpy1Dto2D_shlALIGNED = PFX(cpy1Dto2D_shl_16x16_neon); + p.cuBLOCK_32x32.cpy1Dto2D_shlALIGNED = PFX(cpy1Dto2D_shl_32x32_neon); + p.cuBLOCK_64x64.cpy1Dto2D_shlALIGNED = PFX(cpy1Dto2D_shl_64x64_neon); + + p.cuBLOCK_4x4.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_4x4_neon); + p.cuBLOCK_8x8.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_8x8_neon); + p.cuBLOCK_16x16.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_16x16_neon); + p.cuBLOCK_32x32.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_32x32_neon); + p.cuBLOCK_64x64.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_64x64_neon); + + // cpy1Dto2D_shr + p.cuBLOCK_4x4.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_4x4_neon); + p.cuBLOCK_8x8.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_8x8_neon); + p.cuBLOCK_16x16.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_16x16_neon); + p.cuBLOCK_32x32.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_32x32_neon); + p.cuBLOCK_64x64.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_64x64_neon); + +#if !HIGH_BIT_DEPTH + // pixel_avg_pp + ALL_LUMA_PU(pixelavg_ppNONALIGNED, pixel_avg_pp, neon); + ALL_LUMA_PU(pixelavg_ppALIGNED, pixel_avg_pp, neon); + + // addAvg + ALL_LUMA_PU(addAvgNONALIGNED, addAvg, neon); + ALL_LUMA_PU(addAvgALIGNED, addAvg, neon); + ALL_CHROMA_420_PU(addAvgNONALIGNED, addAvg, neon); + ALL_CHROMA_422_PU(addAvgNONALIGNED, addAvg, neon); + ALL_CHROMA_420_PU(addAvgALIGNED, addAvg, neon); + ALL_CHROMA_422_PU(addAvgALIGNED, addAvg, neon); + + // sad + ALL_LUMA_PU(sad, pixel_sad, neon); + ALL_LUMA_PU(sad_x3, sad_x3, neon); + ALL_LUMA_PU(sad_x4, sad_x4, neon); + + // sse_pp + p.cuBLOCK_4x4.sse_pp = PFX(pixel_sse_pp_4x4_neon); + p.cuBLOCK_8x8.sse_pp = PFX(pixel_sse_pp_8x8_neon); + p.cuBLOCK_16x16.sse_pp = PFX(pixel_sse_pp_16x16_neon); + p.cuBLOCK_32x32.sse_pp = PFX(pixel_sse_pp_32x32_neon); + p.cuBLOCK_64x64.sse_pp = PFX(pixel_sse_pp_64x64_neon); + + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.sse_pp = PFX(pixel_sse_pp_4x4_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.sse_pp = PFX(pixel_sse_pp_8x8_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.sse_pp = PFX(pixel_sse_pp_16x16_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.sse_pp = PFX(pixel_sse_pp_32x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.sse_pp = PFX(pixel_sse_pp_4x8_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sse_pp = PFX(pixel_sse_pp_8x16_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sse_pp = PFX(pixel_sse_pp_16x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sse_pp = PFX(pixel_sse_pp_32x64_neon); + + // sse_ss + p.cuBLOCK_4x4.sse_ss = PFX(pixel_sse_ss_4x4_neon); + p.cuBLOCK_8x8.sse_ss = PFX(pixel_sse_ss_8x8_neon); + p.cuBLOCK_16x16.sse_ss = PFX(pixel_sse_ss_16x16_neon); + p.cuBLOCK_32x32.sse_ss = PFX(pixel_sse_ss_32x32_neon); + p.cuBLOCK_64x64.sse_ss = PFX(pixel_sse_ss_64x64_neon); + + // ssd_s + p.cuBLOCK_4x4.ssd_sNONALIGNED = PFX(pixel_ssd_s_4x4_neon); + p.cuBLOCK_8x8.ssd_sNONALIGNED = PFX(pixel_ssd_s_8x8_neon); + p.cuBLOCK_16x16.ssd_sNONALIGNED = PFX(pixel_ssd_s_16x16_neon); + p.cuBLOCK_32x32.ssd_sNONALIGNED = PFX(pixel_ssd_s_32x32_neon); + + p.cuBLOCK_4x4.ssd_sALIGNED = PFX(pixel_ssd_s_4x4_neon); + p.cuBLOCK_8x8.ssd_sALIGNED = PFX(pixel_ssd_s_8x8_neon); + p.cuBLOCK_16x16.ssd_sALIGNED = PFX(pixel_ssd_s_16x16_neon); + p.cuBLOCK_32x32.ssd_sALIGNED = PFX(pixel_ssd_s_32x32_neon); + + // pixel_var + p.cuBLOCK_8x8.var = PFX(pixel_var_8x8_neon); + p.cuBLOCK_16x16.var = PFX(pixel_var_16x16_neon); + p.cuBLOCK_32x32.var = PFX(pixel_var_32x32_neon); + p.cuBLOCK_64x64.var = PFX(pixel_var_64x64_neon); + + // calc_Residual + p.cuBLOCK_4x4.calcresidualNONALIGNED = PFX(getResidual4_neon); + p.cuBLOCK_8x8.calcresidualNONALIGNED = PFX(getResidual8_neon); + p.cuBLOCK_16x16.calcresidualNONALIGNED = PFX(getResidual16_neon); + p.cuBLOCK_32x32.calcresidualNONALIGNED = PFX(getResidual32_neon); + + p.cuBLOCK_4x4.calcresidualALIGNED = PFX(getResidual4_neon); + p.cuBLOCK_8x8.calcresidualALIGNED = PFX(getResidual8_neon); + p.cuBLOCK_16x16.calcresidualALIGNED = PFX(getResidual16_neon); + p.cuBLOCK_32x32.calcresidualALIGNED = PFX(getResidual32_neon); + + // pixel_sub_ps + p.cuBLOCK_4x4.sub_ps = PFX(pixel_sub_ps_4x4_neon); + p.cuBLOCK_8x8.sub_ps = PFX(pixel_sub_ps_8x8_neon); + p.cuBLOCK_16x16.sub_ps = PFX(pixel_sub_ps_16x16_neon); + p.cuBLOCK_32x32.sub_ps = PFX(pixel_sub_ps_32x32_neon); + p.cuBLOCK_64x64.sub_ps = PFX(pixel_sub_ps_64x64_neon); + + // chroma sub_ps + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.sub_ps = PFX(pixel_sub_ps_4x4_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.sub_ps = PFX(pixel_sub_ps_8x8_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.sub_ps = PFX(pixel_sub_ps_16x16_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.sub_ps = PFX(pixel_sub_ps_32x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.sub_ps = PFX(pixel_sub_ps_4x8_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sub_ps = PFX(pixel_sub_ps_8x16_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sub_ps = PFX(pixel_sub_ps_16x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sub_ps = PFX(pixel_sub_ps_32x64_neon); + + // pixel_add_ps + p.cuBLOCK_4x4.add_psNONALIGNED = PFX(pixel_add_ps_4x4_neon); + p.cuBLOCK_8x8.add_psNONALIGNED = PFX(pixel_add_ps_8x8_neon); + p.cuBLOCK_16x16.add_psNONALIGNED = PFX(pixel_add_ps_16x16_neon); + p.cuBLOCK_32x32.add_psNONALIGNED = PFX(pixel_add_ps_32x32_neon); + p.cuBLOCK_64x64.add_psNONALIGNED = PFX(pixel_add_ps_64x64_neon); + + p.cuBLOCK_4x4.add_psALIGNED = PFX(pixel_add_ps_4x4_neon); + p.cuBLOCK_8x8.add_psALIGNED = PFX(pixel_add_ps_8x8_neon); + p.cuBLOCK_16x16.add_psALIGNED = PFX(pixel_add_ps_16x16_neon); + p.cuBLOCK_32x32.add_psALIGNED = PFX(pixel_add_ps_32x32_neon); + p.cuBLOCK_64x64.add_psALIGNED = PFX(pixel_add_ps_64x64_neon); + + // chroma add_ps + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.add_psNONALIGNED = PFX(pixel_add_ps_4x4_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.add_psNONALIGNED = PFX(pixel_add_ps_8x8_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.add_psNONALIGNED = PFX(pixel_add_ps_16x16_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.add_psNONALIGNED = PFX(pixel_add_ps_32x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.add_psNONALIGNED = PFX(pixel_add_ps_4x8_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.add_psNONALIGNED = PFX(pixel_add_ps_8x16_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.add_psNONALIGNED = PFX(pixel_add_ps_16x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.add_psNONALIGNED = PFX(pixel_add_ps_32x64_neon); + + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.add_psALIGNED = PFX(pixel_add_ps_4x4_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.add_psALIGNED = PFX(pixel_add_ps_8x8_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.add_psALIGNED = PFX(pixel_add_ps_16x16_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.add_psALIGNED = PFX(pixel_add_ps_32x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.add_psALIGNED = PFX(pixel_add_ps_4x8_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.add_psALIGNED = PFX(pixel_add_ps_8x16_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.add_psALIGNED = PFX(pixel_add_ps_16x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.add_psALIGNED = PFX(pixel_add_ps_32x64_neon); + + //scale2D_64to32 + p.scale2D_64to32 = PFX(scale2D_64to32_neon); + + // scale1D_128to64 + p.scale1D_128to64NONALIGNED = PFX(scale1D_128to64_neon); + p.scale1D_128to64ALIGNED = PFX(scale1D_128to64_neon); + + // planecopy + p.planecopy_cp = PFX(pixel_planecopy_cp_neon); + + // satd + ALL_LUMA_PU(satd, pixel_satd, neon); + + p.chromaX265_CSP_I420.puCHROMA_420_4x4.satd = PFX(pixel_satd_4x4_neon); + p.chromaX265_CSP_I420.puCHROMA_420_8x8.satd = PFX(pixel_satd_8x8_neon); + p.chromaX265_CSP_I420.puCHROMA_420_16x16.satd = PFX(pixel_satd_16x16_neon); + p.chromaX265_CSP_I420.puCHROMA_420_32x32.satd = PFX(pixel_satd_32x32_neon); + p.chromaX265_CSP_I420.puCHROMA_420_8x4.satd = PFX(pixel_satd_8x4_neon); + p.chromaX265_CSP_I420.puCHROMA_420_4x8.satd = PFX(pixel_satd_4x8_neon); + p.chromaX265_CSP_I420.puCHROMA_420_16x8.satd = PFX(pixel_satd_16x8_neon); + p.chromaX265_CSP_I420.puCHROMA_420_8x16.satd = PFX(pixel_satd_8x16_neon); + p.chromaX265_CSP_I420.puCHROMA_420_32x16.satd = PFX(pixel_satd_32x16_neon); + p.chromaX265_CSP_I420.puCHROMA_420_16x32.satd = PFX(pixel_satd_16x32_neon); + p.chromaX265_CSP_I420.puCHROMA_420_16x12.satd = PFX(pixel_satd_16x12_neon); + p.chromaX265_CSP_I420.puCHROMA_420_12x16.satd = PFX(pixel_satd_12x16_neon); + p.chromaX265_CSP_I420.puCHROMA_420_16x4.satd = PFX(pixel_satd_16x4_neon); + p.chromaX265_CSP_I420.puCHROMA_420_4x16.satd = PFX(pixel_satd_4x16_neon); + p.chromaX265_CSP_I420.puCHROMA_420_32x24.satd = PFX(pixel_satd_32x24_neon); + p.chromaX265_CSP_I420.puCHROMA_420_24x32.satd = PFX(pixel_satd_24x32_neon); + p.chromaX265_CSP_I420.puCHROMA_420_32x8.satd = PFX(pixel_satd_32x8_neon); + p.chromaX265_CSP_I420.puCHROMA_420_8x32.satd = PFX(pixel_satd_8x32_neon); + + p.chromaX265_CSP_I422.puCHROMA_422_4x8.satd = PFX(pixel_satd_4x8_neon); + p.chromaX265_CSP_I422.puCHROMA_422_8x16.satd = PFX(pixel_satd_8x16_neon); + p.chromaX265_CSP_I422.puCHROMA_422_16x32.satd = PFX(pixel_satd_16x32_neon); + p.chromaX265_CSP_I422.puCHROMA_422_32x64.satd = PFX(pixel_satd_32x64_neon); + p.chromaX265_CSP_I422.puCHROMA_422_4x4.satd = PFX(pixel_satd_4x4_neon); + p.chromaX265_CSP_I422.puCHROMA_422_8x8.satd = PFX(pixel_satd_8x8_neon); + p.chromaX265_CSP_I422.puCHROMA_422_4x16.satd = PFX(pixel_satd_4x16_neon); + p.chromaX265_CSP_I422.puCHROMA_422_16x16.satd = PFX(pixel_satd_16x16_neon); + p.chromaX265_CSP_I422.puCHROMA_422_8x32.satd = PFX(pixel_satd_8x32_neon); + p.chromaX265_CSP_I422.puCHROMA_422_32x32.satd = PFX(pixel_satd_32x32_neon); + p.chromaX265_CSP_I422.puCHROMA_422_16x64.satd = PFX(pixel_satd_16x64_neon); + p.chromaX265_CSP_I422.puCHROMA_422_8x12.satd = PFX(pixel_satd_8x12_neon); + p.chromaX265_CSP_I422.puCHROMA_422_8x4.satd = PFX(pixel_satd_8x4_neon); + p.chromaX265_CSP_I422.puCHROMA_422_16x24.satd = PFX(pixel_satd_16x24_neon); + p.chromaX265_CSP_I422.puCHROMA_422_12x32.satd = PFX(pixel_satd_12x32_neon); + p.chromaX265_CSP_I422.puCHROMA_422_16x8.satd = PFX(pixel_satd_16x8_neon); + p.chromaX265_CSP_I422.puCHROMA_422_4x32.satd = PFX(pixel_satd_4x32_neon); + p.chromaX265_CSP_I422.puCHROMA_422_32x48.satd = PFX(pixel_satd_32x48_neon); + p.chromaX265_CSP_I422.puCHROMA_422_24x64.satd = PFX(pixel_satd_24x64_neon); + p.chromaX265_CSP_I422.puCHROMA_422_32x16.satd = PFX(pixel_satd_32x16_neon); + p.chromaX265_CSP_I422.puCHROMA_422_8x64.satd = PFX(pixel_satd_8x64_neon); + + // sa8d + p.cuBLOCK_4x4.sa8d = PFX(pixel_satd_4x4_neon); + p.cuBLOCK_8x8.sa8d = PFX(pixel_sa8d_8x8_neon); + p.cuBLOCK_16x16.sa8d = PFX(pixel_sa8d_16x16_neon); + p.cuBLOCK_32x32.sa8d = PFX(pixel_sa8d_32x32_neon); + p.cuBLOCK_64x64.sa8d = PFX(pixel_sa8d_64x64_neon); + p.chromaX265_CSP_I420.cuBLOCK_8x8.sa8d = PFX(pixel_satd_4x4_neon); + p.chromaX265_CSP_I420.cuBLOCK_16x16.sa8d = PFX(pixel_sa8d_16x16_neon); + p.chromaX265_CSP_I420.cuBLOCK_32x32.sa8d = PFX(pixel_sa8d_32x32_neon); + p.chromaX265_CSP_I420.cuBLOCK_64x64.sa8d = PFX(pixel_sa8d_64x64_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sa8d = PFX(pixel_sa8d_8x16_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sa8d = PFX(pixel_sa8d_16x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sa8d = PFX(pixel_sa8d_32x64_neon); + + // dequant_scaling + p.dequant_scaling = PFX(dequant_scaling_neon); + p.dequant_normal = PFX(dequant_normal_neon); + + // ssim_4x4x2_core + p.ssim_4x4x2_core = PFX(ssim_4x4x2_core_neon); + + // ssimDist + p.cuBLOCK_4x4.ssimDist = PFX(ssimDist4_neon); + p.cuBLOCK_8x8.ssimDist = PFX(ssimDist8_neon); + p.cuBLOCK_16x16.ssimDist = PFX(ssimDist16_neon); + p.cuBLOCK_32x32.ssimDist = PFX(ssimDist32_neon); + p.cuBLOCK_64x64.ssimDist = PFX(ssimDist64_neon); + + // normFact + p.cuBLOCK_8x8.normFact = PFX(normFact8_neon); + p.cuBLOCK_16x16.normFact = PFX(normFact16_neon); + p.cuBLOCK_32x32.normFact = PFX(normFact32_neon); + p.cuBLOCK_64x64.normFact = PFX(normFact64_neon); + + // psy_cost_pp + p.cuBLOCK_4x4.psy_cost_pp = PFX(psyCost_4x4_neon); + + p.weight_pp = PFX(weight_pp_neon); +#if !defined(__APPLE__) + p.scanPosLast = PFX(scanPosLast_neon); #endif + p.costCoeffNxN = PFX(costCoeffNxN_neon); #endif - } -} + // quant + p.quant = PFX(quant_neon); + p.nquant = PFX(nquant_neon); +} -void setupAssemblyPrimitives(EncoderPrimitives &p, int cpuMask) +#if defined(HAVE_SVE2) || defined(HAVE_SVE) +void setupSvePrimitives(EncoderPrimitives &p) { - if (cpuMask & X265_CPU_NEON) - { - p.puLUMA_4x4.satd = PFX(pixel_satd_4x4_neon); - p.puLUMA_4x8.satd = PFX(pixel_satd_4x8_neon); - p.puLUMA_4x16.satd = PFX(pixel_satd_4x16_neon); - p.puLUMA_8x4.satd = PFX(pixel_satd_8x4_neon); - p.puLUMA_8x8.satd = PFX(pixel_satd_8x8_neon); - p.puLUMA_12x16.satd = PFX(pixel_satd_12x16_neon); - - p.chromaX265_CSP_I420.puCHROMA_420_4x4.satd = PFX(pixel_satd_4x4_neon); - p.chromaX265_CSP_I420.puCHROMA_420_4x8.satd = PFX(pixel_satd_4x8_neon); - p.chromaX265_CSP_I420.puCHROMA_420_4x16.satd = PFX(pixel_satd_4x16_neon); - p.chromaX265_CSP_I420.puCHROMA_420_8x4.satd = PFX(pixel_satd_8x4_neon); - p.chromaX265_CSP_I420.puCHROMA_420_8x8.satd = PFX(pixel_satd_8x8_neon); - p.chromaX265_CSP_I420.puCHROMA_420_12x16.satd = PFX(pixel_satd_12x16_neon); - - p.chromaX265_CSP_I422.puCHROMA_422_4x4.satd = PFX(pixel_satd_4x4_neon); - p.chromaX265_CSP_I422.puCHROMA_422_4x8.satd = PFX(pixel_satd_4x8_neon); - p.chromaX265_CSP_I422.puCHROMA_422_4x16.satd = PFX(pixel_satd_4x16_neon); - p.chromaX265_CSP_I422.puCHROMA_422_4x32.satd = PFX(pixel_satd_4x32_neon); - p.chromaX265_CSP_I422.puCHROMA_422_8x4.satd = PFX(pixel_satd_8x4_neon); - p.chromaX265_CSP_I422.puCHROMA_422_8x8.satd = PFX(pixel_satd_8x8_neon); - p.chromaX265_CSP_I422.puCHROMA_422_12x32.satd = PFX(pixel_satd_12x32_neon); - - p.puLUMA_4x4.pixelavg_ppNONALIGNED = PFX(pixel_avg_pp_4x4_neon); - p.puLUMA_4x8.pixelavg_ppNONALIGNED = PFX(pixel_avg_pp_4x8_neon); - p.puLUMA_4x16.pixelavg_ppNONALIGNED = PFX(pixel_avg_pp_4x16_neon); - p.puLUMA_8x4.pixelavg_ppNONALIGNED = PFX(pixel_avg_pp_8x4_neon); - p.puLUMA_8x8.pixelavg_ppNONALIGNED = PFX(pixel_avg_pp_8x8_neon); - p.puLUMA_8x16.pixelavg_ppNONALIGNED = PFX(pixel_avg_pp_8x16_neon); - p.puLUMA_8x32.pixelavg_ppNONALIGNED = PFX(pixel_avg_pp_8x32_neon); - - p.puLUMA_4x4.pixelavg_ppALIGNED = PFX(pixel_avg_pp_4x4_neon); - p.puLUMA_4x8.pixelavg_ppALIGNED = PFX(pixel_avg_pp_4x8_neon); - p.puLUMA_4x16.pixelavg_ppALIGNED = PFX(pixel_avg_pp_4x16_neon); - p.puLUMA_8x4.pixelavg_ppALIGNED = PFX(pixel_avg_pp_8x4_neon); - p.puLUMA_8x8.pixelavg_ppALIGNED = PFX(pixel_avg_pp_8x8_neon); - p.puLUMA_8x16.pixelavg_ppALIGNED = PFX(pixel_avg_pp_8x16_neon); - p.puLUMA_8x32.pixelavg_ppALIGNED = PFX(pixel_avg_pp_8x32_neon); - - p.puLUMA_8x4.sad_x3 = PFX(sad_x3_8x4_neon); - p.puLUMA_8x8.sad_x3 = PFX(sad_x3_8x8_neon); - p.puLUMA_8x16.sad_x3 = PFX(sad_x3_8x16_neon); - p.puLUMA_8x32.sad_x3 = PFX(sad_x3_8x32_neon); - - p.puLUMA_8x4.sad_x4 = PFX(sad_x4_8x4_neon); - p.puLUMA_8x8.sad_x4 = PFX(sad_x4_8x8_neon); - p.puLUMA_8x16.sad_x4 = PFX(sad_x4_8x16_neon); - p.puLUMA_8x32.sad_x4 = PFX(sad_x4_8x32_neon); - - // quant - p.quant = PFX(quant_neon); - // luma_hps - p.puLUMA_4x4.luma_hps = PFX(interp_8tap_horiz_ps_4x4_neon); - p.puLUMA_4x8.luma_hps = PFX(interp_8tap_horiz_ps_4x8_neon); - p.puLUMA_4x16.luma_hps = PFX(interp_8tap_horiz_ps_4x16_neon); - p.puLUMA_8x4.luma_hps = PFX(interp_8tap_horiz_ps_8x4_neon); - p.puLUMA_8x8.luma_hps = PFX(interp_8tap_horiz_ps_8x8_neon); - p.puLUMA_8x16.luma_hps = PFX(interp_8tap_horiz_ps_8x16_neon); - p.puLUMA_8x32.luma_hps = PFX(interp_8tap_horiz_ps_8x32_neon); - p.puLUMA_12x16.luma_hps = PFX(interp_8tap_horiz_ps_12x16_neon); - p.puLUMA_24x32.luma_hps = PFX(interp_8tap_horiz_ps_24x32_neon); -#if !AUTO_VECTORIZE || GCC_VERSION < GCC_5_1_0 /* gcc_version < gcc-5.1.0 */ - p.puLUMA_16x4.luma_hps = PFX(interp_8tap_horiz_ps_16x4_neon); - p.puLUMA_16x8.luma_hps = PFX(interp_8tap_horiz_ps_16x8_neon); - p.puLUMA_16x12.luma_hps = PFX(interp_8tap_horiz_ps_16x12_neon); - p.puLUMA_16x16.luma_hps = PFX(interp_8tap_horiz_ps_16x16_neon); - p.puLUMA_16x32.luma_hps = PFX(interp_8tap_horiz_ps_16x32_neon); - p.puLUMA_16x64.luma_hps = PFX(interp_8tap_horiz_ps_16x64_neon); - p.puLUMA_32x8.luma_hps = PFX(interp_8tap_horiz_ps_32x8_neon); - p.puLUMA_32x16.luma_hps = PFX(interp_8tap_horiz_ps_32x16_neon); - p.puLUMA_32x24.luma_hps = PFX(interp_8tap_horiz_ps_32x24_neon); - p.puLUMA_32x32.luma_hps = PFX(interp_8tap_horiz_ps_32x32_neon); - p.puLUMA_32x64.luma_hps = PFX(interp_8tap_horiz_ps_32x64_neon); - p.puLUMA_48x64.luma_hps = PFX(interp_8tap_horiz_ps_48x64_neon); - p.puLUMA_64x16.luma_hps = PFX(interp_8tap_horiz_ps_64x16_neon); - p.puLUMA_64x32.luma_hps = PFX(interp_8tap_horiz_ps_64x32_neon); - p.puLUMA_64x48.luma_hps = PFX(interp_8tap_horiz_ps_64x48_neon); - p.puLUMA_64x64.luma_hps = PFX(interp_8tap_horiz_ps_64x64_neon); -#endif - - p.puLUMA_8x4.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_8x4>; - p.puLUMA_8x8.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_8x8>; - p.puLUMA_8x16.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_8x16>; - p.puLUMA_8x32.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_8x32>; - p.puLUMA_12x16.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_12x16>; -#if !AUTO_VECTORIZE || GCC_VERSION < GCC_5_1_0 /* gcc_version < gcc-5.1.0 */ - p.puLUMA_16x4.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_16x4>; - p.puLUMA_16x8.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_16x8>; - p.puLUMA_16x12.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_16x12>; - p.puLUMA_16x16.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_16x16>; - p.puLUMA_16x32.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_16x32>; - p.puLUMA_16x64.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_16x64>; - p.puLUMA_32x16.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_32x16>; - p.puLUMA_32x24.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_32x24>; - p.puLUMA_32x32.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_32x32>; - p.puLUMA_32x64.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_32x64>; - p.puLUMA_48x64.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_48x64>; - p.puLUMA_64x16.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_64x16>; - p.puLUMA_64x32.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_64x32>; - p.puLUMA_64x48.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_64x48>; - p.puLUMA_64x64.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_64x64>; -#if !AUTO_VECTORIZE || GCC_VERSION < GCC_4_9_0 /* gcc_version < gcc-4.9.0 */ - p.puLUMA_4x4.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_4x4>; - p.puLUMA_4x8.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_4x8>; - p.puLUMA_4x16.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_4x16>; - p.puLUMA_24x32.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_24x32>; - p.puLUMA_32x8.luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_32x8>; + // When these primitives will use SVE/SVE2 instructions set, + // change the following definitions to point to the SVE/SVE2 implementation + setupPixelPrimitives_neon(p); + setupFilterPrimitives_neon(p); + setupDCTPrimitives_neon(p); + setupLoopFilterPrimitives_neon(p); + setupIntraPrimitives_neon(p); + + CHROMA_420_PU_FILTER_PIXEL_TO_SHORT_NEON(p2sNONALIGNED); + CHROMA_420_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED); + CHROMA_422_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sALIGNED); + CHROMA_422_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sALIGNED); + CHROMA_444_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sALIGNED); + CHROMA_444_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sALIGNED); + LUMA_PU_NEON_FILTER_PIXEL_TO_SHORT(convert_p2sALIGNED); + LUMA_PU_SVE_FILTER_PIXEL_TO_SHORT(convert_p2sALIGNED); + CHROMA_420_PU_FILTER_PIXEL_TO_SHORT_NEON(p2sALIGNED); + CHROMA_420_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sALIGNED); + CHROMA_422_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED); + CHROMA_422_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED); + CHROMA_444_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED); + CHROMA_444_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED); + LUMA_PU_NEON_FILTER_PIXEL_TO_SHORT(convert_p2sNONALIGNED); + LUMA_PU_SVE_FILTER_PIXEL_TO_SHORT(convert_p2sNONALIGNED); + +#if !HIGH_BIT_DEPTH + ALL_LUMA_PU(luma_vpp, interp_8tap_vert_pp, neon); + ALL_LUMA_PU(luma_vsp, interp_8tap_vert_sp, neon); + ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, neon); + ALL_LUMA_PU(luma_hpp, interp_horiz_pp, neon); + ALL_LUMA_PU(luma_hps, interp_horiz_ps, neon); + ALL_LUMA_PU(luma_vss, interp_8tap_vert_ss, neon); + ALL_LUMA_PU_T(luma_hvpp, interp_8tap_hv_pp_cpu); + ALL_CHROMA_420_VERT_FILTERS(neon); + CHROMA_422_VERT_FILTERS_NEON(); + CHROMA_422_VERT_FILTERS_CAN_USE_SVE2(neon); + ALL_CHROMA_444_VERT_FILTERS(neon); + ALL_CHROMA_420_FILTERS(neon); + ALL_CHROMA_422_FILTERS(neon); + ALL_CHROMA_444_FILTERS(neon); + + + // Blockcopy_pp + LUMA_PU_NEON_1(copy_pp, blockcopy_pp); + LUMA_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(copy_pp, blockcopy_pp); + CHROMA_420_PU_NEON_1(copy_pp, blockcopy_pp); + CHROMA_420_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(copy_pp, blockcopy_pp); + CHROMA_422_PU_NEON_1(copy_pp, blockcopy_pp); + CHROMA_422_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(copy_pp, blockcopy_pp); + p.cuBLOCK_4x4.copy_pp = PFX(blockcopy_pp_4x4_neon); + p.cuBLOCK_8x8.copy_pp = PFX(blockcopy_pp_8x8_neon); + p.cuBLOCK_16x16.copy_pp = PFX(blockcopy_pp_16x16_neon); + p.cuBLOCK_32x32.copy_pp = PFX(blockcopy_pp_32x32_sve); + p.cuBLOCK_64x64.copy_pp = PFX(blockcopy_pp_64x64_sve); + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_pp = PFX(blockcopy_pp_4x4_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_pp = PFX(blockcopy_pp_8x8_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_pp = PFX(blockcopy_pp_16x16_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_pp = PFX(blockcopy_pp_32x32_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_pp = PFX(blockcopy_pp_4x8_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_pp = PFX(blockcopy_pp_8x16_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_pp = PFX(blockcopy_pp_16x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_pp = PFX(blockcopy_pp_32x64_sve); + +#endif // !HIGH_BIT_DEPTH + + // Blockcopy_ss + p.cuBLOCK_4x4.copy_ss = PFX(blockcopy_ss_4x4_neon); + p.cuBLOCK_8x8.copy_ss = PFX(blockcopy_ss_8x8_neon); + p.cuBLOCK_16x16.copy_ss = PFX(blockcopy_ss_16x16_sve); + p.cuBLOCK_32x32.copy_ss = PFX(blockcopy_ss_32x32_sve); + p.cuBLOCK_64x64.copy_ss = PFX(blockcopy_ss_64x64_sve); + + // Blockcopy_ps + p.cuBLOCK_4x4.copy_ps = PFX(blockcopy_ps_4x4_neon); + p.cuBLOCK_8x8.copy_ps = PFX(blockcopy_ps_8x8_neon); + p.cuBLOCK_16x16.copy_ps = PFX(blockcopy_ps_16x16_sve); + p.cuBLOCK_32x32.copy_ps = PFX(blockcopy_ps_32x32_sve); + p.cuBLOCK_64x64.copy_ps = PFX(blockcopy_ps_64x64_sve); + + // Blockcopy_sp + p.cuBLOCK_4x4.copy_sp = PFX(blockcopy_sp_4x4_sve); + p.cuBLOCK_8x8.copy_sp = PFX(blockcopy_sp_8x8_sve); + p.cuBLOCK_16x16.copy_sp = PFX(blockcopy_sp_16x16_sve); + p.cuBLOCK_32x32.copy_sp = PFX(blockcopy_sp_32x32_sve); + p.cuBLOCK_64x64.copy_sp = PFX(blockcopy_sp_64x64_neon); + + // chroma blockcopy_ss + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_ss = PFX(blockcopy_ss_4x4_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_ss = PFX(blockcopy_ss_8x8_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_ss = PFX(blockcopy_ss_16x16_sve); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_ss = PFX(blockcopy_ss_32x32_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_ss = PFX(blockcopy_ss_4x8_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_ss = PFX(blockcopy_ss_8x16_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_ss = PFX(blockcopy_ss_16x32_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_ss = PFX(blockcopy_ss_32x64_sve); + + // chroma blockcopy_ps + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_ps = PFX(blockcopy_ps_4x4_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_ps = PFX(blockcopy_ps_8x8_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_ps = PFX(blockcopy_ps_16x16_sve); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_ps = PFX(blockcopy_ps_32x32_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_ps = PFX(blockcopy_ps_4x8_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_ps = PFX(blockcopy_ps_8x16_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_ps = PFX(blockcopy_ps_16x32_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_ps = PFX(blockcopy_ps_32x64_sve); + + // chroma blockcopy_sp + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_sp = PFX(blockcopy_sp_4x4_sve); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_sp = PFX(blockcopy_sp_8x8_sve); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_sp = PFX(blockcopy_sp_16x16_sve); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_sp = PFX(blockcopy_sp_32x32_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_sp = PFX(blockcopy_sp_4x8_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_sp = PFX(blockcopy_sp_8x16_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_sp = PFX(blockcopy_sp_16x32_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_sp = PFX(blockcopy_sp_32x64_sve); + + // Block_fill + LUMA_TU_NEON(blockfill_sALIGNED, blockfill_s); + LUMA_TU_CAN_USE_SVE(blockfill_sALIGNED, blockfill_s); + LUMA_TU_NEON(blockfill_sNONALIGNED, blockfill_s); + LUMA_TU_CAN_USE_SVE(blockfill_sNONALIGNED, blockfill_s); + + // copy_count + p.cuBLOCK_4x4.copy_cnt = PFX(copy_cnt_4_neon); + p.cuBLOCK_8x8.copy_cnt = PFX(copy_cnt_8_neon); + p.cuBLOCK_16x16.copy_cnt = PFX(copy_cnt_16_neon); + p.cuBLOCK_32x32.copy_cnt = PFX(copy_cnt_32_neon); + + // count nonzero + p.cuBLOCK_4x4.count_nonzero = PFX(count_nonzero_4_neon); + p.cuBLOCK_8x8.count_nonzero = PFX(count_nonzero_8_neon); + p.cuBLOCK_16x16.count_nonzero = PFX(count_nonzero_16_neon); + p.cuBLOCK_32x32.count_nonzero = PFX(count_nonzero_32_neon); + + // cpy2Dto1D_shl + p.cuBLOCK_4x4.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_4x4_neon); + p.cuBLOCK_8x8.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_8x8_neon); + p.cuBLOCK_16x16.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_16x16_sve); + p.cuBLOCK_32x32.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_32x32_sve); + p.cuBLOCK_64x64.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_64x64_sve); + + // cpy2Dto1D_shr + p.cuBLOCK_4x4.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_4x4_neon); + p.cuBLOCK_8x8.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_8x8_neon); + p.cuBLOCK_16x16.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_16x16_sve); + p.cuBLOCK_32x32.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_32x32_sve); + + // cpy1Dto2D_shl + p.cuBLOCK_4x4.cpy1Dto2D_shlALIGNED = PFX(cpy1Dto2D_shl_4x4_neon); + p.cuBLOCK_8x8.cpy1Dto2D_shlALIGNED = PFX(cpy1Dto2D_shl_8x8_neon); + p.cuBLOCK_16x16.cpy1Dto2D_shlALIGNED = PFX(cpy1Dto2D_shl_16x16_sve); + p.cuBLOCK_32x32.cpy1Dto2D_shlALIGNED = PFX(cpy1Dto2D_shl_32x32_sve); + p.cuBLOCK_64x64.cpy1Dto2D_shlALIGNED = PFX(cpy1Dto2D_shl_64x64_sve); + + p.cuBLOCK_4x4.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_4x4_neon); + p.cuBLOCK_8x8.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_8x8_neon); + p.cuBLOCK_16x16.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_16x16_sve); + p.cuBLOCK_32x32.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_32x32_sve); + p.cuBLOCK_64x64.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_64x64_sve); + + // cpy1Dto2D_shr + p.cuBLOCK_4x4.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_4x4_neon); + p.cuBLOCK_8x8.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_8x8_neon); + p.cuBLOCK_16x16.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_16x16_sve); + p.cuBLOCK_32x32.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_32x32_sve); + p.cuBLOCK_64x64.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_64x64_sve); + +#if !HIGH_BIT_DEPTH + // pixel_avg_pp + ALL_LUMA_PU(pixelavg_ppNONALIGNED, pixel_avg_pp, neon); + ALL_LUMA_PU(pixelavg_ppALIGNED, pixel_avg_pp, neon); + + // addAvg + ALL_LUMA_PU(addAvgNONALIGNED, addAvg, neon); + ALL_LUMA_PU(addAvgALIGNED, addAvg, neon); + ALL_CHROMA_420_PU(addAvgNONALIGNED, addAvg, neon); + ALL_CHROMA_422_PU(addAvgNONALIGNED, addAvg, neon); + ALL_CHROMA_420_PU(addAvgALIGNED, addAvg, neon); + ALL_CHROMA_422_PU(addAvgALIGNED, addAvg, neon); + + // sad + ALL_LUMA_PU(sad, pixel_sad, neon); + ALL_LUMA_PU(sad_x3, sad_x3, neon); + ALL_LUMA_PU(sad_x4, sad_x4, neon); + + // sse_pp + p.cuBLOCK_4x4.sse_pp = PFX(pixel_sse_pp_4x4_sve); + p.cuBLOCK_8x8.sse_pp = PFX(pixel_sse_pp_8x8_neon); + p.cuBLOCK_16x16.sse_pp = PFX(pixel_sse_pp_16x16_neon); + p.cuBLOCK_32x32.sse_pp = PFX(pixel_sse_pp_32x32_neon); + p.cuBLOCK_64x64.sse_pp = PFX(pixel_sse_pp_64x64_neon); + + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.sse_pp = PFX(pixel_sse_pp_4x4_sve); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.sse_pp = PFX(pixel_sse_pp_8x8_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.sse_pp = PFX(pixel_sse_pp_16x16_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.sse_pp = PFX(pixel_sse_pp_32x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.sse_pp = PFX(pixel_sse_pp_4x8_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sse_pp = PFX(pixel_sse_pp_8x16_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sse_pp = PFX(pixel_sse_pp_16x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sse_pp = PFX(pixel_sse_pp_32x64_neon); + + // sse_ss + p.cuBLOCK_4x4.sse_ss = PFX(pixel_sse_ss_4x4_neon); + p.cuBLOCK_8x8.sse_ss = PFX(pixel_sse_ss_8x8_neon); + p.cuBLOCK_16x16.sse_ss = PFX(pixel_sse_ss_16x16_neon); + p.cuBLOCK_32x32.sse_ss = PFX(pixel_sse_ss_32x32_neon); + p.cuBLOCK_64x64.sse_ss = PFX(pixel_sse_ss_64x64_neon); + + // ssd_s + p.cuBLOCK_4x4.ssd_sNONALIGNED = PFX(pixel_ssd_s_4x4_neon); + p.cuBLOCK_8x8.ssd_sNONALIGNED = PFX(pixel_ssd_s_8x8_neon); + p.cuBLOCK_16x16.ssd_sNONALIGNED = PFX(pixel_ssd_s_16x16_neon); + p.cuBLOCK_32x32.ssd_sNONALIGNED = PFX(pixel_ssd_s_32x32_neon); + + p.cuBLOCK_4x4.ssd_sALIGNED = PFX(pixel_ssd_s_4x4_neon); + p.cuBLOCK_8x8.ssd_sALIGNED = PFX(pixel_ssd_s_8x8_neon); + p.cuBLOCK_16x16.ssd_sALIGNED = PFX(pixel_ssd_s_16x16_neon); + p.cuBLOCK_32x32.ssd_sALIGNED = PFX(pixel_ssd_s_32x32_neon); + + // pixel_var + p.cuBLOCK_8x8.var = PFX(pixel_var_8x8_neon); + p.cuBLOCK_16x16.var = PFX(pixel_var_16x16_neon); + p.cuBLOCK_32x32.var = PFX(pixel_var_32x32_neon); + p.cuBLOCK_64x64.var = PFX(pixel_var_64x64_neon); + + // calc_Residual + p.cuBLOCK_4x4.calcresidualNONALIGNED = PFX(getResidual4_neon); + p.cuBLOCK_8x8.calcresidualNONALIGNED = PFX(getResidual8_neon); + p.cuBLOCK_16x16.calcresidualNONALIGNED = PFX(getResidual16_neon); + p.cuBLOCK_32x32.calcresidualNONALIGNED = PFX(getResidual32_neon); + + p.cuBLOCK_4x4.calcresidualALIGNED = PFX(getResidual4_neon); + p.cuBLOCK_8x8.calcresidualALIGNED = PFX(getResidual8_neon); + p.cuBLOCK_16x16.calcresidualALIGNED = PFX(getResidual16_neon); + p.cuBLOCK_32x32.calcresidualALIGNED = PFX(getResidual32_neon); + + // pixel_sub_ps + p.cuBLOCK_4x4.sub_ps = PFX(pixel_sub_ps_4x4_neon); + p.cuBLOCK_8x8.sub_ps = PFX(pixel_sub_ps_8x8_neon); + p.cuBLOCK_16x16.sub_ps = PFX(pixel_sub_ps_16x16_neon); + p.cuBLOCK_32x32.sub_ps = PFX(pixel_sub_ps_32x32_neon); + p.cuBLOCK_64x64.sub_ps = PFX(pixel_sub_ps_64x64_neon); + + // chroma sub_ps + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.sub_ps = PFX(pixel_sub_ps_4x4_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.sub_ps = PFX(pixel_sub_ps_8x8_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.sub_ps = PFX(pixel_sub_ps_16x16_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.sub_ps = PFX(pixel_sub_ps_32x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.sub_ps = PFX(pixel_sub_ps_4x8_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sub_ps = PFX(pixel_sub_ps_8x16_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sub_ps = PFX(pixel_sub_ps_16x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sub_ps = PFX(pixel_sub_ps_32x64_neon); + + // pixel_add_ps + p.cuBLOCK_4x4.add_psNONALIGNED = PFX(pixel_add_ps_4x4_neon); + p.cuBLOCK_8x8.add_psNONALIGNED = PFX(pixel_add_ps_8x8_neon); + p.cuBLOCK_16x16.add_psNONALIGNED = PFX(pixel_add_ps_16x16_neon); + p.cuBLOCK_32x32.add_psNONALIGNED = PFX(pixel_add_ps_32x32_neon); + p.cuBLOCK_64x64.add_psNONALIGNED = PFX(pixel_add_ps_64x64_neon); + + p.cuBLOCK_4x4.add_psALIGNED = PFX(pixel_add_ps_4x4_neon); + p.cuBLOCK_8x8.add_psALIGNED = PFX(pixel_add_ps_8x8_neon); + p.cuBLOCK_16x16.add_psALIGNED = PFX(pixel_add_ps_16x16_neon); + p.cuBLOCK_32x32.add_psALIGNED = PFX(pixel_add_ps_32x32_neon); + p.cuBLOCK_64x64.add_psALIGNED = PFX(pixel_add_ps_64x64_neon); + + // chroma add_ps + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.add_psNONALIGNED = PFX(pixel_add_ps_4x4_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.add_psNONALIGNED = PFX(pixel_add_ps_8x8_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.add_psNONALIGNED = PFX(pixel_add_ps_16x16_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.add_psNONALIGNED = PFX(pixel_add_ps_32x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.add_psNONALIGNED = PFX(pixel_add_ps_4x8_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.add_psNONALIGNED = PFX(pixel_add_ps_8x16_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.add_psNONALIGNED = PFX(pixel_add_ps_16x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.add_psNONALIGNED = PFX(pixel_add_ps_32x64_neon); + + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.add_psALIGNED = PFX(pixel_add_ps_4x4_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.add_psALIGNED = PFX(pixel_add_ps_8x8_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.add_psALIGNED = PFX(pixel_add_ps_16x16_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.add_psALIGNED = PFX(pixel_add_ps_32x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.add_psALIGNED = PFX(pixel_add_ps_4x8_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.add_psALIGNED = PFX(pixel_add_ps_8x16_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.add_psALIGNED = PFX(pixel_add_ps_16x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.add_psALIGNED = PFX(pixel_add_ps_32x64_neon); + + //scale2D_64to32 + p.scale2D_64to32 = PFX(scale2D_64to32_neon); + + // scale1D_128to64 + p.scale1D_128to64NONALIGNED = PFX(scale1D_128to64_neon); + p.scale1D_128to64ALIGNED = PFX(scale1D_128to64_neon); + + // planecopy + p.planecopy_cp = PFX(pixel_planecopy_cp_neon); + + // satd + p.puLUMA_4x4.satd = PFX(pixel_satd_4x4_sve); + p.puLUMA_8x8.satd = PFX(pixel_satd_8x8_neon); + p.puLUMA_16x16.satd = PFX(pixel_satd_16x16_neon); + p.puLUMA_32x32.satd = PFX(pixel_satd_32x32_sve); + p.puLUMA_64x64.satd = PFX(pixel_satd_64x64_neon); + p.puLUMA_8x4.satd = PFX(pixel_satd_8x4_sve); + p.puLUMA_4x8.satd = PFX(pixel_satd_4x8_neon); + p.puLUMA_16x8.satd = PFX(pixel_satd_16x8_neon); + p.puLUMA_8x16.satd = PFX(pixel_satd_8x16_neon); + p.puLUMA_16x32.satd = PFX(pixel_satd_16x32_neon); + p.puLUMA_32x16.satd = PFX(pixel_satd_32x16_sve); + p.puLUMA_64x32.satd = PFX(pixel_satd_64x32_neon); + p.puLUMA_32x64.satd = PFX(pixel_satd_32x64_neon); + p.puLUMA_16x12.satd = PFX(pixel_satd_16x12_neon); + p.puLUMA_12x16.satd = PFX(pixel_satd_12x16_neon); + p.puLUMA_16x4.satd = PFX(pixel_satd_16x4_neon); + p.puLUMA_4x16.satd = PFX(pixel_satd_4x16_neon); + p.puLUMA_32x24.satd = PFX(pixel_satd_32x24_neon); + p.puLUMA_24x32.satd = PFX(pixel_satd_24x32_neon); + p.puLUMA_32x8.satd = PFX(pixel_satd_32x8_neon); + p.puLUMA_8x32.satd = PFX(pixel_satd_8x32_neon); + p.puLUMA_64x48.satd = PFX(pixel_satd_64x48_sve); + p.puLUMA_48x64.satd = PFX(pixel_satd_48x64_neon); + p.puLUMA_64x16.satd = PFX(pixel_satd_64x16_neon); + p.puLUMA_16x64.satd = PFX(pixel_satd_16x64_neon); + + p.chromaX265_CSP_I420.puCHROMA_420_4x4.satd = PFX(pixel_satd_4x4_sve); + p.chromaX265_CSP_I420.puCHROMA_420_8x8.satd = PFX(pixel_satd_8x8_neon); + p.chromaX265_CSP_I420.puCHROMA_420_16x16.satd = PFX(pixel_satd_16x16_neon); + p.chromaX265_CSP_I420.puCHROMA_420_32x32.satd = PFX(pixel_satd_32x32_neon); + p.chromaX265_CSP_I420.puCHROMA_420_8x4.satd = PFX(pixel_satd_8x4_sve); + p.chromaX265_CSP_I420.puCHROMA_420_4x8.satd = PFX(pixel_satd_4x8_neon); + p.chromaX265_CSP_I420.puCHROMA_420_16x8.satd = PFX(pixel_satd_16x8_neon); + p.chromaX265_CSP_I420.puCHROMA_420_8x16.satd = PFX(pixel_satd_8x16_neon); + p.chromaX265_CSP_I420.puCHROMA_420_32x16.satd = PFX(pixel_satd_32x16_neon); + p.chromaX265_CSP_I420.puCHROMA_420_16x32.satd = PFX(pixel_satd_16x32_neon); + p.chromaX265_CSP_I420.puCHROMA_420_16x12.satd = PFX(pixel_satd_16x12_neon); + p.chromaX265_CSP_I420.puCHROMA_420_12x16.satd = PFX(pixel_satd_12x16_neon); + p.chromaX265_CSP_I420.puCHROMA_420_16x4.satd = PFX(pixel_satd_16x4_neon); + p.chromaX265_CSP_I420.puCHROMA_420_4x16.satd = PFX(pixel_satd_4x16_neon); + p.chromaX265_CSP_I420.puCHROMA_420_32x24.satd = PFX(pixel_satd_32x24_neon); + p.chromaX265_CSP_I420.puCHROMA_420_24x32.satd = PFX(pixel_satd_24x32_neon); + p.chromaX265_CSP_I420.puCHROMA_420_32x8.satd = PFX(pixel_satd_32x8_neon); + p.chromaX265_CSP_I420.puCHROMA_420_8x32.satd = PFX(pixel_satd_8x32_neon); + + p.chromaX265_CSP_I422.puCHROMA_422_4x8.satd = PFX(pixel_satd_4x8_neon); + p.chromaX265_CSP_I422.puCHROMA_422_8x16.satd = PFX(pixel_satd_8x16_neon); + p.chromaX265_CSP_I422.puCHROMA_422_16x32.satd = PFX(pixel_satd_16x32_neon); + p.chromaX265_CSP_I422.puCHROMA_422_32x64.satd = PFX(pixel_satd_32x64_neon); + p.chromaX265_CSP_I422.puCHROMA_422_4x4.satd = PFX(pixel_satd_4x4_sve); + p.chromaX265_CSP_I422.puCHROMA_422_8x8.satd = PFX(pixel_satd_8x8_neon); + p.chromaX265_CSP_I422.puCHROMA_422_4x16.satd = PFX(pixel_satd_4x16_neon); + p.chromaX265_CSP_I422.puCHROMA_422_16x16.satd = PFX(pixel_satd_16x16_neon); + p.chromaX265_CSP_I422.puCHROMA_422_8x32.satd = PFX(pixel_satd_8x32_neon); + p.chromaX265_CSP_I422.puCHROMA_422_32x32.satd = PFX(pixel_satd_32x32_neon); + p.chromaX265_CSP_I422.puCHROMA_422_16x64.satd = PFX(pixel_satd_16x64_neon); + p.chromaX265_CSP_I422.puCHROMA_422_8x12.satd = PFX(pixel_satd_8x12_sve); + p.chromaX265_CSP_I422.puCHROMA_422_8x4.satd = PFX(pixel_satd_8x4_sve); + p.chromaX265_CSP_I422.puCHROMA_422_16x24.satd = PFX(pixel_satd_16x24_neon); + p.chromaX265_CSP_I422.puCHROMA_422_12x32.satd = PFX(pixel_satd_12x32_neon); + p.chromaX265_CSP_I422.puCHROMA_422_16x8.satd = PFX(pixel_satd_16x8_neon); + p.chromaX265_CSP_I422.puCHROMA_422_4x32.satd = PFX(pixel_satd_4x32_neon); + p.chromaX265_CSP_I422.puCHROMA_422_32x48.satd = PFX(pixel_satd_32x48_neon); + p.chromaX265_CSP_I422.puCHROMA_422_24x64.satd = PFX(pixel_satd_24x64_neon); + p.chromaX265_CSP_I422.puCHROMA_422_32x16.satd = PFX(pixel_satd_32x16_neon); + p.chromaX265_CSP_I422.puCHROMA_422_8x64.satd = PFX(pixel_satd_8x64_neon); + + // sa8d + p.cuBLOCK_4x4.sa8d = PFX(pixel_satd_4x4_sve); + p.cuBLOCK_8x8.sa8d = PFX(pixel_sa8d_8x8_neon); + p.cuBLOCK_16x16.sa8d = PFX(pixel_sa8d_16x16_neon); + p.cuBLOCK_32x32.sa8d = PFX(pixel_sa8d_32x32_neon); + p.cuBLOCK_64x64.sa8d = PFX(pixel_sa8d_64x64_neon); + p.chromaX265_CSP_I420.cuBLOCK_8x8.sa8d = PFX(pixel_satd_4x4_sve); + p.chromaX265_CSP_I420.cuBLOCK_16x16.sa8d = PFX(pixel_sa8d_16x16_neon); + p.chromaX265_CSP_I420.cuBLOCK_32x32.sa8d = PFX(pixel_sa8d_32x32_neon); + p.chromaX265_CSP_I420.cuBLOCK_64x64.sa8d = PFX(pixel_sa8d_64x64_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sa8d = PFX(pixel_sa8d_8x16_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sa8d = PFX(pixel_sa8d_16x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sa8d = PFX(pixel_sa8d_32x64_neon); + + // dequant_scaling + p.dequant_scaling = PFX(dequant_scaling_neon); + p.dequant_normal = PFX(dequant_normal_neon); + + // ssim_4x4x2_core + p.ssim_4x4x2_core = PFX(ssim_4x4x2_core_neon); + + // ssimDist + p.cuBLOCK_4x4.ssimDist = PFX(ssimDist4_neon); + p.cuBLOCK_8x8.ssimDist = PFX(ssimDist8_neon); + p.cuBLOCK_16x16.ssimDist = PFX(ssimDist16_neon); + p.cuBLOCK_32x32.ssimDist = PFX(ssimDist32_neon); + p.cuBLOCK_64x64.ssimDist = PFX(ssimDist64_neon); + + // normFact + p.cuBLOCK_8x8.normFact = PFX(normFact8_neon); + p.cuBLOCK_16x16.normFact = PFX(normFact16_neon); + p.cuBLOCK_32x32.normFact = PFX(normFact32_neon); + p.cuBLOCK_64x64.normFact = PFX(normFact64_neon); + + // psy_cost_pp + p.cuBLOCK_4x4.psy_cost_pp = PFX(psyCost_4x4_neon); + + p.weight_pp = PFX(weight_pp_neon); +#if !defined(__APPLE__) + p.scanPosLast = PFX(scanPosLast_neon); +#endif + p.costCoeffNxN = PFX(costCoeffNxN_neon); #endif + + // quant + p.quant = PFX(quant_sve); + p.nquant = PFX(nquant_neon); +} #endif +#if defined(HAVE_SVE2) +void setupSve2Primitives(EncoderPrimitives &p) +{ + // When these primitives will use SVE/SVE2 instructions set, + // change the following definitions to point to the SVE/SVE2 implementation + setupPixelPrimitives_neon(p); + setupFilterPrimitives_neon(p); + setupDCTPrimitives_neon(p); + setupLoopFilterPrimitives_neon(p); + setupIntraPrimitives_neon(p); + + CHROMA_420_PU_FILTER_PIXEL_TO_SHORT_NEON(p2sNONALIGNED); + CHROMA_420_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED); + CHROMA_422_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sALIGNED); + CHROMA_422_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sALIGNED); + CHROMA_444_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sALIGNED); + CHROMA_444_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sALIGNED); + LUMA_PU_NEON_FILTER_PIXEL_TO_SHORT(convert_p2sALIGNED); + LUMA_PU_SVE_FILTER_PIXEL_TO_SHORT(convert_p2sALIGNED); + CHROMA_420_PU_FILTER_PIXEL_TO_SHORT_NEON(p2sALIGNED); + CHROMA_420_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sALIGNED); + CHROMA_422_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED); + CHROMA_422_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED); + CHROMA_444_PU_NEON_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED); + CHROMA_444_PU_SVE_FILTER_PIXEL_TO_SHORT(p2sNONALIGNED); + LUMA_PU_NEON_FILTER_PIXEL_TO_SHORT(convert_p2sNONALIGNED); + LUMA_PU_SVE_FILTER_PIXEL_TO_SHORT(convert_p2sNONALIGNED); + #if !HIGH_BIT_DEPTH - p.cuBLOCK_4x4.psy_cost_pp = PFX(psyCost_4x4_neon); + LUMA_PU_MULTIPLE_ARCHS_1(luma_vpp, interp_8tap_vert_pp, neon); + LUMA_PU_MULTIPLE_ARCHS_2(luma_vpp, interp_8tap_vert_pp, sve2); + LUMA_PU_MULTIPLE_ARCHS_1(luma_vsp, interp_8tap_vert_sp, sve2); + LUMA_PU_MULTIPLE_ARCHS_2(luma_vsp, interp_8tap_vert_sp, neon); + ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, sve2); + ALL_LUMA_PU(luma_hpp, interp_horiz_pp, neon); + ALL_LUMA_PU(luma_hps, interp_horiz_ps, neon); + ALL_LUMA_PU(luma_vss, interp_8tap_vert_ss, sve2); + ALL_LUMA_PU_T(luma_hvpp, interp_8tap_hv_pp_cpu); + CHROMA_420_VERT_FILTERS_NEON(); + CHROMA_420_VERT_FILTERS_CAN_USE_SVE2(); + CHROMA_422_VERT_FILTERS_NEON(); + CHROMA_422_VERT_FILTERS_CAN_USE_SVE2(sve2); + CHROMA_444_VERT_FILTERS_NEON(); + CHROMA_444_VERT_FILTERS_CAN_USE_SVE2(); + CHROMA_420_FILTERS_NEON(); + CHROMA_420_FILTERS_CAN_USE_SVE2(); + CHROMA_422_FILTERS_NEON(); + CHROMA_422_FILTERS_CAN_USE_SVE2(); + CHROMA_444_FILTERS_NEON(); + CHROMA_444_FILTERS_CAN_USE_SVE2(); + + // Blockcopy_pp + LUMA_PU_NEON_1(copy_pp, blockcopy_pp); + LUMA_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(copy_pp, blockcopy_pp); + CHROMA_420_PU_NEON_1(copy_pp, blockcopy_pp); + CHROMA_420_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(copy_pp, blockcopy_pp); + CHROMA_422_PU_NEON_1(copy_pp, blockcopy_pp); + CHROMA_422_PU_CAN_USE_SVE_EXCEPT_FILTER_PIXEL_TO_SHORT(copy_pp, blockcopy_pp); + p.cuBLOCK_4x4.copy_pp = PFX(blockcopy_pp_4x4_neon); + p.cuBLOCK_8x8.copy_pp = PFX(blockcopy_pp_8x8_neon); + p.cuBLOCK_16x16.copy_pp = PFX(blockcopy_pp_16x16_neon); + p.cuBLOCK_32x32.copy_pp = PFX(blockcopy_pp_32x32_sve); + p.cuBLOCK_64x64.copy_pp = PFX(blockcopy_pp_64x64_sve); + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_pp = PFX(blockcopy_pp_4x4_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_pp = PFX(blockcopy_pp_8x8_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_pp = PFX(blockcopy_pp_16x16_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_pp = PFX(blockcopy_pp_32x32_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_pp = PFX(blockcopy_pp_4x8_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_pp = PFX(blockcopy_pp_8x16_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_pp = PFX(blockcopy_pp_16x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_pp = PFX(blockcopy_pp_32x64_sve); + #endif // !HIGH_BIT_DEPTH + // Blockcopy_ss + p.cuBLOCK_4x4.copy_ss = PFX(blockcopy_ss_4x4_neon); + p.cuBLOCK_8x8.copy_ss = PFX(blockcopy_ss_8x8_neon); + p.cuBLOCK_16x16.copy_ss = PFX(blockcopy_ss_16x16_sve); + p.cuBLOCK_32x32.copy_ss = PFX(blockcopy_ss_32x32_sve); + p.cuBLOCK_64x64.copy_ss = PFX(blockcopy_ss_64x64_sve); + + // Blockcopy_ps + p.cuBLOCK_4x4.copy_ps = PFX(blockcopy_ps_4x4_neon); + p.cuBLOCK_8x8.copy_ps = PFX(blockcopy_ps_8x8_neon); + p.cuBLOCK_16x16.copy_ps = PFX(blockcopy_ps_16x16_sve); + p.cuBLOCK_32x32.copy_ps = PFX(blockcopy_ps_32x32_sve); + p.cuBLOCK_64x64.copy_ps = PFX(blockcopy_ps_64x64_sve); + + // Blockcopy_sp + p.cuBLOCK_4x4.copy_sp = PFX(blockcopy_sp_4x4_sve); + p.cuBLOCK_8x8.copy_sp = PFX(blockcopy_sp_8x8_sve); + p.cuBLOCK_16x16.copy_sp = PFX(blockcopy_sp_16x16_sve); + p.cuBLOCK_32x32.copy_sp = PFX(blockcopy_sp_32x32_sve); + p.cuBLOCK_64x64.copy_sp = PFX(blockcopy_sp_64x64_neon); + + // chroma blockcopy_ss + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_ss = PFX(blockcopy_ss_4x4_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_ss = PFX(blockcopy_ss_8x8_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_ss = PFX(blockcopy_ss_16x16_sve); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_ss = PFX(blockcopy_ss_32x32_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_ss = PFX(blockcopy_ss_4x8_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_ss = PFX(blockcopy_ss_8x16_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_ss = PFX(blockcopy_ss_16x32_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_ss = PFX(blockcopy_ss_32x64_sve); + + // chroma blockcopy_ps + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_ps = PFX(blockcopy_ps_4x4_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_ps = PFX(blockcopy_ps_8x8_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_ps = PFX(blockcopy_ps_16x16_sve); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_ps = PFX(blockcopy_ps_32x32_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_ps = PFX(blockcopy_ps_4x8_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_ps = PFX(blockcopy_ps_8x16_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_ps = PFX(blockcopy_ps_16x32_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_ps = PFX(blockcopy_ps_32x64_sve); + + // chroma blockcopy_sp + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.copy_sp = PFX(blockcopy_sp_4x4_sve); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.copy_sp = PFX(blockcopy_sp_8x8_sve); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.copy_sp = PFX(blockcopy_sp_16x16_sve); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.copy_sp = PFX(blockcopy_sp_32x32_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.copy_sp = PFX(blockcopy_sp_4x8_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.copy_sp = PFX(blockcopy_sp_8x16_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.copy_sp = PFX(blockcopy_sp_16x32_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.copy_sp = PFX(blockcopy_sp_32x64_sve); + + // Block_fill + LUMA_TU_NEON(blockfill_sALIGNED, blockfill_s); + LUMA_TU_CAN_USE_SVE(blockfill_sALIGNED, blockfill_s); + LUMA_TU_NEON(blockfill_sNONALIGNED, blockfill_s); + LUMA_TU_CAN_USE_SVE(blockfill_sNONALIGNED, blockfill_s); + + // copy_count + p.cuBLOCK_4x4.copy_cnt = PFX(copy_cnt_4_neon); + p.cuBLOCK_8x8.copy_cnt = PFX(copy_cnt_8_neon); + p.cuBLOCK_16x16.copy_cnt = PFX(copy_cnt_16_neon); + p.cuBLOCK_32x32.copy_cnt = PFX(copy_cnt_32_neon); + + // count nonzero + p.cuBLOCK_4x4.count_nonzero = PFX(count_nonzero_4_neon); + p.cuBLOCK_8x8.count_nonzero = PFX(count_nonzero_8_neon); + p.cuBLOCK_16x16.count_nonzero = PFX(count_nonzero_16_neon); + p.cuBLOCK_32x32.count_nonzero = PFX(count_nonzero_32_neon); + + // cpy2Dto1D_shl + p.cuBLOCK_4x4.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_4x4_neon); + p.cuBLOCK_8x8.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_8x8_neon); + p.cuBLOCK_16x16.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_16x16_sve); + p.cuBLOCK_32x32.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_32x32_sve); + p.cuBLOCK_64x64.cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_64x64_sve); + + // cpy2Dto1D_shr + p.cuBLOCK_4x4.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_4x4_neon); + p.cuBLOCK_8x8.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_8x8_neon); + p.cuBLOCK_16x16.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_16x16_sve); + p.cuBLOCK_32x32.cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_32x32_sve); + + // cpy1Dto2D_shl + p.cuBLOCK_4x4.cpy1Dto2D_shlALIGNED = PFX(cpy1Dto2D_shl_4x4_neon); + p.cuBLOCK_8x8.cpy1Dto2D_shlALIGNED = PFX(cpy1Dto2D_shl_8x8_neon); + p.cuBLOCK_16x16.cpy1Dto2D_shlALIGNED = PFX(cpy1Dto2D_shl_16x16_sve); + p.cuBLOCK_32x32.cpy1Dto2D_shlALIGNED = PFX(cpy1Dto2D_shl_32x32_sve); + p.cuBLOCK_64x64.cpy1Dto2D_shlALIGNED = PFX(cpy1Dto2D_shl_64x64_sve); + + p.cuBLOCK_4x4.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_4x4_neon); + p.cuBLOCK_8x8.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_8x8_neon); + p.cuBLOCK_16x16.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_16x16_sve); + p.cuBLOCK_32x32.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_32x32_sve); + p.cuBLOCK_64x64.cpy1Dto2D_shlNONALIGNED = PFX(cpy1Dto2D_shl_64x64_sve); + + // cpy1Dto2D_shr + p.cuBLOCK_4x4.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_4x4_neon); + p.cuBLOCK_8x8.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_8x8_neon); + p.cuBLOCK_16x16.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_16x16_sve); + p.cuBLOCK_32x32.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_32x32_sve); + p.cuBLOCK_64x64.cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_64x64_sve); + +#if !HIGH_BIT_DEPTH + // pixel_avg_pp + LUMA_PU_NEON_2(pixelavg_ppNONALIGNED, pixel_avg_pp); + LUMA_PU_MULTIPLE_ARCHS_3(pixelavg_ppNONALIGNED, pixel_avg_pp, sve2); + LUMA_PU_NEON_2(pixelavg_ppALIGNED, pixel_avg_pp); + LUMA_PU_MULTIPLE_ARCHS_3(pixelavg_ppALIGNED, pixel_avg_pp, sve2); + + // addAvg + LUMA_PU_NEON_3(addAvgNONALIGNED, addAvg); + LUMA_PU_CAN_USE_SVE2(addAvgNONALIGNED, addAvg); + LUMA_PU_NEON_3(addAvgALIGNED, addAvg); + LUMA_PU_CAN_USE_SVE2(addAvgALIGNED, addAvg); + CHROMA_420_PU_NEON_2(addAvgNONALIGNED, addAvg); + CHROMA_420_PU_MULTIPLE_ARCHS(addAvgNONALIGNED, addAvg, sve2); + CHROMA_420_PU_NEON_2(addAvgALIGNED, addAvg); + CHROMA_420_PU_MULTIPLE_ARCHS(addAvgALIGNED, addAvg, sve2); + CHROMA_422_PU_NEON_2(addAvgNONALIGNED, addAvg); + CHROMA_422_PU_CAN_USE_SVE2(addAvgNONALIGNED, addAvg); + CHROMA_422_PU_NEON_2(addAvgALIGNED, addAvg); + CHROMA_422_PU_CAN_USE_SVE2(addAvgALIGNED, addAvg); + + // sad + ALL_LUMA_PU(sad, pixel_sad, sve2); + ALL_LUMA_PU(sad_x3, sad_x3, sve2); + ALL_LUMA_PU(sad_x4, sad_x4, sve2); + + // sse_pp + p.cuBLOCK_4x4.sse_pp = PFX(pixel_sse_pp_4x4_sve); + p.cuBLOCK_8x8.sse_pp = PFX(pixel_sse_pp_8x8_neon); + p.cuBLOCK_16x16.sse_pp = PFX(pixel_sse_pp_16x16_neon); + p.cuBLOCK_32x32.sse_pp = PFX(pixel_sse_pp_32x32_sve2); + p.cuBLOCK_64x64.sse_pp = PFX(pixel_sse_pp_64x64_sve2); + + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.sse_pp = PFX(pixel_sse_pp_4x4_sve); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.sse_pp = PFX(pixel_sse_pp_8x8_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.sse_pp = PFX(pixel_sse_pp_16x16_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.sse_pp = PFX(pixel_sse_pp_32x32_sve2); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.sse_pp = PFX(pixel_sse_pp_4x8_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sse_pp = PFX(pixel_sse_pp_8x16_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sse_pp = PFX(pixel_sse_pp_16x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sse_pp = PFX(pixel_sse_pp_32x64_sve2); + + // sse_ss + p.cuBLOCK_4x4.sse_ss = PFX(pixel_sse_ss_4x4_sve2); + p.cuBLOCK_8x8.sse_ss = PFX(pixel_sse_ss_8x8_sve2); + p.cuBLOCK_16x16.sse_ss = PFX(pixel_sse_ss_16x16_sve2); + p.cuBLOCK_32x32.sse_ss = PFX(pixel_sse_ss_32x32_sve2); + p.cuBLOCK_64x64.sse_ss = PFX(pixel_sse_ss_64x64_sve2); + + // ssd_s + p.cuBLOCK_4x4.ssd_sNONALIGNED = PFX(pixel_ssd_s_4x4_sve2); + p.cuBLOCK_8x8.ssd_sNONALIGNED = PFX(pixel_ssd_s_8x8_sve2); + p.cuBLOCK_16x16.ssd_sNONALIGNED = PFX(pixel_ssd_s_16x16_sve2); + p.cuBLOCK_32x32.ssd_sNONALIGNED = PFX(pixel_ssd_s_32x32_sve2); + + p.cuBLOCK_4x4.ssd_sALIGNED = PFX(pixel_ssd_s_4x4_sve2); + p.cuBLOCK_8x8.ssd_sALIGNED = PFX(pixel_ssd_s_8x8_sve2); + p.cuBLOCK_16x16.ssd_sALIGNED = PFX(pixel_ssd_s_16x16_sve2); + p.cuBLOCK_32x32.ssd_sALIGNED = PFX(pixel_ssd_s_32x32_sve2); + + // pixel_var + p.cuBLOCK_8x8.var = PFX(pixel_var_8x8_sve2); + p.cuBLOCK_16x16.var = PFX(pixel_var_16x16_sve2); + p.cuBLOCK_32x32.var = PFX(pixel_var_32x32_sve2); + p.cuBLOCK_64x64.var = PFX(pixel_var_64x64_sve2); + + // calc_Residual + p.cuBLOCK_4x4.calcresidualNONALIGNED = PFX(getResidual4_neon); + p.cuBLOCK_8x8.calcresidualNONALIGNED = PFX(getResidual8_neon); + p.cuBLOCK_16x16.calcresidualNONALIGNED = PFX(getResidual16_sve2); + p.cuBLOCK_32x32.calcresidualNONALIGNED = PFX(getResidual32_sve2); + + p.cuBLOCK_4x4.calcresidualALIGNED = PFX(getResidual4_neon); + p.cuBLOCK_8x8.calcresidualALIGNED = PFX(getResidual8_neon); + p.cuBLOCK_16x16.calcresidualALIGNED = PFX(getResidual16_sve2); + p.cuBLOCK_32x32.calcresidualALIGNED = PFX(getResidual32_sve2); + + // pixel_sub_ps + p.cuBLOCK_4x4.sub_ps = PFX(pixel_sub_ps_4x4_neon); + p.cuBLOCK_8x8.sub_ps = PFX(pixel_sub_ps_8x8_neon); + p.cuBLOCK_16x16.sub_ps = PFX(pixel_sub_ps_16x16_neon); + p.cuBLOCK_32x32.sub_ps = PFX(pixel_sub_ps_32x32_sve2); + p.cuBLOCK_64x64.sub_ps = PFX(pixel_sub_ps_64x64_sve2); + + // chroma sub_ps + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.sub_ps = PFX(pixel_sub_ps_4x4_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.sub_ps = PFX(pixel_sub_ps_8x8_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.sub_ps = PFX(pixel_sub_ps_16x16_neon); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.sub_ps = PFX(pixel_sub_ps_32x32_sve2); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.sub_ps = PFX(pixel_sub_ps_4x8_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sub_ps = PFX(pixel_sub_ps_8x16_sve); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sub_ps = PFX(pixel_sub_ps_16x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sub_ps = PFX(pixel_sub_ps_32x64_sve2); + + // pixel_add_ps + p.cuBLOCK_4x4.add_psNONALIGNED = PFX(pixel_add_ps_4x4_sve2); + p.cuBLOCK_8x8.add_psNONALIGNED = PFX(pixel_add_ps_8x8_sve2); + p.cuBLOCK_16x16.add_psNONALIGNED = PFX(pixel_add_ps_16x16_sve2); + p.cuBLOCK_32x32.add_psNONALIGNED = PFX(pixel_add_ps_32x32_sve2); + p.cuBLOCK_64x64.add_psNONALIGNED = PFX(pixel_add_ps_64x64_sve2); + + p.cuBLOCK_4x4.add_psALIGNED = PFX(pixel_add_ps_4x4_sve2); + p.cuBLOCK_8x8.add_psALIGNED = PFX(pixel_add_ps_8x8_sve2); + p.cuBLOCK_16x16.add_psALIGNED = PFX(pixel_add_ps_16x16_sve2); + p.cuBLOCK_32x32.add_psALIGNED = PFX(pixel_add_ps_32x32_sve2); + p.cuBLOCK_64x64.add_psALIGNED = PFX(pixel_add_ps_64x64_sve2); + + // chroma add_ps + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.add_psNONALIGNED = PFX(pixel_add_ps_4x4_sve2); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.add_psNONALIGNED = PFX(pixel_add_ps_8x8_sve2); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.add_psNONALIGNED = PFX(pixel_add_ps_16x16_sve2); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.add_psNONALIGNED = PFX(pixel_add_ps_32x32_sve2); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.add_psNONALIGNED = PFX(pixel_add_ps_4x8_sve2); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.add_psNONALIGNED = PFX(pixel_add_ps_8x16_sve2); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.add_psNONALIGNED = PFX(pixel_add_ps_16x32_sve2); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.add_psNONALIGNED = PFX(pixel_add_ps_32x64_sve2); + + p.chromaX265_CSP_I420.cuBLOCK_420_4x4.add_psALIGNED = PFX(pixel_add_ps_4x4_sve2); + p.chromaX265_CSP_I420.cuBLOCK_420_8x8.add_psALIGNED = PFX(pixel_add_ps_8x8_sve2); + p.chromaX265_CSP_I420.cuBLOCK_420_16x16.add_psALIGNED = PFX(pixel_add_ps_16x16_sve2); + p.chromaX265_CSP_I420.cuBLOCK_420_32x32.add_psALIGNED = PFX(pixel_add_ps_32x32_sve2); + p.chromaX265_CSP_I422.cuBLOCK_422_4x8.add_psALIGNED = PFX(pixel_add_ps_4x8_sve2); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.add_psALIGNED = PFX(pixel_add_ps_8x16_sve2); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.add_psALIGNED = PFX(pixel_add_ps_16x32_sve2); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.add_psALIGNED = PFX(pixel_add_ps_32x64_sve2); + + //scale2D_64to32 + p.scale2D_64to32 = PFX(scale2D_64to32_neon); + + // scale1D_128to64 + p.scale1D_128to64NONALIGNED = PFX(scale1D_128to64_sve2); + p.scale1D_128to64ALIGNED = PFX(scale1D_128to64_sve2); + + // planecopy + p.planecopy_cp = PFX(pixel_planecopy_cp_neon); + + // satd + p.puLUMA_4x4.satd = PFX(pixel_satd_4x4_sve); + p.puLUMA_8x8.satd = PFX(pixel_satd_8x8_neon); + p.puLUMA_16x16.satd = PFX(pixel_satd_16x16_neon); + p.puLUMA_32x32.satd = PFX(pixel_satd_32x32_sve); + p.puLUMA_64x64.satd = PFX(pixel_satd_64x64_neon); + p.puLUMA_8x4.satd = PFX(pixel_satd_8x4_sve); + p.puLUMA_4x8.satd = PFX(pixel_satd_4x8_neon); + p.puLUMA_16x8.satd = PFX(pixel_satd_16x8_neon); + p.puLUMA_8x16.satd = PFX(pixel_satd_8x16_neon); + p.puLUMA_16x32.satd = PFX(pixel_satd_16x32_neon); + p.puLUMA_32x16.satd = PFX(pixel_satd_32x16_sve); + p.puLUMA_64x32.satd = PFX(pixel_satd_64x32_neon); + p.puLUMA_32x64.satd = PFX(pixel_satd_32x64_neon); + p.puLUMA_16x12.satd = PFX(pixel_satd_16x12_neon); + p.puLUMA_12x16.satd = PFX(pixel_satd_12x16_neon); + p.puLUMA_16x4.satd = PFX(pixel_satd_16x4_neon); + p.puLUMA_4x16.satd = PFX(pixel_satd_4x16_neon); + p.puLUMA_32x24.satd = PFX(pixel_satd_32x24_neon); + p.puLUMA_24x32.satd = PFX(pixel_satd_24x32_neon); + p.puLUMA_32x8.satd = PFX(pixel_satd_32x8_neon); + p.puLUMA_8x32.satd = PFX(pixel_satd_8x32_neon); + p.puLUMA_64x48.satd = PFX(pixel_satd_64x48_sve); + p.puLUMA_48x64.satd = PFX(pixel_satd_48x64_neon); + p.puLUMA_64x16.satd = PFX(pixel_satd_64x16_neon); + p.puLUMA_16x64.satd = PFX(pixel_satd_16x64_neon); + + p.chromaX265_CSP_I420.puCHROMA_420_4x4.satd = PFX(pixel_satd_4x4_sve); + p.chromaX265_CSP_I420.puCHROMA_420_8x8.satd = PFX(pixel_satd_8x8_neon); + p.chromaX265_CSP_I420.puCHROMA_420_16x16.satd = PFX(pixel_satd_16x16_neon); + p.chromaX265_CSP_I420.puCHROMA_420_32x32.satd = PFX(pixel_satd_32x32_neon); + p.chromaX265_CSP_I420.puCHROMA_420_8x4.satd = PFX(pixel_satd_8x4_sve); + p.chromaX265_CSP_I420.puCHROMA_420_4x8.satd = PFX(pixel_satd_4x8_neon); + p.chromaX265_CSP_I420.puCHROMA_420_16x8.satd = PFX(pixel_satd_16x8_neon); + p.chromaX265_CSP_I420.puCHROMA_420_8x16.satd = PFX(pixel_satd_8x16_neon); + p.chromaX265_CSP_I420.puCHROMA_420_32x16.satd = PFX(pixel_satd_32x16_neon); + p.chromaX265_CSP_I420.puCHROMA_420_16x32.satd = PFX(pixel_satd_16x32_neon); + p.chromaX265_CSP_I420.puCHROMA_420_16x12.satd = PFX(pixel_satd_16x12_neon); + p.chromaX265_CSP_I420.puCHROMA_420_12x16.satd = PFX(pixel_satd_12x16_neon); + p.chromaX265_CSP_I420.puCHROMA_420_16x4.satd = PFX(pixel_satd_16x4_neon); + p.chromaX265_CSP_I420.puCHROMA_420_4x16.satd = PFX(pixel_satd_4x16_neon); + p.chromaX265_CSP_I420.puCHROMA_420_32x24.satd = PFX(pixel_satd_32x24_neon); + p.chromaX265_CSP_I420.puCHROMA_420_24x32.satd = PFX(pixel_satd_24x32_neon); + p.chromaX265_CSP_I420.puCHROMA_420_32x8.satd = PFX(pixel_satd_32x8_neon); + p.chromaX265_CSP_I420.puCHROMA_420_8x32.satd = PFX(pixel_satd_8x32_neon); + + p.chromaX265_CSP_I422.puCHROMA_422_4x8.satd = PFX(pixel_satd_4x8_neon); + p.chromaX265_CSP_I422.puCHROMA_422_8x16.satd = PFX(pixel_satd_8x16_neon); + p.chromaX265_CSP_I422.puCHROMA_422_16x32.satd = PFX(pixel_satd_16x32_neon); + p.chromaX265_CSP_I422.puCHROMA_422_32x64.satd = PFX(pixel_satd_32x64_neon); + p.chromaX265_CSP_I422.puCHROMA_422_4x4.satd = PFX(pixel_satd_4x4_sve); + p.chromaX265_CSP_I422.puCHROMA_422_8x8.satd = PFX(pixel_satd_8x8_neon); + p.chromaX265_CSP_I422.puCHROMA_422_4x16.satd = PFX(pixel_satd_4x16_neon); + p.chromaX265_CSP_I422.puCHROMA_422_16x16.satd = PFX(pixel_satd_16x16_neon); + p.chromaX265_CSP_I422.puCHROMA_422_8x32.satd = PFX(pixel_satd_8x32_neon); + p.chromaX265_CSP_I422.puCHROMA_422_32x32.satd = PFX(pixel_satd_32x32_neon); + p.chromaX265_CSP_I422.puCHROMA_422_16x64.satd = PFX(pixel_satd_16x64_neon); + p.chromaX265_CSP_I422.puCHROMA_422_8x12.satd = PFX(pixel_satd_8x12_sve); + p.chromaX265_CSP_I422.puCHROMA_422_8x4.satd = PFX(pixel_satd_8x4_sve); + p.chromaX265_CSP_I422.puCHROMA_422_16x24.satd = PFX(pixel_satd_16x24_neon); + p.chromaX265_CSP_I422.puCHROMA_422_12x32.satd = PFX(pixel_satd_12x32_neon); + p.chromaX265_CSP_I422.puCHROMA_422_16x8.satd = PFX(pixel_satd_16x8_neon); + p.chromaX265_CSP_I422.puCHROMA_422_4x32.satd = PFX(pixel_satd_4x32_neon); + p.chromaX265_CSP_I422.puCHROMA_422_32x48.satd = PFX(pixel_satd_32x48_neon); + p.chromaX265_CSP_I422.puCHROMA_422_24x64.satd = PFX(pixel_satd_24x64_neon); + p.chromaX265_CSP_I422.puCHROMA_422_32x16.satd = PFX(pixel_satd_32x16_neon); + p.chromaX265_CSP_I422.puCHROMA_422_8x64.satd = PFX(pixel_satd_8x64_neon); + + // sa8d + p.cuBLOCK_4x4.sa8d = PFX(pixel_satd_4x4_sve); + p.cuBLOCK_8x8.sa8d = PFX(pixel_sa8d_8x8_neon); + p.cuBLOCK_16x16.sa8d = PFX(pixel_sa8d_16x16_neon); + p.cuBLOCK_32x32.sa8d = PFX(pixel_sa8d_32x32_neon); + p.cuBLOCK_64x64.sa8d = PFX(pixel_sa8d_64x64_neon); + p.chromaX265_CSP_I420.cuBLOCK_8x8.sa8d = PFX(pixel_satd_4x4_sve); + p.chromaX265_CSP_I420.cuBLOCK_16x16.sa8d = PFX(pixel_sa8d_16x16_neon); + p.chromaX265_CSP_I420.cuBLOCK_32x32.sa8d = PFX(pixel_sa8d_32x32_neon); + p.chromaX265_CSP_I420.cuBLOCK_64x64.sa8d = PFX(pixel_sa8d_64x64_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_8x16.sa8d = PFX(pixel_sa8d_8x16_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_16x32.sa8d = PFX(pixel_sa8d_16x32_neon); + p.chromaX265_CSP_I422.cuBLOCK_422_32x64.sa8d = PFX(pixel_sa8d_32x64_neon); + + // dequant_scaling + p.dequant_scaling = PFX(dequant_scaling_sve2); + p.dequant_normal = PFX(dequant_normal_sve2); + + // ssim_4x4x2_core + p.ssim_4x4x2_core = PFX(ssim_4x4x2_core_sve2); + + // ssimDist + p.cuBLOCK_4x4.ssimDist = PFX(ssimDist4_sve2); + p.cuBLOCK_8x8.ssimDist = PFX(ssimDist8_sve2); + p.cuBLOCK_16x16.ssimDist = PFX(ssimDist16_sve2); + p.cuBLOCK_32x32.ssimDist = PFX(ssimDist32_sve2); + p.cuBLOCK_64x64.ssimDist = PFX(ssimDist64_sve2); + + // normFact + p.cuBLOCK_8x8.normFact = PFX(normFact8_sve2); + p.cuBLOCK_16x16.normFact = PFX(normFact16_sve2); + p.cuBLOCK_32x32.normFact = PFX(normFact32_sve2); + p.cuBLOCK_64x64.normFact = PFX(normFact64_sve2); + + // psy_cost_pp + p.cuBLOCK_4x4.psy_cost_pp = PFX(psyCost_4x4_neon); + + p.weight_pp = PFX(weight_pp_neon); +#if !defined(__APPLE__) + p.scanPosLast = PFX(scanPosLast_neon); +#endif + p.costCoeffNxN = PFX(costCoeffNxN_neon); +#endif + + // quant + p.quant = PFX(quant_sve); + p.nquant = PFX(nquant_neon); +} +#endif + +void setupAssemblyPrimitives(EncoderPrimitives &p, int cpuMask) +{ + +#ifdef HAVE_SVE2 + if (cpuMask & X265_CPU_SVE2) + { + setupSve2Primitives(p); } + else if (cpuMask & X265_CPU_SVE) + { + setupSvePrimitives(p); + } + else if (cpuMask & X265_CPU_NEON) + { + setupNeonPrimitives(p); + } + +#elif defined(HAVE_SVE) + if (cpuMask & X265_CPU_SVE) + { + setupSvePrimitives(p); + } + else if (cpuMask & X265_CPU_NEON) + { + setupNeonPrimitives(p); + } + +#else + if (cpuMask & X265_CPU_NEON) + { + setupNeonPrimitives(p); + } +#endif + } } // namespace X265_NS
View file
x265_3.6.tar.gz/source/common/aarch64/asm-sve.S
Added
@@ -0,0 +1,39 @@ +/***************************************************************************** + * Copyright (C) 2022-2023 MulticoreWare, Inc + * + * Authors: David Chen <david.chen@myais.com.cn> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "asm.S" + +.arch armv8-a+sve + +.macro ABS2_SVE a b c + abs \a, \c\()/m, \a + abs \b, \c\()/m, \b +.endm + +.macro ABS8_SVE z0, z1, z2, z3, z4, z5, z6, z7, p0 + ABS2_SVE \z0, \z1, p0 + ABS2_SVE \z2, \z3, p0 + ABS2_SVE \z4, \z5, p0 + ABS2_SVE \z6, \z7, p0 +.endm +
View file
x265_3.5.tar.gz/source/common/aarch64/asm.S -> x265_3.6.tar.gz/source/common/aarch64/asm.S
Changed
@@ -1,7 +1,8 @@ /***************************************************************************** - * Copyright (C) 2020 MulticoreWare, Inc + * Copyright (C) 2020-2021 MulticoreWare, Inc * * Authors: Hongbin Liu <liuhongbin1@huawei.com> + * Sebastian Pop <spop@amazon.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -21,34 +22,74 @@ * For more information, contact us at license @ x265.com. *****************************************************************************/ +#ifndef ASM_S_ // #include guards +#define ASM_S_ + .arch armv8-a +#define PFX3(prefix, name) prefix ## _ ## name +#define PFX2(prefix, name) PFX3(prefix, name) +#define PFX(name) PFX2(X265_NS, name) + +#ifdef __APPLE__ +#define PREFIX 1 +#endif + #ifdef PREFIX #define EXTERN_ASM _ +#define HAVE_AS_FUNC 0 +#elif defined __clang__ +#define EXTERN_ASM +#define HAVE_AS_FUNC 0 +#define PREFIX 1 #else #define EXTERN_ASM +#define HAVE_AS_FUNC 1 #endif #ifdef __ELF__ #define ELF #else +#ifdef PREFIX +#define ELF # +#else #define ELF @ #endif - -#define HAVE_AS_FUNC 1 +#endif #if HAVE_AS_FUNC #define FUNC #else +#ifdef PREFIX +#define FUNC # +#else #define FUNC @ #endif +#endif + +#define GLUE(a, b) a ## b +#define JOIN(a, b) GLUE(a, b) + +#define PFX_C(name) JOIN(JOIN(JOIN(EXTERN_ASM, X265_NS), _), name) + +#ifdef __APPLE__ +.macro endfunc +ELF .size \name, . - \name +FUNC .endfunc +.endm +#endif .macro function name, export=1 +#ifdef __APPLE__ + .global \name + endfunc +#else .macro endfunc ELF .size \name, . - \name FUNC .endfunc .purgem endfunc .endm +#endif .align 2 .if \export == 1 .global EXTERN_ASM\name @@ -64,6 +105,83 @@ .endif .endm +.macro const name, align=2 + .macro endconst +ELF .size \name, . - \name + .purgem endconst + .endm +#ifdef __MACH__ + .const_data +#else + .section .rodata +#endif + .align \align +\name: +.endm + +.macro movrel rd, val, offset=0 +#if defined(__APPLE__) + .if \offset < 0 + adrp \rd, \val@PAGE + add \rd, \rd, \val@PAGEOFF + sub \rd, \rd, -(\offset) + .else + adrp \rd, \val+(\offset)@PAGE + add \rd, \rd, \val+(\offset)@PAGEOFF + .endif +#elif defined(PIC) && defined(_WIN32) + .if \offset < 0 + adrp \rd, \val + add \rd, \rd, :lo12:\val + sub \rd, \rd, -(\offset) + .else + adrp \rd, \val+(\offset) + add \rd, \rd, :lo12:\val+(\offset) + .endif +#else + adrp \rd, \val+(\offset) + add \rd, \rd, :lo12:\val+(\offset) +#endif +.endm #define FENC_STRIDE 64 #define FDEC_STRIDE 32 + +.macro SUMSUB_AB sum, diff, a, b + add \sum, \a, \b + sub \diff, \a, \b +.endm + +.macro SUMSUB_ABCD s1, d1, s2, d2, a, b, c, d + SUMSUB_AB \s1, \d1, \a, \b + SUMSUB_AB \s2, \d2, \c, \d +.endm + +.macro HADAMARD4_V r1, r2, r3, r4, t1, t2, t3, t4 + SUMSUB_ABCD \t1, \t2, \t3, \t4, \r1, \r2, \r3, \r4 + SUMSUB_ABCD \r1, \r3, \r2, \r4, \t1, \t3, \t2, \t4 +.endm + +.macro ABS2 a b + abs \a, \a + abs \b, \b +.endm + +.macro ABS8 v0, v1, v2, v3, v4, v5, v6, v7 + ABS2 \v0, \v1 + ABS2 \v2, \v3 + ABS2 \v4, \v5 + ABS2 \v6, \v7 +.endm + +.macro vtrn t1, t2, s1, s2 + trn1 \t1, \s1, \s2 + trn2 \t2, \s1, \s2 +.endm + +.macro trn4 t1, t2, t3, t4, s1, s2, s3, s4 + vtrn \t1, \t2, \s1, \s2 + vtrn \t3, \t4, \s3, \s4 +.endm + +#endif \ No newline at end of file
View file
x265_3.6.tar.gz/source/common/aarch64/blockcopy8-common.S
Added
@@ -0,0 +1,54 @@ +/***************************************************************************** + * Copyright (C) 2022-2023 MulticoreWare, Inc + * + * Authors: David Chen <david.chen@myais.com.cn> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +// This file contains the macros written using NEON instruction set +// that are also used by the SVE2 functions + +#include "asm.S" + +.arch armv8-a + +// void cpy1Dto2D_shr(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift) +.macro cpy1Dto2D_shr_start + add x2, x2, x2 + dup v0.8h, w3 + cmeq v1.8h, v1.8h, v1.8h + sshl v1.8h, v1.8h, v0.8h + sri v1.8h, v1.8h, #1 + neg v0.8h, v0.8h +.endm + +.macro cpy2Dto1D_shr_start + add x2, x2, x2 + dup v0.8h, w3 + cmeq v1.8h, v1.8h, v1.8h + sshl v1.8h, v1.8h, v0.8h + sri v1.8h, v1.8h, #1 + neg v0.8h, v0.8h +.endm + +const xtn_xtn2_table, align=4 +.byte 0, 2, 4, 6, 8, 10, 12, 14 +.byte 16, 18, 20, 22, 24, 26, 28, 30 +endconst +
View file
x265_3.6.tar.gz/source/common/aarch64/blockcopy8-sve.S
Added
@@ -0,0 +1,1416 @@ +/***************************************************************************** + * Copyright (C) 2022-2023 MulticoreWare, Inc + * + * Authors: David Chen <david.chen@myais.com.cn> + + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "asm-sve.S" +#include "blockcopy8-common.S" + +.arch armv8-a+sve + +#ifdef __APPLE__ +.section __RODATA,__rodata +#else +.section .rodata +#endif + +.align 4 + +.text + +/* void blockcopy_sp(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb) + * + * r0 - a + * r1 - stridea + * r2 - b + * r3 - strideb */ + +function PFX(blockcopy_sp_4x4_sve) + ptrue p0.h, vl4 +.rept 2 + ld1h {z0.h}, p0/z, x2 + add x2, x2, x3, lsl #1 + st1b {z0.h}, p0, x0 + add x0, x0, x1 + ld1h {z1.h}, p0/z, x2 + add x2, x2, x3, lsl #1 + st1b {z1.h}, p0, x0 + add x0, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_sp_8x8_sve) + ptrue p0.h, vl8 +.rept 4 + ld1h {z0.h}, p0/z, x2 + add x2, x2, x3, lsl #1 + st1b {z0.h}, p0, x0 + add x0, x0, x1 + ld1h {z1.h}, p0/z, x2 + add x2, x2, x3, lsl #1 + st1b {z1.h}, p0, x0 + add x0, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_sp_16x16_sve) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_blockcopy_sp_16_16 + lsl x3, x3, #1 + movrel x11, xtn_xtn2_table + ld1 {v31.16b}, x11 +.rept 8 + ld1 {v0.8h-v1.8h}, x2, x3 + ld1 {v2.8h-v3.8h}, x2, x3 + tbl v0.16b, {v0.16b,v1.16b}, v31.16b + tbl v1.16b, {v2.16b,v3.16b}, v31.16b + st1 {v0.16b}, x0, x1 + st1 {v1.16b}, x0, x1 +.endr + ret +.vl_gt_16_blockcopy_sp_16_16: + ptrue p0.h, vl16 +.rept 8 + ld1h {z0.h}, p0/z, x2 + st1b {z0.h}, p0, x0 + add x2, x2, x3, lsl #1 + add x0, x0, x1 + ld1h {z1.h}, p0/z, x2 + st1b {z1.h}, p0, x0 + add x2, x2, x3, lsl #1 + add x0, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_sp_32x32_sve) + mov w12, #4 + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_blockcopy_sp_32_32 + lsl x3, x3, #1 + movrel x11, xtn_xtn2_table + ld1 {v31.16b}, x11 +.loop_csp32_sve: + sub w12, w12, #1 +.rept 4 + ld1 {v0.8h-v3.8h}, x2, x3 + ld1 {v4.8h-v7.8h}, x2, x3 + tbl v0.16b, {v0.16b,v1.16b}, v31.16b + tbl v1.16b, {v2.16b,v3.16b}, v31.16b + tbl v2.16b, {v4.16b,v5.16b}, v31.16b + tbl v3.16b, {v6.16b,v7.16b}, v31.16b + st1 {v0.16b-v1.16b}, x0, x1 + st1 {v2.16b-v3.16b}, x0, x1 +.endr + cbnz w12, .loop_csp32_sve + ret +.vl_gt_16_blockcopy_sp_32_32: + cmp x9, #48 + bgt .vl_gt_48_blockcopy_sp_32_32 + ptrue p0.h, vl16 +.vl_gt_16_loop_csp32_sve: + sub w12, w12, #1 +.rept 4 + ld1h {z0.h}, p0/z, x2 + ld1h {z1.h}, p0/z, x2, #1, mul vl + st1b {z0.h}, p0, x0 + st1b {z1.h}, p0, x0, #1, mul vl + add x2, x2, x3, lsl #1 + add x0, x0, x1 + ld1h {z2.h}, p0/z, x2 + ld1h {z3.h}, p0/z, x2, #1, mul vl + st1b {z2.h}, p0, x0 + st1b {z3.h}, p0, x0, #1, mul vl + add x2, x2, x3, lsl #1 + add x0, x0, x1 +.endr + cbnz w12, .vl_gt_16_loop_csp32_sve + ret +.vl_gt_48_blockcopy_sp_32_32: + ptrue p0.h, vl32 +.vl_gt_48_loop_csp32_sve: + sub w12, w12, #1 +.rept 4 + ld1h {z0.h}, p0/z, x2 + st1b {z0.h}, p0, x0 + add x2, x2, x3, lsl #1 + add x0, x0, x1 + ld1h {z1.h}, p0/z, x2 + st1b {z1.h}, p0, x0 + add x2, x2, x3, lsl #1 + add x0, x0, x1 +.endr + cbnz w12, .vl_gt_48_loop_csp32_sve + ret +endfunc + +function PFX(blockcopy_ps_16x16_sve) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_blockcopy_ps_16_16 + lsl x1, x1, #1 +.rept 8 + ld1 {v4.16b}, x2, x3 + ld1 {v5.16b}, x2, x3 + uxtl v0.8h, v4.8b + uxtl2 v1.8h, v4.16b + uxtl v2.8h, v5.8b + uxtl2 v3.8h, v5.16b + st1 {v0.8h-v1.8h}, x0, x1 + st1 {v2.8h-v3.8h}, x0, x1 +.endr + ret +.vl_gt_16_blockcopy_ps_16_16: + ptrue p0.b, vl32 +.rept 16 + ld1b {z1.h}, p0/z, x2 + st1h {z1.h}, p0, x0 + add x0, x0, x1, lsl #1 + add x2, x2, x3 +.endr + ret +endfunc + +function PFX(blockcopy_ps_32x32_sve) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_blockcopy_ps_32_32 + lsl x1, x1, #1 + mov w12, #4 +.loop_cps32_sve: + sub w12, w12, #1 +.rept 4 + ld1 {v16.16b-v17.16b}, x2, x3 + ld1 {v18.16b-v19.16b}, x2, x3 + uxtl v0.8h, v16.8b + uxtl2 v1.8h, v16.16b + uxtl v2.8h, v17.8b + uxtl2 v3.8h, v17.16b + uxtl v4.8h, v18.8b + uxtl2 v5.8h, v18.16b + uxtl v6.8h, v19.8b + uxtl2 v7.8h, v19.16b + st1 {v0.8h-v3.8h}, x0, x1 + st1 {v4.8h-v7.8h}, x0, x1 +.endr + cbnz w12, .loop_cps32_sve + ret +.vl_gt_16_blockcopy_ps_32_32: + cmp x9, #48 + bgt .vl_gt_48_blockcopy_ps_32_32 + ptrue p0.b, vl32 +.rept 32 + ld1b {z2.h}, p0/z, x2 + ld1b {z3.h}, p0/z, x2, #1, mul vl + st1h {z2.h}, p0, x0 + st1h {z3.h}, p0, x0, #1, mul vl + add x0, x0, x1, lsl #1 + add x2, x2, x3 +.endr + ret +.vl_gt_48_blockcopy_ps_32_32: + ptrue p0.b, vl64 +.rept 32 + ld1b {z2.h}, p0/z, x2 + st1h {z2.h}, p0, x0 + add x0, x0, x1, lsl #1 + add x2, x2, x3 +.endr + ret +endfunc + +function PFX(blockcopy_ps_64x64_sve) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_blockcopy_ps_64_64 + lsl x1, x1, #1 + sub x1, x1, #64 + mov w12, #16 +.loop_cps64_sve: + sub w12, w12, #1 +.rept 4 + ld1 {v16.16b-v19.16b}, x2, x3 + uxtl v0.8h, v16.8b + uxtl2 v1.8h, v16.16b + uxtl v2.8h, v17.8b + uxtl2 v3.8h, v17.16b + uxtl v4.8h, v18.8b + uxtl2 v5.8h, v18.16b + uxtl v6.8h, v19.8b + uxtl2 v7.8h, v19.16b + st1 {v0.8h-v3.8h}, x0, #64 + st1 {v4.8h-v7.8h}, x0, x1 +.endr + cbnz w12, .loop_cps64_sve + ret +.vl_gt_16_blockcopy_ps_64_64: + cmp x9, #48 + bgt .vl_gt_48_blockcopy_ps_64_64 + ptrue p0.b, vl32 +.rept 64 + ld1b {z4.h}, p0/z, x2 + ld1b {z5.h}, p0/z, x2, #1, mul vl + ld1b {z6.h}, p0/z, x2, #2, mul vl + ld1b {z7.h}, p0/z, x2, #3, mul vl + st1h {z4.h}, p0, x0 + st1h {z5.h}, p0, x0, #1, mul vl + st1h {z6.h}, p0, x0, #2, mul vl + st1h {z7.h}, p0, x0, #3, mul vl + add x0, x0, x1, lsl #1 + add x2, x2, x3 +.endr + ret +.vl_gt_48_blockcopy_ps_64_64: + cmp x9, #112 + bgt .vl_gt_112_blockcopy_ps_64_64 + ptrue p0.b, vl64 +.rept 64 + ld1b {z4.h}, p0/z, x2 + ld1b {z5.h}, p0/z, x2, #1, mul vl + st1h {z4.h}, p0, x0 + st1h {z5.h}, p0, x0, #1, mul vl + add x0, x0, x1, lsl #1 + add x2, x2, x3 +.endr + ret +.vl_gt_112_blockcopy_ps_64_64: + ptrue p0.b, vl128 +.rept 64 + ld1b {z4.h}, p0/z, x2 + st1h {z4.h}, p0, x0 + add x0, x0, x1, lsl #1 + add x2, x2, x3 +.endr + ret + +endfunc + +function PFX(blockcopy_ss_16x16_sve) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_blockcopy_ss_16_16 + lsl x1, x1, #1 + lsl x3, x3, #1 +.rept 8 + ld1 {v0.8h-v1.8h}, x2, x3 + ld1 {v2.8h-v3.8h}, x2, x3 + st1 {v0.8h-v1.8h}, x0, x1 + st1 {v2.8h-v3.8h}, x0, x1 +.endr + ret +.vl_gt_16_blockcopy_ss_16_16: + ptrue p0.h, vl16 +.rept 16 + ld1h {z0.h}, p0/z, x2 + st1h {z0.h}, p0, x0 + add x2, x2, x3, lsl #1 + add x0, x0, x1, lsl #1 +.endr + ret +endfunc + +function PFX(blockcopy_ss_32x32_sve) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_blockcopy_ss_32_32 + lsl x1, x1, #1 + lsl x3, x3, #1 + mov w12, #4 +.loop_css32_sve: + sub w12, w12, #1 +.rept 8 + ld1 {v0.8h-v3.8h}, x2, x3 + st1 {v0.8h-v3.8h}, x0, x1 +.endr + cbnz w12, .loop_css32_sve + ret +.vl_gt_16_blockcopy_ss_32_32: + cmp x9, #48 + bgt .vl_gt_48_blockcopy_ss_32_32 + ptrue p0.h, vl16 +.rept 32 + ld1h {z0.h}, p0/z, x2 + ld1h {z1.h}, p0/z, x2, #1, mul vl + st1h {z0.h}, p0, x0 + st1h {z1.h}, p0, x0, #1, mul vl + add x2, x2, x3, lsl #1 + add x0, x0, x1, lsl #1 +.endr + ret +.vl_gt_48_blockcopy_ss_32_32: + ptrue p0.h, vl32 +.rept 32 + ld1h {z0.h}, p0/z, x2 + st1h {z0.h}, p0, x0 + add x2, x2, x3, lsl #1 + add x0, x0, x1, lsl #1 +.endr + ret +endfunc + +function PFX(blockcopy_ss_64x64_sve) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_blockcopy_ss_64_64 + lsl x1, x1, #1 + sub x1, x1, #64 + lsl x3, x3, #1 + sub x3, x3, #64 + mov w12, #8 +.loop_css64_sve: + sub w12, w12, #1 +.rept 8 + ld1 {v0.8h-v3.8h}, x2, #64 + ld1 {v4.8h-v7.8h}, x2, x3 + st1 {v0.8h-v3.8h}, x0, #64 + st1 {v4.8h-v7.8h}, x0, x1 +.endr + cbnz w12, .loop_css64_sve + ret +.vl_gt_16_blockcopy_ss_64_64: + cmp x9, #48 + bgt .vl_gt_48_blockcopy_ss_64_64 + mov w12, #8 + ptrue p0.b, vl32 +.vl_gt_16_loop_css64_sve: + sub w12, w12, #1 +.rept 8 + ld1b {z0.b}, p0/z, x2 + ld1b {z1.b}, p0/z, x2, #1, mul vl + ld1b {z2.b}, p0/z, x2, #2, mul vl + ld1b {z3.b}, p0/z, x2, #3, mul vl + st1b {z0.b}, p0, x0 + st1b {z1.b}, p0, x0, #1, mul vl + st1b {z2.b}, p0, x0, #2, mul vl + st1b {z3.b}, p0, x0, #3, mul vl + add x2, x2, x3, lsl #1 + add x0, x0, x1, lsl #1 +.endr + cbnz w12, .vl_gt_16_loop_css64_sve + ret +.vl_gt_48_blockcopy_ss_64_64: + cmp x9, #112 + bgt .vl_gt_112_blockcopy_ss_64_64 + mov w12, #8 + ptrue p0.b, vl64 +.vl_gt_48_loop_css64_sve: + sub w12, w12, #1 +.rept 8 + ld1b {z0.b}, p0/z, x2 + ld1b {z1.b}, p0/z, x2, #1, mul vl + st1b {z0.b}, p0, x0 + st1b {z1.b}, p0, x0, #1, mul vl + add x2, x2, x3, lsl #1 + add x0, x0, x1, lsl #1 +.endr + cbnz w12, .vl_gt_48_loop_css64_sve + ret +.vl_gt_112_blockcopy_ss_64_64: + mov w12, #8 + ptrue p0.b, vl128 +.vl_gt_112_loop_css64_sve: + sub w12, w12, #1 +.rept 8 + ld1b {z0.b}, p0/z, x2 + st1b {z0.b}, p0, x0 + add x2, x2, x3, lsl #1 + add x0, x0, x1, lsl #1 +.endr + cbnz w12, .vl_gt_112_loop_css64_sve + ret +endfunc + +/******** Chroma blockcopy********/ +function PFX(blockcopy_ss_16x32_sve) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_blockcopy_ss_16_32 + lsl x1, x1, #1 + lsl x3, x3, #1 +.rept 16 + ld1 {v0.8h-v1.8h}, x2, x3 + ld1 {v2.8h-v3.8h}, x2, x3 + st1 {v0.8h-v1.8h}, x0, x1 + st1 {v2.8h-v3.8h}, x0, x1 +.endr + ret +.vl_gt_16_blockcopy_ss_16_32: + ptrue p0.h, vl16 +.rept 32 + ld1h {z0.h}, p0/z, x2 + st1h {z0.h}, p0, x0 + add x2, x2, x3, lsl #1 + add x0, x0, x1, lsl #1 +.endr + ret +endfunc + +function PFX(blockcopy_ss_32x64_sve) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_blockcopy_ss_32_64 + lsl x1, x1, #1 + lsl x3, x3, #1 + mov w12, #8 +.loop_css32x64_sve: + sub w12, w12, #1 +.rept 8 + ld1 {v0.8h-v3.8h}, x2, x3 + st1 {v0.8h-v3.8h}, x0, x1 +.endr + cbnz w12, .loop_css32x64_sve + ret +.vl_gt_16_blockcopy_ss_32_64: + cmp x9, #48 + bgt .vl_gt_48_blockcopy_ss_32_64 + mov w12, #8 + ptrue p0.b, vl32 +.vl_gt_32_loop_css32x64_sve: + sub w12, w12, #1 +.rept 8 + ld1b {z0.b}, p0/z, x2 + ld1b {z1.b}, p0/z, x2, #1, mul vl + st1b {z0.b}, p0, x0 + st1b {z1.b}, p0, x0, #1, mul vl + add x2, x2, x3, lsl #1 + add x0, x0, x1, lsl #1 +.endr + cbnz w12, .vl_gt_32_loop_css32x64_sve + ret +.vl_gt_48_blockcopy_ss_32_64: + mov w12, #8 + ptrue p0.b, vl64 +.vl_gt_48_loop_css32x64_sve: + sub w12, w12, #1 +.rept 8 + ld1b {z0.b}, p0/z, x2 + st1b {z0.b}, p0, x0 + add x2, x2, x3, lsl #1 + add x0, x0, x1, lsl #1 +.endr + cbnz w12, .vl_gt_48_loop_css32x64_sve + ret +endfunc + +// chroma blockcopy_ps +function PFX(blockcopy_ps_4x8_sve) + ptrue p0.h, vl4 +.rept 8 + ld1b {z0.h}, p0/z, x2 + st1h {z0.h}, p0, x0 + add x0, x0, x1, lsl #1 + add x2, x2, x3 +.endr + ret +endfunc + +function PFX(blockcopy_ps_8x16_sve) + ptrue p0.h, vl8 +.rept 16 + ld1b {z0.h}, p0/z, x2 + st1h {z0.h}, p0, x0 + add x0, x0, x1, lsl #1 + add x2, x2, x3 +.endr + ret +endfunc + +function PFX(blockcopy_ps_16x32_sve) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_blockcopy_ps_16_32 + lsl x1, x1, #1 +.rept 16 + ld1 {v4.16b}, x2, x3 + ld1 {v5.16b}, x2, x3 + uxtl v0.8h, v4.8b + uxtl2 v1.8h, v4.16b + uxtl v2.8h, v5.8b + uxtl2 v3.8h, v5.16b + st1 {v0.8h-v1.8h}, x0, x1 + st1 {v2.8h-v3.8h}, x0, x1 +.endr + ret +.vl_gt_16_blockcopy_ps_16_32: + ptrue p0.b, vl32 +.rept 32 + ld1b {z1.h}, p0/z, x2 + st1h {z1.h}, p0, x0 + add x0, x0, x1, lsl #1 + add x2, x2, x3 +.endr + ret +endfunc + +function PFX(blockcopy_ps_32x64_sve) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_blockcopy_ps_32_64 + lsl x1, x1, #1 + mov w12, #8 +.loop_cps32x64_sve: + sub w12, w12, #1 +.rept 4 + ld1 {v16.16b-v17.16b}, x2, x3 + ld1 {v18.16b-v19.16b}, x2, x3 + uxtl v0.8h, v16.8b + uxtl2 v1.8h, v16.16b + uxtl v2.8h, v17.8b + uxtl2 v3.8h, v17.16b + uxtl v4.8h, v18.8b + uxtl2 v5.8h, v18.16b + uxtl v6.8h, v19.8b + uxtl2 v7.8h, v19.16b + st1 {v0.8h-v3.8h}, x0, x1 + st1 {v4.8h-v7.8h}, x0, x1 +.endr + cbnz w12, .loop_cps32x64_sve + ret +.vl_gt_16_blockcopy_ps_32_64: + cmp x9, #48 + bgt .vl_gt_48_blockcopy_ps_32_64 + ptrue p0.b, vl32 +.rept 64 + ld1b {z2.h}, p0/z, x2 + ld1b {z3.h}, p0/z, x2, #1, mul vl + st1h {z2.h}, p0, x0 + st1h {z3.h}, p0, x0, #1, mul vl + add x0, x0, x1, lsl #1 + add x2, x2, x3 +.endr + ret +.vl_gt_48_blockcopy_ps_32_64: + ptrue p0.b, vl64 +.rept 64 + ld1b {z2.h}, p0/z, x2 + st1h {z2.h}, p0, x0 + add x0, x0, x1, lsl #1 + add x2, x2, x3 +.endr + ret +endfunc + +// chroma blockcopy_sp +function PFX(blockcopy_sp_4x8_sve) + ptrue p0.h, vl4 +.rept 8 + ld1h {z0.h}, p0/z, x2 + st1b {z0.h}, p0, x0 + add x2, x2, x3, lsl #1 + add x0, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_sp_8x16_sve) + ptrue p0.h, vl8 +.rept 16 + ld1h {z0.h}, p0/z, x2 + st1b {z0.h}, p0, x0 + add x2, x2, x3, lsl #1 + add x0, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_sp_16x32_sve) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_blockcopy_sp_16_32 + ptrue p0.h, vl8 +.rept 32 + ld1h {z0.h}, p0/z, x2 + ld1h {z1.h}, p0/z, x2, #1, mul vl + st1b {z0.h}, p0, x0 + st1b {z1.h}, p0, x0, #1, mul vl + add x2, x2, x3, lsl #1 + add x0, x0, x1 +.endr + ret +.vl_gt_16_blockcopy_sp_16_32: + ptrue p0.h, vl16 +.rept 32 + ld1h {z0.h}, p0/z, x2 + st1b {z0.h}, p0, x0 + add x2, x2, x3, lsl #1 + add x0, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_sp_32x64_sve) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_blockcopy_sp_32_64 + ptrue p0.h, vl8 +.rept 64 + ld1h {z0.h}, p0/z, x2 + ld1h {z1.h}, p0/z, x2, #1, mul vl + ld1h {z2.h}, p0/z, x2, #2, mul vl + ld1h {z3.h}, p0/z, x2, #3, mul vl + st1b {z0.h}, p0, x0 + st1b {z1.h}, p0, x0, #1, mul vl + st1b {z2.h}, p0, x0, #2, mul vl + st1b {z3.h}, p0, x0, #3, mul vl + add x2, x2, x3, lsl #1 + add x0, x0, x1 +.endr + ret +.vl_gt_16_blockcopy_sp_32_64: + cmp x9, #48 + bgt .vl_gt_48_blockcopy_sp_32_64 + ptrue p0.h, vl16 +.rept 64 + ld1h {z0.h}, p0/z, x2 + ld1h {z1.h}, p0/z, x2, #1, mul vl + st1b {z0.h}, p0, x0 + st1b {z1.h}, p0, x0, #1, mul vl + add x2, x2, x3, lsl #1 + add x0, x0, x1 +.endr + ret +.vl_gt_48_blockcopy_sp_32_64: + ptrue p0.h, vl32 +.rept 64 + ld1h {z0.h}, p0/z, x2 + st1b {z0.h}, p0, x0 + add x2, x2, x3, lsl #1 + add x0, x0, x1 +.endr + ret +endfunc + +/* blockcopy_pp(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride) */ + +function PFX(blockcopy_pp_32x8_sve) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_blockcopy_pp_32_8 +.rept 8 + ld1 {v0.16b-v1.16b}, x2, x3 + st1 {v0.16b-v1.16b}, x0, x1 +.endr + ret +.vl_gt_16_blockcopy_pp_32_8: + ptrue p0.b, vl32 +.rept 8 + ld1b {z0.b}, p0/z, x2 + st1b {z0.b}, p0, x0 + add x2, x2, x3 + add x0, x0, x1 +.endr + ret +endfunc + +.macro blockcopy_pp_32xN_sve h +function PFX(blockcopy_pp_32x\h\()_sve) + mov w12, #\h / 8 + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_blockcopy_pp_32xN_\h +.loop_sve_32x\h\(): + sub w12, w12, #1 +.rept 8 + ld1 {v0.16b-v1.16b}, x2, x3 + st1 {v0.16b-v1.16b}, x0, x1 +.endr + cbnz w12, .loop_sve_32x\h + ret +.vl_gt_16_blockcopy_pp_32xN_\h: + ptrue p0.b, vl32 +.L_gt_16_blockcopy_pp_32xN_\h: + sub w12, w12, #1 +.rept 8 + ld1b {z0.b}, p0/z, x2 + st1b {z0.b}, p0, x0 + add x2, x2, x3 + add x0, x0, x1 +.endr + cbnz w12, .L_gt_16_blockcopy_pp_32xN_\h + ret +endfunc +.endm + +blockcopy_pp_32xN_sve 16 +blockcopy_pp_32xN_sve 24 +blockcopy_pp_32xN_sve 32 +blockcopy_pp_32xN_sve 64 +blockcopy_pp_32xN_sve 48 + +.macro blockcopy_pp_64xN_sve h +function PFX(blockcopy_pp_64x\h\()_sve) + mov w12, #\h / 4 + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_blockcopy_pp_64xN_\h +.loop_sve_64x\h\(): + sub w12, w12, #1 +.rept 4 + ld1 {v0.16b-v3.16b}, x2, x3 + st1 {v0.16b-v3.16b}, x0, x1 +.endr + cbnz w12, .loop_sve_64x\h + ret +.vl_gt_16_blockcopy_pp_64xN_\h: + cmp x9, #48 + bgt .vl_gt_48_blockcopy_pp_64xN_\h + ptrue p0.b, vl32 +.L_le_32_blockcopy_pp_64xN_\h: + sub w12, w12, #1 +.rept 4 + ld1b {z0.b}, p0/z, x2 + ld1b {z1.b}, p0/z, x2, #1, mul vl + st1b {z0.b}, p0, x0 + st1b {z1.b}, p0, x0, #1, mul vl + add x2, x2, x3 + add x0, x0, x1 +.endr + cbnz w12, .L_le_32_blockcopy_pp_64xN_\h + ret +.vl_gt_48_blockcopy_pp_64xN_\h: + ptrue p0.b, vl64 +.L_blockcopy_pp_64xN_\h: + sub w12, w12, #1 +.rept 4 + ld1b {z0.b}, p0/z, x2 + st1b {z0.b}, p0, x0 + add x2, x2, x3 + add x0, x0, x1 +.endr + cbnz w12, .L_blockcopy_pp_64xN_\h + ret +endfunc +.endm + +blockcopy_pp_64xN_sve 16 +blockcopy_pp_64xN_sve 32 +blockcopy_pp_64xN_sve 48 +blockcopy_pp_64xN_sve 64 + +function PFX(blockfill_s_32x32_sve) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_blockfill_s_32_32 + dup v0.8h, w2 + mov v1.16b, v0.16b + mov v2.16b, v0.16b + mov v3.16b, v0.16b + lsl x1, x1, #1 +.rept 32 + st1 {v0.8h-v3.8h}, x0, x1 +.endr + ret +.vl_gt_16_blockfill_s_32_32: + cmp x9, #48 + bgt .vl_gt_48_blockfill_s_32_32 + dup z0.h, w2 + ptrue p0.h, vl16 +.rept 32 + st1h {z0.h}, p0, x0 + st1h {z0.h}, p0, x0, #1, mul vl + add x0, x0, x1, lsl #1 +.endr + ret +.vl_gt_48_blockfill_s_32_32: + dup z0.h, w2 + ptrue p0.h, vl32 +.rept 32 + st1h {z0.h}, p0, x0 + add x0, x0, x1, lsl #1 +.endr + ret +endfunc + +// void cpy2Dto1D_shl(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift) +.macro cpy2Dto1D_shl_start_sve + add x2, x2, x2 + mov z0.h, w3 +.endm + +function PFX(cpy2Dto1D_shl_16x16_sve) + dup z0.h, w3 + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_cpy2Dto1D_shl_16x16 + cpy2Dto1D_shl_start_sve + mov w12, #4 +.loop_cpy2Dto1D_shl_16_sve: + sub w12, w12, #1 +.rept 4 + ld1 {v2.16b-v3.16b}, x1, x2 + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + st1 {v2.16b-v3.16b}, x0, #32 +.endr + cbnz w12, .loop_cpy2Dto1D_shl_16_sve + ret +.vl_gt_16_cpy2Dto1D_shl_16x16: + ptrue p0.h, vl16 +.rept 16 + ld1h {z1.h}, p0/z, x1 + lsl z1.h, p0/m, z1.h, z0.h + st1h {z1.h}, p0, x0 + add x1, x1, x2, lsl #1 + add x0, x0, #32 +.endr + ret +endfunc + +function PFX(cpy2Dto1D_shl_32x32_sve) + dup z0.h, w3 + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_cpy2Dto1D_shl_32x32 + cpy2Dto1D_shl_start_sve + mov w12, #16 +.loop_cpy2Dto1D_shl_32_sve: + sub w12, w12, #1 +.rept 2 + ld1 {v2.16b-v5.16b}, x1, x2 + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + sshl v4.8h, v4.8h, v0.8h + sshl v5.8h, v5.8h, v0.8h + st1 {v2.16b-v5.16b}, x0, #64 +.endr + cbnz w12, .loop_cpy2Dto1D_shl_32_sve + ret +.vl_gt_16_cpy2Dto1D_shl_32x32: + cmp x9, #48 + bgt .vl_gt_48_cpy2Dto1D_shl_32x32 + ptrue p0.h, vl16 +.rept 32 + ld1h {z1.h}, p0/z, x1 + ld1h {z2.h}, p0/z, x1, #1, mul vl + lsl z1.h, p0/m, z1.h, z0.h + lsl z2.h, p0/m, z2.h, z0.h + st1h {z1.h}, p0, x0 + st1h {z2.h}, p0, x0, #1, mul vl + add x1, x1, x2, lsl #1 + add x0, x0, #64 +.endr + ret +.vl_gt_48_cpy2Dto1D_shl_32x32: + ptrue p0.h, vl32 +.rept 32 + ld1h {z1.h}, p0/z, x1 + lsl z1.h, p0/m, z1.h, z0.h + st1h {z1.h}, p0, x0 + add x1, x1, x2, lsl #1 + add x0, x0, #64 +.endr + ret +endfunc + +function PFX(cpy2Dto1D_shl_64x64_sve) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_cpy2Dto1D_shl_64x64 + cpy2Dto1D_shl_start_sve + mov w12, #32 + sub x2, x2, #64 +.loop_cpy2Dto1D_shl_64_sve: + sub w12, w12, #1 +.rept 2 + ld1 {v2.16b-v5.16b}, x1, #64 + ld1 {v16.16b-v19.16b}, x1, x2 + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + sshl v4.8h, v4.8h, v0.8h + sshl v5.8h, v5.8h, v0.8h + sshl v16.8h, v16.8h, v0.8h + sshl v17.8h, v17.8h, v0.8h + sshl v18.8h, v18.8h, v0.8h + sshl v19.8h, v19.8h, v0.8h + st1 {v2.16b-v5.16b}, x0, #64 + st1 {v16.16b-v19.16b}, x0, #64 +.endr + cbnz w12, .loop_cpy2Dto1D_shl_64_sve + ret +.vl_gt_16_cpy2Dto1D_shl_64x64: + dup z0.h, w3 + mov x8, #64 + mov w12, #64 +.L_init_cpy2Dto1D_shl_64x64: + sub w12, w12, 1 + mov x9, #0 + whilelt p0.h, x9, x8 +.L_cpy2Dto1D_shl_64x64: + ld1h {z1.h}, p0/z, x1, x9, lsl #1 + lsl z1.h, p0/m, z1.h, z0.h + st1h {z1.h}, p0, x0, x9, lsl #1 + inch x9 + whilelt p0.h, x9, x8 + b.first .L_cpy2Dto1D_shl_64x64 + add x1, x1, x2, lsl #1 + addvl x0, x0, #1 + cbnz w12, .L_init_cpy2Dto1D_shl_64x64 + ret +endfunc + +// void cpy2Dto1D_shr(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift) + +function PFX(cpy2Dto1D_shr_4x4_sve) + dup z0.h, w3 + sub w4, w3, #1 + dup z1.h, w4 + ptrue p0.h, vl8 + mov z2.h, #1 + lsl z2.h, p0/m, z2.h, z1.h + lsl x2, x2, #1 + index z3.d, #0, x2 + index z4.d, #0, #8 +.rept 2 + ld1d {z5.d}, p0/z, x1, z3.d + add x1, x1, x2, lsl #1 + add z5.h, p0/m, z5.h, z2.h + asr z5.h, p0/m, z5.h, z0.h + st1d {z5.d}, p0, x0, z4.d + add x0, x0, #16 +.endr + ret +endfunc + +function PFX(cpy2Dto1D_shr_8x8_sve) + dup z0.h, w3 + sub w4, w3, #1 + dup z1.h, w4 + ptrue p0.h, vl8 + mov z2.h, #1 + lsl z2.h, p0/m, z2.h, z1.h +.rept 8 + ld1d {z5.d}, p0/z, x1 + add x1, x1, x2, lsl #1 + add z5.h, p0/m, z5.h, z2.h + asr z5.h, p0/m, z5.h, z0.h + st1d {z5.d}, p0, x0 + add x0, x0, #16 +.endr + ret +endfunc + +function PFX(cpy2Dto1D_shr_16x16_sve) + dup z0.h, w3 + sub w4, w3, #1 + dup z1.h, w4 + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_cpy2Dto1D_shr_16x16 + ptrue p0.h, vl8 + mov z2.h, #1 + lsl z2.h, p0/m, z2.h, z1.h +.rept 16 + ld1d {z5.d}, p0/z, x1 + ld1d {z6.d}, p0/z, x1, #1, mul vl + add x1, x1, x2, lsl #1 + add z5.h, p0/m, z5.h, z2.h + add z6.h, p0/m, z6.h, z2.h + asr z5.h, p0/m, z5.h, z0.h + asr z6.h, p0/m, z6.h, z0.h + st1d {z5.d}, p0, x0 + st1d {z6.d}, p0, x0, #1, mul vl + add x0, x0, #32 +.endr + ret +.vl_gt_16_cpy2Dto1D_shr_16x16: + ptrue p0.h, vl16 + mov z2.h, #1 + lsl z2.h, p0/m, z2.h, z1.h +.rept 16 + ld1d {z5.d}, p0/z, x1 + add x1, x1, x2, lsl #1 + add z5.h, p0/m, z5.h, z2.h + asr z5.h, p0/m, z5.h, z0.h + st1d {z5.d}, p0, x0 + add x0, x0, #32 +.endr + ret +endfunc + +function PFX(cpy2Dto1D_shr_32x32_sve) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_cpy2Dto1D_shr_32x32 + cpy2Dto1D_shr_start + mov w12, #16 +.loop_cpy2Dto1D_shr_32_sve: + sub w12, w12, #1 +.rept 2 + ld1 {v2.8h-v5.8h}, x1, x2 + sub v2.8h, v2.8h, v1.8h + sub v3.8h, v3.8h, v1.8h + sub v4.8h, v4.8h, v1.8h + sub v5.8h, v5.8h, v1.8h + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + sshl v4.8h, v4.8h, v0.8h + sshl v5.8h, v5.8h, v0.8h + st1 {v2.8h-v5.8h}, x0, #64 +.endr + cbnz w12, .loop_cpy2Dto1D_shr_32_sve + ret +.vl_gt_16_cpy2Dto1D_shr_32x32: + dup z0.h, w3 + sub w4, w3, #1 + dup z1.h, w4 + cmp x9, #48 + bgt .vl_gt_48_cpy2Dto1D_shr_32x32 + ptrue p0.h, vl16 + mov z2.h, #1 + lsl z2.h, p0/m, z2.h, z1.h +.rept 32 + ld1d {z5.d}, p0/z, x1 + ld1d {z6.d}, p0/z, x1, #1, mul vl + add x1, x1, x2, lsl #1 + add z5.h, p0/m, z5.h, z2.h + add z6.h, p0/m, z6.h, z2.h + asr z5.h, p0/m, z5.h, z0.h + asr z6.h, p0/m, z6.h, z0.h + st1d {z5.d}, p0, x0 + st1d {z6.d}, p0, x0, #1, mul vl + add x0, x0, #64 +.endr + ret +.vl_gt_48_cpy2Dto1D_shr_32x32: + ptrue p0.h, vl32 + mov z2.h, #1 + lsl z2.h, p0/m, z2.h, z1.h +.rept 32 + ld1d {z5.d}, p0/z, x1 + add x1, x1, x2, lsl #1 + add z5.h, p0/m, z5.h, z2.h + asr z5.h, p0/m, z5.h, z0.h + st1d {z5.d}, p0, x0 + add x0, x0, #64 +.endr + ret +endfunc + +// void cpy1Dto2D_shl(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift) + +function PFX(cpy1Dto2D_shl_16x16_sve) + dup z0.h, w3 + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_cpy1Dto2D_shl_16x16 + ptrue p0.h, vl8 +.rept 16 + ld1h {z1.h}, p0/z, x1 + ld1h {z2.h}, p0/z, x1, #1, mul vl + lsl z1.h, p0/m, z1.h, z0.h + lsl z2.h, p0/m, z2.h, z0.h + st1h {z1.h}, p0, x0 + st1h {z2.h}, p0, x0, #1, mul vl + add x1, x1, #32 + add x0, x0, x2, lsl #1 +.endr + ret +.vl_gt_16_cpy1Dto2D_shl_16x16: + ptrue p0.h, vl16 +.rept 16 + ld1h {z1.h}, p0/z, x1 + lsl z1.h, p0/m, z1.h, z0.h + st1h {z1.h}, p0, x0 + add x1, x1, #32 + add x0, x0, x2, lsl #1 +.endr + ret +endfunc + +function PFX(cpy1Dto2D_shl_32x32_sve) + dup z0.h, w3 + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_cpy1Dto2D_shl_32x32 + ptrue p0.h, vl8 +.rept 32 + ld1h {z1.h}, p0/z, x1 + ld1h {z2.h}, p0/z, x1, #1, mul vl + ld1h {z3.h}, p0/z, x1, #2, mul vl + ld1h {z4.h}, p0/z, x1, #3, mul vl + lsl z1.h, p0/m, z1.h, z0.h + lsl z2.h, p0/m, z2.h, z0.h + lsl z3.h, p0/m, z3.h, z0.h + lsl z4.h, p0/m, z4.h, z0.h + st1h {z1.h}, p0, x0 + st1h {z2.h}, p0, x0, #1, mul vl + st1h {z3.h}, p0, x0, #2, mul vl + st1h {z4.h}, p0, x0, #3, mul vl + add x1, x1, #64 + add x0, x0, x2, lsl #1 +.endr + ret +.vl_gt_16_cpy1Dto2D_shl_32x32: + cmp x9, #48 + bgt .vl_gt_48_cpy1Dto2D_shl_32x32 + ptrue p0.h, vl16 +.rept 32 + ld1h {z1.h}, p0/z, x1 + ld1h {z2.h}, p0/z, x1, #1, mul vl + lsl z1.h, p0/m, z1.h, z0.h + lsl z2.h, p0/m, z2.h, z0.h + st1h {z1.h}, p0, x0 + st1h {z2.h}, p0, x0, #1, mul vl + add x1, x1, #64 + add x0, x0, x2, lsl #1 +.endr + ret +.vl_gt_48_cpy1Dto2D_shl_32x32: + ptrue p0.h, vl32 +.rept 32 + ld1h {z1.h}, p0/z, x1 + lsl z1.h, p0/m, z1.h, z0.h + st1h {z1.h}, p0, x0 + add x1, x1, #64 + add x0, x0, x2, lsl #1 +.endr + ret +endfunc + +function PFX(cpy1Dto2D_shl_64x64_sve) + dup z0.h, w3 + mov x8, #64 + mov w12, #64 +.L_init_cpy1Dto2D_shl_64x64: + sub w12, w12, 1 + mov x9, #0 + whilelt p0.h, x9, x8 +.L_cpy1Dto2D_shl_64x64: + ld1h {z1.h}, p0/z, x1, x9, lsl #1 + lsl z1.h, p0/m, z1.h, z0.h + st1h {z1.h}, p0, x0, x9, lsl #1 + inch x9 + whilelt p0.h, x9, x8 + b.first .L_cpy1Dto2D_shl_64x64 + addvl x1, x1, #1 + add x0, x0, x2, lsl #1 + cbnz w12, .L_init_cpy1Dto2D_shl_64x64 + ret +endfunc + +// void cpy1Dto2D_shr(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift) + +function PFX(cpy1Dto2D_shr_16x16_sve) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_cpy1Dto2D_shr_16x16 + cpy1Dto2D_shr_start + mov w12, #4 +.loop_cpy1Dto2D_shr_16: + sub w12, w12, #1 +.rept 4 + ld1 {v2.8h-v3.8h}, x1, #32 + sub v2.8h, v2.8h, v1.8h + sub v3.8h, v3.8h, v1.8h + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + st1 {v2.8h-v3.8h}, x0, x2 +.endr + cbnz w12, .loop_cpy1Dto2D_shr_16 + ret +.vl_gt_16_cpy1Dto2D_shr_16x16: + dup z0.h, w3 + sub w4, w3, #1 + dup z1.h, w4 + ptrue p0.h, vl16 + mov z2.h, #1 + lsl z2.h, p0/m, z2.h, z1.h +.rept 16 + ld1d {z5.d}, p0/z, x1 + add x1, x1, #32 + add z5.h, p0/m, z5.h, z2.h + asr z5.h, p0/m, z5.h, z0.h + st1d {z5.d}, p0, x0 + add x0, x0, x2, lsl #1 +.endr + ret +endfunc + +function PFX(cpy1Dto2D_shr_32x32_sve) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_cpy1Dto2D_shr_32x32 + cpy1Dto2D_shr_start + mov w12, #16 +.loop_cpy1Dto2D_shr_32_sve: + sub w12, w12, #1 +.rept 2 + ld1 {v2.16b-v5.16b}, x1, #64 + sub v2.8h, v2.8h, v1.8h + sub v3.8h, v3.8h, v1.8h + sub v4.8h, v4.8h, v1.8h + sub v5.8h, v5.8h, v1.8h + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + sshl v4.8h, v4.8h, v0.8h + sshl v5.8h, v5.8h, v0.8h + st1 {v2.16b-v5.16b}, x0, x2 +.endr + cbnz w12, .loop_cpy1Dto2D_shr_32_sve + ret +.vl_gt_16_cpy1Dto2D_shr_32x32: + dup z0.h, w3 + sub w4, w3, #1 + dup z1.h, w4 + cmp x9, #48 + bgt .vl_gt_48_cpy2Dto1D_shr_32x32 + ptrue p0.h, vl16 + mov z2.h, #1 + lsl z2.h, p0/m, z2.h, z1.h +.rept 32 + ld1d {z5.d}, p0/z, x1 + ld1d {z6.d}, p0/z, x1, #1, mul vl + add x1, x1, #64 + add z5.h, p0/m, z5.h, z2.h + add z6.h, p0/m, z6.h, z2.h + asr z5.h, p0/m, z5.h, z0.h + asr z6.h, p0/m, z6.h, z0.h + st1d {z5.d}, p0, x0 + st1d {z6.d}, p0, x0, #1, mul vl + add x0, x0, x2, lsl #1 +.endr + ret +.vl_gt_48_cpy1Dto2D_shr_32x32: + ptrue p0.h, vl32 + mov z2.h, #1 + lsl z2.h, p0/m, z2.h, z1.h +.rept 32 + ld1d {z5.d}, p0/z, x1 + add x1, x1, #64 + add z5.h, p0/m, z5.h, z2.h + asr z5.h, p0/m, z5.h, z0.h + st1d {z5.d}, p0, x0 + add x0, x0, x2, lsl #1 +.endr + ret +endfunc + +function PFX(cpy1Dto2D_shr_64x64_sve) + dup z0.h, w3 + sub w4, w3, #1 + dup z1.h, w4 + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_cpy1Dto2D_shr_64x64 + ptrue p0.h, vl8 + mov z2.h, #1 + lsl z2.h, p0/m, z2.h, z1.h +.rept 128 + ld1d {z5.d}, p0/z, x1 + ld1d {z6.d}, p0/z, x1, #1, mul vl + ld1d {z7.d}, p0/z, x1, #2, mul vl + ld1d {z8.d}, p0/z, x1, #3, mul vl + ld1d {z9.d}, p0/z, x1, #4, mul vl + ld1d {z10.d}, p0/z, x1, #5, mul vl + ld1d {z11.d}, p0/z, x1, #6, mul vl + ld1d {z12.d}, p0/z, x1, #7, mul vl + add x1, x1, #128 + add z5.h, p0/m, z5.h, z2.h + add z6.h, p0/m, z6.h, z2.h + add z7.h, p0/m, z7.h, z2.h + add z8.h, p0/m, z8.h, z2.h + add z9.h, p0/m, z9.h, z2.h + add z10.h, p0/m, z10.h, z2.h + add z11.h, p0/m, z11.h, z2.h + add z12.h, p0/m, z12.h, z2.h + asr z5.h, p0/m, z5.h, z0.h + asr z6.h, p0/m, z6.h, z0.h + asr z7.h, p0/m, z7.h, z0.h + asr z8.h, p0/m, z8.h, z0.h + asr z9.h, p0/m, z9.h, z0.h + asr z10.h, p0/m, z10.h, z0.h + asr z11.h, p0/m, z11.h, z0.h + asr z12.h, p0/m, z12.h, z0.h + st1d {z5.d}, p0, x0 + st1d {z6.d}, p0, x0, #1, mul vl + st1d {z7.d}, p0, x0, #2, mul vl + st1d {z8.d}, p0, x0, #3, mul vl + st1d {z9.d}, p0, x0, #4, mul vl + st1d {z10.d}, p0, x0, #5, mul vl + st1d {z11.d}, p0, x0, #6, mul vl + st1d {z12.d}, p0, x0, #7, mul vl + add x0, x0, x2, lsl #1 +.endr + ret +.vl_gt_16_cpy1Dto2D_shr_64x64: + cmp x9, #48 + bgt .vl_gt_48_cpy1Dto2D_shr_64x64 + ptrue p0.h, vl16 + mov z2.h, #1 + lsl z2.h, p0/m, z2.h, z1.h +.rept 128 + ld1d {z5.d}, p0/z, x1 + ld1d {z6.d}, p0/z, x1, #1, mul vl + ld1d {z7.d}, p0/z, x1, #2, mul vl + ld1d {z8.d}, p0/z, x1, #3, mul vl + add x1, x1, #128 + add z5.h, p0/m, z5.h, z2.h + add z6.h, p0/m, z6.h, z2.h + add z7.h, p0/m, z7.h, z2.h + add z8.h, p0/m, z8.h, z2.h + asr z5.h, p0/m, z5.h, z0.h + asr z6.h, p0/m, z6.h, z0.h + asr z7.h, p0/m, z7.h, z0.h + asr z8.h, p0/m, z8.h, z0.h + st1d {z5.d}, p0, x0 + st1d {z6.d}, p0, x0, #1, mul vl + st1d {z7.d}, p0, x0, #2, mul vl + st1d {z8.d}, p0, x0, #3, mul vl + add x0, x0, x2, lsl #1 +.endr + ret +.vl_gt_48_cpy1Dto2D_shr_64x64: + cmp x9, #112 + bgt .vl_gt_112_cpy1Dto2D_shr_64x64 + ptrue p0.h, vl32 + mov z2.h, #1 + lsl z2.h, p0/m, z2.h, z1.h +.rept 128 + ld1d {z5.d}, p0/z, x1 + ld1d {z6.d}, p0/z, x1, #1, mul vl + add x1, x1, #128 + add z5.h, p0/m, z5.h, z2.h + add z6.h, p0/m, z6.h, z2.h + asr z5.h, p0/m, z5.h, z0.h + asr z6.h, p0/m, z6.h, z0.h + st1d {z5.d}, p0, x0 + st1d {z6.d}, p0, x0, #1, mul vl + add x0, x0, x2, lsl #1 +.endr + ret +.vl_gt_112_cpy1Dto2D_shr_64x64: + ptrue p0.h, vl64 + mov z2.h, #1 + lsl z2.h, p0/m, z2.h, z1.h +.rept 128 + ld1d {z5.d}, p0/z, x1 + add x1, x1, #128 + add z5.h, p0/m, z5.h, z2.h + asr z5.h, p0/m, z5.h, z0.h + st1d {z5.d}, p0, x0 + add x0, x0, x2, lsl #1 +.endr + ret +endfunc
View file
x265_3.6.tar.gz/source/common/aarch64/blockcopy8.S
Added
@@ -0,0 +1,1299 @@ +/***************************************************************************** + * Copyright (C) 2021 MulticoreWare, Inc + * + * Authors: Sebastian Pop <spop@amazon.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "asm.S" +#include "blockcopy8-common.S" + +#ifdef __APPLE__ +.section __RODATA,__rodata +#else +.section .rodata +#endif + +.align 4 + +.text + +/* void blockcopy_sp(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb) + * + * r0 - a + * r1 - stridea + * r2 - b + * r3 - strideb */ +function PFX(blockcopy_sp_4x4_neon) + lsl x3, x3, #1 +.rept 2 + ld1 {v0.8h}, x2, x3 + ld1 {v1.8h}, x2, x3 + xtn v0.8b, v0.8h + xtn v1.8b, v1.8h + st1 {v0.s}0, x0, x1 + st1 {v1.s}0, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_sp_8x8_neon) + lsl x3, x3, #1 +.rept 4 + ld1 {v0.8h}, x2, x3 + ld1 {v1.8h}, x2, x3 + xtn v0.8b, v0.8h + xtn v1.8b, v1.8h + st1 {v0.d}0, x0, x1 + st1 {v1.d}0, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_sp_16x16_neon) + lsl x3, x3, #1 + movrel x11, xtn_xtn2_table + ld1 {v31.16b}, x11 +.rept 8 + ld1 {v0.8h-v1.8h}, x2, x3 + ld1 {v2.8h-v3.8h}, x2, x3 + tbl v0.16b, {v0.16b,v1.16b}, v31.16b + tbl v1.16b, {v2.16b,v3.16b}, v31.16b + st1 {v0.16b}, x0, x1 + st1 {v1.16b}, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_sp_32x32_neon) + mov w12, #4 + lsl x3, x3, #1 + movrel x11, xtn_xtn2_table + ld1 {v31.16b}, x11 +.loop_csp32: + sub w12, w12, #1 +.rept 4 + ld1 {v0.8h-v3.8h}, x2, x3 + ld1 {v4.8h-v7.8h}, x2, x3 + tbl v0.16b, {v0.16b,v1.16b}, v31.16b + tbl v1.16b, {v2.16b,v3.16b}, v31.16b + tbl v2.16b, {v4.16b,v5.16b}, v31.16b + tbl v3.16b, {v6.16b,v7.16b}, v31.16b + st1 {v0.16b-v1.16b}, x0, x1 + st1 {v2.16b-v3.16b}, x0, x1 +.endr + cbnz w12, .loop_csp32 + ret +endfunc + +function PFX(blockcopy_sp_64x64_neon) + mov w12, #16 + lsl x3, x3, #1 + sub x3, x3, #64 + movrel x11, xtn_xtn2_table + ld1 {v31.16b}, x11 +.loop_csp64: + sub w12, w12, #1 +.rept 4 + ld1 {v0.8h-v3.8h}, x2, #64 + ld1 {v4.8h-v7.8h}, x2, x3 + tbl v0.16b, {v0.16b,v1.16b}, v31.16b + tbl v1.16b, {v2.16b,v3.16b}, v31.16b + tbl v2.16b, {v4.16b,v5.16b}, v31.16b + tbl v3.16b, {v6.16b,v7.16b}, v31.16b + st1 {v0.16b-v3.16b}, x0, x1 +.endr + cbnz w12, .loop_csp64 + ret +endfunc + +// void blockcopy_ps(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb) +function PFX(blockcopy_ps_4x4_neon) + lsl x1, x1, #1 +.rept 2 + ld1 {v0.8b}, x2, x3 + ld1 {v1.8b}, x2, x3 + uxtl v0.8h, v0.8b + uxtl v1.8h, v1.8b + st1 {v0.4h}, x0, x1 + st1 {v1.4h}, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_ps_8x8_neon) + lsl x1, x1, #1 +.rept 4 + ld1 {v0.8b}, x2, x3 + ld1 {v1.8b}, x2, x3 + uxtl v0.8h, v0.8b + uxtl v1.8h, v1.8b + st1 {v0.8h}, x0, x1 + st1 {v1.8h}, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_ps_16x16_neon) + lsl x1, x1, #1 +.rept 8 + ld1 {v4.16b}, x2, x3 + ld1 {v5.16b}, x2, x3 + uxtl v0.8h, v4.8b + uxtl2 v1.8h, v4.16b + uxtl v2.8h, v5.8b + uxtl2 v3.8h, v5.16b + st1 {v0.8h-v1.8h}, x0, x1 + st1 {v2.8h-v3.8h}, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_ps_32x32_neon) + lsl x1, x1, #1 + mov w12, #4 +.loop_cps32: + sub w12, w12, #1 +.rept 4 + ld1 {v16.16b-v17.16b}, x2, x3 + ld1 {v18.16b-v19.16b}, x2, x3 + uxtl v0.8h, v16.8b + uxtl2 v1.8h, v16.16b + uxtl v2.8h, v17.8b + uxtl2 v3.8h, v17.16b + uxtl v4.8h, v18.8b + uxtl2 v5.8h, v18.16b + uxtl v6.8h, v19.8b + uxtl2 v7.8h, v19.16b + st1 {v0.8h-v3.8h}, x0, x1 + st1 {v4.8h-v7.8h}, x0, x1 +.endr + cbnz w12, .loop_cps32 + ret +endfunc + +function PFX(blockcopy_ps_64x64_neon) + lsl x1, x1, #1 + sub x1, x1, #64 + mov w12, #16 +.loop_cps64: + sub w12, w12, #1 +.rept 4 + ld1 {v16.16b-v19.16b}, x2, x3 + uxtl v0.8h, v16.8b + uxtl2 v1.8h, v16.16b + uxtl v2.8h, v17.8b + uxtl2 v3.8h, v17.16b + uxtl v4.8h, v18.8b + uxtl2 v5.8h, v18.16b + uxtl v6.8h, v19.8b + uxtl2 v7.8h, v19.16b + st1 {v0.8h-v3.8h}, x0, #64 + st1 {v4.8h-v7.8h}, x0, x1 +.endr + cbnz w12, .loop_cps64 + ret +endfunc + +// void x265_blockcopy_ss(int16_t* a, intptr_t stridea, const int16_t* b, intptr_t strideb) +function PFX(blockcopy_ss_4x4_neon) + lsl x1, x1, #1 + lsl x3, x3, #1 +.rept 2 + ld1 {v0.8b}, x2, x3 + ld1 {v1.8b}, x2, x3 + st1 {v0.8b}, x0, x1 + st1 {v1.8b}, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_ss_8x8_neon) + lsl x1, x1, #1 + lsl x3, x3, #1 +.rept 4 + ld1 {v0.8h}, x2, x3 + ld1 {v1.8h}, x2, x3 + st1 {v0.8h}, x0, x1 + st1 {v1.8h}, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_ss_16x16_neon) + lsl x1, x1, #1 + lsl x3, x3, #1 +.rept 8 + ld1 {v0.8h-v1.8h}, x2, x3 + ld1 {v2.8h-v3.8h}, x2, x3 + st1 {v0.8h-v1.8h}, x0, x1 + st1 {v2.8h-v3.8h}, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_ss_32x32_neon) + lsl x1, x1, #1 + lsl x3, x3, #1 + mov w12, #4 +.loop_css32: + sub w12, w12, #1 +.rept 8 + ld1 {v0.8h-v3.8h}, x2, x3 + st1 {v0.8h-v3.8h}, x0, x1 +.endr + cbnz w12, .loop_css32 + ret +endfunc + +function PFX(blockcopy_ss_64x64_neon) + lsl x1, x1, #1 + sub x1, x1, #64 + lsl x3, x3, #1 + sub x3, x3, #64 + mov w12, #8 +.loop_css64: + sub w12, w12, #1 +.rept 8 + ld1 {v0.8h-v3.8h}, x2, #64 + ld1 {v4.8h-v7.8h}, x2, x3 + st1 {v0.8h-v3.8h}, x0, #64 + st1 {v4.8h-v7.8h}, x0, x1 +.endr + cbnz w12, .loop_css64 + ret +endfunc + +/******** Chroma blockcopy********/ +function PFX(blockcopy_ss_4x8_neon) + lsl x1, x1, #1 + lsl x3, x3, #1 +.rept 4 + ld1 {v0.8b}, x2, x3 + ld1 {v1.8b}, x2, x3 + st1 {v0.8b}, x0, x1 + st1 {v1.8b}, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_ss_8x16_neon) + lsl x1, x1, #1 + lsl x3, x3, #1 +.rept 8 + ld1 {v0.8h}, x2, x3 + ld1 {v1.8h}, x2, x3 + st1 {v0.8h}, x0, x1 + st1 {v1.8h}, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_ss_16x32_neon) + lsl x1, x1, #1 + lsl x3, x3, #1 +.rept 16 + ld1 {v0.8h-v1.8h}, x2, x3 + ld1 {v2.8h-v3.8h}, x2, x3 + st1 {v0.8h-v1.8h}, x0, x1 + st1 {v2.8h-v3.8h}, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_ss_32x64_neon) + lsl x1, x1, #1 + lsl x3, x3, #1 + mov w12, #8 +.loop_css32x64: + sub w12, w12, #1 +.rept 8 + ld1 {v0.8h-v3.8h}, x2, x3 + st1 {v0.8h-v3.8h}, x0, x1 +.endr + cbnz w12, .loop_css32x64 + ret +endfunc + +// chroma blockcopy_ps +function PFX(blockcopy_ps_4x8_neon) + lsl x1, x1, #1 +.rept 4 + ld1 {v0.8b}, x2, x3 + ld1 {v1.8b}, x2, x3 + uxtl v0.8h, v0.8b + uxtl v1.8h, v1.8b + st1 {v0.4h}, x0, x1 + st1 {v1.4h}, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_ps_8x16_neon) + lsl x1, x1, #1 +.rept 8 + ld1 {v0.8b}, x2, x3 + ld1 {v1.8b}, x2, x3 + uxtl v0.8h, v0.8b + uxtl v1.8h, v1.8b + st1 {v0.8h}, x0, x1 + st1 {v1.8h}, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_ps_16x32_neon) + lsl x1, x1, #1 +.rept 16 + ld1 {v4.16b}, x2, x3 + ld1 {v5.16b}, x2, x3 + uxtl v0.8h, v4.8b + uxtl2 v1.8h, v4.16b + uxtl v2.8h, v5.8b + uxtl2 v3.8h, v5.16b + st1 {v0.8h-v1.8h}, x0, x1 + st1 {v2.8h-v3.8h}, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_ps_32x64_neon) + lsl x1, x1, #1 + mov w12, #8 +.loop_cps32x64: + sub w12, w12, #1 +.rept 4 + ld1 {v16.16b-v17.16b}, x2, x3 + ld1 {v18.16b-v19.16b}, x2, x3 + uxtl v0.8h, v16.8b + uxtl2 v1.8h, v16.16b + uxtl v2.8h, v17.8b + uxtl2 v3.8h, v17.16b + uxtl v4.8h, v18.8b + uxtl2 v5.8h, v18.16b + uxtl v6.8h, v19.8b + uxtl2 v7.8h, v19.16b + st1 {v0.8h-v3.8h}, x0, x1 + st1 {v4.8h-v7.8h}, x0, x1 +.endr + cbnz w12, .loop_cps32x64 + ret +endfunc + +// chroma blockcopy_sp +function PFX(blockcopy_sp_4x8_neon) + lsl x3, x3, #1 +.rept 4 + ld1 {v0.8h}, x2, x3 + ld1 {v1.8h}, x2, x3 + xtn v0.8b, v0.8h + xtn v1.8b, v1.8h + st1 {v0.s}0, x0, x1 + st1 {v1.s}0, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_sp_8x16_neon) + lsl x3, x3, #1 +.rept 8 + ld1 {v0.8h}, x2, x3 + ld1 {v1.8h}, x2, x3 + xtn v0.8b, v0.8h + xtn v1.8b, v1.8h + st1 {v0.d}0, x0, x1 + st1 {v1.d}0, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_sp_16x32_neon) + lsl x3, x3, #1 + movrel x11, xtn_xtn2_table + ld1 {v31.16b}, x11 +.rept 16 + ld1 {v0.8h-v1.8h}, x2, x3 + ld1 {v2.8h-v3.8h}, x2, x3 + tbl v0.16b, {v0.16b,v1.16b}, v31.16b + tbl v1.16b, {v2.16b,v3.16b}, v31.16b + st1 {v0.16b}, x0, x1 + st1 {v1.16b}, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_sp_32x64_neon) + mov w12, #8 + lsl x3, x3, #1 + movrel x11, xtn_xtn2_table + ld1 {v31.16b}, x11 +.loop_csp32x64: + sub w12, w12, #1 +.rept 4 + ld1 {v0.8h-v3.8h}, x2, x3 + ld1 {v4.8h-v7.8h}, x2, x3 + tbl v0.16b, {v0.16b,v1.16b}, v31.16b + tbl v1.16b, {v2.16b,v3.16b}, v31.16b + tbl v2.16b, {v4.16b,v5.16b}, v31.16b + tbl v3.16b, {v6.16b,v7.16b}, v31.16b + st1 {v0.16b-v1.16b}, x0, x1 + st1 {v2.16b-v3.16b}, x0, x1 +.endr + cbnz w12, .loop_csp32x64 + ret +endfunc + +/* blockcopy_pp(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride) */ + +function PFX(blockcopy_pp_2x4_neon) + ldrh w9, x2 + add x4, x1, x1 + add x14, x3, x3 + strh w9, x0 + ldrh w10, x2, x3 + add x5, x4, x1 + add x15, x14, x3 + strh w10, x0, x1 + ldrh w11, x2, x14 + strh w11, x0, x4 + ldrh w12, x2, x15 + strh w12, x0, x5 + ret +endfunc + +.macro blockcopy_pp_2xN_neon h +function PFX(blockcopy_pp_2x\h\()_neon) + add x4, x1, x1 + add x5, x4, x1 + add x6, x5, x1 + + add x14, x3, x3 + add x15, x14, x3 + add x16, x15, x3 + +.rept \h / 4 + ldrh w9, x2 + strh w9, x0 + ldrh w10, x2, x3 + strh w10, x0, x1 + ldrh w11, x2, x14 + strh w11, x0, x4 + ldrh w12, x2, x15 + strh w12, x0, x5 + add x2, x2, x16 + add x0, x0, x6 +.endr + ret +endfunc +.endm + +blockcopy_pp_2xN_neon 8 +blockcopy_pp_2xN_neon 16 + +function PFX(blockcopy_pp_4x2_neon) + ldr w9, x2 + str w9, x0 + ldr w10, x2, x3 + str w10, x0, x1 + ret +endfunc + +function PFX(blockcopy_pp_4x4_neon) + ldr w9, x2 + add x4, x1, x1 + add x14, x3, x3 + str w9, x0 + ldr w10, x2, x3 + add x5, x4, x1 + add x15, x14, x3 + str w10, x0, x1 + ldr w11, x2, x14 + str w11, x0, x4 + ldr w12, x2, x15 + str w12, x0, x5 + ret +endfunc + +.macro blockcopy_pp_4xN_neon h +function PFX(blockcopy_pp_4x\h\()_neon) + add x4, x1, x1 + add x5, x4, x1 + add x6, x5, x1 + + add x14, x3, x3 + add x15, x14, x3 + add x16, x15, x3 + +.rept \h / 4 + ldr w9, x2 + str w9, x0 + ldr w10, x2, x3 + str w10, x0, x1 + ldr w11, x2, x14 + str w11, x0, x4 + ldr w12, x2, x15 + str w12, x0, x5 + add x2, x2, x16 + add x0, x0, x6 +.endr + ret +endfunc +.endm + +blockcopy_pp_4xN_neon 8 +blockcopy_pp_4xN_neon 16 +blockcopy_pp_4xN_neon 32 + +.macro blockcopy_pp_6xN_neon h +function PFX(blockcopy_pp_6x\h\()_neon) + sub x1, x1, #4 +.rept \h + ld1 {v0.8b}, x2, x3 + st1 {v0.s}0, x0, #4 + st1 {v0.h}2, x0, x1 +.endr + ret +endfunc +.endm + +blockcopy_pp_6xN_neon 8 +blockcopy_pp_6xN_neon 16 + +.macro blockcopy_pp_8xN_neon h +function PFX(blockcopy_pp_8x\h\()_neon) +.rept \h + ld1 {v0.4h}, x2, x3 + st1 {v0.4h}, x0, x1 +.endr + ret +endfunc +.endm + +blockcopy_pp_8xN_neon 2 +blockcopy_pp_8xN_neon 4 +blockcopy_pp_8xN_neon 6 +blockcopy_pp_8xN_neon 8 +blockcopy_pp_8xN_neon 12 +blockcopy_pp_8xN_neon 16 +blockcopy_pp_8xN_neon 32 + +function PFX(blockcopy_pp_8x64_neon) + mov w12, #4 +.loop_pp_8x64: + sub w12, w12, #1 +.rept 16 + ld1 {v0.4h}, x2, x3 + st1 {v0.4h}, x0, x1 +.endr + cbnz w12, .loop_pp_8x64 + ret +endfunc + +.macro blockcopy_pp_16xN_neon h +function PFX(blockcopy_pp_16x\h\()_neon) +.rept \h + ld1 {v0.8h}, x2, x3 + st1 {v0.8h}, x0, x1 +.endr + ret +endfunc +.endm + +blockcopy_pp_16xN_neon 4 +blockcopy_pp_16xN_neon 8 +blockcopy_pp_16xN_neon 12 +blockcopy_pp_16xN_neon 16 + +.macro blockcopy_pp_16xN1_neon h +function PFX(blockcopy_pp_16x\h\()_neon) + mov w12, #\h / 8 +.loop_16x\h\(): +.rept 8 + ld1 {v0.8h}, x2, x3 + st1 {v0.8h}, x0, x1 +.endr + sub w12, w12, #1 + cbnz w12, .loop_16x\h + ret +endfunc +.endm + +blockcopy_pp_16xN1_neon 24 +blockcopy_pp_16xN1_neon 32 +blockcopy_pp_16xN1_neon 64 + +function PFX(blockcopy_pp_12x16_neon) + sub x1, x1, #8 +.rept 16 + ld1 {v0.16b}, x2, x3 + str d0, x0, #8 + st1 {v0.s}2, x0, x1 +.endr + ret +endfunc + +function PFX(blockcopy_pp_12x32_neon) + sub x1, x1, #8 + mov w12, #4 +.loop_pp_12x32: + sub w12, w12, #1 +.rept 8 + ld1 {v0.16b}, x2, x3 + str d0, x0, #8 + st1 {v0.s}2, x0, x1 +.endr + cbnz w12, .loop_pp_12x32 + ret +endfunc + +function PFX(blockcopy_pp_24x32_neon) + mov w12, #4 +.loop_24x32: + sub w12, w12, #1 +.rept 8 + ld1 {v0.8b-v2.8b}, x2, x3 + st1 {v0.8b-v2.8b}, x0, x1 +.endr + cbnz w12, .loop_24x32 + ret +endfunc + +function PFX(blockcopy_pp_24x64_neon) + mov w12, #4 +.loop_24x64: + sub w12, w12, #1 +.rept 16 + ld1 {v0.8b-v2.8b}, x2, x3 + st1 {v0.8b-v2.8b}, x0, x1 +.endr + cbnz w12, .loop_24x64 + ret +endfunc + +function PFX(blockcopy_pp_32x8_neon) +.rept 8 + ld1 {v0.16b-v1.16b}, x2, x3 + st1 {v0.16b-v1.16b}, x0, x1 +.endr + ret +endfunc + +.macro blockcopy_pp_32xN_neon h +function PFX(blockcopy_pp_32x\h\()_neon) + mov w12, #\h / 8 +.loop_32x\h\(): + sub w12, w12, #1 +.rept 8 + ld1 {v0.16b-v1.16b}, x2, x3 + st1 {v0.16b-v1.16b}, x0, x1 +.endr + cbnz w12, .loop_32x\h + ret +endfunc +.endm + +blockcopy_pp_32xN_neon 16 +blockcopy_pp_32xN_neon 24 +blockcopy_pp_32xN_neon 32 +blockcopy_pp_32xN_neon 64 +blockcopy_pp_32xN_neon 48 + +function PFX(blockcopy_pp_48x64_neon) + mov w12, #8 +.loop_48x64: + sub w12, w12, #1 +.rept 8 + ld1 {v0.16b-v2.16b}, x2, x3 + st1 {v0.16b-v2.16b}, x0, x1 +.endr + cbnz w12, .loop_48x64 + ret +endfunc + +.macro blockcopy_pp_64xN_neon h +function PFX(blockcopy_pp_64x\h\()_neon) + mov w12, #\h / 4 +.loop_64x\h\(): + sub w12, w12, #1 +.rept 4 + ld1 {v0.16b-v3.16b}, x2, x3 + st1 {v0.16b-v3.16b}, x0, x1 +.endr + cbnz w12, .loop_64x\h + ret +endfunc +.endm + +blockcopy_pp_64xN_neon 16 +blockcopy_pp_64xN_neon 32 +blockcopy_pp_64xN_neon 48 +blockcopy_pp_64xN_neon 64 + +// void x265_blockfill_s_neon(int16_t* dst, intptr_t dstride, int16_t val) +function PFX(blockfill_s_4x4_neon) + dup v0.4h, w2 + lsl x1, x1, #1 +.rept 4 + st1 {v0.4h}, x0, x1 +.endr + ret +endfunc + +function PFX(blockfill_s_8x8_neon) + dup v0.8h, w2 + lsl x1, x1, #1 +.rept 8 + st1 {v0.8h}, x0, x1 +.endr + ret +endfunc + +function PFX(blockfill_s_16x16_neon) + dup v0.8h, w2 + mov v1.16b, v0.16b + lsl x1, x1, #1 +.rept 16 + stp q0, q1, x0 + add x0, x0, x1 +.endr + ret +endfunc + +function PFX(blockfill_s_32x32_neon) + dup v0.8h, w2 + mov v1.16b, v0.16b + mov v2.16b, v0.16b + mov v3.16b, v0.16b + lsl x1, x1, #1 +.rept 32 + st1 {v0.8h-v3.8h}, x0, x1 +.endr + ret +endfunc + +function PFX(blockfill_s_64x64_neon) + dup v0.8h, w2 + mov v1.16b, v0.16b + mov v2.16b, v0.16b + mov v3.16b, v0.16b + lsl x1, x1, #1 + sub x1, x1, #64 +.rept 64 + st1 {v0.8h-v3.8h}, x0, #64 + st1 {v0.8h-v3.8h}, x0, x1 +.endr + ret +endfunc + +// uint32_t copy_count(int16_t* coeff, const int16_t* residual, intptr_t resiStride) +function PFX(copy_cnt_4_neon) + lsl x2, x2, #1 + movi v4.8b, #0 +.rept 2 + ld1 {v0.8b}, x1, x2 + ld1 {v1.8b}, x1, x2 + stp d0, d1, x0, #16 + cmeq v0.4h, v0.4h, #0 + cmeq v1.4h, v1.4h, #0 + add v4.4h, v4.4h, v0.4h + add v4.4h, v4.4h, v1.4h +.endr + saddlv s4, v4.4h + fmov w12, s4 + add w0, w12, #16 + ret +endfunc + +function PFX(copy_cnt_8_neon) + lsl x2, x2, #1 + movi v4.8b, #0 +.rept 4 + ld1 {v0.16b}, x1, x2 + ld1 {v1.16b}, x1, x2 + stp q0, q1, x0, #32 + cmeq v0.8h, v0.8h, #0 + cmeq v1.8h, v1.8h, #0 + add v4.8h, v4.8h, v0.8h + add v4.8h, v4.8h, v1.8h +.endr + saddlv s4, v4.8h + fmov w12, s4 + add w0, w12, #64 + ret +endfunc + +function PFX(copy_cnt_16_neon) + lsl x2, x2, #1 + movi v4.8b, #0 +.rept 16 + ld1 {v0.16b-v1.16b}, x1, x2 + st1 {v0.16b-v1.16b}, x0, #32 + cmeq v0.8h, v0.8h, #0 + cmeq v1.8h, v1.8h, #0 + add v4.8h, v4.8h, v0.8h + add v4.8h, v4.8h, v1.8h +.endr + saddlv s4, v4.8h + fmov w12, s4 + add w0, w12, #256 + ret +endfunc + +function PFX(copy_cnt_32_neon) + lsl x2, x2, #1 + movi v4.8b, #0 +.rept 32 + ld1 {v0.16b-v3.16b}, x1, x2 + st1 {v0.16b-v3.16b}, x0, #64 + cmeq v0.8h, v0.8h, #0 + cmeq v1.8h, v1.8h, #0 + cmeq v2.8h, v2.8h, #0 + cmeq v3.8h, v3.8h, #0 + add v0.8h, v0.8h, v1.8h + add v2.8h, v2.8h, v3.8h + add v4.8h, v4.8h, v0.8h + add v4.8h, v4.8h, v2.8h +.endr + saddlv s4, v4.8h + fmov w12, s4 + add w0, w12, #1024 + ret +endfunc + +// int count_nonzero_c(const int16_t* quantCoeff) +function PFX(count_nonzero_4_neon) + movi v16.16b, #1 + movi v17.16b, #0 + trn1 v16.16b, v16.16b, v17.16b + ldp q0, q1, x0 + cmhi v0.8h, v0.8h, v17.8h + cmhi v1.8h, v1.8h, v17.8h + and v0.16b, v0.16b, v16.16b + and v1.16b, v1.16b, v16.16b + add v0.8h, v0.8h, v1.8h + uaddlv s0, v0.8h + fmov w0, s0 + ret +endfunc + +.macro COUNT_NONZERO_8 + ld1 {v0.16b-v3.16b}, x0, #64 + ld1 {v4.16b-v7.16b}, x0, #64 + cmhi v0.8h, v0.8h, v17.8h + cmhi v1.8h, v1.8h, v17.8h + cmhi v2.8h, v2.8h, v17.8h + cmhi v3.8h, v3.8h, v17.8h + cmhi v4.8h, v4.8h, v17.8h + cmhi v5.8h, v5.8h, v17.8h + cmhi v6.8h, v6.8h, v17.8h + cmhi v7.8h, v7.8h, v17.8h + and v0.16b, v0.16b, v16.16b + and v1.16b, v1.16b, v16.16b + and v2.16b, v2.16b, v16.16b + and v3.16b, v3.16b, v16.16b + and v4.16b, v4.16b, v16.16b + and v5.16b, v5.16b, v16.16b + and v6.16b, v6.16b, v16.16b + and v7.16b, v7.16b, v16.16b + add v0.8h, v0.8h, v1.8h + add v2.8h, v2.8h, v3.8h + add v4.8h, v4.8h, v5.8h + add v6.8h, v6.8h, v7.8h + add v0.8h, v0.8h, v2.8h + add v4.8h, v4.8h, v6.8h + add v0.8h, v0.8h, v4.8h +.endm + +function PFX(count_nonzero_8_neon) + movi v16.16b, #1 + movi v17.16b, #0 + trn1 v16.16b, v16.16b, v17.16b + COUNT_NONZERO_8 + uaddlv s0, v0.8h + fmov w0, s0 + ret +endfunc + +function PFX(count_nonzero_16_neon) + movi v16.16b, #1 + movi v17.16b, #0 + trn1 v16.16b, v16.16b, v17.16b + movi v18.16b, #0 +.rept 4 + COUNT_NONZERO_8 + add v18.16b, v18.16b, v0.16b +.endr + uaddlv s0, v18.8h + fmov w0, s0 + ret +endfunc + +function PFX(count_nonzero_32_neon) + movi v16.16b, #1 + movi v17.16b, #0 + trn1 v16.16b, v16.16b, v17.16b + movi v18.16b, #0 + mov w12, #16 +.loop_count_nonzero_32: + sub w12, w12, #1 + COUNT_NONZERO_8 + add v18.16b, v18.16b, v0.16b + cbnz w12, .loop_count_nonzero_32 + + uaddlv s0, v18.8h + fmov w0, s0 + ret +endfunc + +// void cpy2Dto1D_shl(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift) +.macro cpy2Dto1D_shl_start + add x2, x2, x2 + dup v0.8h, w3 +.endm + +function PFX(cpy2Dto1D_shl_4x4_neon) + cpy2Dto1D_shl_start + ld1 {v2.d}0, x1, x2 + ld1 {v2.d}1, x1, x2 + ld1 {v3.d}0, x1, x2 + ld1 {v3.d}1, x1, x2 + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + st1 {v2.16b-v3.16b}, x0 + ret +endfunc + +function PFX(cpy2Dto1D_shl_8x8_neon) + cpy2Dto1D_shl_start +.rept 4 + ld1 {v2.16b}, x1, x2 + ld1 {v3.16b}, x1, x2 + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + st1 {v2.16b-v3.16b}, x0, #32 +.endr + ret +endfunc + +function PFX(cpy2Dto1D_shl_16x16_neon) + cpy2Dto1D_shl_start + mov w12, #4 +.loop_cpy2Dto1D_shl_16: + sub w12, w12, #1 +.rept 4 + ld1 {v2.16b-v3.16b}, x1, x2 + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + st1 {v2.16b-v3.16b}, x0, #32 +.endr + cbnz w12, .loop_cpy2Dto1D_shl_16 + ret +endfunc + +function PFX(cpy2Dto1D_shl_32x32_neon) + cpy2Dto1D_shl_start + mov w12, #16 +.loop_cpy2Dto1D_shl_32: + sub w12, w12, #1 +.rept 2 + ld1 {v2.16b-v5.16b}, x1, x2 + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + sshl v4.8h, v4.8h, v0.8h + sshl v5.8h, v5.8h, v0.8h + st1 {v2.16b-v5.16b}, x0, #64 +.endr + cbnz w12, .loop_cpy2Dto1D_shl_32 + ret +endfunc + +function PFX(cpy2Dto1D_shl_64x64_neon) + cpy2Dto1D_shl_start + mov w12, #32 + sub x2, x2, #64 +.loop_cpy2Dto1D_shl_64: + sub w12, w12, #1 +.rept 2 + ld1 {v2.16b-v5.16b}, x1, #64 + ld1 {v16.16b-v19.16b}, x1, x2 + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + sshl v4.8h, v4.8h, v0.8h + sshl v5.8h, v5.8h, v0.8h + sshl v16.8h, v16.8h, v0.8h + sshl v17.8h, v17.8h, v0.8h + sshl v18.8h, v18.8h, v0.8h + sshl v19.8h, v19.8h, v0.8h + st1 {v2.16b-v5.16b}, x0, #64 + st1 {v16.16b-v19.16b}, x0, #64 +.endr + cbnz w12, .loop_cpy2Dto1D_shl_64 + ret +endfunc + +// void cpy2Dto1D_shr(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift) +function PFX(cpy2Dto1D_shr_4x4_neon) + cpy2Dto1D_shr_start + ld1 {v2.d}0, x1, x2 + ld1 {v2.d}1, x1, x2 + ld1 {v3.d}0, x1, x2 + ld1 {v3.d}1, x1, x2 + sub v2.8h, v2.8h, v1.8h + sub v3.8h, v3.8h, v1.8h + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + stp q2, q3, x0 + ret +endfunc + +function PFX(cpy2Dto1D_shr_8x8_neon) + cpy2Dto1D_shr_start +.rept 4 + ld1 {v2.16b}, x1, x2 + ld1 {v3.16b}, x1, x2 + sub v2.8h, v2.8h, v1.8h + sub v3.8h, v3.8h, v1.8h + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + stp q2, q3, x0, #32 +.endr + ret +endfunc + +function PFX(cpy2Dto1D_shr_16x16_neon) + cpy2Dto1D_shr_start + mov w12, #4 +.loop_cpy2Dto1D_shr_16: + sub w12, w12, #1 +.rept 4 + ld1 {v2.8h-v3.8h}, x1, x2 + sub v2.8h, v2.8h, v1.8h + sub v3.8h, v3.8h, v1.8h + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + st1 {v2.8h-v3.8h}, x0, #32 +.endr + cbnz w12, .loop_cpy2Dto1D_shr_16 + ret +endfunc + +function PFX(cpy2Dto1D_shr_32x32_neon) + cpy2Dto1D_shr_start + mov w12, #16 +.loop_cpy2Dto1D_shr_32: + sub w12, w12, #1 +.rept 2 + ld1 {v2.8h-v5.8h}, x1, x2 + sub v2.8h, v2.8h, v1.8h + sub v3.8h, v3.8h, v1.8h + sub v4.8h, v4.8h, v1.8h + sub v5.8h, v5.8h, v1.8h + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + sshl v4.8h, v4.8h, v0.8h + sshl v5.8h, v5.8h, v0.8h + st1 {v2.8h-v5.8h}, x0, #64 +.endr + cbnz w12, .loop_cpy2Dto1D_shr_32 + ret +endfunc + +// void cpy1Dto2D_shl(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift) +.macro cpy1Dto2D_shl_start + add x2, x2, x2 + dup v0.8h, w3 +.endm + +function PFX(cpy1Dto2D_shl_4x4_neon) + cpy1Dto2D_shl_start + ld1 {v2.16b-v3.16b}, x1 + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + st1 {v2.d}0, x0, x2 + st1 {v2.d}1, x0, x2 + st1 {v3.d}0, x0, x2 + st1 {v3.d}1, x0, x2 + ret +endfunc + +function PFX(cpy1Dto2D_shl_8x8_neon) + cpy1Dto2D_shl_start +.rept 4 + ld1 {v2.16b-v3.16b}, x1, #32 + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + st1 {v2.16b}, x0, x2 + st1 {v3.16b}, x0, x2 +.endr + ret +endfunc + +function PFX(cpy1Dto2D_shl_16x16_neon) + cpy1Dto2D_shl_start + mov w12, #4 +.loop_cpy1Dto2D_shl_16: + sub w12, w12, #1 +.rept 4 + ld1 {v2.16b-v3.16b}, x1, #32 + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + st1 {v2.16b-v3.16b}, x0, x2 +.endr + cbnz w12, .loop_cpy1Dto2D_shl_16 + ret +endfunc + +function PFX(cpy1Dto2D_shl_32x32_neon) + cpy1Dto2D_shl_start + mov w12, #16 +.loop_cpy1Dto2D_shl_32: + sub w12, w12, #1 +.rept 2 + ld1 {v2.16b-v5.16b}, x1, #64 + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + sshl v4.8h, v4.8h, v0.8h + sshl v5.8h, v5.8h, v0.8h + st1 {v2.16b-v5.16b}, x0, x2 +.endr + cbnz w12, .loop_cpy1Dto2D_shl_32 + ret +endfunc + +function PFX(cpy1Dto2D_shl_64x64_neon) + cpy1Dto2D_shl_start + mov w12, #32 + sub x2, x2, #64 +.loop_cpy1Dto2D_shl_64: + sub w12, w12, #1 +.rept 2 + ld1 {v2.16b-v5.16b}, x1, #64 + ld1 {v16.16b-v19.16b}, x1, #64 + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + sshl v4.8h, v4.8h, v0.8h + sshl v5.8h, v5.8h, v0.8h + sshl v16.8h, v16.8h, v0.8h + sshl v17.8h, v17.8h, v0.8h + sshl v18.8h, v18.8h, v0.8h + sshl v19.8h, v19.8h, v0.8h + st1 {v2.16b-v5.16b}, x0, #64 + st1 {v16.16b-v19.16b}, x0, x2 +.endr + cbnz w12, .loop_cpy1Dto2D_shl_64 + ret +endfunc + +function PFX(cpy1Dto2D_shr_4x4_neon) + cpy1Dto2D_shr_start + ld1 {v2.16b-v3.16b}, x1 + sub v2.8h, v2.8h, v1.8h + sub v3.8h, v3.8h, v1.8h + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + st1 {v2.d}0, x0, x2 + st1 {v2.d}1, x0, x2 + st1 {v3.d}0, x0, x2 + st1 {v3.d}1, x0, x2 + ret +endfunc + +function PFX(cpy1Dto2D_shr_8x8_neon) + cpy1Dto2D_shr_start +.rept 4 + ld1 {v2.16b-v3.16b}, x1, #32 + sub v2.8h, v2.8h, v1.8h + sub v3.8h, v3.8h, v1.8h + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + st1 {v2.16b}, x0, x2 + st1 {v3.16b}, x0, x2 +.endr + ret +endfunc + +function PFX(cpy1Dto2D_shr_16x16_neon) + cpy1Dto2D_shr_start + mov w12, #4 +.loop_cpy1Dto2D_shr_16: + sub w12, w12, #1 +.rept 4 + ld1 {v2.8h-v3.8h}, x1, #32 + sub v2.8h, v2.8h, v1.8h + sub v3.8h, v3.8h, v1.8h + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + st1 {v2.8h-v3.8h}, x0, x2 +.endr + cbnz w12, .loop_cpy1Dto2D_shr_16 + ret +endfunc + +function PFX(cpy1Dto2D_shr_32x32_neon) + cpy1Dto2D_shr_start + mov w12, #16 +.loop_cpy1Dto2D_shr_32: + sub w12, w12, #1 +.rept 2 + ld1 {v2.16b-v5.16b}, x1, #64 + sub v2.8h, v2.8h, v1.8h + sub v3.8h, v3.8h, v1.8h + sub v4.8h, v4.8h, v1.8h + sub v5.8h, v5.8h, v1.8h + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + sshl v4.8h, v4.8h, v0.8h + sshl v5.8h, v5.8h, v0.8h + st1 {v2.16b-v5.16b}, x0, x2 +.endr + cbnz w12, .loop_cpy1Dto2D_shr_32 + ret +endfunc + +function PFX(cpy1Dto2D_shr_64x64_neon) + cpy1Dto2D_shr_start + mov w12, #32 + sub x2, x2, #64 +.loop_cpy1Dto2D_shr_64: + sub w12, w12, #1 +.rept 2 + ld1 {v2.16b-v5.16b}, x1, #64 + ld1 {v16.16b-v19.16b}, x1, #64 + sub v2.8h, v2.8h, v1.8h + sub v3.8h, v3.8h, v1.8h + sub v4.8h, v4.8h, v1.8h + sub v5.8h, v5.8h, v1.8h + sub v16.8h, v16.8h, v1.8h + sub v17.8h, v17.8h, v1.8h + sub v18.8h, v18.8h, v1.8h + sub v19.8h, v19.8h, v1.8h + sshl v2.8h, v2.8h, v0.8h + sshl v3.8h, v3.8h, v0.8h + sshl v4.8h, v4.8h, v0.8h + sshl v5.8h, v5.8h, v0.8h + sshl v16.8h, v16.8h, v0.8h + sshl v17.8h, v17.8h, v0.8h + sshl v18.8h, v18.8h, v0.8h + sshl v19.8h, v19.8h, v0.8h + st1 {v2.16b-v5.16b}, x0, #64 + st1 {v16.16b-v19.16b}, x0, x2 +.endr + cbnz w12, .loop_cpy1Dto2D_shr_64 + ret +endfunc
View file
x265_3.6.tar.gz/source/common/aarch64/dct-prim.cpp
Added
@@ -0,0 +1,948 @@ +#include "dct-prim.h" + + +#if HAVE_NEON + +#include <arm_neon.h> + + +namespace +{ +using namespace X265_NS; + + +static int16x8_t rev16(const int16x8_t a) +{ + static const int8x16_t tbl = {14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1}; + return vqtbx1q_u8(a, a, tbl); +} + +static int32x4_t rev32(const int32x4_t a) +{ + static const int8x16_t tbl = {12, 13, 14, 15, 8, 9, 10, 11, 4, 5, 6, 7, 0, 1, 2, 3}; + return vqtbx1q_u8(a, a, tbl); +} + +static void transpose_4x4x16(int16x4_t &x0, int16x4_t &x1, int16x4_t &x2, int16x4_t &x3) +{ + int16x4_t s0, s1, s2, s3; + s0 = vtrn1_s32(x0, x2); + s1 = vtrn1_s32(x1, x3); + s2 = vtrn2_s32(x0, x2); + s3 = vtrn2_s32(x1, x3); + + x0 = vtrn1_s16(s0, s1); + x1 = vtrn2_s16(s0, s1); + x2 = vtrn1_s16(s2, s3); + x3 = vtrn2_s16(s2, s3); +} + + + +static int scanPosLast_opt(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, + uint8_t *coeffNum, int numSig, const uint16_t * /*scanCG4x4*/, const int /*trSize*/) +{ + + // This is an optimized function for scanPosLast, which removes the rmw dependency, once integrated into mainline x265, should replace reference implementation + // For clarity, left the original reference code in comments + int scanPosLast = 0; + + uint16_t cSign = 0; + uint16_t cFlag = 0; + uint8_t cNum = 0; + + uint32_t prevcgIdx = 0; + do + { + const uint32_t cgIdx = (uint32_t)scanPosLast >> MLS_CG_SIZE; + + const uint32_t posLast = scanscanPosLast; + + const int curCoeff = coeffposLast; + const uint32_t isNZCoeff = (curCoeff != 0); + /* + NOTE: the new algorithm is complicated, so I keep reference code here + uint32_t posy = posLast >> log2TrSize; + uint32_t posx = posLast - (posy << log2TrSize); + uint32_t blkIdx0 = ((posy >> MLS_CG_LOG2_SIZE) << codingParameters.log2TrSizeCG) + (posx >> MLS_CG_LOG2_SIZE); + const uint32_t blkIdx = ((posLast >> (2 * MLS_CG_LOG2_SIZE)) & ~maskPosXY) + ((posLast >> MLS_CG_LOG2_SIZE) & maskPosXY); + sigCoeffGroupFlag64 |= ((uint64_t)isNZCoeff << blkIdx); + */ + + // get L1 sig map + numSig -= isNZCoeff; + + if (scanPosLast % (1 << MLS_CG_SIZE) == 0) + { + coeffSignprevcgIdx = cSign; + coeffFlagprevcgIdx = cFlag; + coeffNumprevcgIdx = cNum; + cSign = 0; + cFlag = 0; + cNum = 0; + } + // TODO: optimize by instruction BTS + cSign += (uint16_t)(((curCoeff < 0) ? 1 : 0) << cNum); + cFlag = (cFlag << 1) + (uint16_t)isNZCoeff; + cNum += (uint8_t)isNZCoeff; + prevcgIdx = cgIdx; + scanPosLast++; + } + while (numSig > 0); + + coeffSignprevcgIdx = cSign; + coeffFlagprevcgIdx = cFlag; + coeffNumprevcgIdx = cNum; + return scanPosLast - 1; +} + + +#if (MLS_CG_SIZE == 4) +template<int log2TrSize> +static void nonPsyRdoQuant_neon(int16_t *m_resiDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, + int64_t *totalRdCost, uint32_t blkPos) +{ + const int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - + log2TrSize; /* Represents scaling through forward transform */ + const int scaleBits = SCALE_BITS - 2 * transformShift; + const uint32_t trSize = 1 << log2TrSize; + + int64x2_t vcost_sum_0 = vdupq_n_s64(0); + int64x2_t vcost_sum_1 = vdupq_n_s64(0); + for (int y = 0; y < MLS_CG_SIZE; y++) + { + int16x4_t in = *(int16x4_t *)&m_resiDctCoeffblkPos; + int32x4_t mul = vmull_s16(in, in); + int64x2_t cost0, cost1; + cost0 = vshll_n_s32(vget_low_s32(mul), scaleBits); + cost1 = vshll_high_n_s32(mul, scaleBits); + *(int64x2_t *)&costUncodedblkPos + 0 = cost0; + *(int64x2_t *)&costUncodedblkPos + 2 = cost1; + vcost_sum_0 = vaddq_s64(vcost_sum_0, cost0); + vcost_sum_1 = vaddq_s64(vcost_sum_1, cost1); + blkPos += trSize; + } + int64_t sum = vaddvq_s64(vaddq_s64(vcost_sum_0, vcost_sum_1)); + *totalUncodedCost += sum; + *totalRdCost += sum; +} + +template<int log2TrSize> +static void psyRdoQuant_neon(int16_t *m_resiDctCoeff, int16_t *m_fencDctCoeff, int64_t *costUncoded, + int64_t *totalUncodedCost, int64_t *totalRdCost, int64_t *psyScale, uint32_t blkPos) +{ + const int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - + log2TrSize; /* Represents scaling through forward transform */ + const int scaleBits = SCALE_BITS - 2 * transformShift; + const uint32_t trSize = 1 << log2TrSize; + //using preprocessor to bypass clang bug + const int max = X265_MAX(0, (2 * transformShift + 1)); + + int64x2_t vcost_sum_0 = vdupq_n_s64(0); + int64x2_t vcost_sum_1 = vdupq_n_s64(0); + int32x4_t vpsy = vdupq_n_s32(*psyScale); + for (int y = 0; y < MLS_CG_SIZE; y++) + { + int32x4_t signCoef = vmovl_s16(*(int16x4_t *)&m_resiDctCoeffblkPos); + int32x4_t predictedCoef = vsubq_s32(vmovl_s16(*(int16x4_t *)&m_fencDctCoeffblkPos), signCoef); + int64x2_t cost0, cost1; + cost0 = vmull_s32(vget_low_s32(signCoef), vget_low_s32(signCoef)); + cost1 = vmull_high_s32(signCoef, signCoef); + cost0 = vshlq_n_s64(cost0, scaleBits); + cost1 = vshlq_n_s64(cost1, scaleBits); + int64x2_t neg0 = vmull_s32(vget_low_s32(predictedCoef), vget_low_s32(vpsy)); + int64x2_t neg1 = vmull_high_s32(predictedCoef, vpsy); + if (max > 0) + { + int64x2_t shift = vdupq_n_s64(-max); + neg0 = vshlq_s64(neg0, shift); + neg1 = vshlq_s64(neg1, shift); + } + cost0 = vsubq_s64(cost0, neg0); + cost1 = vsubq_s64(cost1, neg1); + *(int64x2_t *)&costUncodedblkPos + 0 = cost0; + *(int64x2_t *)&costUncodedblkPos + 2 = cost1; + vcost_sum_0 = vaddq_s64(vcost_sum_0, cost0); + vcost_sum_1 = vaddq_s64(vcost_sum_1, cost1); + + blkPos += trSize; + } + int64_t sum = vaddvq_s64(vaddq_s64(vcost_sum_0, vcost_sum_1)); + *totalUncodedCost += sum; + *totalRdCost += sum; +} + +#else +#error "MLS_CG_SIZE must be 4 for neon version" +#endif + + + +template<int trSize> +int count_nonzero_neon(const int16_t *quantCoeff) +{ + X265_CHECK(((intptr_t)quantCoeff & 15) == 0, "quant buffer not aligned\n"); + int count = 0; + int16x8_t vcount = vdupq_n_s16(0); + const int numCoeff = trSize * trSize; + int i = 0; + for (; (i + 8) <= numCoeff; i += 8) + { + int16x8_t in = *(int16x8_t *)&quantCoeffi; + vcount = vaddq_s16(vcount, vtstq_s16(in, in)); + } + for (; i < numCoeff; i++) + { + count += quantCoeffi != 0; + } + + return count - vaddvq_s16(vcount); +} + +template<int trSize> +uint32_t copy_count_neon(int16_t *coeff, const int16_t *residual, intptr_t resiStride) +{ + uint32_t numSig = 0; + int16x8_t vcount = vdupq_n_s16(0); + for (int k = 0; k < trSize; k++) + { + int j = 0; + for (; (j + 8) <= trSize; j += 8) + { + int16x8_t in = *(int16x8_t *)&residualj; + *(int16x8_t *)&coeffj = in; + vcount = vaddq_s16(vcount, vtstq_s16(in, in)); + } + for (; j < trSize; j++) + { + coeffj = residualj; + numSig += (residualj != 0); + } + residual += resiStride; + coeff += trSize; + } + + return numSig - vaddvq_s16(vcount); +} + + +static void partialButterfly16(const int16_t *src, int16_t *dst, int shift, int line) +{ + int j, k; + int32x4_t E2, O2; + int32x4_t EE, EO; + int32x2_t EEE, EEO; + const int add = 1 << (shift - 1); + const int32x4_t _vadd = {add, 0}; + + for (j = 0; j < line; j++) + { + int16x8_t in0 = *(int16x8_t *)src; + int16x8_t in1 = rev16(*(int16x8_t *)&src8); + + E0 = vaddl_s16(vget_low_s16(in0), vget_low_s16(in1)); + O0 = vsubl_s16(vget_low_s16(in0), vget_low_s16(in1)); + E1 = vaddl_high_s16(in0, in1); + O1 = vsubl_high_s16(in0, in1); + + for (k = 1; k < 16; k += 2) + { + int32x4_t c0 = vmovl_s16(*(int16x4_t *)&g_t16k0); + int32x4_t c1 = vmovl_s16(*(int16x4_t *)&g_t16k4); + + int32x4_t res = _vadd; + res = vmlaq_s32(res, c0, O0); + res = vmlaq_s32(res, c1, O1); + dstk * line = (int16_t)(vaddvq_s32(res) >> shift); + } + + /* EE and EO */ + EE = vaddq_s32(E0, rev32(E1)); + EO = vsubq_s32(E0, rev32(E1)); + + for (k = 2; k < 16; k += 4) + { + int32x4_t c0 = vmovl_s16(*(int16x4_t *)&g_t16k0); + int32x4_t res = _vadd; + res = vmlaq_s32(res, c0, EO); + dstk * line = (int16_t)(vaddvq_s32(res) >> shift); + } + + /* EEE and EEO */ + EEE0 = EE0 + EE3; + EEO0 = EE0 - EE3; + EEE1 = EE1 + EE2; + EEO1 = EE1 - EE2; + + dst0 = (int16_t)((g_t1600 * EEE0 + g_t1601 * EEE1 + add) >> shift); + dst8 * line = (int16_t)((g_t1680 * EEE0 + g_t1681 * EEE1 + add) >> shift); + dst4 * line = (int16_t)((g_t1640 * EEO0 + g_t1641 * EEO1 + add) >> shift); + dst12 * line = (int16_t)((g_t16120 * EEO0 + g_t16121 * EEO1 + add) >> shift); + + + src += 16; + dst++; + } +} + + +static void partialButterfly32(const int16_t *src, int16_t *dst, int shift, int line) +{ + int j, k; + const int add = 1 << (shift - 1); + + + for (j = 0; j < line; j++) + { + int32x4_t VE4, VO0, VO1, VO2, VO3; + int32x4_t VEE2, VEO2; + int32x4_t VEEE, VEEO; + int EEEE2, EEEO2; + + int16x8x4_t inputs; + inputs = *(int16x8x4_t *)&src0; + int16x8x4_t in_rev; + + in_rev.val1 = rev16(inputs.val2); + in_rev.val0 = rev16(inputs.val3); + + VE0 = vaddl_s16(vget_low_s16(inputs.val0), vget_low_s16(in_rev.val0)); + VE1 = vaddl_high_s16(inputs.val0, in_rev.val0); + VO0 = vsubl_s16(vget_low_s16(inputs.val0), vget_low_s16(in_rev.val0)); + VO1 = vsubl_high_s16(inputs.val0, in_rev.val0); + VE2 = vaddl_s16(vget_low_s16(inputs.val1), vget_low_s16(in_rev.val1)); + VE3 = vaddl_high_s16(inputs.val1, in_rev.val1); + VO2 = vsubl_s16(vget_low_s16(inputs.val1), vget_low_s16(in_rev.val1)); + VO3 = vsubl_high_s16(inputs.val1, in_rev.val1); + + for (k = 1; k < 32; k += 2) + { + int32x4_t c0 = vmovl_s16(*(int16x4_t *)&g_t32k0); + int32x4_t c1 = vmovl_s16(*(int16x4_t *)&g_t32k4); + int32x4_t c2 = vmovl_s16(*(int16x4_t *)&g_t32k8); + int32x4_t c3 = vmovl_s16(*(int16x4_t *)&g_t32k12); + int32x4_t s = vmulq_s32(c0, VO0); + s = vmlaq_s32(s, c1, VO1); + s = vmlaq_s32(s, c2, VO2); + s = vmlaq_s32(s, c3, VO3); + + dstk * line = (int16_t)((vaddvq_s32(s) + add) >> shift); + + } + + int32x4_t rev_VE2; + + + rev_VE0 = rev32(VE3); + rev_VE1 = rev32(VE2); + + /* EE and EO */ + for (k = 0; k < 2; k++) + { + VEEk = vaddq_s32(VEk, rev_VEk); + VEOk = vsubq_s32(VEk, rev_VEk); + } + for (k = 2; k < 32; k += 4) + { + int32x4_t c0 = vmovl_s16(*(int16x4_t *)&g_t32k0); + int32x4_t c1 = vmovl_s16(*(int16x4_t *)&g_t32k4); + int32x4_t s = vmulq_s32(c0, VEO0); + s = vmlaq_s32(s, c1, VEO1); + + dstk * line = (int16_t)((vaddvq_s32(s) + add) >> shift); + + } + + int32x4_t tmp = rev32(VEE1); + VEEE = vaddq_s32(VEE0, tmp); + VEEO = vsubq_s32(VEE0, tmp); + for (k = 4; k < 32; k += 8) + { + int32x4_t c = vmovl_s16(*(int16x4_t *)&g_t32k0); + int32x4_t s = vmulq_s32(c, VEEO); + + dstk * line = (int16_t)((vaddvq_s32(s) + add) >> shift); + } + + /* EEEE and EEEO */ + EEEE0 = VEEE0 + VEEE3; + EEEO0 = VEEE0 - VEEE3; + EEEE1 = VEEE1 + VEEE2; + EEEO1 = VEEE1 - VEEE2; + + dst0 = (int16_t)((g_t3200 * EEEE0 + g_t3201 * EEEE1 + add) >> shift); + dst16 * line = (int16_t)((g_t32160 * EEEE0 + g_t32161 * EEEE1 + add) >> shift); + dst8 * line = (int16_t)((g_t3280 * EEEO0 + g_t3281 * EEEO1 + add) >> shift); + dst24 * line = (int16_t)((g_t32240 * EEEO0 + g_t32241 * EEEO1 + add) >> shift); + + + + src += 32; + dst++; + } +} + +static void partialButterfly8(const int16_t *src, int16_t *dst, int shift, int line) +{ + int j, k; + int E4, O4; + int EE2, EO2; + int add = 1 << (shift - 1); + + for (j = 0; j < line; j++) + { + /* E and O*/ + for (k = 0; k < 4; k++) + { + Ek = srck + src7 - k; + Ok = srck - src7 - k; + } + + /* EE and EO */ + EE0 = E0 + E3; + EO0 = E0 - E3; + EE1 = E1 + E2; + EO1 = E1 - E2; + + dst0 = (int16_t)((g_t800 * EE0 + g_t801 * EE1 + add) >> shift); + dst4 * line = (int16_t)((g_t840 * EE0 + g_t841 * EE1 + add) >> shift); + dst2 * line = (int16_t)((g_t820 * EO0 + g_t821 * EO1 + add) >> shift); + dst6 * line = (int16_t)((g_t860 * EO0 + g_t861 * EO1 + add) >> shift); + + dstline = (int16_t)((g_t810 * O0 + g_t811 * O1 + g_t812 * O2 + g_t813 * O3 + add) >> shift); + dst3 * line = (int16_t)((g_t830 * O0 + g_t831 * O1 + g_t832 * O2 + g_t833 * O3 + add) >> + shift); + dst5 * line = (int16_t)((g_t850 * O0 + g_t851 * O1 + g_t852 * O2 + g_t853 * O3 + add) >> + shift); + dst7 * line = (int16_t)((g_t870 * O0 + g_t871 * O1 + g_t872 * O2 + g_t873 * O3 + add) >> + shift); + + src += 8; + dst++; + } +} + +static void partialButterflyInverse4(const int16_t *src, int16_t *dst, int shift, int line) +{ + int j; + int E2, O2; + int add = 1 << (shift - 1); + + for (j = 0; j < line; j++) + { + /* Utilizing symmetry properties to the maximum to minimize the number of multiplications */ + O0 = g_t410 * srcline + g_t430 * src3 * line; + O1 = g_t411 * srcline + g_t431 * src3 * line; + E0 = g_t400 * src0 + g_t420 * src2 * line; + E1 = g_t401 * src0 + g_t421 * src2 * line; + + /* Combining even and odd terms at each hierarchy levels to calculate the final spatial domain vector */ + dst0 = (int16_t)(x265_clip3(-32768, 32767, (E0 + O0 + add) >> shift)); + dst1 = (int16_t)(x265_clip3(-32768, 32767, (E1 + O1 + add) >> shift)); + dst2 = (int16_t)(x265_clip3(-32768, 32767, (E1 - O1 + add) >> shift)); + dst3 = (int16_t)(x265_clip3(-32768, 32767, (E0 - O0 + add) >> shift)); + + src++; + dst += 4; + } +} + + + +static void partialButterflyInverse16_neon(const int16_t *src, int16_t *orig_dst, int shift, int line) +{ +#define FMAK(x,l) sl = vmlal_lane_s16(sl,*(int16x4_t*)&src(x)*line,*(int16x4_t *)&g_t16xk,l) +#define MULK(x,l) vmull_lane_s16(*(int16x4_t*)&srcx*line,*(int16x4_t *)&g_t16xk,l); +#define ODD3_15(k) FMAK(3,k);FMAK(5,k);FMAK(7,k);FMAK(9,k);FMAK(11,k);FMAK(13,k);FMAK(15,k); +#define EVEN6_14_STEP4(k) FMAK(6,k);FMAK(10,k);FMAK(14,k); + + + int j, k; + int32x4_t E8, O8; + int32x4_t EE4, EO4; + int32x4_t EEE2, EEO2; + const int add = 1 << (shift - 1); + + +#pragma unroll(4) + for (j = 0; j < line; j += 4) + { + /* Utilizing symmetry properties to the maximum to minimize the number of multiplications */ + +#pragma unroll(2) + for (k = 0; k < 2; k++) + { + int32x4_t s; + s = vmull_s16(vdup_n_s16(g_t164k), *(int16x4_t *)&src4 * line);; + EEOk = vmlal_s16(s, vdup_n_s16(g_t1612k), *(int16x4_t *)&src(12) * line); + s = vmull_s16(vdup_n_s16(g_t160k), *(int16x4_t *)&src0 * line);; + EEEk = vmlal_s16(s, vdup_n_s16(g_t168k), *(int16x4_t *)&src(8) * line); + } + + /* Combining even and odd terms at each hierarchy levels to calculate the final spatial domain vector */ + EE0 = vaddq_s32(EEE0 , EEO0); + EE2 = vsubq_s32(EEE1 , EEO1); + EE1 = vaddq_s32(EEE1 , EEO1); + EE3 = vsubq_s32(EEE0 , EEO0); + + +#pragma unroll(1) + for (k = 0; k < 4; k += 4) + { + int32x4_t s4; + s0 = MULK(2, 0); + s1 = MULK(2, 1); + s2 = MULK(2, 2); + s3 = MULK(2, 3); + + EVEN6_14_STEP4(0); + EVEN6_14_STEP4(1); + EVEN6_14_STEP4(2); + EVEN6_14_STEP4(3); + + EOk = s0; + EOk + 1 = s1; + EOk + 2 = s2; + EOk + 3 = s3; + } + + + + static const int32x4_t min = vdupq_n_s32(-32768); + static const int32x4_t max = vdupq_n_s32(32767); + const int32x4_t minus_shift = vdupq_n_s32(-shift); + +#pragma unroll(4) + for (k = 0; k < 4; k++) + { + Ek = vaddq_s32(EEk , EOk); + Ek + 4 = vsubq_s32(EE3 - k , EO3 - k); + } + +#pragma unroll(2) + for (k = 0; k < 8; k += 4) + { + int32x4_t s4; + s0 = MULK(1, 0); + s1 = MULK(1, 1); + s2 = MULK(1, 2); + s3 = MULK(1, 3); + ODD3_15(0); + ODD3_15(1); + ODD3_15(2); + ODD3_15(3); + Ok = s0; + Ok + 1 = s1; + Ok + 2 = s2; + Ok + 3 = s3; + int32x4_t t; + int16x4_t x0, x1, x2, x3; + + Ek = vaddq_s32(vdupq_n_s32(add), Ek); + t = vaddq_s32(Ek, Ok); + t = vshlq_s32(t, minus_shift); + t = vmaxq_s32(t, min); + t = vminq_s32(t, max); + x0 = vmovn_s32(t); + + Ek + 1 = vaddq_s32(vdupq_n_s32(add), Ek + 1); + t = vaddq_s32(Ek + 1, Ok + 1); + t = vshlq_s32(t, minus_shift); + t = vmaxq_s32(t, min); + t = vminq_s32(t, max); + x1 = vmovn_s32(t); + + Ek + 2 = vaddq_s32(vdupq_n_s32(add), Ek + 2); + t = vaddq_s32(Ek + 2, Ok + 2); + t = vshlq_s32(t, minus_shift); + t = vmaxq_s32(t, min); + t = vminq_s32(t, max); + x2 = vmovn_s32(t); + + Ek + 3 = vaddq_s32(vdupq_n_s32(add), Ek + 3); + t = vaddq_s32(Ek + 3, Ok + 3); + t = vshlq_s32(t, minus_shift); + t = vmaxq_s32(t, min); + t = vminq_s32(t, max); + x3 = vmovn_s32(t); + + transpose_4x4x16(x0, x1, x2, x3); + *(int16x4_t *)&orig_dst0 * 16 + k = x0; + *(int16x4_t *)&orig_dst1 * 16 + k = x1; + *(int16x4_t *)&orig_dst2 * 16 + k = x2; + *(int16x4_t *)&orig_dst3 * 16 + k = x3; + } + + +#pragma unroll(2) + for (k = 0; k < 8; k += 4) + { + int32x4_t t; + int16x4_t x0, x1, x2, x3; + + t = vsubq_s32(E7 - k, O7 - k); + t = vshlq_s32(t, minus_shift); + t = vmaxq_s32(t, min); + t = vminq_s32(t, max); + x0 = vmovn_s32(t); + + t = vsubq_s32(E6 - k, O6 - k); + t = vshlq_s32(t, minus_shift); + t = vmaxq_s32(t, min); + t = vminq_s32(t, max); + x1 = vmovn_s32(t); + + t = vsubq_s32(E5 - k, O5 - k); + + t = vshlq_s32(t, minus_shift); + t = vmaxq_s32(t, min); + t = vminq_s32(t, max); + x2 = vmovn_s32(t); + + t = vsubq_s32(E4 - k, O4 - k); + t = vshlq_s32(t, minus_shift); + t = vmaxq_s32(t, min); + t = vminq_s32(t, max); + x3 = vmovn_s32(t); + + transpose_4x4x16(x0, x1, x2, x3); + *(int16x4_t *)&orig_dst0 * 16 + k + 8 = x0; + *(int16x4_t *)&orig_dst1 * 16 + k + 8 = x1; + *(int16x4_t *)&orig_dst2 * 16 + k + 8 = x2; + *(int16x4_t *)&orig_dst3 * 16 + k + 8 = x3; + } + orig_dst += 4 * 16; + src += 4; + } + +#undef MUL +#undef FMA +#undef FMAK +#undef MULK +#undef ODD3_15 +#undef EVEN6_14_STEP4 + + +} + + + +static void partialButterflyInverse32_neon(const int16_t *src, int16_t *orig_dst, int shift, int line) +{ +#define MUL(x) vmull_s16(vdup_n_s16(g_t32xk),*(int16x4_t*)&srcx*line); +#define FMA(x) s = vmlal_s16(s,vdup_n_s16(g_t32xk),*(int16x4_t*)&src(x)*line) +#define FMAK(x,l) sl = vmlal_lane_s16(sl,*(int16x4_t*)&src(x)*line,*(int16x4_t *)&g_t32xk,l) +#define MULK(x,l) vmull_lane_s16(*(int16x4_t*)&srcx*line,*(int16x4_t *)&g_t32xk,l); +#define ODD31(k) FMAK(3,k);FMAK(5,k);FMAK(7,k);FMAK(9,k);FMAK(11,k);FMAK(13,k);FMAK(15,k);FMAK(17,k);FMAK(19,k);FMAK(21,k);FMAK(23,k);FMAK(25,k);FMAK(27,k);FMAK(29,k);FMAK(31,k); + +#define ODD15(k) FMAK(6,k);FMAK(10,k);FMAK(14,k);FMAK(18,k);FMAK(22,k);FMAK(26,k);FMAK(30,k); +#define ODD7(k) FMAK(12,k);FMAK(20,k);FMAK(28,k); + + + int j, k; + int32x4_t E16, O16; + int32x4_t EE8, EO8; + int32x4_t EEE4, EEO4; + int32x4_t EEEE2, EEEO2; + int16x4_t dst32; + int add = 1 << (shift - 1); + +#pragma unroll (8) + for (j = 0; j < line; j += 4) + { +#pragma unroll (4) + for (k = 0; k < 16; k += 4) + { + int32x4_t s4; + s0 = MULK(1, 0); + s1 = MULK(1, 1); + s2 = MULK(1, 2); + s3 = MULK(1, 3); + ODD31(0); + ODD31(1); + ODD31(2); + ODD31(3); + Ok = s0; + Ok + 1 = s1; + Ok + 2 = s2; + Ok + 3 = s3; + + + } + + +#pragma unroll (2) + for (k = 0; k < 8; k += 4) + { + int32x4_t s4; + s0 = MULK(2, 0); + s1 = MULK(2, 1); + s2 = MULK(2, 2); + s3 = MULK(2, 3); + + ODD15(0); + ODD15(1); + ODD15(2); + ODD15(3); + + EOk = s0; + EOk + 1 = s1; + EOk + 2 = s2; + EOk + 3 = s3; + } + + + for (k = 0; k < 4; k += 4) + { + int32x4_t s4; + s0 = MULK(4, 0); + s1 = MULK(4, 1); + s2 = MULK(4, 2); + s3 = MULK(4, 3); + + ODD7(0); + ODD7(1); + ODD7(2); + ODD7(3); + + EEOk = s0; + EEOk + 1 = s1; + EEOk + 2 = s2; + EEOk + 3 = s3; + } + +#pragma unroll (2) + for (k = 0; k < 2; k++) + { + int32x4_t s; + s = MUL(8); + EEEOk = FMA(24); + s = MUL(0); + EEEEk = FMA(16); + } + /* Combining even and odd terms at each hierarchy levels to calculate the final spatial domain vector */ + EEE0 = vaddq_s32(EEEE0, EEEO0); + EEE3 = vsubq_s32(EEEE0, EEEO0); + EEE1 = vaddq_s32(EEEE1, EEEO1); + EEE2 = vsubq_s32(EEEE1, EEEO1); + +#pragma unroll (4) + for (k = 0; k < 4; k++) + { + EEk = vaddq_s32(EEEk, EEOk); + EEk + 4 = vsubq_s32((EEE3 - k), (EEO3 - k)); + } + +#pragma unroll (8) + for (k = 0; k < 8; k++) + { + Ek = vaddq_s32(EEk, EOk); + Ek + 8 = vsubq_s32((EE7 - k), (EO7 - k)); + } + + static const int32x4_t min = vdupq_n_s32(-32768); + static const int32x4_t max = vdupq_n_s32(32767); + + + +#pragma unroll (16) + for (k = 0; k < 16; k++) + { + int32x4_t adde = vaddq_s32(vdupq_n_s32(add), Ek); + int32x4_t s = vaddq_s32(adde, Ok); + s = vshlq_s32(s, vdupq_n_s32(-shift)); + s = vmaxq_s32(s, min); + s = vminq_s32(s, max); + + + + dstk = vmovn_s32(s); + adde = vaddq_s32(vdupq_n_s32(add), (E15 - k)); + s = vsubq_s32(adde, (O15 - k)); + s = vshlq_s32(s, vdupq_n_s32(-shift)); + s = vmaxq_s32(s, min); + s = vminq_s32(s, max); + + dstk + 16 = vmovn_s32(s); + } + + +#pragma unroll (8) + for (k = 0; k < 32; k += 4) + { + int16x4_t x0 = dstk + 0; + int16x4_t x1 = dstk + 1; + int16x4_t x2 = dstk + 2; + int16x4_t x3 = dstk + 3; + transpose_4x4x16(x0, x1, x2, x3); + *(int16x4_t *)&orig_dst0 * 32 + k = x0; + *(int16x4_t *)&orig_dst1 * 32 + k = x1; + *(int16x4_t *)&orig_dst2 * 32 + k = x2; + *(int16x4_t *)&orig_dst3 * 32 + k = x3; + } + orig_dst += 4 * 32; + src += 4; + } +#undef MUL +#undef FMA +#undef FMAK +#undef MULK +#undef ODD31 +#undef ODD15 +#undef ODD7 + +} + + +static void dct8_neon(const int16_t *src, int16_t *dst, intptr_t srcStride) +{ + const int shift_1st = 2 + X265_DEPTH - 8; + const int shift_2nd = 9; + + ALIGN_VAR_32(int16_t, coef8 * 8); + ALIGN_VAR_32(int16_t, block8 * 8); + + for (int i = 0; i < 8; i++) + { + memcpy(&blocki * 8, &srci * srcStride, 8 * sizeof(int16_t)); + } + + partialButterfly8(block, coef, shift_1st, 8); + partialButterfly8(coef, dst, shift_2nd, 8); +} + +static void dct16_neon(const int16_t *src, int16_t *dst, intptr_t srcStride) +{ + const int shift_1st = 3 + X265_DEPTH - 8; + const int shift_2nd = 10; + + ALIGN_VAR_32(int16_t, coef16 * 16); + ALIGN_VAR_32(int16_t, block16 * 16); + + for (int i = 0; i < 16; i++) + { + memcpy(&blocki * 16, &srci * srcStride, 16 * sizeof(int16_t)); + } + + partialButterfly16(block, coef, shift_1st, 16); + partialButterfly16(coef, dst, shift_2nd, 16); +} + +static void dct32_neon(const int16_t *src, int16_t *dst, intptr_t srcStride) +{ + const int shift_1st = 4 + X265_DEPTH - 8; + const int shift_2nd = 11; + + ALIGN_VAR_32(int16_t, coef32 * 32); + ALIGN_VAR_32(int16_t, block32 * 32); + + for (int i = 0; i < 32; i++) + { + memcpy(&blocki * 32, &srci * srcStride, 32 * sizeof(int16_t)); + } + + partialButterfly32(block, coef, shift_1st, 32); + partialButterfly32(coef, dst, shift_2nd, 32); +} + +static void idct4_neon(const int16_t *src, int16_t *dst, intptr_t dstStride) +{ + const int shift_1st = 7; + const int shift_2nd = 12 - (X265_DEPTH - 8); + + ALIGN_VAR_32(int16_t, coef4 * 4); + ALIGN_VAR_32(int16_t, block4 * 4); + + partialButterflyInverse4(src, coef, shift_1st, 4); // Forward DST BY FAST ALGORITHM, block input, coef output + partialButterflyInverse4(coef, block, shift_2nd, 4); // Forward DST BY FAST ALGORITHM, coef input, coeff output + + for (int i = 0; i < 4; i++) + { + memcpy(&dsti * dstStride, &blocki * 4, 4 * sizeof(int16_t)); + } +} + +static void idct16_neon(const int16_t *src, int16_t *dst, intptr_t dstStride) +{ + const int shift_1st = 7; + const int shift_2nd = 12 - (X265_DEPTH - 8); + + ALIGN_VAR_32(int16_t, coef16 * 16); + ALIGN_VAR_32(int16_t, block16 * 16); + + partialButterflyInverse16_neon(src, coef, shift_1st, 16); + partialButterflyInverse16_neon(coef, block, shift_2nd, 16); + + for (int i = 0; i < 16; i++) + { + memcpy(&dsti * dstStride, &blocki * 16, 16 * sizeof(int16_t)); + } +} + +static void idct32_neon(const int16_t *src, int16_t *dst, intptr_t dstStride) +{ + const int shift_1st = 7; + const int shift_2nd = 12 - (X265_DEPTH - 8); + + ALIGN_VAR_32(int16_t, coef32 * 32); + ALIGN_VAR_32(int16_t, block32 * 32); + + partialButterflyInverse32_neon(src, coef, shift_1st, 32); + partialButterflyInverse32_neon(coef, block, shift_2nd, 32); + + for (int i = 0; i < 32; i++) + { + memcpy(&dsti * dstStride, &blocki * 32, 32 * sizeof(int16_t)); + } +} + + + +} + +namespace X265_NS +{ +// x265 private namespace +void setupDCTPrimitives_neon(EncoderPrimitives &p) +{ + p.cuBLOCK_4x4.nonPsyRdoQuant = nonPsyRdoQuant_neon<2>; + p.cuBLOCK_8x8.nonPsyRdoQuant = nonPsyRdoQuant_neon<3>; + p.cuBLOCK_16x16.nonPsyRdoQuant = nonPsyRdoQuant_neon<4>; + p.cuBLOCK_32x32.nonPsyRdoQuant = nonPsyRdoQuant_neon<5>; + p.cuBLOCK_4x4.psyRdoQuant = psyRdoQuant_neon<2>; + p.cuBLOCK_8x8.psyRdoQuant = psyRdoQuant_neon<3>; + p.cuBLOCK_16x16.psyRdoQuant = psyRdoQuant_neon<4>; + p.cuBLOCK_32x32.psyRdoQuant = psyRdoQuant_neon<5>; + p.cuBLOCK_8x8.dct = dct8_neon; + p.cuBLOCK_16x16.dct = dct16_neon; + p.cuBLOCK_32x32.dct = dct32_neon; + p.cuBLOCK_4x4.idct = idct4_neon; + p.cuBLOCK_16x16.idct = idct16_neon; + p.cuBLOCK_32x32.idct = idct32_neon; + p.cuBLOCK_4x4.count_nonzero = count_nonzero_neon<4>; + p.cuBLOCK_8x8.count_nonzero = count_nonzero_neon<8>; + p.cuBLOCK_16x16.count_nonzero = count_nonzero_neon<16>; + p.cuBLOCK_32x32.count_nonzero = count_nonzero_neon<32>; + + p.cuBLOCK_4x4.copy_cnt = copy_count_neon<4>; + p.cuBLOCK_8x8.copy_cnt = copy_count_neon<8>; + p.cuBLOCK_16x16.copy_cnt = copy_count_neon<16>; + p.cuBLOCK_32x32.copy_cnt = copy_count_neon<32>; + p.cuBLOCK_4x4.psyRdoQuant_1p = nonPsyRdoQuant_neon<2>; + p.cuBLOCK_4x4.psyRdoQuant_2p = psyRdoQuant_neon<2>; + p.cuBLOCK_8x8.psyRdoQuant_1p = nonPsyRdoQuant_neon<3>; + p.cuBLOCK_8x8.psyRdoQuant_2p = psyRdoQuant_neon<3>; + p.cuBLOCK_16x16.psyRdoQuant_1p = nonPsyRdoQuant_neon<4>; + p.cuBLOCK_16x16.psyRdoQuant_2p = psyRdoQuant_neon<4>; + p.cuBLOCK_32x32.psyRdoQuant_1p = nonPsyRdoQuant_neon<5>; + p.cuBLOCK_32x32.psyRdoQuant_2p = psyRdoQuant_neon<5>; + + p.scanPosLast = scanPosLast_opt; + +} + +}; + + + +#endif
View file
x265_3.6.tar.gz/source/common/aarch64/dct-prim.h
Added
@@ -0,0 +1,19 @@ +#ifndef __DCT_PRIM_NEON_H__ +#define __DCT_PRIM_NEON_H__ + + +#include "common.h" +#include "primitives.h" +#include "contexts.h" // costCoeffNxN_c +#include "threading.h" // CLZ + +namespace X265_NS +{ +// x265 private namespace +void setupDCTPrimitives_neon(EncoderPrimitives &p); +}; + + + +#endif +
View file
x265_3.6.tar.gz/source/common/aarch64/filter-prim.cpp
Added
@@ -0,0 +1,995 @@ +#if HAVE_NEON + +#include "filter-prim.h" +#include <arm_neon.h> + +namespace +{ + +using namespace X265_NS; + + +template<int width, int height> +void filterPixelToShort_neon(const pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +{ + const int shift = IF_INTERNAL_PREC - X265_DEPTH; + int row, col; + const int16x8_t off = vdupq_n_s16(IF_INTERNAL_OFFS); + for (row = 0; row < height; row++) + { + + for (col = 0; col < width; col += 8) + { + int16x8_t in; + +#if HIGH_BIT_DEPTH + in = *(int16x8_t *)&srccol; +#else + in = vmovl_u8(*(uint8x8_t *)&srccol); +#endif + + int16x8_t tmp = vshlq_n_s16(in, shift); + tmp = vsubq_s16(tmp, off); + *(int16x8_t *)&dstcol = tmp; + + } + + src += srcStride; + dst += dstStride; + } +} + + +template<int N, int width, int height> +void interp_horiz_pp_neon(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +{ + const int16_t *coeff = (N == 4) ? g_chromaFiltercoeffIdx : g_lumaFiltercoeffIdx; + int headRoom = IF_FILTER_PREC; + int offset = (1 << (headRoom - 1)); + uint16_t maxVal = (1 << X265_DEPTH) - 1; + int cStride = 1; + + src -= (N / 2 - 1) * cStride; + int16x8_t vc; + vc = *(int16x8_t *)coeff; + int16x4_t low_vc = vget_low_s16(vc); + int16x4_t high_vc = vget_high_s16(vc); + + const int32x4_t voffset = vdupq_n_s32(offset); + const int32x4_t vhr = vdupq_n_s32(-headRoom); + + int row, col; + for (row = 0; row < height; row++) + { + for (col = 0; col < width; col += 8) + { + int32x4_t vsum1, vsum2; + + int16x8_t inputN; + + for (int i = 0; i < N; i++) + { +#if HIGH_BIT_DEPTH + inputi = *(int16x8_t *)&srccol + i; +#else + inputi = vmovl_u8(*(uint8x8_t *)&srccol + i); +#endif + } + vsum1 = voffset; + vsum2 = voffset; + + vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input0), low_vc, 0); + vsum2 = vmlal_high_lane_s16(vsum2, input0, low_vc, 0); + + vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input1), low_vc, 1); + vsum2 = vmlal_high_lane_s16(vsum2, input1, low_vc, 1); + + vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input2), low_vc, 2); + vsum2 = vmlal_high_lane_s16(vsum2, input2, low_vc, 2); + + vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input3), low_vc, 3); + vsum2 = vmlal_high_lane_s16(vsum2, input3, low_vc, 3); + + if (N == 8) + { + vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input4), high_vc, 0); + vsum2 = vmlal_high_lane_s16(vsum2, input4, high_vc, 0); + vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input5), high_vc, 1); + vsum2 = vmlal_high_lane_s16(vsum2, input5, high_vc, 1); + vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input6), high_vc, 2); + vsum2 = vmlal_high_lane_s16(vsum2, input6, high_vc, 2); + vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input7), high_vc, 3); + vsum2 = vmlal_high_lane_s16(vsum2, input7, high_vc, 3); + + } + + vsum1 = vshlq_s32(vsum1, vhr); + vsum2 = vshlq_s32(vsum2, vhr); + + int16x8_t vsum = vuzp1q_s16(vsum1, vsum2); + vsum = vminq_s16(vsum, vdupq_n_s16(maxVal)); + vsum = vmaxq_s16(vsum, vdupq_n_s16(0)); +#if HIGH_BIT_DEPTH + *(int16x8_t *)&dstcol = vsum; +#else + uint8x16_t usum = vuzp1q_u8(vsum, vsum); + *(uint8x8_t *)&dstcol = vget_low_u8(usum); +#endif + + } + + src += srcStride; + dst += dstStride; + } +} + +#if HIGH_BIT_DEPTH + +template<int N, int width, int height> +void interp_horiz_ps_neon(const uint16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, + int isRowExt) +{ + const int16_t *coeff = (N == 4) ? g_chromaFiltercoeffIdx : g_lumaFiltercoeffIdx; + const int headRoom = IF_INTERNAL_PREC - X265_DEPTH; + const int shift = IF_FILTER_PREC - headRoom; + const int offset = (unsigned) - IF_INTERNAL_OFFS << shift; + + int blkheight = height; + src -= N / 2 - 1; + + if (isRowExt) + { + src -= (N / 2 - 1) * srcStride; + blkheight += N - 1; + } + int16x8_t vc3 = vld1q_s16(coeff); + const int32x4_t voffset = vdupq_n_s32(offset); + const int32x4_t vhr = vdupq_n_s32(-shift); + + int row, col; + for (row = 0; row < blkheight; row++) + { + for (col = 0; col < width; col += 8) + { + int32x4_t vsum, vsum2; + + int16x8_t inputN; + for (int i = 0; i < N; i++) + { + inputi = vld1q_s16((int16_t *)&srccol + i); + } + + vsum = voffset; + vsum2 = voffset; + + vsum = vmlal_lane_s16(vsum, vget_low_u16(input0), vget_low_s16(vc3), 0); + vsum2 = vmlal_high_lane_s16(vsum2, input0, vget_low_s16(vc3), 0); + + vsum = vmlal_lane_s16(vsum, vget_low_u16(input1), vget_low_s16(vc3), 1); + vsum2 = vmlal_high_lane_s16(vsum2, input1, vget_low_s16(vc3), 1); + + vsum = vmlal_lane_s16(vsum, vget_low_u16(input2), vget_low_s16(vc3), 2); + vsum2 = vmlal_high_lane_s16(vsum2, input2, vget_low_s16(vc3), 2); + + vsum = vmlal_lane_s16(vsum, vget_low_u16(input3), vget_low_s16(vc3), 3); + vsum2 = vmlal_high_lane_s16(vsum2, input3, vget_low_s16(vc3), 3); + + if (N == 8) + { + vsum = vmlal_lane_s16(vsum, vget_low_s16(input4), vget_high_s16(vc3), 0); + vsum2 = vmlal_high_lane_s16(vsum2, input4, vget_high_s16(vc3), 0); + + vsum = vmlal_lane_s16(vsum, vget_low_s16(input5), vget_high_s16(vc3), 1); + vsum2 = vmlal_high_lane_s16(vsum2, input5, vget_high_s16(vc3), 1); + + vsum = vmlal_lane_s16(vsum, vget_low_s16(input6), vget_high_s16(vc3), 2); + vsum2 = vmlal_high_lane_s16(vsum2, input6, vget_high_s16(vc3), 2); + + vsum = vmlal_lane_s16(vsum, vget_low_s16(input7), vget_high_s16(vc3), 3); + vsum2 = vmlal_high_lane_s16(vsum2, input7, vget_high_s16(vc3), 3); + } + + vsum = vshlq_s32(vsum, vhr); + vsum2 = vshlq_s32(vsum2, vhr); + *(int16x4_t *)&dstcol = vmovn_u32(vsum); + *(int16x4_t *)&dstcol+4 = vmovn_u32(vsum2); + } + + src += srcStride; + dst += dstStride; + } +} + + +#else + +template<int N, int width, int height> +void interp_horiz_ps_neon(const uint8_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, + int isRowExt) +{ + const int16_t *coeff = (N == 4) ? g_chromaFiltercoeffIdx : g_lumaFiltercoeffIdx; + const int headRoom = IF_INTERNAL_PREC - X265_DEPTH; + const int shift = IF_FILTER_PREC - headRoom; + const int offset = (unsigned) - IF_INTERNAL_OFFS << shift; + + int blkheight = height; + src -= N / 2 - 1; + + if (isRowExt) + { + src -= (N / 2 - 1) * srcStride; + blkheight += N - 1; + } + int16x8_t vc; + vc = *(int16x8_t *)coeff; + + const int16x8_t voffset = vdupq_n_s16(offset); + const int16x8_t vhr = vdupq_n_s16(-shift); + + int row, col; + for (row = 0; row < blkheight; row++) + { + for (col = 0; col < width; col += 8) + { + int16x8_t vsum; + + int16x8_t inputN; + + for (int i = 0; i < N; i++) + { + inputi = vmovl_u8(*(uint8x8_t *)&srccol + i); + } + vsum = voffset; + vsum = vmlaq_laneq_s16(vsum, (input0), vc, 0); + vsum = vmlaq_laneq_s16(vsum, (input1), vc, 1); + vsum = vmlaq_laneq_s16(vsum, (input2), vc, 2); + vsum = vmlaq_laneq_s16(vsum, (input3), vc, 3); + + + if (N == 8) + { + vsum = vmlaq_laneq_s16(vsum, (input4), vc, 4); + vsum = vmlaq_laneq_s16(vsum, (input5), vc, 5); + vsum = vmlaq_laneq_s16(vsum, (input6), vc, 6); + vsum = vmlaq_laneq_s16(vsum, (input7), vc, 7); + + } + + vsum = vshlq_s16(vsum, vhr); + *(int16x8_t *)&dstcol = vsum; + } + + src += srcStride; + dst += dstStride; + } +} + +#endif + + +template<int N, int width, int height> +void interp_vert_ss_neon(const int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +{ + const int16_t *c = (N == 8 ? g_lumaFiltercoeffIdx : g_chromaFiltercoeffIdx); + int shift = IF_FILTER_PREC; + src -= (N / 2 - 1) * srcStride; + int16x8_t vc; + vc = *(int16x8_t *)c; + int16x4_t low_vc = vget_low_s16(vc); + int16x4_t high_vc = vget_high_s16(vc); + + const int32x4_t vhr = vdupq_n_s32(-shift); + + int row, col; + for (row = 0; row < height; row++) + { + for (col = 0; col < width; col += 8) + { + int32x4_t vsum1, vsum2; + + int16x8_t inputN; + + for (int i = 0; i < N; i++) + { + inputi = *(int16x8_t *)&srccol + i * srcStride; + } + + vsum1 = vmull_lane_s16(vget_low_s16(input0), low_vc, 0); + vsum2 = vmull_high_lane_s16(input0, low_vc, 0); + + vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input1), low_vc, 1); + vsum2 = vmlal_high_lane_s16(vsum2, input1, low_vc, 1); + + vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input2), low_vc, 2); + vsum2 = vmlal_high_lane_s16(vsum2, input2, low_vc, 2); + + vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input3), low_vc, 3); + vsum2 = vmlal_high_lane_s16(vsum2, input3, low_vc, 3); + + if (N == 8) + { + vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input4), high_vc, 0); + vsum2 = vmlal_high_lane_s16(vsum2, input4, high_vc, 0); + vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input5), high_vc, 1); + vsum2 = vmlal_high_lane_s16(vsum2, input5, high_vc, 1); + vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input6), high_vc, 2); + vsum2 = vmlal_high_lane_s16(vsum2, input6, high_vc, 2); + vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input7), high_vc, 3); + vsum2 = vmlal_high_lane_s16(vsum2, input7, high_vc, 3); + + } + + vsum1 = vshlq_s32(vsum1, vhr); + vsum2 = vshlq_s32(vsum2, vhr); + + int16x8_t vsum = vuzp1q_s16(vsum1, vsum2); + *(int16x8_t *)&dstcol = vsum; + } + + src += srcStride; + dst += dstStride; + } + +} + + +#if HIGH_BIT_DEPTH + +template<int N, int width, int height> +void interp_vert_pp_neon(const uint16_t *src, intptr_t srcStride, uint16_t *dst, intptr_t dstStride, int coeffIdx) +{ + + const int16_t *c = (N == 4) ? g_chromaFiltercoeffIdx : g_lumaFiltercoeffIdx; + int shift = IF_FILTER_PREC; + int offset = 1 << (shift - 1); + const uint16_t maxVal = (1 << X265_DEPTH) - 1; + + src -= (N / 2 - 1) * srcStride; + int16x8_t vc; + vc = *(int16x8_t *)c; + int32x4_t low_vc = vmovl_s16(vget_low_s16(vc)); + int32x4_t high_vc = vmovl_s16(vget_high_s16(vc)); + + const int32x4_t voffset = vdupq_n_s32(offset); + const int32x4_t vhr = vdupq_n_s32(-shift); + + int row, col; + for (row = 0; row < height; row++) + { + for (col = 0; col < width; col += 4) + { + int32x4_t vsum; + + int32x4_t inputN; + + for (int i = 0; i < N; i++) + { + inputi = vmovl_u16(*(uint16x4_t *)&srccol + i * srcStride); + } + vsum = voffset; + + vsum = vmlaq_laneq_s32(vsum, (input0), low_vc, 0); + vsum = vmlaq_laneq_s32(vsum, (input1), low_vc, 1); + vsum = vmlaq_laneq_s32(vsum, (input2), low_vc, 2); + vsum = vmlaq_laneq_s32(vsum, (input3), low_vc, 3); + + if (N == 8) + { + vsum = vmlaq_laneq_s32(vsum, (input4), high_vc, 0); + vsum = vmlaq_laneq_s32(vsum, (input5), high_vc, 1); + vsum = vmlaq_laneq_s32(vsum, (input6), high_vc, 2); + vsum = vmlaq_laneq_s32(vsum, (input7), high_vc, 3); + } + + vsum = vshlq_s32(vsum, vhr); + vsum = vminq_s32(vsum, vdupq_n_s32(maxVal)); + vsum = vmaxq_s32(vsum, vdupq_n_s32(0)); + *(uint16x4_t *)&dstcol = vmovn_u32(vsum); + } + src += srcStride; + dst += dstStride; + } +} + + + + +#else + +template<int N, int width, int height> +void interp_vert_pp_neon(const uint8_t *src, intptr_t srcStride, uint8_t *dst, intptr_t dstStride, int coeffIdx) +{ + + const int16_t *c = (N == 4) ? g_chromaFiltercoeffIdx : g_lumaFiltercoeffIdx; + int shift = IF_FILTER_PREC; + int offset = 1 << (shift - 1); + const uint16_t maxVal = (1 << X265_DEPTH) - 1; + + src -= (N / 2 - 1) * srcStride; + int16x8_t vc; + vc = *(int16x8_t *)c; + + const int16x8_t voffset = vdupq_n_s16(offset); + const int16x8_t vhr = vdupq_n_s16(-shift); + + int row, col; + for (row = 0; row < height; row++) + { + for (col = 0; col < width; col += 8) + { + int16x8_t vsum; + + int16x8_t inputN; + + for (int i = 0; i < N; i++) + { + inputi = vmovl_u8(*(uint8x8_t *)&srccol + i * srcStride); + } + vsum = voffset; + + vsum = vmlaq_laneq_s16(vsum, (input0), vc, 0); + vsum = vmlaq_laneq_s16(vsum, (input1), vc, 1); + vsum = vmlaq_laneq_s16(vsum, (input2), vc, 2); + vsum = vmlaq_laneq_s16(vsum, (input3), vc, 3); + + if (N == 8) + { + vsum = vmlaq_laneq_s16(vsum, (input4), vc, 4); + vsum = vmlaq_laneq_s16(vsum, (input5), vc, 5); + vsum = vmlaq_laneq_s16(vsum, (input6), vc, 6); + vsum = vmlaq_laneq_s16(vsum, (input7), vc, 7); + + } + + vsum = vshlq_s16(vsum, vhr); + + vsum = vminq_s16(vsum, vdupq_n_s16(maxVal)); + vsum = vmaxq_s16(vsum, vdupq_n_s16(0)); + uint8x16_t usum = vuzp1q_u8(vsum, vsum); + *(uint8x8_t *)&dstcol = vget_low_u8(usum); + + } + + src += srcStride; + dst += dstStride; + } +} + + +#endif + + +#if HIGH_BIT_DEPTH + +template<int N, int width, int height> +void interp_vert_ps_neon(const uint16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +{ + const int16_t *c = (N == 4) ? g_chromaFiltercoeffIdx : g_lumaFiltercoeffIdx; + int headRoom = IF_INTERNAL_PREC - X265_DEPTH; + int shift = IF_FILTER_PREC - headRoom; + int offset = (unsigned) - IF_INTERNAL_OFFS << shift; + src -= (N / 2 - 1) * srcStride; + + int16x8_t vc; + vc = *(int16x8_t *)c; + int32x4_t low_vc = vmovl_s16(vget_low_s16(vc)); + int32x4_t high_vc = vmovl_s16(vget_high_s16(vc)); + + const int32x4_t voffset = vdupq_n_s32(offset); + const int32x4_t vhr = vdupq_n_s32(-shift); + + int row, col; + for (row = 0; row < height; row++) + { + for (col = 0; col < width; col += 4) + { + int16x8_t vsum; + + int16x8_t inputN; + + for (int i = 0; i < N; i++) + { + inputi = vmovl_u16(*(uint16x4_t *)&srccol + i * srcStride); + } + vsum = voffset; + + vsum = vmlaq_laneq_s32(vsum, (input0), low_vc, 0); + vsum = vmlaq_laneq_s32(vsum, (input1), low_vc, 1); + vsum = vmlaq_laneq_s32(vsum, (input2), low_vc, 2); + vsum = vmlaq_laneq_s32(vsum, (input3), low_vc, 3); + + if (N == 8) + { + int16x8_t vsum1 = vmulq_laneq_s32((input4), high_vc, 0); + vsum1 = vmlaq_laneq_s32(vsum1, (input5), high_vc, 1); + vsum1 = vmlaq_laneq_s32(vsum1, (input6), high_vc, 2); + vsum1 = vmlaq_laneq_s32(vsum1, (input7), high_vc, 3); + vsum = vaddq_s32(vsum, vsum1); + } + + vsum = vshlq_s32(vsum, vhr); + + *(uint16x4_t *)&dstcol = vmovn_s32(vsum); + } + + src += srcStride; + dst += dstStride; + } +} + +#else + +template<int N, int width, int height> +void interp_vert_ps_neon(const uint8_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +{ + const int16_t *c = (N == 4) ? g_chromaFiltercoeffIdx : g_lumaFiltercoeffIdx; + int headRoom = IF_INTERNAL_PREC - X265_DEPTH; + int shift = IF_FILTER_PREC - headRoom; + int offset = (unsigned) - IF_INTERNAL_OFFS << shift; + src -= (N / 2 - 1) * srcStride; + + int16x8_t vc; + vc = *(int16x8_t *)c; + + const int16x8_t voffset = vdupq_n_s16(offset); + const int16x8_t vhr = vdupq_n_s16(-shift); + + int row, col; + for (row = 0; row < height; row++) + { + for (col = 0; col < width; col += 8) + { + int16x8_t vsum; + + int16x8_t inputN; + + for (int i = 0; i < N; i++) + { + inputi = vmovl_u8(*(uint8x8_t *)&srccol + i * srcStride); + } + vsum = voffset; + + vsum = vmlaq_laneq_s16(vsum, (input0), vc, 0); + vsum = vmlaq_laneq_s16(vsum, (input1), vc, 1); + vsum = vmlaq_laneq_s16(vsum, (input2), vc, 2); + vsum = vmlaq_laneq_s16(vsum, (input3), vc, 3); + + if (N == 8) + { + int16x8_t vsum1 = vmulq_laneq_s16((input4), vc, 4); + vsum1 = vmlaq_laneq_s16(vsum1, (input5), vc, 5); + vsum1 = vmlaq_laneq_s16(vsum1, (input6), vc, 6); + vsum1 = vmlaq_laneq_s16(vsum1, (input7), vc, 7); + vsum = vaddq_s16(vsum, vsum1); + } + + vsum = vshlq_s32(vsum, vhr); + *(int16x8_t *)&dstcol = vsum; + } + + src += srcStride; + dst += dstStride; + } +} + +#endif + + + +template<int N, int width, int height> +void interp_vert_sp_neon(const int16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +{ + int headRoom = IF_INTERNAL_PREC - X265_DEPTH; + int shift = IF_FILTER_PREC + headRoom; + int offset = (1 << (shift - 1)) + (IF_INTERNAL_OFFS << IF_FILTER_PREC); + uint16_t maxVal = (1 << X265_DEPTH) - 1; + const int16_t *coeff = (N == 8 ? g_lumaFiltercoeffIdx : g_chromaFiltercoeffIdx); + + src -= (N / 2 - 1) * srcStride; + + int16x8_t vc; + vc = *(int16x8_t *)coeff; + int16x4_t low_vc = vget_low_s16(vc); + int16x4_t high_vc = vget_high_s16(vc); + + const int32x4_t voffset = vdupq_n_s32(offset); + const int32x4_t vhr = vdupq_n_s32(-shift); + + int row, col; + for (row = 0; row < height; row++) + { + for (col = 0; col < width; col += 8) + { + int32x4_t vsum1, vsum2; + + int16x8_t inputN; + + for (int i = 0; i < N; i++) + { + inputi = *(int16x8_t *)&srccol + i * srcStride; + } + vsum1 = voffset; + vsum2 = voffset; + + vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input0), low_vc, 0); + vsum2 = vmlal_high_lane_s16(vsum2, input0, low_vc, 0); + + vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input1), low_vc, 1); + vsum2 = vmlal_high_lane_s16(vsum2, input1, low_vc, 1); + + vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input2), low_vc, 2); + vsum2 = vmlal_high_lane_s16(vsum2, input2, low_vc, 2); + + vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input3), low_vc, 3); + vsum2 = vmlal_high_lane_s16(vsum2, input3, low_vc, 3); + + if (N == 8) + { + vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input4), high_vc, 0); + vsum2 = vmlal_high_lane_s16(vsum2, input4, high_vc, 0); + + vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input5), high_vc, 1); + vsum2 = vmlal_high_lane_s16(vsum2, input5, high_vc, 1); + + vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input6), high_vc, 2); + vsum2 = vmlal_high_lane_s16(vsum2, input6, high_vc, 2); + + vsum1 = vmlal_lane_s16(vsum1, vget_low_s16(input7), high_vc, 3); + vsum2 = vmlal_high_lane_s16(vsum2, input7, high_vc, 3); + } + + vsum1 = vshlq_s32(vsum1, vhr); + vsum2 = vshlq_s32(vsum2, vhr); + + int16x8_t vsum = vuzp1q_s16(vsum1, vsum2); + vsum = vminq_s16(vsum, vdupq_n_s16(maxVal)); + vsum = vmaxq_s16(vsum, vdupq_n_s16(0)); +#if HIGH_BIT_DEPTH + *(int16x8_t *)&dstcol = vsum; +#else + uint8x16_t usum = vuzp1q_u8(vsum, vsum); + *(uint8x8_t *)&dstcol = vget_low_u8(usum); +#endif + + } + + src += srcStride; + dst += dstStride; + } +} + + + + + + +template<int N, int width, int height> +void interp_hv_pp_neon(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int idxX, int idxY) +{ + ALIGN_VAR_32(int16_t, immedwidth * (height + N - 1)); + + interp_horiz_ps_neon<N, width, height>(src, srcStride, immed, width, idxX, 1); + interp_vert_sp_neon<N, width, height>(immed + (N / 2 - 1) * width, width, dst, dstStride, idxY); +} + + + +} + + + + +namespace X265_NS +{ +#if defined(__APPLE__) +#define CHROMA_420(W, H) \ + p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_hpp = interp_horiz_pp_neon<4, W, H>; \ + p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vpp = interp_vert_pp_neon<4, W, H>; \ + p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vps = interp_vert_ps_neon<4, W, H>; \ + p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vsp = interp_vert_sp_neon<4, W, H>; \ + p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vss = interp_vert_ss_neon<4, W, H>; \ + p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.p2sNONALIGNED = filterPixelToShort_neon<W, H>;\ + p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.p2sALIGNED = filterPixelToShort_neon<W, H>; + +#define CHROMA_FILTER_420(W, H) \ + p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_hps = interp_horiz_ps_neon<4, W, H>; + +#else // defined(__APPLE__) +#define CHROMA_420(W, H) \ + p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vss = interp_vert_ss_neon<4, W, H>; \ + p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.p2sNONALIGNED = filterPixelToShort_neon<W, H>;\ + p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.p2sALIGNED = filterPixelToShort_neon<W, H>; + +#define CHROMA_FILTER_420(W, H) \ + p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_hpp = interp_horiz_pp_neon<4, W, H>; \ + p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_hps = interp_horiz_ps_neon<4, W, H>; \ + p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vpp = interp_vert_pp_neon<4, W, H>; \ + p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vps = interp_vert_ps_neon<4, W, H>; \ + p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.filter_vsp = interp_vert_sp_neon<4, W, H>; +#endif // defined(__APPLE__) + +#if defined(__APPLE__) +#define CHROMA_422(W, H) \ + p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_hpp = interp_horiz_pp_neon<4, W, H>; \ + p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vpp = interp_vert_pp_neon<4, W, H>; \ + p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vps = interp_vert_ps_neon<4, W, H>; \ + p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vsp = interp_vert_sp_neon<4, W, H>; \ + p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vss = interp_vert_ss_neon<4, W, H>; \ + p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.p2sNONALIGNED = filterPixelToShort_neon<W, H>;\ + p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.p2sALIGNED = filterPixelToShort_neon<W, H>; + +#define CHROMA_FILTER_422(W, H) \ + p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_hps = interp_horiz_ps_neon<4, W, H>; + +#else // defined(__APPLE__) +#define CHROMA_422(W, H) \ + p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vss = interp_vert_ss_neon<4, W, H>; \ + p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.p2sNONALIGNED = filterPixelToShort_neon<W, H>;\ + p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.p2sALIGNED = filterPixelToShort_neon<W, H>; + +#define CHROMA_FILTER_422(W, H) \ + p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_hpp = interp_horiz_pp_neon<4, W, H>; \ + p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_hps = interp_horiz_ps_neon<4, W, H>; \ + p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vpp = interp_vert_pp_neon<4, W, H>; \ + p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vps = interp_vert_ps_neon<4, W, H>; \ + p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.filter_vsp = interp_vert_sp_neon<4, W, H>; +#endif // defined(__APPLE__) + +#if defined(__APPLE__) +#define CHROMA_444(W, H) \ + p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_hpp = interp_horiz_pp_neon<4, W, H>; \ + p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vpp = interp_vert_pp_neon<4, W, H>; \ + p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vps = interp_vert_ps_neon<4, W, H>; \ + p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vsp = interp_vert_sp_neon<4, W, H>; \ + p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vss = interp_vert_ss_neon<4, W, H>; \ + p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.p2sNONALIGNED = filterPixelToShort_neon<W, H>;\ + p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.p2sALIGNED = filterPixelToShort_neon<W, H>; + +#define CHROMA_FILTER_444(W, H) \ + p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_hps = interp_horiz_ps_neon<4, W, H>; + +#else // defined(__APPLE__) +#define CHROMA_444(W, H) \ + p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.p2sNONALIGNED = filterPixelToShort_neon<W, H>;\ + p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.p2sALIGNED = filterPixelToShort_neon<W, H>; + +#define CHROMA_FILTER_444(W, H) \ + p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_hpp = interp_horiz_pp_neon<4, W, H>; \ + p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_hps = interp_horiz_ps_neon<4, W, H>; \ + p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vpp = interp_vert_pp_neon<4, W, H>; \ + p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vps = interp_vert_ps_neon<4, W, H>; \ + p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vsp = interp_vert_sp_neon<4, W, H>; \ + p.chromaX265_CSP_I444.puLUMA_ ## W ## x ## H.filter_vss = interp_vert_ss_neon<4, W, H>; +#endif // defined(__APPLE__) + +#if defined(__APPLE__) +#define LUMA(W, H) \ + p.puLUMA_ ## W ## x ## H.luma_hpp = interp_horiz_pp_neon<8, W, H>; \ + p.puLUMA_ ## W ## x ## H.luma_vpp = interp_vert_pp_neon<8, W, H>; \ + p.puLUMA_ ## W ## x ## H.luma_vps = interp_vert_ps_neon<8, W, H>; \ + p.puLUMA_ ## W ## x ## H.luma_vsp = interp_vert_sp_neon<8, W, H>; \ + p.puLUMA_ ## W ## x ## H.luma_vss = interp_vert_ss_neon<8, W, H>; \ + p.puLUMA_ ## W ## x ## H.luma_hvpp = interp_hv_pp_neon<8, W, H>; \ + p.puLUMA_ ## W ## x ## H.convert_p2sNONALIGNED = filterPixelToShort_neon<W, H>;\ + p.puLUMA_ ## W ## x ## H.convert_p2sALIGNED = filterPixelToShort_neon<W, H>; + +#else // defined(__APPLE__) +#define LUMA(W, H) \ + p.puLUMA_ ## W ## x ## H.luma_vss = interp_vert_ss_neon<8, W, H>; \ + p.puLUMA_ ## W ## x ## H.convert_p2sNONALIGNED = filterPixelToShort_neon<W, H>;\ + p.puLUMA_ ## W ## x ## H.convert_p2sALIGNED = filterPixelToShort_neon<W, H>; + +#define LUMA_FILTER(W, H) \ + p.puLUMA_ ## W ## x ## H.luma_hpp = interp_horiz_pp_neon<8, W, H>; \ + p.puLUMA_ ## W ## x ## H.luma_vpp = interp_vert_pp_neon<8, W, H>; \ + p.puLUMA_ ## W ## x ## H.luma_vps = interp_vert_ps_neon<8, W, H>; \ + p.puLUMA_ ## W ## x ## H.luma_vsp = interp_vert_sp_neon<8, W, H>; \ + p.puLUMA_ ## W ## x ## H.luma_hvpp = interp_hv_pp_neon<8, W, H>; +#endif // defined(__APPLE__) + +void setupFilterPrimitives_neon(EncoderPrimitives &p) +{ + + // All neon functions assume width of multiple of 8, (2,4,12 variants are not optimized) + + LUMA(8, 8); + LUMA(8, 4); + LUMA(16, 16); + CHROMA_420(8, 8); + LUMA(16, 8); + CHROMA_420(8, 4); + LUMA(8, 16); + LUMA(16, 12); + CHROMA_420(8, 6); + LUMA(16, 4); + CHROMA_420(8, 2); + LUMA(32, 32); + CHROMA_420(16, 16); + LUMA(32, 16); + CHROMA_420(16, 8); + LUMA(16, 32); + CHROMA_420(8, 16); + LUMA(32, 24); + CHROMA_420(16, 12); + LUMA(24, 32); + LUMA(32, 8); + CHROMA_420(16, 4); + LUMA(8, 32); + LUMA(64, 64); + CHROMA_420(32, 32); + LUMA(64, 32); + CHROMA_420(32, 16); + LUMA(32, 64); + CHROMA_420(16, 32); + LUMA(64, 48); + CHROMA_420(32, 24); + LUMA(48, 64); + CHROMA_420(24, 32); + LUMA(64, 16); + CHROMA_420(32, 8); + LUMA(16, 64); + CHROMA_420(8, 32); + CHROMA_422(8, 16); + CHROMA_422(8, 8); + CHROMA_422(8, 12); + CHROMA_422(8, 4); + CHROMA_422(16, 32); + CHROMA_422(16, 16); + CHROMA_422(8, 32); + CHROMA_422(16, 24); + CHROMA_422(16, 8); + CHROMA_422(32, 64); + CHROMA_422(32, 32); + CHROMA_422(16, 64); + CHROMA_422(32, 48); + CHROMA_422(24, 64); + CHROMA_422(32, 16); + CHROMA_422(8, 64); + CHROMA_444(8, 8); + CHROMA_444(8, 4); + CHROMA_444(16, 16); + CHROMA_444(16, 8); + CHROMA_444(8, 16); + CHROMA_444(16, 12); + CHROMA_444(16, 4); + CHROMA_444(32, 32); + CHROMA_444(32, 16); + CHROMA_444(16, 32); + CHROMA_444(32, 24); + CHROMA_444(24, 32); + CHROMA_444(32, 8); + CHROMA_444(8, 32); + CHROMA_444(64, 64); + CHROMA_444(64, 32); + CHROMA_444(32, 64); + CHROMA_444(64, 48); + CHROMA_444(48, 64); + CHROMA_444(64, 16); + CHROMA_444(16, 64); + +#if defined(__APPLE__) || HIGH_BIT_DEPTH + p.puLUMA_8x4.luma_hps = interp_horiz_ps_neon<8, 8, 4>; + p.puLUMA_8x8.luma_hps = interp_horiz_ps_neon<8, 8, 8>; + p.puLUMA_8x16.luma_hps = interp_horiz_ps_neon<8, 8, 16>; + p.puLUMA_8x32.luma_hps = interp_horiz_ps_neon<8, 8, 32>; +#endif // HIGH_BIT_DEPTH + +#if !defined(__APPLE__) && HIGH_BIT_DEPTH + p.puLUMA_24x32.luma_hps = interp_horiz_ps_neon<8, 24, 32>; +#endif // !defined(__APPLE__) + +#if !defined(__APPLE__) + p.puLUMA_32x8.luma_hpp = interp_horiz_pp_neon<8, 32, 8>; + p.puLUMA_32x16.luma_hpp = interp_horiz_pp_neon<8, 32, 16>; + p.puLUMA_32x24.luma_hpp = interp_horiz_pp_neon<8, 32, 24>; + p.puLUMA_32x32.luma_hpp = interp_horiz_pp_neon<8, 32, 32>; + p.puLUMA_32x64.luma_hpp = interp_horiz_pp_neon<8, 32, 64>; + p.puLUMA_48x64.luma_hpp = interp_horiz_pp_neon<8, 48, 64>; + p.puLUMA_64x16.luma_hpp = interp_horiz_pp_neon<8, 64, 16>; + p.puLUMA_64x32.luma_hpp = interp_horiz_pp_neon<8, 64, 32>; + p.puLUMA_64x48.luma_hpp = interp_horiz_pp_neon<8, 64, 48>; + p.puLUMA_64x64.luma_hpp = interp_horiz_pp_neon<8, 64, 64>; + + LUMA_FILTER(8, 4); + LUMA_FILTER(8, 8); + LUMA_FILTER(8, 16); + LUMA_FILTER(8, 32); + LUMA_FILTER(24, 32); + + LUMA_FILTER(16, 32); + LUMA_FILTER(32, 16); + LUMA_FILTER(32, 24); + LUMA_FILTER(32, 32); + LUMA_FILTER(32, 64); + LUMA_FILTER(48, 64); + LUMA_FILTER(64, 32); + LUMA_FILTER(64, 48); + LUMA_FILTER(64, 64); + + CHROMA_FILTER_420(24, 32); + + p.chromaX265_CSP_I420.puCHROMA_420_32x8.filter_hpp = interp_horiz_pp_neon<4, 32, 8>; + p.chromaX265_CSP_I420.puCHROMA_420_32x16.filter_hpp = interp_horiz_pp_neon<4, 32, 16>; + p.chromaX265_CSP_I420.puCHROMA_420_32x24.filter_hpp = interp_horiz_pp_neon<4, 32, 24>; + p.chromaX265_CSP_I420.puCHROMA_420_32x32.filter_hpp = interp_horiz_pp_neon<4, 32, 32>; + + CHROMA_FILTER_422(24, 64); + + p.chromaX265_CSP_I422.puCHROMA_422_32x16.filter_hpp = interp_horiz_pp_neon<4, 32, 16>; + p.chromaX265_CSP_I422.puCHROMA_422_32x32.filter_hpp = interp_horiz_pp_neon<4, 32, 32>; + p.chromaX265_CSP_I422.puCHROMA_422_32x48.filter_hpp = interp_horiz_pp_neon<4, 32, 48>; + p.chromaX265_CSP_I422.puCHROMA_422_32x64.filter_hpp = interp_horiz_pp_neon<4, 32, 64>; + + CHROMA_FILTER_444(24, 32); + + p.chromaX265_CSP_I444.puLUMA_32x8.filter_hpp = interp_horiz_pp_neon<4, 32, 8>; + p.chromaX265_CSP_I444.puLUMA_32x16.filter_hpp = interp_horiz_pp_neon<4, 32, 16>; + p.chromaX265_CSP_I444.puLUMA_32x24.filter_hpp = interp_horiz_pp_neon<4, 32, 24>; + p.chromaX265_CSP_I444.puLUMA_32x32.filter_hpp = interp_horiz_pp_neon<4, 32, 32>; + p.chromaX265_CSP_I444.puLUMA_32x64.filter_hpp = interp_horiz_pp_neon<4, 32, 64>; + p.chromaX265_CSP_I444.puLUMA_48x64.filter_hpp = interp_horiz_pp_neon<4, 48, 64>; + p.chromaX265_CSP_I444.puLUMA_64x16.filter_hpp = interp_horiz_pp_neon<4, 64, 16>; + p.chromaX265_CSP_I444.puLUMA_64x32.filter_hpp = interp_horiz_pp_neon<4, 64, 32>; + p.chromaX265_CSP_I444.puLUMA_64x48.filter_hpp = interp_horiz_pp_neon<4, 64, 48>; + p.chromaX265_CSP_I444.puLUMA_64x64.filter_hpp = interp_horiz_pp_neon<4, 64, 64>; + + p.chromaX265_CSP_I444.puLUMA_16x4.filter_vss = interp_vert_ss_neon<4, 16, 4>; + p.chromaX265_CSP_I444.puLUMA_16x8.filter_vss = interp_vert_ss_neon<4, 16, 8>; + p.chromaX265_CSP_I444.puLUMA_16x12.filter_vss = interp_vert_ss_neon<4, 16, 12>; + p.chromaX265_CSP_I444.puLUMA_16x16.filter_vss = interp_vert_ss_neon<4, 16, 16>; + p.chromaX265_CSP_I444.puLUMA_16x32.filter_vss = interp_vert_ss_neon<4, 16, 32>; + p.chromaX265_CSP_I444.puLUMA_16x64.filter_vss = interp_vert_ss_neon<4, 16, 64>; + p.chromaX265_CSP_I444.puLUMA_32x8.filter_vss = interp_vert_ss_neon<4, 32, 8>; + p.chromaX265_CSP_I444.puLUMA_32x16.filter_vss = interp_vert_ss_neon<4, 32, 16>; + p.chromaX265_CSP_I444.puLUMA_32x24.filter_vss = interp_vert_ss_neon<4, 32, 24>; + p.chromaX265_CSP_I444.puLUMA_32x32.filter_vss = interp_vert_ss_neon<4, 32, 32>; + p.chromaX265_CSP_I444.puLUMA_32x64.filter_vss = interp_vert_ss_neon<4, 32, 64>; +#endif // !defined(__APPLE__) + + CHROMA_FILTER_420(8, 2); + CHROMA_FILTER_420(8, 4); + CHROMA_FILTER_420(8, 6); + CHROMA_FILTER_420(8, 8); + CHROMA_FILTER_420(8, 16); + CHROMA_FILTER_420(8, 32); + + CHROMA_FILTER_422(8, 4); + CHROMA_FILTER_422(8, 8); + CHROMA_FILTER_422(8, 12); + CHROMA_FILTER_422(8, 16); + CHROMA_FILTER_422(8, 32); + CHROMA_FILTER_422(8, 64); + + CHROMA_FILTER_444(8, 4); + CHROMA_FILTER_444(8, 8); + CHROMA_FILTER_444(8, 16); + CHROMA_FILTER_444(8, 32); + +#if defined(__APPLE__) + CHROMA_FILTER_420(16, 4); + CHROMA_FILTER_420(16, 8); + CHROMA_FILTER_420(16, 12); + CHROMA_FILTER_420(16, 16); + CHROMA_FILTER_420(16, 32); + + CHROMA_FILTER_422(16, 8); + CHROMA_FILTER_422(16, 16); + CHROMA_FILTER_422(16, 24); + CHROMA_FILTER_422(16, 32); + CHROMA_FILTER_422(16, 64); + + CHROMA_FILTER_444(16, 4); + CHROMA_FILTER_444(16, 8); + CHROMA_FILTER_444(16, 12); + CHROMA_FILTER_444(16, 16); + CHROMA_FILTER_444(16, 32); + CHROMA_FILTER_444(16, 64); +#endif // defined(__APPLE__) +} + +}; + + +#endif + +
View file
x265_3.6.tar.gz/source/common/aarch64/filter-prim.h
Added
@@ -0,0 +1,21 @@ +#ifndef _FILTER_PRIM_ARM64_H__ +#define _FILTER_PRIM_ARM64_H__ + + +#include "common.h" +#include "slicetype.h" // LOWRES_COST_MASK +#include "primitives.h" +#include "x265.h" + + +namespace X265_NS +{ + + +void setupFilterPrimitives_neon(EncoderPrimitives &p); + +}; + + +#endif +
View file
x265_3.6.tar.gz/source/common/aarch64/fun-decls.h
Added
@@ -0,0 +1,256 @@ +/***************************************************************************** + * Copyright (C) 2021 MulticoreWare, Inc + * + * Authors: Sebastian Pop <spop@amazon.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#define FUNCDEF_TU(ret, name, cpu, ...) \ + ret PFX(name ## _4x4_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## _8x8_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## _16x16_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## _32x32_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## _64x64_ ## cpu(__VA_ARGS__)) + +#define FUNCDEF_TU_S(ret, name, cpu, ...) \ + ret PFX(name ## _4_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## _8_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## _16_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## _32_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## _64_ ## cpu(__VA_ARGS__)) + +#define FUNCDEF_TU_S2(ret, name, cpu, ...) \ + ret PFX(name ## 4_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## 8_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## 16_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## 32_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## 64_ ## cpu(__VA_ARGS__)) + +#define FUNCDEF_PU(ret, name, cpu, ...) \ + ret PFX(name ## _4x4_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _8x8_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _16x16_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _32x32_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _64x64_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _8x4_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _4x8_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _16x8_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _8x16_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _16x32_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _32x16_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _64x32_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _32x64_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _16x12_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _12x16_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _16x4_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _4x16_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _32x24_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _24x32_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _32x8_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _8x32_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _64x48_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _48x64_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _64x16_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _16x64_ ## cpu)(__VA_ARGS__) + +#define FUNCDEF_CHROMA_PU(ret, name, cpu, ...) \ + FUNCDEF_PU(ret, name, cpu, __VA_ARGS__); \ + ret PFX(name ## _4x2_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _4x4_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _2x4_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _8x2_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _2x8_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _8x6_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _6x8_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _8x12_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _12x8_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _6x16_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _16x6_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _2x16_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _16x2_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _4x12_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _12x4_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _32x12_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _12x32_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _32x4_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _4x32_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _32x48_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _48x32_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _16x24_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _24x16_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _8x64_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _64x8_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _64x24_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _24x64_ ## cpu)(__VA_ARGS__); + +#define DECLS(cpu) \ + FUNCDEF_TU(void, cpy2Dto1D_shl, cpu, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); \ + FUNCDEF_TU(void, cpy2Dto1D_shr, cpu, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); \ + FUNCDEF_TU(void, cpy1Dto2D_shl, cpu, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); \ + FUNCDEF_TU(void, cpy1Dto2D_shl_aligned, cpu, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); \ + FUNCDEF_TU(void, cpy1Dto2D_shr, cpu, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); \ + FUNCDEF_TU_S(uint32_t, copy_cnt, cpu, int16_t* dst, const int16_t* src, intptr_t srcStride); \ + FUNCDEF_TU_S(int, count_nonzero, cpu, const int16_t* quantCoeff); \ + FUNCDEF_TU(void, blockfill_s, cpu, int16_t* dst, intptr_t dstride, int16_t val); \ + FUNCDEF_TU(void, blockfill_s_aligned, cpu, int16_t* dst, intptr_t dstride, int16_t val); \ + FUNCDEF_CHROMA_PU(void, blockcopy_ss, cpu, int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); \ + FUNCDEF_CHROMA_PU(void, blockcopy_pp, cpu, pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); \ + FUNCDEF_PU(void, blockcopy_sp, cpu, pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); \ + FUNCDEF_PU(void, blockcopy_ps, cpu, int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); \ + FUNCDEF_PU(void, interp_8tap_horiz_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \ + FUNCDEF_PU(void, interp_8tap_horiz_ps, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); \ + FUNCDEF_PU(void, interp_8tap_vert_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \ + FUNCDEF_PU(void, interp_8tap_vert_ps, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); \ + FUNCDEF_PU(void, interp_8tap_vert_sp, cpu, const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \ + FUNCDEF_PU(void, interp_8tap_vert_ss, cpu, const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); \ + FUNCDEF_PU(void, interp_8tap_hv_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY); \ + FUNCDEF_CHROMA_PU(void, filterPixelToShort, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); \ + FUNCDEF_CHROMA_PU(void, filterPixelToShort_aligned, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); \ + FUNCDEF_CHROMA_PU(void, interp_horiz_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \ + FUNCDEF_CHROMA_PU(void, interp_4tap_horiz_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \ + FUNCDEF_CHROMA_PU(void, interp_horiz_ps, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); \ + FUNCDEF_CHROMA_PU(void, interp_4tap_horiz_ps, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); \ + FUNCDEF_CHROMA_PU(void, interp_4tap_vert_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \ + FUNCDEF_CHROMA_PU(void, interp_4tap_vert_ps, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); \ + FUNCDEF_CHROMA_PU(void, interp_4tap_vert_sp, cpu, const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \ + FUNCDEF_CHROMA_PU(void, interp_4tap_vert_ss, cpu, const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); \ + FUNCDEF_CHROMA_PU(void, addAvg, cpu, const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t); \ + FUNCDEF_CHROMA_PU(void, addAvg_aligned, cpu, const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t); \ + FUNCDEF_PU(void, pixel_avg_pp, cpu, pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); \ + FUNCDEF_PU(void, pixel_avg_pp_aligned, cpu, pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); \ + FUNCDEF_PU(void, sad_x3, cpu, const pixel*, const pixel*, const pixel*, const pixel*, intptr_t, int32_t*); \ + FUNCDEF_PU(void, sad_x4, cpu, const pixel*, const pixel*, const pixel*, const pixel*, const pixel*, intptr_t, int32_t*); \ + FUNCDEF_CHROMA_PU(int, pixel_sad, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \ + FUNCDEF_CHROMA_PU(sse_t, pixel_ssd_s, cpu, const int16_t*, intptr_t); \ + FUNCDEF_CHROMA_PU(sse_t, pixel_ssd_s_aligned, cpu, const int16_t*, intptr_t); \ + FUNCDEF_TU_S(sse_t, pixel_ssd_s, cpu, const int16_t*, intptr_t); \ + FUNCDEF_TU_S(sse_t, pixel_ssd_s_aligned, cpu, const int16_t*, intptr_t); \ + FUNCDEF_PU(sse_t, pixel_sse_pp, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \ + FUNCDEF_CHROMA_PU(sse_t, pixel_sse_ss, cpu, const int16_t*, intptr_t, const int16_t*, intptr_t); \ + FUNCDEF_PU(void, pixel_sub_ps, cpu, int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); \ + FUNCDEF_PU(void, pixel_add_ps, cpu, pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); \ + FUNCDEF_PU(void, pixel_add_ps_aligned, cpu, pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); \ + FUNCDEF_CHROMA_PU(int, pixel_satd, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \ + FUNCDEF_TU_S2(void, ssimDist, cpu, const pixel *fenc, uint32_t fStride, const pixel *recon, intptr_t rstride, uint64_t *ssBlock, int shift, uint64_t *ac_k); \ + FUNCDEF_TU_S2(void, normFact, cpu, const pixel *src, uint32_t blockSize, int shift, uint64_t *z_k) + +DECLS(neon); +DECLS(sve); +DECLS(sve2); + + +void x265_pixel_planecopy_cp_neon(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift); + +uint64_t x265_pixel_var_8x8_neon(const pixel* pix, intptr_t stride); +uint64_t x265_pixel_var_16x16_neon(const pixel* pix, intptr_t stride); +uint64_t x265_pixel_var_32x32_neon(const pixel* pix, intptr_t stride); +uint64_t x265_pixel_var_64x64_neon(const pixel* pix, intptr_t stride); + +void x265_getResidual4_neon(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); +void x265_getResidual8_neon(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); +void x265_getResidual16_neon(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); +void x265_getResidual32_neon(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); + +void x265_scale1D_128to64_neon(pixel *dst, const pixel *src); +void x265_scale2D_64to32_neon(pixel* dst, const pixel* src, intptr_t stride); + +int x265_pixel_satd_4x4_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_4x8_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_4x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_4x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_8x4_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_8x8_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_8x12_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_8x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_8x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_8x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_12x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_12x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_16x4_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_16x8_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_16x12_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_16x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_16x24_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_16x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_16x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_24x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_24x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_32x8_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_32x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_32x24_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_32x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_32x48_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_32x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_48x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_64x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_64x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_64x48_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_64x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); + +int x265_pixel_sa8d_8x8_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2); +int x265_pixel_sa8d_8x16_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2); +int x265_pixel_sa8d_16x16_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2); +int x265_pixel_sa8d_16x32_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2); +int x265_pixel_sa8d_32x32_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2); +int x265_pixel_sa8d_32x64_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2); +int x265_pixel_sa8d_64x64_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2); + +uint32_t PFX(quant_neon)(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff); +uint32_t PFX(nquant_neon)(const int16_t* coef, const int32_t* quantCoeff, int16_t* qCoef, int qBits, int add, int numCoeff); + +void x265_dequant_scaling_neon(const int16_t* quantCoef, const int32_t* deQuantCoef, int16_t* coef, int num, int per, int shift); +void x265_dequant_normal_neon(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift); + +void x265_ssim_4x4x2_core_neon(const pixel* pix1, intptr_t stride1, const pixel* pix2, intptr_t stride2, int sums24); + +int PFX(psyCost_4x4_neon)(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); +int PFX(psyCost_8x8_neon)(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); +void PFX(weight_pp_neon)(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset); +void PFX(weight_sp_neon)(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset); +int PFX(scanPosLast_neon)(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize); +uint32_t PFX(costCoeffNxN_neon)(const uint16_t *scan, const coeff_t *coeff, intptr_t trSize, uint16_t *absCoeff, const uint8_t *tabSigCtx, uint32_t scanFlagMask, uint8_t *baseCtx, int offset, int scanPosSigOff, int subPosBase); + +uint64_t x265_pixel_var_8x8_sve2(const pixel* pix, intptr_t stride); +uint64_t x265_pixel_var_16x16_sve2(const pixel* pix, intptr_t stride); +uint64_t x265_pixel_var_32x32_sve2(const pixel* pix, intptr_t stride); +uint64_t x265_pixel_var_64x64_sve2(const pixel* pix, intptr_t stride); + +void x265_getResidual16_sve2(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); +void x265_getResidual32_sve2(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); + +void x265_scale1D_128to64_sve2(pixel *dst, const pixel *src); +void x265_scale2D_64to32_sve2(pixel* dst, const pixel* src, intptr_t stride); + +int x265_pixel_satd_4x4_sve(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_8x4_sve(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_8x12_sve(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_32x16_sve(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_32x32_sve(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_64x48_sve(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); + +uint32_t PFX(quant_sve)(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff); + +void x265_dequant_scaling_sve2(const int16_t* quantCoef, const int32_t* deQuantCoef, int16_t* coef, int num, int per, int shift); +void x265_dequant_normal_sve2(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift); + +void x265_ssim_4x4x2_core_sve2(const pixel* pix1, intptr_t stride1, const pixel* pix2, intptr_t stride2, int sums24); + +int PFX(psyCost_8x8_sve2)(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); +void PFX(weight_sp_sve2)(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset); +int PFX(scanPosLast_sve2)(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize);
View file
x265_3.6.tar.gz/source/common/aarch64/intrapred-prim.cpp
Added
@@ -0,0 +1,265 @@ +#include "common.h" +#include "primitives.h" + + +#if 1 +#include "arm64-utils.h" +#include <arm_neon.h> + +using namespace X265_NS; + +namespace +{ + + + +template<int width> +void intra_pred_ang_neon(pixel *dst, intptr_t dstStride, const pixel *srcPix0, int dirMode, int bFilter) +{ + int width2 = width << 1; + // Flip the neighbours in the horizontal case. + int horMode = dirMode < 18; + pixel neighbourBuf129; + const pixel *srcPix = srcPix0; + + if (horMode) + { + neighbourBuf0 = srcPix0; + //for (int i = 0; i < width << 1; i++) + //{ + // neighbourBuf1 + i = srcPixwidth2 + 1 + i; + // neighbourBufwidth2 + 1 + i = srcPix1 + i; + //} + memcpy(&neighbourBuf1, &srcPixwidth2 + 1, sizeof(pixel) * (width << 1)); + memcpy(&neighbourBufwidth2 + 1, &srcPix1, sizeof(pixel) * (width << 1)); + srcPix = neighbourBuf; + } + + // Intra prediction angle and inverse angle tables. + const int8_t angleTable17 = { -32, -26, -21, -17, -13, -9, -5, -2, 0, 2, 5, 9, 13, 17, 21, 26, 32 }; + const int16_t invAngleTable8 = { 4096, 1638, 910, 630, 482, 390, 315, 256 }; + + // Get the prediction angle. + int angleOffset = horMode ? 10 - dirMode : dirMode - 26; + int angle = angleTable8 + angleOffset; + + // Vertical Prediction. + if (!angle) + { + for (int y = 0; y < width; y++) + { + memcpy(&dsty * dstStride, srcPix + 1, sizeof(pixel)*width); + } + if (bFilter) + { + int topLeft = srcPix0, top = srcPix1; + for (int y = 0; y < width; y++) + { + dsty * dstStride = x265_clip((int16_t)(top + ((srcPixwidth2 + 1 + y - topLeft) >> 1))); + } + } + } + else // Angular prediction. + { + // Get the reference pixels. The reference base is the first pixel to the top (neighbourBuf1). + pixel refBuf64; + const pixel *ref; + + // Use the projected left neighbours and the top neighbours. + if (angle < 0) + { + // Number of neighbours projected. + int nbProjected = -((width * angle) >> 5) - 1; + pixel *ref_pix = refBuf + nbProjected + 1; + + // Project the neighbours. + int invAngle = invAngleTable- angleOffset - 1; + int invAngleSum = 128; + for (int i = 0; i < nbProjected; i++) + { + invAngleSum += invAngle; + ref_pix- 2 - i = srcPixwidth2 + (invAngleSum >> 8); + } + + // Copy the top-left and top pixels. + //for (int i = 0; i < width + 1; i++) + //ref_pix-1 + i = srcPixi; + + memcpy(&ref_pix-1, srcPix, (width + 1)*sizeof(pixel)); + ref = ref_pix; + } + else // Use the top and top-right neighbours. + { + ref = srcPix + 1; + } + + // Pass every row. + int angleSum = 0; + for (int y = 0; y < width; y++) + { + angleSum += angle; + int offset = angleSum >> 5; + int fraction = angleSum & 31; + + if (fraction) // Interpolate + { + if (width >= 8 && sizeof(pixel) == 1) + { + const int16x8_t f0 = vdupq_n_s16(32 - fraction); + const int16x8_t f1 = vdupq_n_s16(fraction); + for (int x = 0; x < width; x += 8) + { + uint8x8_t in0 = *(uint8x8_t *)&refoffset + x; + uint8x8_t in1 = *(uint8x8_t *)&refoffset + x + 1; + int16x8_t lo = vmlaq_s16(vdupq_n_s16(16), vmovl_u8(in0), f0); + lo = vmlaq_s16(lo, vmovl_u8(in1), f1); + lo = vshrq_n_s16(lo, 5); + *(uint8x8_t *)&dsty * dstStride + x = vmovn_u16(lo); + } + } + else if (width >= 4 && sizeof(pixel) == 2) + { + const int32x4_t f0 = vdupq_n_s32(32 - fraction); + const int32x4_t f1 = vdupq_n_s32(fraction); + for (int x = 0; x < width; x += 4) + { + uint16x4_t in0 = *(uint16x4_t *)&refoffset + x; + uint16x4_t in1 = *(uint16x4_t *)&refoffset + x + 1; + int32x4_t lo = vmlaq_s32(vdupq_n_s32(16), vmovl_u16(in0), f0); + lo = vmlaq_s32(lo, vmovl_u16(in1), f1); + lo = vshrq_n_s32(lo, 5); + *(uint16x4_t *)&dsty * dstStride + x = vmovn_u32(lo); + } + } + else + { + for (int x = 0; x < width; x++) + { + dsty * dstStride + x = (pixel)(((32 - fraction) * refoffset + x + fraction * refoffset + x + 1 + 16) >> 5); + } + } + } + else // Copy. + { + memcpy(&dsty * dstStride, &refoffset, sizeof(pixel)*width); + } + } + } + + // Flip for horizontal. + if (horMode) + { + if (width == 8) + { + transpose8x8(dst, dst, dstStride, dstStride); + } + else if (width == 16) + { + transpose16x16(dst, dst, dstStride, dstStride); + } + else if (width == 32) + { + transpose32x32(dst, dst, dstStride, dstStride); + } + else + { + for (int y = 0; y < width - 1; y++) + { + for (int x = y + 1; x < width; x++) + { + pixel tmp = dsty * dstStride + x; + dsty * dstStride + x = dstx * dstStride + y; + dstx * dstStride + y = tmp; + } + } + } + } +} + +template<int log2Size> +void all_angs_pred_neon(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma) +{ + const int size = 1 << log2Size; + for (int mode = 2; mode <= 34; mode++) + { + pixel *srcPix = (g_intraFilterFlagsmode & size ? filtPix : refPix); + pixel *out = dest + ((mode - 2) << (log2Size * 2)); + + intra_pred_ang_neon<size>(out, size, srcPix, mode, bLuma); + + // Optimize code don't flip buffer + bool modeHor = (mode < 18); + + // transpose the block if this is a horizontal mode + if (modeHor) + { + if (size == 8) + { + transpose8x8(out, out, size, size); + } + else if (size == 16) + { + transpose16x16(out, out, size, size); + } + else if (size == 32) + { + transpose32x32(out, out, size, size); + } + else + { + for (int k = 0; k < size - 1; k++) + { + for (int l = k + 1; l < size; l++) + { + pixel tmp = outk * size + l; + outk * size + l = outl * size + k; + outl * size + k = tmp; + } + } + } + } + } +} +} + +namespace X265_NS +{ +// x265 private namespace + +void setupIntraPrimitives_neon(EncoderPrimitives &p) +{ + for (int i = 2; i < NUM_INTRA_MODE; i++) + { + p.cuBLOCK_8x8.intra_predi = intra_pred_ang_neon<8>; + p.cuBLOCK_16x16.intra_predi = intra_pred_ang_neon<16>; + p.cuBLOCK_32x32.intra_predi = intra_pred_ang_neon<32>; + } + p.cuBLOCK_4x4.intra_pred2 = intra_pred_ang_neon<4>; + p.cuBLOCK_4x4.intra_pred10 = intra_pred_ang_neon<4>; + p.cuBLOCK_4x4.intra_pred18 = intra_pred_ang_neon<4>; + p.cuBLOCK_4x4.intra_pred26 = intra_pred_ang_neon<4>; + p.cuBLOCK_4x4.intra_pred34 = intra_pred_ang_neon<4>; + + p.cuBLOCK_4x4.intra_pred_allangs = all_angs_pred_neon<2>; + p.cuBLOCK_8x8.intra_pred_allangs = all_angs_pred_neon<3>; + p.cuBLOCK_16x16.intra_pred_allangs = all_angs_pred_neon<4>; + p.cuBLOCK_32x32.intra_pred_allangs = all_angs_pred_neon<5>; +} + +} + + + +#else + +namespace X265_NS +{ +// x265 private namespace +void setupIntraPrimitives_neon(EncoderPrimitives &p) +{} +} + +#endif + + +
View file
x265_3.6.tar.gz/source/common/aarch64/intrapred-prim.h
Added
@@ -0,0 +1,15 @@ +#ifndef INTRAPRED_PRIM_H__ + +#if defined(__aarch64__) + +namespace X265_NS +{ +// x265 private namespace + +void setupIntraPrimitives_neon(EncoderPrimitives &p); +} + +#endif + +#endif +
View file
x265_3.6.tar.gz/source/common/aarch64/ipfilter-common.S
Added
@@ -0,0 +1,1436 @@ +/***************************************************************************** + * Copyright (C) 2022-2023 MulticoreWare, Inc + * + * Authors: David Chen <david.chen@myais.com.cn> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +// This file contains the macros written using NEON instruction set +// that are also used by the SVE2 functions + +// Macros below follow these conventions: +// - input data in registers: v0, v1, v2, v3, v4, v5, v6, v7 +// - constants in registers: v24, v25, v26, v27, v31 +// - temporary registers: v16, v17, v18, v19, v20, v21, v22, v23, v28, v29, v30. +// - _32b macros output a result in v17.4s +// - _64b and _32b_1 macros output results in v17.4s, v18.4s + +#include "asm.S" + +.arch armv8-a + +#ifdef __APPLE__ +.section __RODATA,__rodata +#else +.section .rodata +#endif + +.align 4 + +.macro vextin8 v + ldp d6, d7, x11, #16 +.if \v == 0 + // qpel_filter_0 only uses values in v3 + ext v3.8b, v6.8b, v7.8b, #4 +.else +.if \v != 3 + ext v0.8b, v6.8b, v7.8b, #1 +.endif + ext v1.8b, v6.8b, v7.8b, #2 + ext v2.8b, v6.8b, v7.8b, #3 + ext v3.8b, v6.8b, v7.8b, #4 + ext v4.8b, v6.8b, v7.8b, #5 + ext v5.8b, v6.8b, v7.8b, #6 + ext v6.8b, v6.8b, v7.8b, #7 +.endif +.endm + +.macro vextin8_64 v + ldp q6, q7, x11, #32 +.if \v == 0 + // qpel_filter_0 only uses values in v3 + ext v3.16b, v6.16b, v7.16b, #4 +.else +.if \v != 3 + // qpel_filter_3 does not use values in v0 + ext v0.16b, v6.16b, v7.16b, #1 +.endif + ext v1.16b, v6.16b, v7.16b, #2 + ext v2.16b, v6.16b, v7.16b, #3 + ext v3.16b, v6.16b, v7.16b, #4 + ext v4.16b, v6.16b, v7.16b, #5 + ext v5.16b, v6.16b, v7.16b, #6 +.if \v == 1 + ext v6.16b, v6.16b, v7.16b, #7 + // qpel_filter_1 does not use v7 +.else + ext v16.16b, v6.16b, v7.16b, #7 + ext v7.16b, v6.16b, v7.16b, #8 + mov v6.16b, v16.16b +.endif +.endif +.endm + +.macro vextin8_chroma v + ldp d6, d7, x11, #16 +.if \v == 0 + // qpel_filter_chroma_0 only uses values in v1 + ext v1.8b, v6.8b, v7.8b, #2 +.else + ext v0.8b, v6.8b, v7.8b, #1 + ext v1.8b, v6.8b, v7.8b, #2 + ext v2.8b, v6.8b, v7.8b, #3 + ext v3.8b, v6.8b, v7.8b, #4 +.endif +.endm + +.macro vextin8_chroma_64 v + ldp q16, q17, x11, #32 +.if \v == 0 + // qpel_filter_chroma_0 only uses values in v1 + ext v1.16b, v16.16b, v17.16b, #2 +.else + ext v0.16b, v16.16b, v17.16b, #1 + ext v1.16b, v16.16b, v17.16b, #2 + ext v2.16b, v16.16b, v17.16b, #3 + ext v3.16b, v16.16b, v17.16b, #4 +.endif +.endm + +.macro qpel_load_32b v +.if \v == 0 + add x6, x6, x11 // do not load 3 values that are not used in qpel_filter_0 + ld1 {v3.8b}, x6, x1 +.elseif \v == 1 || \v == 2 || \v == 3 +.if \v != 3 // not used in qpel_filter_3 + ld1 {v0.8b}, x6, x1 +.else + add x6, x6, x1 +.endif + ld1 {v1.8b}, x6, x1 + ld1 {v2.8b}, x6, x1 + ld1 {v3.8b}, x6, x1 + ld1 {v4.8b}, x6, x1 + ld1 {v5.8b}, x6, x1 +.if \v != 1 // not used in qpel_filter_1 + ld1 {v6.8b}, x6, x1 + ld1 {v7.8b}, x6 +.else + ld1 {v6.8b}, x6 +.endif +.endif +.endm + +.macro qpel_load_64b v +.if \v == 0 + add x6, x6, x11 // do not load 3 values that are not used in qpel_filter_0 + ld1 {v3.16b}, x6, x1 +.elseif \v == 1 || \v == 2 || \v == 3 +.if \v != 3 // not used in qpel_filter_3 + ld1 {v0.16b}, x6, x1 +.else + add x6, x6, x1 +.endif + ld1 {v1.16b}, x6, x1 + ld1 {v2.16b}, x6, x1 + ld1 {v3.16b}, x6, x1 + ld1 {v4.16b}, x6, x1 + ld1 {v5.16b}, x6, x1 +.if \v != 1 // not used in qpel_filter_1 + ld1 {v6.16b}, x6, x1 + ld1 {v7.16b}, x6 +.else + ld1 {v6.16b}, x6 +.endif +.endif +.endm + +.macro qpel_chroma_load_32b v +.if \v == 0 + // qpel_filter_chroma_0 only uses values in v1 + add x6, x6, x1 + ldr d1, x6 +.else + ld1 {v0.8b}, x6, x1 + ld1 {v1.8b}, x6, x1 + ld1 {v2.8b}, x6, x1 + ld1 {v3.8b}, x6 +.endif +.endm + +.macro qpel_chroma_load_64b v +.if \v == 0 + // qpel_filter_chroma_0 only uses values in v1 + add x6, x6, x1 + ldr q1, x6 +.else + ld1 {v0.16b}, x6, x1 + ld1 {v1.16b}, x6, x1 + ld1 {v2.16b}, x6, x1 + ld1 {v3.16b}, x6 +.endif +.endm + +// a, b, c, d, e, f, g, h +// .hword 0, 0, 0, 64, 0, 0, 0, 0 +.macro qpel_start_0 + movi v24.16b, #64 +.endm + +.macro qpel_filter_0_32b + umull v17.8h, v3.8b, v24.8b // 64*d +.endm + +.macro qpel_filter_0_64b + qpel_filter_0_32b + umull2 v18.8h, v3.16b, v24.16b // 64*d +.endm + +.macro qpel_start_0_1 + movi v24.8h, #64 +.endm + +.macro qpel_filter_0_32b_1 + smull v17.4s, v3.4h, v24.4h // 64*d0 + smull2 v18.4s, v3.8h, v24.8h // 64*d1 +.endm + +// a, b, c, d, e, f, g, h +// .hword -1, 4, -10, 58, 17, -5, 1, 0 +.macro qpel_start_1 + movi v24.16b, #58 + movi v25.16b, #10 + movi v26.16b, #17 + movi v27.16b, #5 +.endm + +.macro qpel_filter_1_32b + umull v19.8h, v2.8b, v25.8b // c*10 + umull v17.8h, v3.8b, v24.8b // d*58 + umull v21.8h, v4.8b, v26.8b // e*17 + umull v23.8h, v5.8b, v27.8b // f*5 + sub v17.8h, v17.8h, v19.8h // d*58 - c*10 + ushll v18.8h, v1.8b, #2 // b*4 + add v17.8h, v17.8h, v21.8h // d*58 - c*10 + e*17 + usubl v21.8h, v6.8b, v0.8b // g - a + add v17.8h, v17.8h, v18.8h // d*58 - c*10 + e*17 + b*4 + sub v21.8h, v21.8h, v23.8h // g - a - f*5 + add v17.8h, v17.8h, v21.8h // d*58 - c*10 + e*17 + b*4 + g - a - f*5 +.endm + +.macro qpel_filter_1_64b + qpel_filter_1_32b + umull2 v20.8h, v2.16b, v25.16b // c*10 + umull2 v18.8h, v3.16b, v24.16b // d*58 + umull2 v21.8h, v4.16b, v26.16b // e*17 + umull2 v23.8h, v5.16b, v27.16b // f*5 + sub v18.8h, v18.8h, v20.8h // d*58 - c*10 + ushll2 v28.8h, v1.16b, #2 // b*4 + add v18.8h, v18.8h, v21.8h // d*58 - c*10 + e*17 + usubl2 v21.8h, v6.16b, v0.16b // g - a + add v18.8h, v18.8h, v28.8h // d*58 - c*10 + e*17 + b*4 + sub v21.8h, v21.8h, v23.8h // g - a - f*5 + add v18.8h, v18.8h, v21.8h // d*58 - c*10 + e*17 + b*4 + g - a - f*5 +.endm + +.macro qpel_start_1_1 + movi v24.8h, #58 + movi v25.8h, #10 + movi v26.8h, #17 + movi v27.8h, #5 +.endm + +.macro qpel_filter_1_32b_1 + smull v17.4s, v3.4h, v24.4h // 58 * d0 + smull2 v18.4s, v3.8h, v24.8h // 58 * d1 + smull v19.4s, v2.4h, v25.4h // 10 * c0 + smull2 v20.4s, v2.8h, v25.8h // 10 * c1 + smull v21.4s, v4.4h, v26.4h // 17 * e0 + smull2 v22.4s, v4.8h, v26.8h // 17 * e1 + smull v23.4s, v5.4h, v27.4h // 5 * f0 + smull2 v16.4s, v5.8h, v27.8h // 5 * f1 + sub v17.4s, v17.4s, v19.4s // 58 * d0 - 10 * c0 + sub v18.4s, v18.4s, v20.4s // 58 * d1 - 10 * c1 + sshll v19.4s, v1.4h, #2 // 4 * b0 + sshll2 v20.4s, v1.8h, #2 // 4 * b1 + add v17.4s, v17.4s, v21.4s // 58 * d0 - 10 * c0 + 17 * e0 + add v18.4s, v18.4s, v22.4s // 58 * d1 - 10 * c1 + 17 * e1 + ssubl v21.4s, v6.4h, v0.4h // g0 - a0 + ssubl2 v22.4s, v6.8h, v0.8h // g1 - a1 + add v17.4s, v17.4s, v19.4s // 58 * d0 - 10 * c0 + 17 * e0 + 4 * b0 + add v18.4s, v18.4s, v20.4s // 58 * d1 - 10 * c1 + 17 * e1 + 4 * b1 + sub v21.4s, v21.4s, v23.4s // g0 - a0 - 5 * f0 + sub v22.4s, v22.4s, v16.4s // g1 - a1 - 5 * f1 + add v17.4s, v17.4s, v21.4s // 58 * d0 - 10 * c0 + 17 * e0 + 4 * b0 + g0 - a0 - 5 * f0 + add v18.4s, v18.4s, v22.4s // 58 * d1 - 10 * c1 + 17 * e1 + 4 * b1 + g1 - a1 - 5 * f1 +.endm + +// a, b, c, d, e, f, g, h +// .hword -1, 4, -11, 40, 40, -11, 4, -1 +.macro qpel_start_2 + movi v24.8h, #11 + movi v25.8h, #40 +.endm + +.macro qpel_filter_2_32b + uaddl v17.8h, v3.8b, v4.8b // d + e + uaddl v19.8h, v2.8b, v5.8b // c + f + uaddl v23.8h, v1.8b, v6.8b // b + g + uaddl v21.8h, v0.8b, v7.8b // a + h + mul v17.8h, v17.8h, v25.8h // 40 * (d + e) + mul v19.8h, v19.8h, v24.8h // 11 * (c + f) + shl v23.8h, v23.8h, #2 // (b + g) * 4 + add v19.8h, v19.8h, v21.8h // 11 * (c + f) + a + h + add v17.8h, v17.8h, v23.8h // 40 * (d + e) + (b + g) * 4 + sub v17.8h, v17.8h, v19.8h // 40 * (d + e) + (b + g) * 4 - 11 * (c + f) - a - h +.endm + +.macro qpel_filter_2_64b + qpel_filter_2_32b + uaddl2 v27.8h, v3.16b, v4.16b // d + e + uaddl2 v16.8h, v2.16b, v5.16b // c + f + uaddl2 v23.8h, v1.16b, v6.16b // b + g + uaddl2 v21.8h, v0.16b, v7.16b // a + h + mul v27.8h, v27.8h, v25.8h // 40 * (d + e) + mul v16.8h, v16.8h, v24.8h // 11 * (c + f) + shl v23.8h, v23.8h, #2 // (b + g) * 4 + add v16.8h, v16.8h, v21.8h // 11 * (c + f) + a + h + add v27.8h, v27.8h, v23.8h // 40 * (d + e) + (b + g) * 4 + sub v18.8h, v27.8h, v16.8h // 40 * (d + e) + (b + g) * 4 - 11 * (c + f) - a - h +.endm + +.macro qpel_start_2_1 + movi v24.4s, #11 + movi v25.4s, #40 +.endm + +.macro qpel_filter_2_32b_1 + saddl v17.4s, v3.4h, v4.4h // d0 + e0 + saddl2 v18.4s, v3.8h, v4.8h // d1 + e1 + saddl v19.4s, v2.4h, v5.4h // c0 + f0 + saddl2 v20.4s, v2.8h, v5.8h // c1 + f1 + mul v19.4s, v19.4s, v24.4s // 11 * (c0 + f0) + mul v20.4s, v20.4s, v24.4s // 11 * (c1 + f1) + saddl v23.4s, v1.4h, v6.4h // b0 + g0 + mul v17.4s, v17.4s, v25.4s // 40 * (d0 + e0) + mul v18.4s, v18.4s, v25.4s // 40 * (d1 + e1) + saddl2 v16.4s, v1.8h, v6.8h // b1 + g1 + saddl v21.4s, v0.4h, v7.4h // a0 + h0 + saddl2 v22.4s, v0.8h, v7.8h // a1 + h1 + shl v23.4s, v23.4s, #2 // 4*(b0+g0) + shl v16.4s, v16.4s, #2 // 4*(b1+g1) + add v19.4s, v19.4s, v21.4s // 11 * (c0 + f0) + a0 + h0 + add v20.4s, v20.4s, v22.4s // 11 * (c1 + f1) + a1 + h1 + add v17.4s, v17.4s, v23.4s // 40 * (d0 + e0) + 4*(b0+g0) + add v18.4s, v18.4s, v16.4s // 40 * (d1 + e1) + 4*(b1+g1) + sub v17.4s, v17.4s, v19.4s // 40 * (d0 + e0) + 4*(b0+g0) - (11 * (c0 + f0) + a0 + h0) + sub v18.4s, v18.4s, v20.4s // 40 * (d1 + e1) + 4*(b1+g1) - (11 * (c1 + f1) + a1 + h1) +.endm + +// a, b, c, d, e, f, g, h +// .hword 0, 1, -5, 17, 58, -10, 4, -1 +.macro qpel_start_3 + movi v24.16b, #17 + movi v25.16b, #5 + movi v26.16b, #58 + movi v27.16b, #10 +.endm + +.macro qpel_filter_3_32b + umull v19.8h, v2.8b, v25.8b // c * 5 + umull v17.8h, v3.8b, v24.8b // d * 17 + umull v21.8h, v4.8b, v26.8b // e * 58 + umull v23.8h, v5.8b, v27.8b // f * 10 + sub v17.8h, v17.8h, v19.8h // d * 17 - c * 5 + ushll v19.8h, v6.8b, #2 // g * 4 + add v17.8h, v17.8h, v21.8h // d * 17 - c * 5 + e * 58 + usubl v21.8h, v1.8b, v7.8b // b - h + add v17.8h, v17.8h, v19.8h // d * 17 - c * 5 + e * 58 + g * 4 + sub v21.8h, v21.8h, v23.8h // b - h - f * 10 + add v17.8h, v17.8h, v21.8h // d * 17 - c * 5 + e * 58 + g * 4 + b - h - f * 10 +.endm + +.macro qpel_filter_3_64b + qpel_filter_3_32b + umull2 v16.8h, v2.16b, v25.16b // c * 5 + umull2 v18.8h, v3.16b, v24.16b // d * 17 + umull2 v21.8h, v4.16b, v26.16b // e * 58 + umull2 v23.8h, v5.16b, v27.16b // f * 10 + sub v18.8h, v18.8h, v16.8h // d * 17 - c * 5 + ushll2 v16.8h, v6.16b, #2 // g * 4 + add v18.8h, v18.8h, v21.8h // d * 17 - c * 5 + e * 58 + usubl2 v21.8h, v1.16b, v7.16b // b - h + add v18.8h, v18.8h, v16.8h // d * 17 - c * 5 + e * 58 + g * 4 + sub v21.8h, v21.8h, v23.8h // b - h - f * 10 + add v18.8h, v18.8h, v21.8h // d * 17 - c * 5 + e * 58 + g * 4 + b - h - f * 10 +.endm + +.macro qpel_start_3_1 + movi v24.8h, #17 + movi v25.8h, #5 + movi v26.8h, #58 + movi v27.8h, #10 +.endm + +.macro qpel_filter_3_32b_1 + smull v17.4s, v3.4h, v24.4h // 17 * d0 + smull2 v18.4s, v3.8h, v24.8h // 17 * d1 + smull v19.4s, v2.4h, v25.4h // 5 * c0 + smull2 v20.4s, v2.8h, v25.8h // 5 * c1 + smull v21.4s, v4.4h, v26.4h // 58 * e0 + smull2 v22.4s, v4.8h, v26.8h // 58 * e1 + smull v23.4s, v5.4h, v27.4h // 10 * f0 + smull2 v16.4s, v5.8h, v27.8h // 10 * f1 + sub v17.4s, v17.4s, v19.4s // 17 * d0 - 5 * c0 + sub v18.4s, v18.4s, v20.4s // 17 * d1 - 5 * c1 + sshll v19.4s, v6.4h, #2 // 4 * g0 + sshll2 v20.4s, v6.8h, #2 // 4 * g1 + add v17.4s, v17.4s, v21.4s // 17 * d0 - 5 * c0 + 58 * e0 + add v18.4s, v18.4s, v22.4s // 17 * d1 - 5 * c1 + 58 * e1 + ssubl v21.4s, v1.4h, v7.4h // b0 - h0 + ssubl2 v22.4s, v1.8h, v7.8h // b1 - h1 + add v17.4s, v17.4s, v19.4s // 17 * d0 - 5 * c0 + 58 * e0 + 4 * g0 + add v18.4s, v18.4s, v20.4s // 17 * d1 - 5 * c1 + 58 * e1 + 4 * g1 + sub v21.4s, v21.4s, v23.4s // b0 - h0 - 10 * f0 + sub v22.4s, v22.4s, v16.4s // b1 - h1 - 10 * f1 + add v17.4s, v17.4s, v21.4s // 17 * d0 - 5 * c0 + 58 * e0 + 4 * g0 + b0 - h0 - 10 * f0 + add v18.4s, v18.4s, v22.4s // 17 * d1 - 5 * c1 + 58 * e1 + 4 * g1 + b1 - h1 - 10 * f1 +.endm + +.macro qpel_start_chroma_0 + movi v24.16b, #64 +.endm + +.macro qpel_filter_chroma_0_32b + umull v17.8h, v1.8b, v24.8b // 64*b +.endm + +.macro qpel_filter_chroma_0_64b + umull v17.8h, v1.8b, v24.8b // 64*b + umull2 v18.8h, v1.16b, v24.16b // 64*b +.endm + +.macro qpel_start_chroma_0_1 + movi v24.8h, #64 +.endm + +.macro qpel_filter_chroma_0_32b_1 + smull v17.4s, v1.4h, v24.4h // 64*b0 + smull2 v18.4s, v1.8h, v24.8h // 64*b1 +.endm + +.macro qpel_start_chroma_1 + movi v24.16b, #58 + movi v25.16b, #10 +.endm + +.macro qpel_filter_chroma_1_32b + umull v17.8h, v1.8b, v24.8b // 58 * b + umull v19.8h, v2.8b, v25.8b // 10 * c + uaddl v22.8h, v0.8b, v3.8b // a + d + shl v22.8h, v22.8h, #1 // 2 * (a+d) + sub v17.8h, v17.8h, v22.8h // 58*b - 2*(a+d) + add v17.8h, v17.8h, v19.8h // 58*b-2*(a+d) + 10*c +.endm + +.macro qpel_filter_chroma_1_64b + umull v17.8h, v1.8b, v24.8b // 58 * b + umull2 v18.8h, v1.16b, v24.16b // 58 * b + umull v19.8h, v2.8b, v25.8b // 10 * c + umull2 v20.8h, v2.16b, v25.16b // 10 * c + uaddl v22.8h, v0.8b, v3.8b // a + d + uaddl2 v23.8h, v0.16b, v3.16b // a + d + shl v22.8h, v22.8h, #1 // 2 * (a+d) + shl v23.8h, v23.8h, #1 // 2 * (a+d) + sub v17.8h, v17.8h, v22.8h // 58*b - 2*(a+d) + sub v18.8h, v18.8h, v23.8h // 58*b - 2*(a+d) + add v17.8h, v17.8h, v19.8h // 58*b-2*(a+d) + 10*c + add v18.8h, v18.8h, v20.8h // 58*b-2*(a+d) + 10*c +.endm + +.macro qpel_start_chroma_1_1 + movi v24.8h, #58 + movi v25.8h, #10 +.endm + +.macro qpel_filter_chroma_1_32b_1 + smull v17.4s, v1.4h, v24.4h // 58 * b0 + smull2 v18.4s, v1.8h, v24.8h // 58 * b1 + smull v19.4s, v2.4h, v25.4h // 10 * c0 + smull2 v20.4s, v2.8h, v25.8h // 10 * c1 + add v22.8h, v0.8h, v3.8h // a + d + sshll v21.4s, v22.4h, #1 // 2 * (a0+d0) + sshll2 v22.4s, v22.8h, #1 // 2 * (a1+d1) + sub v17.4s, v17.4s, v21.4s // 58*b0 - 2*(a0+d0) + sub v18.4s, v18.4s, v22.4s // 58*b1 - 2*(a1+d1) + add v17.4s, v17.4s, v19.4s // 58*b0-2*(a0+d0) + 10*c0 + add v18.4s, v18.4s, v20.4s // 58*b1-2*(a1+d1) + 10*c1 +.endm + +.macro qpel_start_chroma_2 + movi v25.16b, #54 +.endm + +.macro qpel_filter_chroma_2_32b + umull v17.8h, v1.8b, v25.8b // 54 * b + ushll v19.8h, v0.8b, #2 // 4 * a + ushll v21.8h, v2.8b, #4 // 16 * c + ushll v23.8h, v3.8b, #1 // 2 * d + add v17.8h, v17.8h, v21.8h // 54*b + 16*c + add v19.8h, v19.8h, v23.8h // 4*a + 2*d + sub v17.8h, v17.8h, v19.8h // 54*b+16*c - (4*a+2*d) +.endm + +.macro qpel_filter_chroma_2_64b + umull v17.8h, v1.8b, v25.8b // 54 * b + umull2 v18.8h, v1.16b, v25.16b // 54 * b + ushll v19.8h, v0.8b, #2 // 4 * a + ushll2 v20.8h, v0.16b, #2 // 4 * a + ushll v21.8h, v2.8b, #4 // 16 * c + ushll2 v22.8h, v2.16b, #4 // 16 * c + ushll v23.8h, v3.8b, #1 // 2 * d + ushll2 v24.8h, v3.16b, #1 // 2 * d + add v17.8h, v17.8h, v21.8h // 54*b + 16*c + add v18.8h, v18.8h, v22.8h // 54*b + 16*c + add v19.8h, v19.8h, v23.8h // 4*a + 2*d + add v20.8h, v20.8h, v24.8h // 4*a + 2*d + sub v17.8h, v17.8h, v19.8h // 54*b+16*c - (4*a+2*d) + sub v18.8h, v18.8h, v20.8h // 54*b+16*c - (4*a+2*d) +.endm + +.macro qpel_start_chroma_2_1 + movi v25.8h, #54 +.endm + +.macro qpel_filter_chroma_2_32b_1 + smull v17.4s, v1.4h, v25.4h // 54 * b0 + smull2 v18.4s, v1.8h, v25.8h // 54 * b1 + sshll v19.4s, v0.4h, #2 // 4 * a0 + sshll2 v20.4s, v0.8h, #2 // 4 * a1 + sshll v21.4s, v2.4h, #4 // 16 * c0 + sshll2 v22.4s, v2.8h, #4 // 16 * c1 + sshll v23.4s, v3.4h, #1 // 2 * d0 + sshll2 v24.4s, v3.8h, #1 // 2 * d1 + add v17.4s, v17.4s, v21.4s // 54*b0 + 16*c0 + add v18.4s, v18.4s, v22.4s // 54*b1 + 16*c1 + add v19.4s, v19.4s, v23.4s // 4*a0 + 2*d0 + add v20.4s, v20.4s, v24.4s // 4*a1 + 2*d1 + sub v17.4s, v17.4s, v19.4s // 54*b0+16*c0 - (4*a0+2*d0) + sub v18.4s, v18.4s, v20.4s // 54*b1+16*c1 - (4*a1+2*d1) +.endm + +.macro qpel_start_chroma_3 + movi v25.16b, #46 + movi v26.16b, #28 + movi v27.16b, #6 +.endm + +.macro qpel_filter_chroma_3_32b + umull v17.8h, v1.8b, v25.8b // 46 * b + umull v19.8h, v2.8b, v26.8b // 28 * c + ushll v21.8h, v3.8b, #2 // 4 * d + umull v23.8h, v0.8b, v27.8b // 6 * a + add v17.8h, v17.8h, v19.8h // 46*b + 28*c + add v21.8h, v21.8h, v23.8h // 4*d + 6*a + sub v17.8h, v17.8h, v21.8h // 46*b+28*c - (4*d+6*a) +.endm + +.macro qpel_filter_chroma_3_64b + umull v17.8h, v1.8b, v25.8b // 46 * b + umull2 v18.8h, v1.16b, v25.16b // 46 * b + umull v19.8h, v2.8b, v26.8b // 28 * c + umull2 v20.8h, v2.16b, v26.16b // 28 * c + ushll v21.8h, v3.8b, #2 // 4 * d + ushll2 v22.8h, v3.16b, #2 // 4 * d + umull v23.8h, v0.8b, v27.8b // 6 * a + umull2 v24.8h, v0.16b, v27.16b // 6 * a + add v17.8h, v17.8h, v19.8h // 46*b + 28*c + add v18.8h, v18.8h, v20.8h // 46*b + 28*c + add v21.8h, v21.8h, v23.8h // 4*d + 6*a + add v22.8h, v22.8h, v24.8h // 4*d + 6*a + sub v17.8h, v17.8h, v21.8h // 46*b+28*c - (4*d+6*a) + sub v18.8h, v18.8h, v22.8h // 46*b+28*c - (4*d+6*a) +.endm + +.macro qpel_start_chroma_3_1 + movi v25.8h, #46 + movi v26.8h, #28 + movi v27.8h, #6 +.endm + +.macro qpel_filter_chroma_3_32b_1 + smull v17.4s, v1.4h, v25.4h // 46 * b0 + smull2 v18.4s, v1.8h, v25.8h // 46 * b1 + smull v19.4s, v2.4h, v26.4h // 28 * c0 + smull2 v20.4s, v2.8h, v26.8h // 28 * c1 + sshll v21.4s, v3.4h, #2 // 4 * d0 + sshll2 v22.4s, v3.8h, #2 // 4 * d1 + smull v23.4s, v0.4h, v27.4h // 6 * a0 + smull2 v24.4s, v0.8h, v27.8h // 6 * a1 + add v17.4s, v17.4s, v19.4s // 46*b0 + 28*c0 + add v18.4s, v18.4s, v20.4s // 46*b1 + 28*c1 + add v21.4s, v21.4s, v23.4s // 4*d0 + 6*a0 + add v22.4s, v22.4s, v24.4s // 4*d1 + 6*a1 + sub v17.4s, v17.4s, v21.4s // 46*b0+28*c0 - (4*d0+6*a0) + sub v18.4s, v18.4s, v22.4s // 46*b1+28*c1 - (4*d1+6*a1) +.endm + +.macro qpel_start_chroma_4 + movi v24.8h, #36 +.endm + +.macro qpel_filter_chroma_4_32b + uaddl v20.8h, v0.8b, v3.8b // a + d + uaddl v17.8h, v1.8b, v2.8b // b + c + shl v20.8h, v20.8h, #2 // 4 * (a+d) + mul v17.8h, v17.8h, v24.8h // 36 * (b+c) + sub v17.8h, v17.8h, v20.8h // 36*(b+c) - 4*(a+d) +.endm + +.macro qpel_filter_chroma_4_64b + uaddl v20.8h, v0.8b, v3.8b // a + d + uaddl2 v21.8h, v0.16b, v3.16b // a + d + uaddl v17.8h, v1.8b, v2.8b // b + c + uaddl2 v18.8h, v1.16b, v2.16b // b + c + shl v20.8h, v20.8h, #2 // 4 * (a+d) + shl v21.8h, v21.8h, #2 // 4 * (a+d) + mul v17.8h, v17.8h, v24.8h // 36 * (b+c) + mul v18.8h, v18.8h, v24.8h // 36 * (b+c) + sub v17.8h, v17.8h, v20.8h // 36*(b+c) - 4*(a+d) + sub v18.8h, v18.8h, v21.8h // 36*(b+c) - 4*(a+d) +.endm + +.macro qpel_start_chroma_4_1 + movi v24.8h, #36 +.endm + +.macro qpel_filter_chroma_4_32b_1 + add v20.8h, v0.8h, v3.8h // a + d + add v21.8h, v1.8h, v2.8h // b + c + smull v17.4s, v21.4h, v24.4h // 36 * (b0+c0) + smull2 v18.4s, v21.8h, v24.8h // 36 * (b1+c1) + sshll v21.4s, v20.4h, #2 // 4 * (a0+d0) + sshll2 v22.4s, v20.8h, #2 // 4 * (a1+d1) + sub v17.4s, v17.4s, v21.4s // 36*(b0+c0) - 4*(a0+d0) + sub v18.4s, v18.4s, v22.4s // 36*(b1+c1) - 4*(a1+d1) +.endm + +.macro qpel_start_chroma_5 + movi v25.16b, #28 + movi v26.16b, #46 + movi v27.16b, #6 +.endm + +.macro qpel_filter_chroma_5_32b + umull v17.8h, v1.8b, v25.8b // 28 * b + umull v19.8h, v2.8b, v26.8b // 46 * c + ushll v21.8h, v0.8b, #2 // 4 * a + umull v23.8h, v3.8b, v27.8b // 6 * d + add v17.8h, v17.8h, v19.8h // 28*b + 46*c + add v21.8h, v21.8h, v23.8h // 4*a + 6*d + sub v17.8h, v17.8h, v21.8h // 28*b+46*c - (4*a+6*d) +.endm + +.macro qpel_filter_chroma_5_64b + umull v17.8h, v1.8b, v25.8b // 28 * b + umull2 v18.8h, v1.16b, v25.16b // 28 * b + umull v19.8h, v2.8b, v26.8b // 46 * c + umull2 v20.8h, v2.16b, v26.16b // 46 * c + ushll v21.8h, v0.8b, #2 // 4 * a + ushll2 v22.8h, v0.16b, #2 // 4 * a + umull v23.8h, v3.8b, v27.8b // 6 * d + umull2 v24.8h, v3.16b, v27.16b // 6 * d + add v17.8h, v17.8h, v19.8h // 28*b + 46*c + add v18.8h, v18.8h, v20.8h // 28*b + 46*c + add v21.8h, v21.8h, v23.8h // 4*a + 6*d + add v22.8h, v22.8h, v24.8h // 4*a + 6*d + sub v17.8h, v17.8h, v21.8h // 28*b+46*c - (4*a+6*d) + sub v18.8h, v18.8h, v22.8h // 28*b+46*c - (4*a+6*d) +.endm + +.macro qpel_start_chroma_5_1 + movi v25.8h, #28 + movi v26.8h, #46 + movi v27.8h, #6 +.endm + +.macro qpel_filter_chroma_5_32b_1 + smull v17.4s, v1.4h, v25.4h // 28 * b0 + smull2 v18.4s, v1.8h, v25.8h // 28 * b1 + smull v19.4s, v2.4h, v26.4h // 46 * c0 + smull2 v20.4s, v2.8h, v26.8h // 46 * c1 + sshll v21.4s, v0.4h, #2 // 4 * a0 + sshll2 v22.4s, v0.8h, #2 // 4 * a1 + smull v23.4s, v3.4h, v27.4h // 6 * d0 + smull2 v24.4s, v3.8h, v27.8h // 6 * d1 + add v17.4s, v17.4s, v19.4s // 28*b0 + 46*c0 + add v18.4s, v18.4s, v20.4s // 28*b1 + 46*c1 + add v21.4s, v21.4s, v23.4s // 4*a0 + 6*d0 + add v22.4s, v22.4s, v24.4s // 4*a1 + 6*d1 + sub v17.4s, v17.4s, v21.4s // 28*b0+46*c0 - (4*a0+6*d0) + sub v18.4s, v18.4s, v22.4s // 28*b1+46*c1 - (4*a1+6*d1) +.endm + +.macro qpel_start_chroma_6 + movi v25.16b, #54 +.endm + +.macro qpel_filter_chroma_6_32b + umull v17.8h, v2.8b, v25.8b // 54 * c + ushll v19.8h, v0.8b, #1 // 2 * a + ushll v21.8h, v1.8b, #4 // 16 * b + ushll v23.8h, v3.8b, #2 // 4 * d + add v17.8h, v17.8h, v21.8h // 54*c + 16*b + add v19.8h, v19.8h, v23.8h // 2*a + 4*d + sub v17.8h, v17.8h, v19.8h // 54*c+16*b - (2*a+4*d) +.endm + +.macro qpel_filter_chroma_6_64b + umull v17.8h, v2.8b, v25.8b // 54 * c + umull2 v18.8h, v2.16b, v25.16b // 54 * c + ushll v19.8h, v0.8b, #1 // 2 * a + ushll2 v20.8h, v0.16b, #1 // 2 * a + ushll v21.8h, v1.8b, #4 // 16 * b + ushll2 v22.8h, v1.16b, #4 // 16 * b + ushll v23.8h, v3.8b, #2 // 4 * d + ushll2 v24.8h, v3.16b, #2 // 4 * d + add v17.8h, v17.8h, v21.8h // 54*c + 16*b + add v18.8h, v18.8h, v22.8h // 54*c + 16*b + add v19.8h, v19.8h, v23.8h // 2*a + 4*d + add v20.8h, v20.8h, v24.8h // 2*a + 4*d + sub v17.8h, v17.8h, v19.8h // 54*c+16*b - (2*a+4*d) + sub v18.8h, v18.8h, v20.8h // 54*c+16*b - (2*a+4*d) +.endm + +.macro qpel_start_chroma_6_1 + movi v25.8h, #54 +.endm + +.macro qpel_filter_chroma_6_32b_1 + smull v17.4s, v2.4h, v25.4h // 54 * c0 + smull2 v18.4s, v2.8h, v25.8h // 54 * c1 + sshll v19.4s, v0.4h, #1 // 2 * a0 + sshll2 v20.4s, v0.8h, #1 // 2 * a1 + sshll v21.4s, v1.4h, #4 // 16 * b0 + sshll2 v22.4s, v1.8h, #4 // 16 * b1 + sshll v23.4s, v3.4h, #2 // 4 * d0 + sshll2 v24.4s, v3.8h, #2 // 4 * d1 + add v17.4s, v17.4s, v21.4s // 54*c0 + 16*b0 + add v18.4s, v18.4s, v22.4s // 54*c1 + 16*b1 + add v19.4s, v19.4s, v23.4s // 2*a0 + 4*d0 + add v20.4s, v20.4s, v24.4s // 2*a1 + 4*d1 + sub v17.4s, v17.4s, v19.4s // 54*c0+16*b0 - (2*a0+4*d0) + sub v18.4s, v18.4s, v20.4s // 54*c1+16*b1 - (2*a1+4*d1) +.endm + +.macro qpel_start_chroma_7 + movi v24.16b, #58 + movi v25.16b, #10 +.endm + +.macro qpel_filter_chroma_7_32b + uaddl v20.8h, v0.8b, v3.8b // a + d + umull v17.8h, v2.8b, v24.8b // 58 * c + shl v20.8h, v20.8h, #1 // 2 * (a+d) + umull v19.8h, v1.8b, v25.8b // 10 * b + sub v17.8h, v17.8h, v20.8h // 58*c - 2*(a+d) + add v17.8h, v17.8h, v19.8h // 58*c-2*(a+d) + 10*b +.endm + +.macro qpel_filter_chroma_7_64b + uaddl v20.8h, v0.8b, v3.8b // a + d + uaddl2 v21.8h, v0.16b, v3.16b // a + d + umull v17.8h, v2.8b, v24.8b // 58 * c + umull2 v18.8h, v2.16b, v24.16b // 58 * c + shl v20.8h, v20.8h, #1 // 2 * (a+d) + shl v21.8h, v21.8h, #1 // 2 * (a+d) + umull v22.8h, v1.8b, v25.8b // 10 * b + umull2 v23.8h, v1.16b, v25.16b // 10 * b + sub v17.8h, v17.8h, v20.8h // 58*c - 2*(a+d) + sub v18.8h, v18.8h, v21.8h // 58*c - 2*(a+d) + add v17.8h, v17.8h, v22.8h // 58*c-2*(a+d) + 10*b + add v18.8h, v18.8h, v23.8h // 58*c-2*(a+d) + 10*b +.endm + +.macro qpel_start_chroma_7_1 + movi v24.8h, #58 + movi v25.8h, #10 +.endm + +.macro qpel_filter_chroma_7_32b_1 + add v20.8h, v0.8h, v3.8h // a + d + smull v17.4s, v2.4h, v24.4h // 58 * c0 + smull2 v18.4s, v2.8h, v24.8h // 58 * c1 + sshll v21.4s, v20.4h, #1 // 2 * (a0+d0) + sshll2 v22.4s, v20.8h, #1 // 2 * (a1+d1) + smull v19.4s, v1.4h, v25.4h // 10 * b0 + smull2 v20.4s, v1.8h, v25.8h // 10 * b1 + sub v17.4s, v17.4s, v21.4s // 58*c0 - 2*(a0+d0) + sub v18.4s, v18.4s, v22.4s // 58*c1 - 2*(a1+d1) + add v17.4s, v17.4s, v19.4s // 58*c0-2*(a0+d0) + 10*b0 + add v18.4s, v18.4s, v20.4s // 58*c1-2*(a1+d1) + 10*b1 +.endm + +.macro vpp_end + add v17.8h, v17.8h, v31.8h + sqshrun v17.8b, v17.8h, #6 +.endm + +.macro FILTER_LUMA_VPP w, h, v + lsl x10, x1, #2 // x10 = 4 * x1 + sub x11, x10, x1 // x11 = 3 * x1 + sub x0, x0, x11 // src -= (8 / 2 - 1) * srcStride + mov x5, #\h + mov w12, #32 + dup v31.8h, w12 + qpel_start_\v +.loop_luma_vpp_\v\()_\w\()x\h: + mov x7, x2 + mov x9, #0 +.loop_luma_vpp_w8_\v\()_\w\()x\h: + add x6, x0, x9 +.if \w == 8 || \w == 24 + qpel_load_32b \v + qpel_filter_\v\()_32b + vpp_end + str d17, x7, #8 + add x9, x9, #8 +.elseif \w == 12 + qpel_load_32b \v + qpel_filter_\v\()_32b + vpp_end + str d17, x7, #8 + add x6, x0, #8 + qpel_load_32b \v + qpel_filter_\v\()_32b + vpp_end + fmov w6, s17 + str w6, x7, #4 + add x9, x9, #12 +.else + qpel_load_64b \v + qpel_filter_\v\()_64b + vpp_end + add v18.8h, v18.8h, v31.8h + sqshrun2 v17.16b, v18.8h, #6 + str q17, x7, #16 + add x9, x9, #16 +.endif + cmp x9, #\w + blt .loop_luma_vpp_w8_\v\()_\w\()x\h + add x0, x0, x1 + add x2, x2, x3 + sub x5, x5, #1 + cbnz x5, .loop_luma_vpp_\v\()_\w\()x\h + ret +.endm + +.macro vps_end + sub v17.8h, v17.8h, v31.8h +.endm + +.macro FILTER_VPS w, h, v + lsl x3, x3, #1 + lsl x10, x1, #2 // x10 = 4 * x1 + sub x11, x10, x1 // x11 = 3 * x1 + sub x0, x0, x11 // src -= (8 / 2 - 1) * srcStride + mov x5, #\h + mov w12, #8192 + dup v31.8h, w12 + qpel_start_\v +.loop_ps_\v\()_\w\()x\h: + mov x7, x2 + mov x9, #0 +.loop_ps_w8_\v\()_\w\()x\h: + add x6, x0, x9 +.if \w == 8 || \w == 24 + qpel_load_32b \v + qpel_filter_\v\()_32b + vps_end + str q17, x7, #16 + add x9, x9, #8 +.elseif \w == 12 + qpel_load_32b \v + qpel_filter_\v\()_32b + vps_end + str q17, x7, #16 + add x6, x0, #8 + qpel_load_32b \v + qpel_filter_\v\()_32b + vps_end + str d17, x7, #8 + add x9, x9, #12 +.else + qpel_load_64b \v + qpel_filter_\v\()_64b + vps_end + sub v18.8h, v18.8h, v31.8h + stp q17, q18, x7, #32 + add x9, x9, #16 +.endif + cmp x9, #\w + blt .loop_ps_w8_\v\()_\w\()x\h + add x0, x0, x1 + add x2, x2, x3 + sub x5, x5, #1 + cbnz x5, .loop_ps_\v\()_\w\()x\h + ret +.endm + +.macro vsp_end + add v17.4s, v17.4s, v31.4s + add v18.4s, v18.4s, v31.4s + sqshrun v17.4h, v17.4s, #12 + sqshrun2 v17.8h, v18.4s, #12 + sqxtun v17.8b, v17.8h +.endm + +.macro FILTER_VSP w, h, v + lsl x1, x1, #1 + lsl x10, x1, #2 // x10 = 4 * x1 + sub x11, x10, x1 // x11 = 3 * x1 + sub x0, x0, x11 + mov x5, #\h + mov w12, #1 + lsl w12, w12, #19 + add w12, w12, #2048 + dup v31.4s, w12 + mov x12, #\w + lsl x12, x12, #1 + qpel_start_\v\()_1 +.loop_luma_vsp_\v\()_\w\()x\h: + mov x7, x2 + mov x9, #0 +.loop_luma_vsp_w8_\v\()_\w\()x\h: + add x6, x0, x9 + qpel_load_64b \v + qpel_filter_\v\()_32b_1 + vsp_end + str d17, x7, #8 + add x9, x9, #16 +.if \w == 12 + add x6, x0, #16 + qpel_load_64b \v + qpel_filter_\v\()_32b_1 + vsp_end + str s17, x7, #4 + add x9, x9, #8 +.endif + cmp x9, x12 + blt .loop_luma_vsp_w8_\v\()_\w\()x\h + add x0, x0, x1 + add x2, x2, x3 + sub x5, x5, #1 + cbnz x5, .loop_luma_vsp_\v\()_\w\()x\h + ret +.endm + +.macro vss_end + sshr v17.4s, v17.4s, #6 + sshr v18.4s, v18.4s, #6 + uzp1 v17.8h, v17.8h, v18.8h +.endm + +.macro FILTER_VSS w, h, v + lsl x1, x1, #1 + lsl x10, x1, #2 // x10 = 4 * x1 + sub x11, x10, x1 // x11 = 3 * x1 + sub x0, x0, x11 + lsl x3, x3, #1 + mov x5, #\h + mov x12, #\w + lsl x12, x12, #1 + qpel_start_\v\()_1 +.loop_luma_vss_\v\()_\w\()x\h: + mov x7, x2 + mov x9, #0 +.loop_luma_vss_w8_\v\()_\w\()x\h: + add x6, x0, x9 + qpel_load_64b \v + qpel_filter_\v\()_32b_1 + vss_end +.if \w == 4 + str s17, x7, #4 + add x9, x9, #4 +.else + str q17, x7, #16 + add x9, x9, #16 +.if \w == 12 + add x6, x0, x9 + qpel_load_64b \v + qpel_filter_\v\()_32b_1 + vss_end + str d17, x7, #8 + add x9, x9, #8 +.endif +.endif + cmp x9, x12 + blt .loop_luma_vss_w8_\v\()_\w\()x\h + add x0, x0, x1 + add x2, x2, x3 + sub x5, x5, #1 + cbnz x5, .loop_luma_vss_\v\()_\w\()x\h + ret +.endm + +.macro hpp_end + add v17.8h, v17.8h, v31.8h + sqshrun v17.8b, v17.8h, #6 +.endm + +.macro FILTER_HPP w, h, v + mov w6, #\h + sub x3, x3, #\w + mov w12, #32 + dup v31.8h, w12 + qpel_start_\v +.if \w == 4 +.rept \h + mov x11, x0 + sub x11, x11, #4 + vextin8 \v + qpel_filter_\v\()_32b + hpp_end + str s17, x2, #4 + add x0, x0, x1 + add x2, x2, x3 +.endr + ret +.else +.loop1_hpp_\v\()_\w\()x\h: + mov x7, #\w + mov x11, x0 + sub x11, x11, #4 +.loop2_hpp_\v\()_\w\()x\h: + vextin8 \v + qpel_filter_\v\()_32b + hpp_end + str d17, x2, #8 + sub x11, x11, #8 + sub x7, x7, #8 +.if \w == 12 + vextin8 \v + qpel_filter_\v\()_32b + hpp_end + str s17, x2, #4 + sub x7, x7, #4 +.endif + cbnz x7, .loop2_hpp_\v\()_\w\()x\h + sub x6, x6, #1 + add x0, x0, x1 + add x2, x2, x3 + cbnz x6, .loop1_hpp_\v\()_\w\()x\h + ret +.endif +.endm + +.macro hps_end + sub v17.8h, v17.8h, v31.8h +.endm + +.macro FILTER_HPS w, h, v + sub x3, x3, #\w + lsl x3, x3, #1 + mov w12, #8192 + dup v31.8h, w12 + qpel_start_\v +.if \w == 4 +.loop_hps_\v\()_\w\()x\h\(): + mov x11, x0 + sub x11, x11, #4 + vextin8 \v + qpel_filter_\v\()_32b + hps_end + str d17, x2, #8 + sub w6, w6, #1 + add x0, x0, x1 + add x2, x2, x3 + cbnz w6, .loop_hps_\v\()_\w\()x\h + ret +.else +.loop1_hps_\v\()_\w\()x\h\(): + mov w7, #\w + mov x11, x0 + sub x11, x11, #4 +.loop2_hps_\v\()_\w\()x\h\(): +.if \w == 8 || \w == 12 || \w == 24 + vextin8 \v + qpel_filter_\v\()_32b + hps_end + str q17, x2, #16 + sub w7, w7, #8 + sub x11, x11, #8 +.if \w == 12 + vextin8 \v + qpel_filter_\v\()_32b + hps_end + str d17, x2, #8 + sub w7, w7, #4 +.endif +.elseif \w == 16 || \w == 32 || \w == 48 || \w == 64 + vextin8_64 \v + qpel_filter_\v\()_64b + hps_end + sub v18.8h, v18.8h, v31.8h + stp q17, q18, x2, #32 + sub w7, w7, #16 + sub x11, x11, #16 +.endif + cbnz w7, .loop2_hps_\v\()_\w\()x\h + sub w6, w6, #1 + add x0, x0, x1 + add x2, x2, x3 + cbnz w6, .loop1_hps_\v\()_\w\()x\h + ret +.endif +.endm + +.macro FILTER_CHROMA_VPP w, h, v + qpel_start_chroma_\v + mov w12, #32 + dup v31.8h, w12 + sub x0, x0, x1 + mov x5, #\h +.loop_chroma_vpp_\v\()_\w\()x\h: + mov x7, x2 + mov x9, #0 +.loop_chroma_vpp_w8_\v\()_\w\()x\h: + add x6, x0, x9 + qpel_chroma_load_32b \v + qpel_filter_chroma_\v\()_32b + vpp_end + add x9, x9, #8 +.if \w == 2 + fmov w12, s17 + strh w12, x7, #2 +.elseif \w == 4 + str s17, x7, #4 +.elseif \w == 6 + str s17, x7, #4 + umov w12, v17.h2 + strh w12, x7, #2 +.elseif \w == 12 + str d17, x7, #8 + add x6, x0, x9 + qpel_chroma_load_32b \v + qpel_filter_chroma_\v\()_32b + vpp_end + str s17, x7, #4 + add x9, x9, #8 +.else + str d17, x7, #8 +.endif + cmp x9, #\w + blt .loop_chroma_vpp_w8_\v\()_\w\()x\h + add x0, x0, x1 + add x2, x2, x3 + sub x5, x5, #1 + cbnz x5, .loop_chroma_vpp_\v\()_\w\()x\h + ret +.endm + +.macro FILTER_CHROMA_VPS w, h, v + qpel_start_chroma_\v + mov w12, #8192 + dup v31.8h, w12 + lsl x3, x3, #1 + sub x0, x0, x1 + mov x5, #\h +.loop_vps_\v\()_\w\()x\h: + mov x7, x2 + mov x9, #0 +.loop_vps_w8_\v\()_\w\()x\h: + add x6, x0, x9 + qpel_chroma_load_32b \v + qpel_filter_chroma_\v\()_32b + vps_end + add x9, x9, #8 +.if \w == 2 + str s17, x7, #4 +.elseif \w == 4 + str d17, x7, #8 +.elseif \w == 6 + str d17, x7, #8 + st1 {v17.s}2, x7, #4 +.elseif \w == 12 + str q17, x7, #16 + add x6, x0, x9 + qpel_chroma_load_32b \v + qpel_filter_chroma_\v\()_32b + vps_end + str d17, x7, #8 + add x9, x9, #8 +.else + str q17, x7, #16 +.endif + cmp x9, #\w + blt .loop_vps_w8_\v\()_\w\()x\h + + add x0, x0, x1 + add x2, x2, x3 + sub x5, x5, #1 + cbnz x5, .loop_vps_\v\()_\w\()x\h + ret +.endm + +.macro FILTER_CHROMA_VSP w, h, v + lsl x1, x1, #1 + sub x0, x0, x1 + mov x5, #\h + mov w12, #1 + lsl w12, w12, #19 + add w12, w12, #2048 + dup v31.4s, w12 + mov x12, #\w + lsl x12, x12, #1 + qpel_start_chroma_\v\()_1 +.loop_vsp_\v\()_\w\()x\h: + mov x7, x2 + mov x9, #0 +.loop_vsp_w8_\v\()_\w\()x\h: + add x6, x0, x9 + qpel_chroma_load_64b \v + qpel_filter_chroma_\v\()_32b_1 + vsp_end + add x9, x9, #16 +.if \w == 4 + str s17, x7, #4 +.elseif \w == 12 + str d17, x7, #8 + add x6, x0, x9 + qpel_chroma_load_64b \v + qpel_filter_chroma_\v\()_32b_1 + vsp_end + str s17, x7, #4 + add x9, x9, #8 +.else + str d17, x7, #8 +.endif + cmp x9, x12 + blt .loop_vsp_w8_\v\()_\w\()x\h + add x0, x0, x1 + add x2, x2, x3 + sub x5, x5, #1 + cbnz x5, .loop_vsp_\v\()_\w\()x\h + ret +.endm + +.macro FILTER_CHROMA_VSS w, h, v + lsl x1, x1, #1 + sub x0, x0, x1 + lsl x3, x3, #1 + mov x5, #\h + mov x12, #\w + lsl x12, x12, #1 + qpel_start_chroma_\v\()_1 +.loop_vss_\v\()_\w\()x\h: + mov x7, x2 + mov x9, #0 +.if \w == 4 +.rept 2 + add x6, x0, x9 + qpel_chroma_load_64b \v + qpel_filter_chroma_\v\()_32b_1 + vss_end + str s17, x7, #4 + add x9, x9, #4 +.endr +.else +.loop_vss_w8_\v\()_\w\()x\h: + add x6, x0, x9 + qpel_chroma_load_64b \v + qpel_filter_chroma_\v\()_32b_1 + vss_end + str q17, x7, #16 + add x9, x9, #16 +.if \w == 12 + add x6, x0, x9 + qpel_chroma_load_64b \v + qpel_filter_chroma_\v\()_32b_1 + vss_end + str d17, x7, #8 + add x9, x9, #8 +.endif + cmp x9, x12 + blt .loop_vss_w8_\v\()_\w\()x\h +.endif + add x0, x0, x1 + add x2, x2, x3 + sub x5, x5, #1 + cbnz x5, .loop_vss_\v\()_\w\()x\h + ret +.endm + +.macro FILTER_CHROMA_HPP w, h, v + qpel_start_chroma_\v + mov w12, #32 + dup v31.8h, w12 + mov w6, #\h + sub x3, x3, #\w +.if \w == 2 || \w == 4 || \w == 6 || \w == 12 +.loop4_chroma_hpp_\v\()_\w\()x\h: + mov x11, x0 + sub x11, x11, #2 + vextin8_chroma \v + qpel_filter_chroma_\v\()_32b + hpp_end +.if \w == 2 + fmov w12, s17 + strh w12, x2, #2 +.elseif \w == 4 + str s17, x2, #4 +.elseif \w == 6 + str s17, x2, #4 + umov w12, v17.h2 + strh w12, x2, #2 +.elseif \w == 12 + str d17, x2, #8 + sub x11, x11, #8 + vextin8_chroma \v + qpel_filter_chroma_\v\()_32b + hpp_end + str s17, x2, #4 +.endif + sub w6, w6, #1 + add x0, x0, x1 + add x2, x2, x3 + cbnz w6, .loop4_chroma_hpp_\v\()_\w\()x\h + ret +.else +.loop2_chroma_hpp_\v\()_\w\()x\h: + mov x7, #\w + lsr x7, x7, #3 + mov x11, x0 + sub x11, x11, #2 +.loop3_chroma_hpp_\v\()_\w\()x\h: +.if \w == 8 || \w == 24 + vextin8_chroma \v + qpel_filter_chroma_\v\()_32b + hpp_end + str d17, x2, #8 + sub x7, x7, #1 + sub x11, x11, #8 +.elseif \w == 16 || \w == 32 || \w == 48 || \w == 64 + vextin8_chroma_64 \v + qpel_filter_chroma_\v\()_64b + hpp_end + add v18.8h, v18.8h, v31.8h + sqshrun2 v17.16b, v18.8h, #6 + str q17, x2, #16 + sub x7, x7, #2 + sub x11, x11, #16 +.endif + cbnz x7, .loop3_chroma_hpp_\v\()_\w\()x\h + sub w6, w6, #1 + add x0, x0, x1 + add x2, x2, x3 + cbnz w6, .loop2_chroma_hpp_\v\()_\w\()x\h + ret +.endif +.endm + +.macro CHROMA_HPS_2_4_6_12 w, v + mov x11, x0 + sub x11, x11, #2 + vextin8_chroma \v + qpel_filter_chroma_\v\()_32b + hps_end + sub x11, x11, #8 +.if \w == 2 + str s17, x2, #4 +.elseif \w == 4 + str d17, x2, #8 +.elseif \w == 6 + str d17, x2, #8 + st1 {v17.s}2, x2, #4 +.elseif \w == 12 + str q17, x2, #16 + vextin8_chroma \v + qpel_filter_chroma_\v\()_32b + sub v17.8h, v17.8h, v31.8h + str d17, x2, #8 +.endif + add x0, x0, x1 + add x2, x2, x3 +.endm + +.macro FILTER_CHROMA_HPS w, h, v + qpel_start_chroma_\v + mov w12, #8192 + dup v31.8h, w12 + sub x3, x3, #\w + lsl x3, x3, #1 + +.if \w == 2 || \w == 4 || \w == 6 || \w == 12 + cmp x5, #0 + beq 0f + sub x0, x0, x1 +.rept 3 + CHROMA_HPS_2_4_6_12 \w, \v +.endr +0: +.rept \h + CHROMA_HPS_2_4_6_12 \w, \v +.endr + ret +.else + mov w10, #\h + cmp x5, #0 + beq 9f + sub x0, x0, x1 + add w10, w10, #3 +9: + mov w6, w10 +.loop1_chroma_hps_\v\()_\w\()x\h\(): + mov x7, #\w + lsr x7, x7, #3 + mov x11, x0 + sub x11, x11, #2 +.loop2_chroma_hps_\v\()_\w\()x\h\(): +.if \w == 8 || \w == 24 + vextin8_chroma \v + qpel_filter_chroma_\v\()_32b + hps_end + str q17, x2, #16 + sub x7, x7, #1 + sub x11, x11, #8 +.elseif \w == 16 || \w == 32 || \w == 48 || \w == 64 + vextin8_chroma_64 \v + qpel_filter_chroma_\v\()_64b + hps_end + sub v18.8h, v18.8h, v31.8h + stp q17, q18, x2, #32 + sub x7, x7, #2 + sub x11, x11, #16 +.endif + cbnz x7, .loop2_chroma_hps_\v\()_\w\()x\h\() + sub w6, w6, #1 + add x0, x0, x1 + add x2, x2, x3 + cbnz w6, .loop1_chroma_hps_\v\()_\w\()x\h\() + ret +.endif +.endm + +const g_lumaFilter, align=8 +.word 0,0,0,0,0,0,64,64,0,0,0,0,0,0,0,0 +.word -1,-1,4,4,-10,-10,58,58,17,17,-5,-5,1,1,0,0 +.word -1,-1,4,4,-11,-11,40,40,40,40,-11,-11,4,4,-1,-1 +.word 0,0,1,1,-5,-5,17,17,58,58,-10,-10,4,4,-1,-1 +endconst
View file
x265_3.6.tar.gz/source/common/aarch64/ipfilter-sve2.S
Added
@@ -0,0 +1,1282 @@ +/***************************************************************************** + * Copyright (C) 2022-2023 MulticoreWare, Inc + * + * Authors: David Chen <david.chen@myais.com.cn> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +// Functions in this file: +// ***** luma_vpp ***** +// ***** luma_vps ***** +// ***** luma_vsp ***** +// ***** luma_vss ***** +// ***** luma_hpp ***** +// ***** luma_hps ***** +// ***** chroma_vpp ***** +// ***** chroma_vps ***** +// ***** chroma_vsp ***** +// ***** chroma_vss ***** +// ***** chroma_hpp ***** +// ***** chroma_hps ***** + +#include "asm-sve.S" +#include "ipfilter-common.S" + +.arch armv8-a+sve2 + +#ifdef __APPLE__ +.section __RODATA,__rodata +#else +.section .rodata +#endif + +.align 4 + +.text + +.macro qpel_load_32b_sve2 v +.if \v == 0 + add x6, x6, x11 // do not load 3 values that are not used in qpel_filter_0 + ld1b {z3.h}, p0/z, x6 + add x6, x6, x1 +.elseif \v == 1 || \v == 2 || \v == 3 +.if \v != 3 // not used in qpel_filter_3 + ld1b {z0.h}, p0/z, x6 + add x6, x6, x1 +.else + add x6, x6, x1 +.endif + ld1b {z1.h}, p0/z, x6 + add x6, x6, x1 + ld1b {z2.h}, p0/z, x6 + add x6, x6, x1 + ld1b {z3.h}, p0/z, x6 + add x6, x6, x1 + ld1b {z4.h}, p0/z, x6 + add x6, x6, x1 + ld1b {z5.h}, p0/z, x6 + add x6, x6, x1 +.if \v != 1 // not used in qpel_filter_1 + ld1b {z6.h}, p0/z, x6 + add x6, x6, x1 + ld1b {z7.h}, p0/z, x6 +.else + ld1b {z6.h}, p0/z, x6 +.endif +.endif +.endm + +.macro qpel_load_64b_sve2_gt_16 v +.if \v == 0 + add x6, x6, x11 // do not load 3 values that are not used in qpel_filter_0 + ld1b {z3.h}, p2/z, x6 + add x6, x6, x1 +.elseif \v == 1 || \v == 2 || \v == 3 +.if \v != 3 // not used in qpel_filter_3 + ld1b {z0.h}, p2/z, x6 + add x6, x6, x1 +.else + add x6, x6, x1 +.endif + ld1b {z1.h}, p2/z, x6 + add x6, x6, x1 + ld1b {z2.h}, p2/z, x6 + add x6, x6, x1 + ld1b {z3.h}, p2/z, x6 + add x6, x6, x1 + ld1b {z4.h}, p2/z, x6 + add x6, x6, x1 + ld1b {z5.h}, p2/z, x6 + add x6, x6, x1 +.if \v != 1 // not used in qpel_filter_1 + ld1b {z6.h}, p2/z, x6 + add x6, x6, x1 + ld1b {z7.h}, p2/z, x6 +.else + ld1b {z6.h}, p2/z, x6 +.endif +.endif +.endm + +.macro qpel_chroma_load_32b_sve2 v +.if \v == 0 + // qpel_filter_chroma_0 only uses values in v1 + add x6, x6, x1 + ld1b {z1.h}, p0/z, x6 +.else + ld1b {z0.h}, p0/z, x6 + add x6, x6, x1 + ld1b {z1.h}, p0/z, x6 + add x6, x6, x1 + ld1b {z2.h}, p0/z, x6 + add x6, x6, x1 + ld1b {z3.h}, p0/z, x6 +.endif +.endm + +.macro qpel_start_sve2_0 + mov z24.h, #64 +.endm + +.macro qpel_filter_sve2_0_32b + mul z17.h, z3.h, z24.h // 64*d +.endm + +.macro qpel_filter_sve2_0_64b + qpel_filter_sve2_0_32b + mul z18.h, z11.h, z24.h +.endm + +.macro qpel_start_sve2_1 + mov z24.h, #58 + mov z25.h, #10 + mov z26.h, #17 + mov z27.h, #5 +.endm + +.macro qpel_filter_sve2_1_32b + mul z19.h, z2.h, z25.h // c*10 + mul z17.h, z3.h, z24.h // d*58 + mul z21.h, z4.h, z26.h // e*17 + mul z23.h, z5.h, z27.h // f*5 + sub z17.h, z17.h, z19.h // d*58 - c*10 + lsl z18.h, z1.h, #2 // b*4 + add z17.h, z17.h, z21.h // d*58 - c*10 + e*17 + sub z21.h, z6.h, z0.h // g - a + add z17.h, z17.h, z18.h // d*58 - c*10 + e*17 + b*4 + sub z21.h, z21.h, z23.h // g - a - f*5 + add z17.h, z17.h, z21.h // d*58 - c*10 + e*17 + b*4 + g - a - f*5 +.endm + +.macro qpel_filter_sve2_1_64b + qpel_filter_sve2_1_32b + mul z20.h, z10.h, z25.h // c*10 + mul z18.h, z11.h, z24.h // d*58 + mul z21.h, z12.h, z26.h // e*17 + mul z23.h, z13.h, z27.h // f*5 + sub z18.h, z18.h, z20.h // d*58 - c*10 + lsl z28.h, z30.h, #2 // b*4 + add z18.h, z18.h, z21.h // d*58 - c*10 + e*17 + sub z21.h, z14.h, z29.h // g - a + add z18.h, z18.h, z28.h // d*58 - c*10 + e*17 + b*4 + sub z21.h, z21.h, z23.h // g - a - f*5 + add z18.h, z18.h, z21.h // d*58 - c*10 + e*17 + b*4 + g - a - f*5 +.endm + +.macro qpel_start_sve2_2 + mov z24.h, #11 + mov z25.h, #40 +.endm + +.macro qpel_filter_sve2_2_32b + add z17.h, z3.h, z4.h // d + e + add z19.h, z2.h, z5.h // c + f + add z23.h, z1.h, z6.h // b + g + add z21.h, z0.h, z7.h // a + h + mul z17.h, z17.h, z25.h // 40 * (d + e) + mul z19.h, z19.h, z24.h // 11 * (c + f) + lsl z23.h, z23.h, #2 // (b + g) * 4 + add z19.h, z19.h, z21.h // 11 * (c + f) + a + h + add z17.h, z17.h, z23.h // 40 * (d + e) + (b + g) * 4 + sub z17.h, z17.h, z19.h // 40 * (d + e) + (b + g) * 4 - 11 * (c + f) - a - h +.endm + +.macro qpel_filter_sve2_2_64b + qpel_filter_sve2_2_32b + add z27.h, z11.h, z12.h // d + e + add z16.h, z10.h, z13.h // c + f + add z23.h, z30.h, z14.h // b + g + add z21.h, z29.h, z15.h // a + h + mul z27.h, z27.h, z25.h // 40 * (d + e) + mul z16.h, z16.h, z24.h // 11 * (c + f) + lsl z23.h, z23.h, #2 // (b + g) * 4 + add z16.h, z16.h, z21.h // 11 * (c + f) + a + h + add z27.h, z27.h, z23.h // 40 * (d + e) + (b + g) * 4 + sub z18.h, z27.h, z16.h // 40 * (d + e) + (b + g) * 4 - 11 * (c + f) - a - h +.endm + +.macro qpel_start_sve2_3 + mov z24.h, #17 + mov z25.h, #5 + mov z26.h, #58 + mov z27.h, #10 +.endm + +.macro qpel_filter_sve2_3_32b + mul z19.h, z2.h, z25.h // c * 5 + mul z17.h, z3.h, z24.h // d * 17 + mul z21.h, z4.h, z26.h // e * 58 + mul z23.h, z5.h, z27.h // f * 10 + sub z17.h, z17.h, z19.h // d * 17 - c * 5 + lsl z19.h, z6.h, #2 // g * 4 + add z17.h, z17.h, z21.h // d * 17 - c * 5 + e * 58 + sub z21.h, z1.h, z7.h // b - h + add z17.h, z17.h, z19.h // d * 17 - c * 5 + e * 58 + g * 4 + sub z21.h, z21.h, z23.h // b - h - f * 10 + add z17.h, z17.h, z21.h // d * 17 - c * 5 + e * 58 + g * 4 + b - h - f * 10 +.endm + +.macro qpel_filter_sve2_3_64b + qpel_filter_sve2_3_32b + mul z16.h, z10.h, z25.h // c * 5 + mul z18.h, z11.h, z24.h // d * 17 + mul z21.h, z12.h, z26.h // e * 58 + mul z23.h, z13.h, z27.h // f * 10 + sub z18.h, z18.h, z16.h // d * 17 - c * 5 + lsl z16.h, z14.h, #2 // g * 4 + add z18.h, z18.h, z21.h // d * 17 - c * 5 + e * 58 + sub z21.h, z30.h, z15.h // b - h + add z18.h, z18.h, z16.h // d * 17 - c * 5 + e * 58 + g * 4 + sub z21.h, z21.h, z23.h // b - h - f * 10 + add z18.h, z18.h, z21.h // d * 17 - c * 5 + e * 58 + g * 4 + b - h - f * 10 +.endm + +.macro qpel_start_chroma_sve2_0 + mov z29.h, #64 +.endm + +.macro qpel_filter_chroma_sve2_0_32b + mul z17.h, z1.h, z29.h // 64*b +.endm + +.macro qpel_start_chroma_sve2_1 + mov z29.h, #58 + mov z30.h, #10 +.endm + +.macro qpel_filter_chroma_sve2_1_32b + mul z17.h, z1.h, z29.h // 58 * b + mul z19.h, z2.h, z30.h // 10 * c + add z22.h, z0.h, z3.h // a + d + lsl z22.h, z22.h, #1 // 2 * (a+d) + sub z17.h, z17.h, z22.h // 58*b - 2*(a+d) + add z17.h, z17.h, z19.h // 58*b-2*(a+d) + 10*c +.endm + +.macro qpel_start_chroma_sve2_2 + mov z30.h, #54 +.endm + +.macro qpel_filter_chroma_sve2_2_32b + mul z17.h, z1.h, z30.h // 54 * b + lsl z19.h, z0.h, #2 // 4 * a + lsl z21.h, z2.h, #4 // 16 * c + lsl z23.h, z3.h, #1 // 2 * d + add z17.h, z17.h, z21.h // 54*b + 16*c + add z19.h, z19.h, z23.h // 4*a + 2*d + sub z17.h, z17.h, z19.h // 54*b+16*c - (4*a+2*d) +.endm + +.macro qpel_start_chroma_sve2_3 + mov z28.h, #46 + mov z29.h, #28 + mov z30.h, #6 +.endm + +.macro qpel_filter_chroma_sve2_3_32b + mul z17.h, z1.h, z28.h // 46 * b + mul z19.h, z2.h, z29.h // 28 * c + lsl z21.h, z3.h, #2 // 4 * d + mul z23.h, z0.h, z30.h // 6 * a + add z17.h, z17.h, z19.h // 46*b + 28*c + add z21.h, z21.h, z23.h // 4*d + 6*a + sub z17.h, z17.h, z21.h // 46*b+28*c - (4*d+6*a) +.endm + +.macro qpel_start_chroma_sve2_4 + mov z29.h, #36 +.endm + +.macro qpel_filter_chroma_sve2_4_32b + add z20.h, z0.h, z3.h // a + d + add z17.h, z1.h, z2.h // b + c + lsl z20.h, z20.h, #2 // 4 * (a+d) + mul z17.h, z17.h, z29.h // 36 * (b+c) + sub z17.h, z17.h, z20.h // 36*(b+c) - 4*(a+d) +.endm + +.macro qpel_start_chroma_sve2_5 + mov z28.h, #28 + mov z29.h, #46 + mov z30.h, #6 +.endm + +.macro qpel_filter_chroma_sve2_5_32b + mul z17.h, z1.h, z28.h // 28 * b + mul z19.h, z2.h, z29.h // 46 * c + lsl z21.h, z0.h, #2 // 4 * a + mul z23.h, z3.h, z30.h // 6 * d + add z17.h, z17.h, z19.h // 28*b + 46*c + add z21.h, z21.h, z23.h // 4*a + 6*d + sub z17.h, z17.h, z21.h // 28*b+46*c - (4*a+6*d) +.endm + +.macro qpel_start_chroma_sve2_6 + mov z30.h, #54 +.endm + +.macro qpel_filter_chroma_sve2_6_32b + mul z17.h, z2.h, z30.h // 54 * c + lsl z19.h, z0.h, #1 // 2 * a + lsl z21.h, z1.h, #4 // 16 * b + lsl z23.h, z3.h, #2 // 4 * d + add z17.h, z17.h, z21.h // 54*c + 16*b + add z19.h, z19.h, z23.h // 2*a + 4*d + sub z17.h, z17.h, z19.h // 54*c+16*b - (2*a+4*d) +.endm + +.macro qpel_start_chroma_sve2_7 + mov z29.h, #58 + mov z30.h, #10 +.endm + +.macro qpel_filter_chroma_sve2_7_32b + add z20.h, z0.h, z3.h // a + d + mul z17.h, z2.h, z29.h // 58 * c + lsl z20.h, z20.h, #1 // 2 * (a+d) + mul z19.h, z1.h, z30.h // 10 * b + sub z17.h, z17.h, z20.h // 58*c - 2*(a+d) + add z17.h, z17.h, z19.h // 58*c-2*(a+d) + 10*b +.endm + +.macro vpp_end_sve2 + add z17.h, z17.h, z31.h + sqshrun v17.8b, v17.8h, #6 +.endm + +.macro FILTER_LUMA_VPP_SVE2 w, h, v + lsl x10, x1, #2 // x10 = 4 * x1 + sub x11, x10, x1 // x11 = 3 * x1 + sub x0, x0, x11 // src -= (8 / 2 - 1) * srcStride + mov x5, #\h + mov z31.h, #32 + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_FILTER_LUMA_VPP_\v\()_\w\()x\h + qpel_start_\v +.loop_luma_vpp_sve2_\v\()_\w\()x\h: + mov x7, x2 + mov x9, #0 +.loop_luma_vpp_w8_sve2_\v\()_\w\()x\h: + add x6, x0, x9 +.if \w == 8 || \w == 24 + qpel_load_32b \v + qpel_filter_\v\()_32b + vpp_end + str d17, x7, #8 + add x9, x9, #8 +.elseif \w == 12 + qpel_load_32b \v + qpel_filter_\v\()_32b + vpp_end + str d17, x7, #8 + add x6, x0, #8 + qpel_load_32b \v + qpel_filter_\v\()_32b + vpp_end + fmov w6, s17 + str w6, x7, #4 + add x9, x9, #12 +.else + qpel_load_64b \v + qpel_filter_\v\()_64b + vpp_end + add v18.8h, v18.8h, v31.8h + sqshrun2 v17.16b, v18.8h, #6 + str q17, x7, #16 + add x9, x9, #16 +.endif + cmp x9, #\w + blt .loop_luma_vpp_w8_sve2_\v\()_\w\()x\h + add x0, x0, x1 + add x2, x2, x3 + sub x5, x5, #1 + cbnz x5, .loop_luma_vpp_sve2_\v\()_\w\()x\h + ret +.vl_gt_16_FILTER_LUMA_VPP_\v\()_\w\()x\h: + ptrue p0.h, vl8 + ptrue p2.h, vl16 + qpel_start_sve2_\v +.gt_16_loop_luma_vpp_sve2_\v\()_\w\()x\h: + mov x7, x2 + mov x9, #0 +.gt_16_loop_luma_vpp_w8_sve2_\v\()_\w\()x\h: + add x6, x0, x9 +.if \w == 8 || \w == 24 + qpel_load_32b_sve2 \v + qpel_filter_sve2_\v\()_32b + vpp_end_sve2 + str d17, x7, #8 + add x9, x9, #8 +.elseif \w == 12 + qpel_load_32b_sve2 \v + qpel_filter_sve2_\v\()_32b + vpp_end_sve2 + str d17, x7, #8 + add x6, x0, #8 + qpel_load_32b_sve2 \v + qpel_filter_sve2_\v\()_32b + vpp_end_sve2 + fmov w6, s17 + str w6, x7, #4 + add x9, x9, #12 +.else + qpel_load_64b_sve2_gt_16 \v + qpel_filter_sve2_\v\()_32b + vpp_end_sve2 + add z18.h, z18.h, z31.h + sqshrun2 v17.16b, v18.8h, #6 + str q17, x7, #16 + add x9, x9, #16 +.endif + cmp x9, #\w + blt .gt_16_loop_luma_vpp_w8_sve2_\v\()_\w\()x\h + add x0, x0, x1 + add x2, x2, x3 + sub x5, x5, #1 + cbnz x5, .gt_16_loop_luma_vpp_sve2_\v\()_\w\()x\h + ret +.endm + +// void interp_vert_pp_c(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx) +.macro LUMA_VPP_SVE2 w, h +function x265_interp_8tap_vert_pp_\w\()x\h\()_sve2 + cmp x4, #0 + b.eq 0f + cmp x4, #1 + b.eq 1f + cmp x4, #2 + b.eq 2f + cmp x4, #3 + b.eq 3f +0: + FILTER_LUMA_VPP_SVE2 \w, \h, 0 +1: + FILTER_LUMA_VPP_SVE2 \w, \h, 1 +2: + FILTER_LUMA_VPP_SVE2 \w, \h, 2 +3: + FILTER_LUMA_VPP_SVE2 \w, \h, 3 +endfunc +.endm + +LUMA_VPP_SVE2 8, 4 +LUMA_VPP_SVE2 8, 8 +LUMA_VPP_SVE2 8, 16 +LUMA_VPP_SVE2 8, 32 +LUMA_VPP_SVE2 12, 16 +LUMA_VPP_SVE2 16, 4 +LUMA_VPP_SVE2 16, 8 +LUMA_VPP_SVE2 16, 16 +LUMA_VPP_SVE2 16, 32 +LUMA_VPP_SVE2 16, 64 +LUMA_VPP_SVE2 16, 12 +LUMA_VPP_SVE2 24, 32 +LUMA_VPP_SVE2 32, 8 +LUMA_VPP_SVE2 32, 16 +LUMA_VPP_SVE2 32, 32 +LUMA_VPP_SVE2 32, 64 +LUMA_VPP_SVE2 32, 24 +LUMA_VPP_SVE2 48, 64 +LUMA_VPP_SVE2 64, 16 +LUMA_VPP_SVE2 64, 32 +LUMA_VPP_SVE2 64, 64 +LUMA_VPP_SVE2 64, 48 + +// void interp_vert_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx) +.macro LUMA_VPS_4xN_SVE2 h +function x265_interp_8tap_vert_ps_4x\h\()_sve2 + lsl x3, x3, #1 + lsl x5, x4, #6 + lsl x4, x1, #2 + sub x4, x4, x1 + sub x0, x0, x4 + + mov z28.s, #8192 + mov x4, #\h + movrel x12, g_lumaFilter + add x12, x12, x5 + ptrue p0.s, vl4 + ld1rd {z16.d}, p0/z, x12 + ld1rd {z17.d}, p0/z, x12, #8 + ld1rd {z18.d}, p0/z, x12, #16 + ld1rd {z19.d}, p0/z, x12, #24 + ld1rd {z20.d}, p0/z, x12, #32 + ld1rd {z21.d}, p0/z, x12, #40 + ld1rd {z22.d}, p0/z, x12, #48 + ld1rd {z23.d}, p0/z, x12, #56 + +.loop_vps_sve2_4x\h: + mov x6, x0 + + ld1b {z0.s}, p0/z, x6 + add x6, x6, x1 + ld1b {z1.s}, p0/z, x6 + add x6, x6, x1 + ld1b {z2.s}, p0/z, x6 + add x6, x6, x1 + ld1b {z3.s}, p0/z, x6 + add x6, x6, x1 + ld1b {z4.s}, p0/z, x6 + add x6, x6, x1 + ld1b {z5.s}, p0/z, x6 + add x6, x6, x1 + ld1b {z6.s}, p0/z, x6 + add x6, x6, x1 + ld1b {z7.s}, p0/z, x6 + add x6, x6, x1 + + mul z0.s, z0.s, z16.s + mla z0.s, p0/m, z1.s, z17.s + mla z0.s, p0/m, z2.s, z18.s + mla z0.s, p0/m, z3.s, z19.s + mla z0.s, p0/m, z4.s, z20.s + mla z0.s, p0/m, z5.s, z21.s + mla z0.s, p0/m, z6.s, z22.s + mla z0.s, p0/m, z7.s, z23.s + + sub z0.s, z0.s, z28.s + sqxtn v0.4h, v0.4s + st1 {v0.8b}, x2, x3 + + add x0, x0, x1 + sub x4, x4, #1 + cbnz x4, .loop_vps_sve2_4x\h + ret +endfunc +.endm + +LUMA_VPS_4xN_SVE2 4 +LUMA_VPS_4xN_SVE2 8 +LUMA_VPS_4xN_SVE2 16 + +// void interp_vert_sp_c(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx) +.macro LUMA_VSP_4xN_SVE2 h +function x265_interp_8tap_vert_sp_4x\h\()_sve2 + lsl x5, x4, #6 + lsl x1, x1, #1 + lsl x4, x1, #2 + sub x4, x4, x1 + sub x0, x0, x4 + + mov w12, #1 + lsl w12, w12, #19 + add w12, w12, #2048 + dup v24.4s, w12 + mov x4, #\h + movrel x12, g_lumaFilter + add x12, x12, x5 + + ptrue p0.s, vl4 + ld1rd {z16.d}, p0/z, x12 + ld1rd {z17.d}, p0/z, x12, #8 + ld1rd {z18.d}, p0/z, x12, #16 + ld1rd {z19.d}, p0/z, x12, #24 + ld1rd {z20.d}, p0/z, x12, #32 + ld1rd {z21.d}, p0/z, x12, #40 + ld1rd {z22.d}, p0/z, x12, #48 + ld1rd {z23.d}, p0/z, x12, #56 + +.loop_vsp_sve2_4x\h: + mov x6, x0 + + ld1 {v0.8b}, x6, x1 + ld1 {v1.8b}, x6, x1 + ld1 {v2.8b}, x6, x1 + ld1 {v3.8b}, x6, x1 + ld1 {v4.8b}, x6, x1 + ld1 {v5.8b}, x6, x1 + ld1 {v6.8b}, x6, x1 + ld1 {v7.8b}, x6, x1 + + sunpklo z0.s, z0.h + sunpklo z1.s, z1.h + mul z0.s, z0.s, z16.s + sunpklo z2.s, z2.h + mla z0.s, p0/m, z1.s, z17.s + sunpklo z3.s, z3.h + mla z0.s, p0/m, z2.s, z18.s + sunpklo z4.s, z4.h + mla z0.s, p0/m, z3.s, z19.s + sunpklo z5.s, z5.h + mla z0.s, p0/m, z4.s, z20.s + sunpklo z6.s, z6.h + mla z0.s, p0/m, z5.s, z21.s + sunpklo z7.s, z7.h + mla z0.s, p0/m, z6.s, z22.s + + mla z0.s, p0/m, z7.s, z23.s + + add z0.s, z0.s, z24.s + sqshrun v0.4h, v0.4s, #12 + sqxtun v0.8b, v0.8h + st1 {v0.s}0, x2, x3 + + add x0, x0, x1 + sub x4, x4, #1 + cbnz x4, .loop_vsp_sve2_4x\h + ret +endfunc +.endm + +LUMA_VSP_4xN_SVE2 4 +LUMA_VSP_4xN_SVE2 8 +LUMA_VSP_4xN_SVE2 16 + +.macro vps_end_sve2 + sub z17.h, z17.h, z31.h +.endm + +.macro FILTER_VPS_SVE2 w, h, v + lsl x3, x3, #1 + lsl x10, x1, #2 // x10 = 4 * x1 + sub x11, x10, x1 // x11 = 3 * x1 + sub x0, x0, x11 // src -= (8 / 2 - 1) * srcStride + mov x5, #\h + mov z31.h, #8192 + rdvl x14, #1 + cmp x14, #16 + bgt .vl_gt_16_FILTER_VPS_\v\()_\w\()x\h + qpel_start_\v +.loop_ps_sve2_\v\()_\w\()x\h: + mov x7, x2 + mov x9, #0 +.loop_ps_w8_sve2_\v\()_\w\()x\h: + add x6, x0, x9 +.if \w == 8 || \w == 24 + qpel_load_32b \v + qpel_filter_\v\()_32b + vps_end + str q17, x7, #16 + add x9, x9, #8 +.elseif \w == 12 + qpel_load_32b \v + qpel_filter_\v\()_32b + vps_end + str q17, x7, #16 + add x6, x0, #8 + qpel_load_32b \v + qpel_filter_\v\()_32b + vps_end + str d17, x7, #8 + add x9, x9, #12 +.else + qpel_load_64b \v + qpel_filter_\v\()_64b + vps_end + sub v18.8h, v18.8h, v31.8h + stp q17, q18, x7, #32 + add x9, x9, #16 +.endif + cmp x9, #\w + blt .loop_ps_w8_sve2_\v\()_\w\()x\h + add x0, x0, x1 + add x2, x2, x3 + sub x5, x5, #1 + cbnz x5, .loop_ps_sve2_\v\()_\w\()x\h + ret +.vl_gt_16_FILTER_VPS_\v\()_\w\()x\h: + ptrue p0.h, vl8 + ptrue p2.h, vl16 + qpel_start_sve2_\v +.gt_16_loop_ps_sve2_\v\()_\w\()x\h: + mov x7, x2 + mov x9, #0 +.gt_16_loop_ps_w8_sve2_\v\()_\w\()x\h: + add x6, x0, x9 +.if \w == 8 || \w == 24 + qpel_load_32b_sve2 \v + qpel_filter_sve2_\v\()_32b + vps_end_sve2 + str q17, x7, #16 + add x9, x9, #8 +.elseif \w == 12 + qpel_load_32b_sve2 \v + qpel_filter_sve2_\v\()_32b + vps_end_sve2 + str q17, x7, #16 + add x6, x0, #8 + qpel_load_32b_sve2 \v + qpel_filter_sve2_\v\()_32b + vps_end_sve2 + str d17, x7, #8 + add x9, x9, #12 +.else + qpel_load_64b_sve2_gt_16 \v + qpel_filter_sve2_\v\()_32b + vps_end_sve2 + sub z18.h, z18.h, z31.h + stp q17, q18, x7, #32 + add x9, x9, #16 +.endif + cmp x9, #\w + blt .gt_16_loop_ps_w8_sve2_\v\()_\w\()x\h + add x0, x0, x1 + add x2, x2, x3 + sub x5, x5, #1 + cbnz x5, .gt_16_loop_ps_sve2_\v\()_\w\()x\h + ret +.endm + +// void interp_vert_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx) +.macro LUMA_VPS_SVE2 w, h +function x265_interp_8tap_vert_ps_\w\()x\h\()_sve2 + cmp x4, #0 + beq 0f + cmp x4, #1 + beq 1f + cmp x4, #2 + beq 2f + cmp x4, #3 + beq 3f +0: + FILTER_VPS_SVE2 \w, \h, 0 +1: + FILTER_VPS_SVE2 \w, \h, 1 +2: + FILTER_VPS_SVE2 \w, \h, 2 +3: + FILTER_VPS_SVE2 \w, \h, 3 +endfunc +.endm + +LUMA_VPS_SVE2 8, 4 +LUMA_VPS_SVE2 8, 8 +LUMA_VPS_SVE2 8, 16 +LUMA_VPS_SVE2 8, 32 +LUMA_VPS_SVE2 12, 16 +LUMA_VPS_SVE2 16, 4 +LUMA_VPS_SVE2 16, 8 +LUMA_VPS_SVE2 16, 16 +LUMA_VPS_SVE2 16, 32 +LUMA_VPS_SVE2 16, 64 +LUMA_VPS_SVE2 16, 12 +LUMA_VPS_SVE2 24, 32 +LUMA_VPS_SVE2 32, 8 +LUMA_VPS_SVE2 32, 16 +LUMA_VPS_SVE2 32, 32 +LUMA_VPS_SVE2 32, 64 +LUMA_VPS_SVE2 32, 24 +LUMA_VPS_SVE2 48, 64 +LUMA_VPS_SVE2 64, 16 +LUMA_VPS_SVE2 64, 32 +LUMA_VPS_SVE2 64, 64 +LUMA_VPS_SVE2 64, 48 + +// ***** luma_vss ***** +.macro vss_end_sve2 + asr z17.s, z17.s, #6 + asr z18.s, z18.s, #6 + uzp1 v17.8h, v17.8h, v18.8h +.endm + +.macro FILTER_VSS_SVE2 w, h, v + lsl x1, x1, #1 + lsl x10, x1, #2 // x10 = 4 * x1 + sub x11, x10, x1 // x11 = 3 * x1 + sub x0, x0, x11 + lsl x3, x3, #1 + mov x5, #\h + mov x12, #\w + lsl x12, x12, #1 + qpel_start_\v\()_1 +.loop_luma_vss_sve2_\v\()_\w\()x\h: + mov x7, x2 + mov x9, #0 +.loop_luma_vss_w8_sve2_\v\()_\w\()x\h: + add x6, x0, x9 + qpel_load_64b \v + qpel_filter_\v\()_32b_1 + vss_end_sve2 +.if \w == 4 + str s17, x7, #4 + add x9, x9, #4 +.else + str q17, x7, #16 + add x9, x9, #16 +.if \w == 12 + add x6, x0, x9 + qpel_load_64b \v + qpel_filter_\v\()_32b_1 + vss_end_sve2 + str d17, x7, #8 + add x9, x9, #8 +.endif +.endif + cmp x9, x12 + blt .loop_luma_vss_w8_sve2_\v\()_\w\()x\h + add x0, x0, x1 + add x2, x2, x3 + sub x5, x5, #1 + cbnz x5, .loop_luma_vss_sve2_\v\()_\w\()x\h + ret +.endm + +// void interp_vert_ss_c(const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx) +.macro LUMA_VSS_SVE2 w, h +function x265_interp_8tap_vert_ss_\w\()x\h\()_sve2 + cmp x4, #0 + beq 0f + cmp x4, #1 + beq 1f + cmp x4, #2 + beq 2f + cmp x4, #3 + beq 3f +0: + FILTER_VSS_SVE2 \w, \h, 0 +1: + FILTER_VSS_SVE2 \w, \h, 1 +2: + FILTER_VSS_SVE2 \w, \h, 2 +3: + FILTER_VSS_SVE2 \w, \h, 3 +endfunc +.endm + +LUMA_VSS_SVE2 4, 4 +LUMA_VSS_SVE2 4, 8 +LUMA_VSS_SVE2 4, 16 +LUMA_VSS_SVE2 8, 4 +LUMA_VSS_SVE2 8, 8 +LUMA_VSS_SVE2 8, 16 +LUMA_VSS_SVE2 8, 32 +LUMA_VSS_SVE2 12, 16 +LUMA_VSS_SVE2 16, 4 +LUMA_VSS_SVE2 16, 8 +LUMA_VSS_SVE2 16, 16 +LUMA_VSS_SVE2 16, 32 +LUMA_VSS_SVE2 16, 64 +LUMA_VSS_SVE2 16, 12 +LUMA_VSS_SVE2 32, 8 +LUMA_VSS_SVE2 32, 16 +LUMA_VSS_SVE2 32, 32 +LUMA_VSS_SVE2 32, 64 +LUMA_VSS_SVE2 32, 24 +LUMA_VSS_SVE2 64, 16 +LUMA_VSS_SVE2 64, 32 +LUMA_VSS_SVE2 64, 64 +LUMA_VSS_SVE2 64, 48 +LUMA_VSS_SVE2 24, 32 +LUMA_VSS_SVE2 48, 64 + +// ***** luma_hps ***** + +.macro FILTER_CHROMA_VPP_SVE2 w, h, v + ptrue p0.h, vl8 + qpel_start_chroma_sve2_\v + mov z31.h, #32 + sub x0, x0, x1 + mov x5, #\h +.loop_chroma_vpp_sve2_\v\()_\w\()x\h: + mov x7, x2 + mov x9, #0 +.loop_chroma_vpp_w8_sve2_\v\()_\w\()x\h: + add x6, x0, x9 + qpel_chroma_load_32b_sve2 \v + qpel_filter_chroma_sve2_\v\()_32b + vpp_end_sve2 + add x9, x9, #8 +.if \w == 2 + fmov w12, s17 + strh w12, x7, #2 +.elseif \w == 4 + str s17, x7, #4 +.elseif \w == 6 + str s17, x7, #4 + umov w12, v17.h2 + strh w12, x7, #2 +.elseif \w == 12 + str d17, x7, #8 + add x6, x0, x9 + qpel_chroma_load_32b_sve2 \v + qpel_filter_chroma_sve2_\v\()_32b + vpp_end_sve2 + str s17, x7, #4 + add x9, x9, #8 +.else + str d17, x7, #8 +.endif + cmp x9, #\w + blt .loop_chroma_vpp_w8_sve2_\v\()_\w\()x\h + add x0, x0, x1 + add x2, x2, x3 + sub x5, x5, #1 + cbnz x5, .loop_chroma_vpp_sve2_\v\()_\w\()x\h + ret +.endm + +// void interp_vert_pp_c(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx) +.macro CHROMA_VPP_SVE2 w, h +function x265_interp_4tap_vert_pp_\w\()x\h\()_sve2 + cmp x4, #0 + beq 0f + cmp x4, #1 + beq 1f + cmp x4, #2 + beq 2f + cmp x4, #3 + beq 3f + cmp x4, #4 + beq 4f + cmp x4, #5 + beq 5f + cmp x4, #6 + beq 6f + cmp x4, #7 + beq 7f +0: + FILTER_CHROMA_VPP_SVE2 \w, \h, 0 +1: + FILTER_CHROMA_VPP_SVE2 \w, \h, 1 +2: + FILTER_CHROMA_VPP_SVE2 \w, \h, 2 +3: + FILTER_CHROMA_VPP_SVE2 \w, \h, 3 +4: + FILTER_CHROMA_VPP_SVE2 \w, \h, 4 +5: + FILTER_CHROMA_VPP_SVE2 \w, \h, 5 +6: + FILTER_CHROMA_VPP_SVE2 \w, \h, 6 +7: + FILTER_CHROMA_VPP_SVE2 \w, \h, 7 +endfunc +.endm + +CHROMA_VPP_SVE2 2, 4 +CHROMA_VPP_SVE2 2, 8 +CHROMA_VPP_SVE2 2, 16 +CHROMA_VPP_SVE2 4, 2 +CHROMA_VPP_SVE2 4, 4 +CHROMA_VPP_SVE2 4, 8 +CHROMA_VPP_SVE2 4, 16 +CHROMA_VPP_SVE2 4, 32 +CHROMA_VPP_SVE2 6, 8 +CHROMA_VPP_SVE2 6, 16 +CHROMA_VPP_SVE2 8, 2 +CHROMA_VPP_SVE2 8, 4 +CHROMA_VPP_SVE2 8, 6 +CHROMA_VPP_SVE2 8, 8 +CHROMA_VPP_SVE2 8, 16 +CHROMA_VPP_SVE2 8, 32 +CHROMA_VPP_SVE2 8, 12 +CHROMA_VPP_SVE2 8, 64 +CHROMA_VPP_SVE2 12, 16 +CHROMA_VPP_SVE2 12, 32 +CHROMA_VPP_SVE2 16, 4 +CHROMA_VPP_SVE2 16, 8 +CHROMA_VPP_SVE2 16, 12 +CHROMA_VPP_SVE2 16, 16 +CHROMA_VPP_SVE2 16, 32 +CHROMA_VPP_SVE2 16, 64 +CHROMA_VPP_SVE2 16, 24 +CHROMA_VPP_SVE2 32, 8 +CHROMA_VPP_SVE2 32, 16 +CHROMA_VPP_SVE2 32, 24 +CHROMA_VPP_SVE2 32, 32 +CHROMA_VPP_SVE2 32, 64 +CHROMA_VPP_SVE2 32, 48 +CHROMA_VPP_SVE2 24, 32 +CHROMA_VPP_SVE2 24, 64 +CHROMA_VPP_SVE2 64, 16 +CHROMA_VPP_SVE2 64, 32 +CHROMA_VPP_SVE2 64, 48 +CHROMA_VPP_SVE2 64, 64 +CHROMA_VPP_SVE2 48, 64 + +.macro FILTER_CHROMA_VPS_SVE2 w, h, v + ptrue p0.h, vl8 + qpel_start_chroma_sve2_\v + mov z31.h, #8192 + lsl x3, x3, #1 + sub x0, x0, x1 + mov x5, #\h +.loop_vps_sve2_\v\()_\w\()x\h: + mov x7, x2 + mov x9, #0 +.loop_vps_w8_sve2_\v\()_\w\()x\h: + add x6, x0, x9 + qpel_chroma_load_32b_sve2 \v + qpel_filter_chroma_sve2_\v\()_32b + vps_end_sve2 + add x9, x9, #8 +.if \w == 2 + str s17, x7, #4 +.elseif \w == 4 + str d17, x7, #8 +.elseif \w == 6 + str d17, x7, #8 + st1 {v17.s}2, x7, #4 +.elseif \w == 12 + str q17, x7, #16 + add x6, x0, x9 + qpel_chroma_load_32b_sve2 \v + qpel_filter_chroma_sve2_\v\()_32b + vps_end_sve2 + str d17, x7, #8 + add x9, x9, #8 +.else + str q17, x7, #16 +.endif + cmp x9, #\w + blt .loop_vps_w8_sve2_\v\()_\w\()x\h + + add x0, x0, x1 + add x2, x2, x3 + sub x5, x5, #1 + cbnz x5, .loop_vps_sve2_\v\()_\w\()x\h + ret +.endm + +// void interp_vert_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx) +.macro CHROMA_VPS_SVE2 w, h +function x265_interp_4tap_vert_ps_\w\()x\h\()_sve2 + cmp x4, #0 + beq 0f + cmp x4, #1 + beq 1f + cmp x4, #2 + beq 2f + cmp x4, #3 + beq 3f + cmp x4, #4 + beq 4f + cmp x4, #5 + beq 5f + cmp x4, #6 + beq 6f + cmp x4, #7 + beq 7f +0: + FILTER_CHROMA_VPS_SVE2 \w, \h, 0 +1: + FILTER_CHROMA_VPS_SVE2 \w, \h, 1 +2: + FILTER_CHROMA_VPS_SVE2 \w, \h, 2 +3: + FILTER_CHROMA_VPS_SVE2 \w, \h, 3 +4: + FILTER_CHROMA_VPS_SVE2 \w, \h, 4 +5: + FILTER_CHROMA_VPS_SVE2 \w, \h, 5 +6: + FILTER_CHROMA_VPS_SVE2 \w, \h, 6 +7: + FILTER_CHROMA_VPS_SVE2 \w, \h, 7 +endfunc +.endm + +CHROMA_VPS_SVE2 2, 4 +CHROMA_VPS_SVE2 2, 8 +CHROMA_VPS_SVE2 2, 16 +CHROMA_VPS_SVE2 4, 2 +CHROMA_VPS_SVE2 4, 4 +CHROMA_VPS_SVE2 4, 8 +CHROMA_VPS_SVE2 4, 16 +CHROMA_VPS_SVE2 4, 32 +CHROMA_VPS_SVE2 6, 8 +CHROMA_VPS_SVE2 6, 16 +CHROMA_VPS_SVE2 8, 2 +CHROMA_VPS_SVE2 8, 4 +CHROMA_VPS_SVE2 8, 6 +CHROMA_VPS_SVE2 8, 8 +CHROMA_VPS_SVE2 8, 16 +CHROMA_VPS_SVE2 8, 32 +CHROMA_VPS_SVE2 8, 12 +CHROMA_VPS_SVE2 8, 64 +CHROMA_VPS_SVE2 12, 16 +CHROMA_VPS_SVE2 12, 32 +CHROMA_VPS_SVE2 16, 4 +CHROMA_VPS_SVE2 16, 8 +CHROMA_VPS_SVE2 16, 12 +CHROMA_VPS_SVE2 16, 16 +CHROMA_VPS_SVE2 16, 32 +CHROMA_VPS_SVE2 16, 64 +CHROMA_VPS_SVE2 16, 24 +CHROMA_VPS_SVE2 32, 8 +CHROMA_VPS_SVE2 32, 16 +CHROMA_VPS_SVE2 32, 24 +CHROMA_VPS_SVE2 32, 32 +CHROMA_VPS_SVE2 32, 64 +CHROMA_VPS_SVE2 32, 48 +CHROMA_VPS_SVE2 24, 32 +CHROMA_VPS_SVE2 24, 64 +CHROMA_VPS_SVE2 64, 16 +CHROMA_VPS_SVE2 64, 32 +CHROMA_VPS_SVE2 64, 48 +CHROMA_VPS_SVE2 64, 64 +CHROMA_VPS_SVE2 48, 64 + +.macro qpel_start_chroma_sve2_0_1 + mov z24.h, #64 +.endm + +.macro qpel_start_chroma_sve2_1_1 + mov z24.h, #58 + mov z25.h, #10 +.endm + +.macro qpel_start_chroma_sve2_2_1 + mov z25.h, #54 +.endm + +.macro qpel_start_chroma_sve2_3_1 + mov z25.h, #46 + mov z26.h, #28 + mov z27.h, #6 +.endm + +.macro qpel_start_chroma_sve2_4_1 + mov z24.h, #36 +.endm + +.macro qpel_start_chroma_sve2_5_1 + mov z25.h, #28 + mov z26.h, #46 + mov z27.h, #6 +.endm + +.macro qpel_start_chroma_sve2_6_1 + mov z25.h, #54 +.endm + +.macro qpel_start_chroma_sve2_7_1 + mov z24.h, #58 + mov z25.h, #10 +.endm + +.macro FILTER_CHROMA_VSS_SVE2 w, h, v + lsl x1, x1, #1 + sub x0, x0, x1 + lsl x3, x3, #1 + mov x5, #\h + mov x12, #\w + lsl x12, x12, #1 + qpel_start_chroma_sve2_\v\()_1 +.loop_vss_sve2_\v\()_\w\()x\h: + mov x7, x2 + mov x9, #0 +.if \w == 4 +.rept 2 + add x6, x0, x9 + qpel_chroma_load_64b \v + qpel_filter_chroma_\v\()_32b_1 + vss_end_sve2 + str s17, x7, #4 + add x9, x9, #4 +.endr +.else +.loop_vss_w8_sve2_\v\()_\w\()x\h: + add x6, x0, x9 + qpel_chroma_load_64b \v + qpel_filter_chroma_\v\()_32b_1 + vss_end_sve2 + str q17, x7, #16 + add x9, x9, #16 +.if \w == 12 + add x6, x0, x9 + qpel_chroma_load_64b \v + qpel_filter_chroma_\v\()_32b_1 + vss_end_sve2 + str d17, x7, #8 + add x9, x9, #8 +.endif + cmp x9, x12 + blt .loop_vss_w8_sve2_\v\()_\w\()x\h +.endif + add x0, x0, x1 + add x2, x2, x3 + sub x5, x5, #1 + cbnz x5, .loop_vss_sve2_\v\()_\w\()x\h + ret +.endm + +// void interp_vert_ss_c(const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx) +.macro CHROMA_VSS_SVE2 w, h +function x265_interp_4tap_vert_ss_\w\()x\h\()_sve2 + cmp x4, #0 + beq 0f + cmp x4, #1 + beq 1f + cmp x4, #2 + beq 2f + cmp x4, #3 + beq 3f + cmp x4, #4 + beq 4f + cmp x4, #5 + beq 5f + cmp x4, #6 + beq 6f + cmp x4, #7 + beq 7f +0: + FILTER_CHROMA_VSS_SVE2 \w, \h, 0 +1: + FILTER_CHROMA_VSS_SVE2 \w, \h, 1 +2: + FILTER_CHROMA_VSS_SVE2 \w, \h, 2 +3: + FILTER_CHROMA_VSS_SVE2 \w, \h, 3 +4: + FILTER_CHROMA_VSS_SVE2 \w, \h, 4 +5: + FILTER_CHROMA_VSS_SVE2 \w, \h, 5 +6: + FILTER_CHROMA_VSS_SVE2 \w, \h, 6 +7: + FILTER_CHROMA_VSS_SVE2 \w, \h, 7 +endfunc +.endm + +CHROMA_VSS_SVE2 4, 4 +CHROMA_VSS_SVE2 4, 8 +CHROMA_VSS_SVE2 4, 16 +CHROMA_VSS_SVE2 4, 32 +CHROMA_VSS_SVE2 8, 2 +CHROMA_VSS_SVE2 8, 4 +CHROMA_VSS_SVE2 8, 6 +CHROMA_VSS_SVE2 8, 8 +CHROMA_VSS_SVE2 8, 16 +CHROMA_VSS_SVE2 8, 32 +CHROMA_VSS_SVE2 8, 12 +CHROMA_VSS_SVE2 8, 64 +CHROMA_VSS_SVE2 12, 16 +CHROMA_VSS_SVE2 12, 32 +CHROMA_VSS_SVE2 16, 4 +CHROMA_VSS_SVE2 16, 8 +CHROMA_VSS_SVE2 16, 12 +CHROMA_VSS_SVE2 16, 16 +CHROMA_VSS_SVE2 16, 32 +CHROMA_VSS_SVE2 16, 64 +CHROMA_VSS_SVE2 16, 24 +CHROMA_VSS_SVE2 32, 8 +CHROMA_VSS_SVE2 32, 16 +CHROMA_VSS_SVE2 32, 24 +CHROMA_VSS_SVE2 32, 32 +CHROMA_VSS_SVE2 32, 64 +CHROMA_VSS_SVE2 32, 48 +CHROMA_VSS_SVE2 24, 32 +CHROMA_VSS_SVE2 24, 64 +CHROMA_VSS_SVE2 64, 16 +CHROMA_VSS_SVE2 64, 32 +CHROMA_VSS_SVE2 64, 48 +CHROMA_VSS_SVE2 64, 64 +CHROMA_VSS_SVE2 48, 64
View file
x265_3.6.tar.gz/source/common/aarch64/ipfilter.S
Added
@@ -0,0 +1,1054 @@ +/***************************************************************************** + * Copyright (C) 2021 MulticoreWare, Inc + * + * Authors: Sebastian Pop <spop@amazon.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +// Functions in this file: +// ***** luma_vpp ***** +// ***** luma_vps ***** +// ***** luma_vsp ***** +// ***** luma_vss ***** +// ***** luma_hpp ***** +// ***** luma_hps ***** +// ***** chroma_vpp ***** +// ***** chroma_vps ***** +// ***** chroma_vsp ***** +// ***** chroma_vss ***** +// ***** chroma_hpp ***** +// ***** chroma_hps ***** + +#include "asm.S" +#include "ipfilter-common.S" + +#ifdef __APPLE__ +.section __RODATA,__rodata +#else +.section .rodata +#endif + +.align 4 + +.text + +// ***** luma_vpp ***** +// void interp_vert_pp_c(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx) +.macro LUMA_VPP_4xN h +function x265_interp_8tap_vert_pp_4x\h\()_neon + movrel x10, g_luma_s16 + sub x0, x0, x1 + sub x0, x0, x1, lsl #1 // src -= 3 * srcStride + lsl x4, x4, #4 + ldr q0, x10, x4 // q0 = luma interpolate coeff + dup v24.8h, v0.h0 + dup v25.8h, v0.h1 + trn1 v24.2d, v24.2d, v25.2d + dup v26.8h, v0.h2 + dup v27.8h, v0.h3 + trn1 v26.2d, v26.2d, v27.2d + dup v28.8h, v0.h4 + dup v29.8h, v0.h5 + trn1 v28.2d, v28.2d, v29.2d + dup v30.8h, v0.h6 + dup v31.8h, v0.h7 + trn1 v30.2d, v30.2d, v31.2d + + // prepare to load 8 lines + ld1 {v0.s}0, x0, x1 + ld1 {v0.s}1, x0, x1 + ushll v0.8h, v0.8b, #0 + ld1 {v1.s}0, x0, x1 + ld1 {v1.s}1, x0, x1 + ushll v1.8h, v1.8b, #0 + ld1 {v2.s}0, x0, x1 + ld1 {v2.s}1, x0, x1 + ushll v2.8h, v2.8b, #0 + ld1 {v3.s}0, x0, x1 + ld1 {v3.s}1, x0, x1 + ushll v3.8h, v3.8b, #0 + + mov x9, #\h +.loop_4x\h: + ld1 {v4.s}0, x0, x1 + ld1 {v4.s}1, x0, x1 + ushll v4.8h, v4.8b, #0 + + // row0-1 + mul v16.8h, v0.8h, v24.8h + ext v21.16b, v0.16b, v1.16b, #8 + mul v17.8h, v21.8h, v24.8h + mov v0.16b, v1.16b + + // row2-3 + mla v16.8h, v1.8h, v26.8h + ext v21.16b, v1.16b, v2.16b, #8 + mla v17.8h, v21.8h, v26.8h + mov v1.16b, v2.16b + + // row4-5 + mla v16.8h, v2.8h, v28.8h + ext v21.16b, v2.16b, v3.16b, #8 + mla v17.8h, v21.8h, v28.8h + mov v2.16b, v3.16b + + // row6-7 + mla v16.8h, v3.8h, v30.8h + ext v21.16b, v3.16b, v4.16b, #8 + mla v17.8h, v21.8h, v30.8h + mov v3.16b, v4.16b + + // sum row0-7 + trn1 v20.2d, v16.2d, v17.2d + trn2 v21.2d, v16.2d, v17.2d + add v16.8h, v20.8h, v21.8h + + sqrshrun v16.8b, v16.8h, #6 + st1 {v16.s}0, x2, x3 + st1 {v16.s}1, x2, x3 + + sub x9, x9, #2 + cbnz x9, .loop_4x\h + ret +endfunc +.endm + +LUMA_VPP_4xN 4 +LUMA_VPP_4xN 8 +LUMA_VPP_4xN 16 + +// void interp_vert_pp_c(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx) +.macro LUMA_VPP w, h +function x265_interp_8tap_vert_pp_\w\()x\h\()_neon + cmp x4, #0 + b.eq 0f + cmp x4, #1 + b.eq 1f + cmp x4, #2 + b.eq 2f + cmp x4, #3 + b.eq 3f +0: + FILTER_LUMA_VPP \w, \h, 0 +1: + FILTER_LUMA_VPP \w, \h, 1 +2: + FILTER_LUMA_VPP \w, \h, 2 +3: + FILTER_LUMA_VPP \w, \h, 3 +endfunc +.endm + +LUMA_VPP 8, 4 +LUMA_VPP 8, 8 +LUMA_VPP 8, 16 +LUMA_VPP 8, 32 +LUMA_VPP 12, 16 +LUMA_VPP 16, 4 +LUMA_VPP 16, 8 +LUMA_VPP 16, 16 +LUMA_VPP 16, 32 +LUMA_VPP 16, 64 +LUMA_VPP 16, 12 +LUMA_VPP 24, 32 +LUMA_VPP 32, 8 +LUMA_VPP 32, 16 +LUMA_VPP 32, 32 +LUMA_VPP 32, 64 +LUMA_VPP 32, 24 +LUMA_VPP 48, 64 +LUMA_VPP 64, 16 +LUMA_VPP 64, 32 +LUMA_VPP 64, 64 +LUMA_VPP 64, 48 + +// ***** luma_vps ***** +// void interp_vert_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx) +.macro LUMA_VPS_4xN h +function x265_interp_8tap_vert_ps_4x\h\()_neon + lsl x3, x3, #1 + lsl x5, x4, #6 + lsl x4, x1, #2 + sub x4, x4, x1 + sub x0, x0, x4 + + mov w6, #8192 + dup v28.4s, w6 + mov x4, #\h + movrel x12, g_lumaFilter + add x12, x12, x5 + ld1r {v16.2d}, x12, #8 + ld1r {v17.2d}, x12, #8 + ld1r {v18.2d}, x12, #8 + ld1r {v19.2d}, x12, #8 + ld1r {v20.2d}, x12, #8 + ld1r {v21.2d}, x12, #8 + ld1r {v22.2d}, x12, #8 + ld1r {v23.2d}, x12, #8 + +.loop_vps_4x\h: + mov x6, x0 + + ld1 {v0.s}0, x6, x1 + ld1 {v1.s}0, x6, x1 + ld1 {v2.s}0, x6, x1 + ld1 {v3.s}0, x6, x1 + ld1 {v4.s}0, x6, x1 + ld1 {v5.s}0, x6, x1 + ld1 {v6.s}0, x6, x1 + ld1 {v7.s}0, x6, x1 + uxtl v0.8h, v0.8b + uxtl v0.4s, v0.4h + + uxtl v1.8h, v1.8b + uxtl v1.4s, v1.4h + mul v0.4s, v0.4s, v16.4s + + uxtl v2.8h, v2.8b + uxtl v2.4s, v2.4h + mla v0.4s, v1.4s, v17.4s + + uxtl v3.8h, v3.8b + uxtl v3.4s, v3.4h + mla v0.4s, v2.4s, v18.4s + + uxtl v4.8h, v4.8b + uxtl v4.4s, v4.4h + mla v0.4s, v3.4s, v19.4s + + uxtl v5.8h, v5.8b + uxtl v5.4s, v5.4h + mla v0.4s, v4.4s, v20.4s + + uxtl v6.8h, v6.8b + uxtl v6.4s, v6.4h + mla v0.4s, v5.4s, v21.4s + + uxtl v7.8h, v7.8b + uxtl v7.4s, v7.4h + mla v0.4s, v6.4s, v22.4s + + mla v0.4s, v7.4s, v23.4s + + sub v0.4s, v0.4s, v28.4s + sqxtn v0.4h, v0.4s + st1 {v0.8b}, x2, x3 + + add x0, x0, x1 + sub x4, x4, #1 + cbnz x4, .loop_vps_4x\h + ret +endfunc +.endm + +LUMA_VPS_4xN 4 +LUMA_VPS_4xN 8 +LUMA_VPS_4xN 16 + +// void interp_vert_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx) +.macro LUMA_VPS w, h +function x265_interp_8tap_vert_ps_\w\()x\h\()_neon + cmp x4, #0 + beq 0f + cmp x4, #1 + beq 1f + cmp x4, #2 + beq 2f + cmp x4, #3 + beq 3f +0: + FILTER_VPS \w, \h, 0 +1: + FILTER_VPS \w, \h, 1 +2: + FILTER_VPS \w, \h, 2 +3: + FILTER_VPS \w, \h, 3 +endfunc +.endm + +LUMA_VPS 8, 4 +LUMA_VPS 8, 8 +LUMA_VPS 8, 16 +LUMA_VPS 8, 32 +LUMA_VPS 12, 16 +LUMA_VPS 16, 4 +LUMA_VPS 16, 8 +LUMA_VPS 16, 16 +LUMA_VPS 16, 32 +LUMA_VPS 16, 64 +LUMA_VPS 16, 12 +LUMA_VPS 24, 32 +LUMA_VPS 32, 8 +LUMA_VPS 32, 16 +LUMA_VPS 32, 32 +LUMA_VPS 32, 64 +LUMA_VPS 32, 24 +LUMA_VPS 48, 64 +LUMA_VPS 64, 16 +LUMA_VPS 64, 32 +LUMA_VPS 64, 64 +LUMA_VPS 64, 48 + +// ***** luma_vsp ***** +// void interp_vert_sp_c(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx) +.macro LUMA_VSP_4xN h +function x265_interp_8tap_vert_sp_4x\h\()_neon + lsl x5, x4, #6 + lsl x1, x1, #1 + lsl x4, x1, #2 + sub x4, x4, x1 + sub x0, x0, x4 + + mov w12, #1 + lsl w12, w12, #19 + add w12, w12, #2048 + dup v24.4s, w12 + mov x4, #\h + movrel x12, g_lumaFilter + add x12, x12, x5 + ld1r {v16.2d}, x12, #8 + ld1r {v17.2d}, x12, #8 + ld1r {v18.2d}, x12, #8 + ld1r {v19.2d}, x12, #8 + ld1r {v20.2d}, x12, #8 + ld1r {v21.2d}, x12, #8 + ld1r {v22.2d}, x12, #8 + ld1r {v23.2d}, x12, #8 +.loop_vsp_4x\h: + mov x6, x0 + + ld1 {v0.8b}, x6, x1 + ld1 {v1.8b}, x6, x1 + ld1 {v2.8b}, x6, x1 + ld1 {v3.8b}, x6, x1 + ld1 {v4.8b}, x6, x1 + ld1 {v5.8b}, x6, x1 + ld1 {v6.8b}, x6, x1 + ld1 {v7.8b}, x6, x1 + + sshll v0.4s, v0.4h, #0 + sshll v1.4s, v1.4h, #0 + mul v0.4s, v0.4s, v16.4s + sshll v2.4s, v2.4h, #0 + mla v0.4s, v1.4s, v17.4s + sshll v3.4s, v3.4h, #0 + mla v0.4s, v2.4s, v18.4s + sshll v4.4s, v4.4h, #0 + mla v0.4s, v3.4s, v19.4s + sshll v5.4s, v5.4h, #0 + mla v0.4s, v4.4s, v20.4s + sshll v6.4s, v6.4h, #0 + mla v0.4s, v5.4s, v21.4s + sshll v7.4s, v7.4h, #0 + mla v0.4s, v6.4s, v22.4s + + mla v0.4s, v7.4s, v23.4s + + add v0.4s, v0.4s, v24.4s + sqshrun v0.4h, v0.4s, #12 + sqxtun v0.8b, v0.8h + st1 {v0.s}0, x2, x3 + + add x0, x0, x1 + sub x4, x4, #1 + cbnz x4, .loop_vsp_4x\h + ret +endfunc +.endm + +LUMA_VSP_4xN 4 +LUMA_VSP_4xN 8 +LUMA_VSP_4xN 16 + +// void interp_vert_sp_c(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx) +.macro LUMA_VSP w, h +function x265_interp_8tap_vert_sp_\w\()x\h\()_neon + cmp x4, #0 + beq 0f + cmp x4, #1 + beq 1f + cmp x4, #2 + beq 2f + cmp x4, #3 + beq 3f +0: + FILTER_VSP \w, \h, 0 +1: + FILTER_VSP \w, \h, 1 +2: + FILTER_VSP \w, \h, 2 +3: + FILTER_VSP \w, \h, 3 +endfunc +.endm + +LUMA_VSP 8, 4 +LUMA_VSP 8, 8 +LUMA_VSP 8, 16 +LUMA_VSP 8, 32 +LUMA_VSP 12, 16 +LUMA_VSP 16, 4 +LUMA_VSP 16, 8 +LUMA_VSP 16, 16 +LUMA_VSP 16, 32 +LUMA_VSP 16, 64 +LUMA_VSP 16, 12 +LUMA_VSP 32, 8 +LUMA_VSP 32, 16 +LUMA_VSP 32, 32 +LUMA_VSP 32, 64 +LUMA_VSP 32, 24 +LUMA_VSP 64, 16 +LUMA_VSP 64, 32 +LUMA_VSP 64, 64 +LUMA_VSP 64, 48 +LUMA_VSP 24, 32 +LUMA_VSP 48, 64 + +// ***** luma_vss ***** +// void interp_vert_ss_c(const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx) +.macro LUMA_VSS w, h +function x265_interp_8tap_vert_ss_\w\()x\h\()_neon + cmp x4, #0 + beq 0f + cmp x4, #1 + beq 1f + cmp x4, #2 + beq 2f + cmp x4, #3 + beq 3f +0: + FILTER_VSS \w, \h, 0 +1: + FILTER_VSS \w, \h, 1 +2: + FILTER_VSS \w, \h, 2 +3: + FILTER_VSS \w, \h, 3 +endfunc +.endm + +LUMA_VSS 4, 4 +LUMA_VSS 4, 8 +LUMA_VSS 4, 16 +LUMA_VSS 8, 4 +LUMA_VSS 8, 8 +LUMA_VSS 8, 16 +LUMA_VSS 8, 32 +LUMA_VSS 12, 16 +LUMA_VSS 16, 4 +LUMA_VSS 16, 8 +LUMA_VSS 16, 16 +LUMA_VSS 16, 32 +LUMA_VSS 16, 64 +LUMA_VSS 16, 12 +LUMA_VSS 32, 8 +LUMA_VSS 32, 16 +LUMA_VSS 32, 32 +LUMA_VSS 32, 64 +LUMA_VSS 32, 24 +LUMA_VSS 64, 16 +LUMA_VSS 64, 32 +LUMA_VSS 64, 64 +LUMA_VSS 64, 48 +LUMA_VSS 24, 32 +LUMA_VSS 48, 64 + +// ***** luma_hpp ***** +// void interp_horiz_pp_c(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx) +.macro LUMA_HPP w, h +function x265_interp_horiz_pp_\w\()x\h\()_neon + cmp x4, #0 + beq 0f + cmp x4, #1 + beq 1f + cmp x4, #2 + beq 2f + cmp x4, #3 + beq 3f +0: + FILTER_HPP \w, \h, 0 +1: + FILTER_HPP \w, \h, 1 +2: + FILTER_HPP \w, \h, 2 +3: + FILTER_HPP \w, \h, 3 +endfunc +.endm + +LUMA_HPP 4, 4 +LUMA_HPP 4, 8 +LUMA_HPP 4, 16 +LUMA_HPP 8, 4 +LUMA_HPP 8, 8 +LUMA_HPP 8, 16 +LUMA_HPP 8, 32 +LUMA_HPP 12, 16 +LUMA_HPP 16, 4 +LUMA_HPP 16, 8 +LUMA_HPP 16, 12 +LUMA_HPP 16, 16 +LUMA_HPP 16, 32 +LUMA_HPP 16, 64 +LUMA_HPP 24, 32 +LUMA_HPP 32, 8 +LUMA_HPP 32, 16 +LUMA_HPP 32, 24 +LUMA_HPP 32, 32 +LUMA_HPP 32, 64 +LUMA_HPP 48, 64 +LUMA_HPP 64, 16 +LUMA_HPP 64, 32 +LUMA_HPP 64, 48 +LUMA_HPP 64, 64 + +// ***** luma_hps ***** +// void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt) +.macro LUMA_HPS w, h +function x265_interp_horiz_ps_\w\()x\h\()_neon + mov w10, #\h + cmp w5, #0 + b.eq 6f + sub x0, x0, x1, lsl #2 + add x0, x0, x1 + add w10, w10, #7 +6: + mov w6, w10 + cmp w4, #0 + b.eq 0f + cmp w4, #1 + b.eq 1f + cmp w4, #2 + b.eq 2f + cmp w4, #3 + b.eq 3f +0: + FILTER_HPS \w, \h, 0 +1: + FILTER_HPS \w, \h, 1 +2: + FILTER_HPS \w, \h, 2 +3: + FILTER_HPS \w, \h, 3 +endfunc +.endm + +LUMA_HPS 4, 4 +LUMA_HPS 4, 8 +LUMA_HPS 4, 16 +LUMA_HPS 8, 4 +LUMA_HPS 8, 8 +LUMA_HPS 8, 16 +LUMA_HPS 8, 32 +LUMA_HPS 12, 16 +LUMA_HPS 16, 4 +LUMA_HPS 16, 8 +LUMA_HPS 16, 12 +LUMA_HPS 16, 16 +LUMA_HPS 16, 32 +LUMA_HPS 16, 64 +LUMA_HPS 24, 32 +LUMA_HPS 32, 8 +LUMA_HPS 32, 16 +LUMA_HPS 32, 24 +LUMA_HPS 32, 32 +LUMA_HPS 32, 64 +LUMA_HPS 48, 64 +LUMA_HPS 64, 16 +LUMA_HPS 64, 32 +LUMA_HPS 64, 48 +LUMA_HPS 64, 64 + +// ***** chroma_vpp ***** +// void interp_vert_pp_c(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx) +.macro CHROMA_VPP w, h +function x265_interp_4tap_vert_pp_\w\()x\h\()_neon + cmp x4, #0 + beq 0f + cmp x4, #1 + beq 1f + cmp x4, #2 + beq 2f + cmp x4, #3 + beq 3f + cmp x4, #4 + beq 4f + cmp x4, #5 + beq 5f + cmp x4, #6 + beq 6f + cmp x4, #7 + beq 7f +0: + FILTER_CHROMA_VPP \w, \h, 0 +1: + FILTER_CHROMA_VPP \w, \h, 1 +2: + FILTER_CHROMA_VPP \w, \h, 2 +3: + FILTER_CHROMA_VPP \w, \h, 3 +4: + FILTER_CHROMA_VPP \w, \h, 4 +5: + FILTER_CHROMA_VPP \w, \h, 5 +6: + FILTER_CHROMA_VPP \w, \h, 6 +7: + FILTER_CHROMA_VPP \w, \h, 7 +endfunc +.endm + +CHROMA_VPP 2, 4 +CHROMA_VPP 2, 8 +CHROMA_VPP 2, 16 +CHROMA_VPP 4, 2 +CHROMA_VPP 4, 4 +CHROMA_VPP 4, 8 +CHROMA_VPP 4, 16 +CHROMA_VPP 4, 32 +CHROMA_VPP 6, 8 +CHROMA_VPP 6, 16 +CHROMA_VPP 8, 2 +CHROMA_VPP 8, 4 +CHROMA_VPP 8, 6 +CHROMA_VPP 8, 8 +CHROMA_VPP 8, 16 +CHROMA_VPP 8, 32 +CHROMA_VPP 8, 12 +CHROMA_VPP 8, 64 +CHROMA_VPP 12, 16 +CHROMA_VPP 12, 32 +CHROMA_VPP 16, 4 +CHROMA_VPP 16, 8 +CHROMA_VPP 16, 12 +CHROMA_VPP 16, 16 +CHROMA_VPP 16, 32 +CHROMA_VPP 16, 64 +CHROMA_VPP 16, 24 +CHROMA_VPP 32, 8 +CHROMA_VPP 32, 16 +CHROMA_VPP 32, 24 +CHROMA_VPP 32, 32 +CHROMA_VPP 32, 64 +CHROMA_VPP 32, 48 +CHROMA_VPP 24, 32 +CHROMA_VPP 24, 64 +CHROMA_VPP 64, 16 +CHROMA_VPP 64, 32 +CHROMA_VPP 64, 48 +CHROMA_VPP 64, 64 +CHROMA_VPP 48, 64 + +// ***** chroma_vps ***** +// void interp_vert_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx) +.macro CHROMA_VPS w, h +function x265_interp_4tap_vert_ps_\w\()x\h\()_neon + cmp x4, #0 + beq 0f + cmp x4, #1 + beq 1f + cmp x4, #2 + beq 2f + cmp x4, #3 + beq 3f + cmp x4, #4 + beq 4f + cmp x4, #5 + beq 5f + cmp x4, #6 + beq 6f + cmp x4, #7 + beq 7f +0: + FILTER_CHROMA_VPS \w, \h, 0 +1: + FILTER_CHROMA_VPS \w, \h, 1 +2: + FILTER_CHROMA_VPS \w, \h, 2 +3: + FILTER_CHROMA_VPS \w, \h, 3 +4: + FILTER_CHROMA_VPS \w, \h, 4 +5: + FILTER_CHROMA_VPS \w, \h, 5 +6: + FILTER_CHROMA_VPS \w, \h, 6 +7: + FILTER_CHROMA_VPS \w, \h, 7 +endfunc +.endm + +CHROMA_VPS 2, 4 +CHROMA_VPS 2, 8 +CHROMA_VPS 2, 16 +CHROMA_VPS 4, 2 +CHROMA_VPS 4, 4 +CHROMA_VPS 4, 8 +CHROMA_VPS 4, 16 +CHROMA_VPS 4, 32 +CHROMA_VPS 6, 8 +CHROMA_VPS 6, 16 +CHROMA_VPS 8, 2 +CHROMA_VPS 8, 4 +CHROMA_VPS 8, 6 +CHROMA_VPS 8, 8 +CHROMA_VPS 8, 16 +CHROMA_VPS 8, 32 +CHROMA_VPS 8, 12 +CHROMA_VPS 8, 64 +CHROMA_VPS 12, 16 +CHROMA_VPS 12, 32 +CHROMA_VPS 16, 4 +CHROMA_VPS 16, 8 +CHROMA_VPS 16, 12 +CHROMA_VPS 16, 16 +CHROMA_VPS 16, 32 +CHROMA_VPS 16, 64 +CHROMA_VPS 16, 24 +CHROMA_VPS 32, 8 +CHROMA_VPS 32, 16 +CHROMA_VPS 32, 24 +CHROMA_VPS 32, 32 +CHROMA_VPS 32, 64 +CHROMA_VPS 32, 48 +CHROMA_VPS 24, 32 +CHROMA_VPS 24, 64 +CHROMA_VPS 64, 16 +CHROMA_VPS 64, 32 +CHROMA_VPS 64, 48 +CHROMA_VPS 64, 64 +CHROMA_VPS 48, 64 + +// ***** chroma_vsp ***** +// void interp_vert_sp_c(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx) +.macro CHROMA_VSP w, h +function x265_interp_4tap_vert_sp_\w\()x\h\()_neon + cmp x4, #0 + beq 0f + cmp x4, #1 + beq 1f + cmp x4, #2 + beq 2f + cmp x4, #3 + beq 3f + cmp x4, #4 + beq 4f + cmp x4, #5 + beq 5f + cmp x4, #6 + beq 6f + cmp x4, #7 + beq 7f +0: + FILTER_CHROMA_VSP \w, \h, 0 +1: + FILTER_CHROMA_VSP \w, \h, 1 +2: + FILTER_CHROMA_VSP \w, \h, 2 +3: + FILTER_CHROMA_VSP \w, \h, 3 +4: + FILTER_CHROMA_VSP \w, \h, 4 +5: + FILTER_CHROMA_VSP \w, \h, 5 +6: + FILTER_CHROMA_VSP \w, \h, 6 +7: + FILTER_CHROMA_VSP \w, \h, 7 +endfunc +.endm + +CHROMA_VSP 4, 4 +CHROMA_VSP 4, 8 +CHROMA_VSP 4, 16 +CHROMA_VSP 4, 32 +CHROMA_VSP 8, 2 +CHROMA_VSP 8, 4 +CHROMA_VSP 8, 6 +CHROMA_VSP 8, 8 +CHROMA_VSP 8, 16 +CHROMA_VSP 8, 32 +CHROMA_VSP 8, 12 +CHROMA_VSP 8, 64 +CHROMA_VSP 12, 16 +CHROMA_VSP 12, 32 +CHROMA_VSP 16, 4 +CHROMA_VSP 16, 8 +CHROMA_VSP 16, 12 +CHROMA_VSP 16, 16 +CHROMA_VSP 16, 32 +CHROMA_VSP 16, 64 +CHROMA_VSP 16, 24 +CHROMA_VSP 32, 8 +CHROMA_VSP 32, 16 +CHROMA_VSP 32, 24 +CHROMA_VSP 32, 32 +CHROMA_VSP 32, 64 +CHROMA_VSP 32, 48 +CHROMA_VSP 24, 32 +CHROMA_VSP 24, 64 +CHROMA_VSP 64, 16 +CHROMA_VSP 64, 32 +CHROMA_VSP 64, 48 +CHROMA_VSP 64, 64 +CHROMA_VSP 48, 64 + +// ***** chroma_vss ***** +// void interp_vert_ss_c(const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx) +.macro CHROMA_VSS w, h +function x265_interp_4tap_vert_ss_\w\()x\h\()_neon + cmp x4, #0 + beq 0f + cmp x4, #1 + beq 1f + cmp x4, #2 + beq 2f + cmp x4, #3 + beq 3f + cmp x4, #4 + beq 4f + cmp x4, #5 + beq 5f + cmp x4, #6 + beq 6f + cmp x4, #7 + beq 7f +0: + FILTER_CHROMA_VSS \w, \h, 0 +1: + FILTER_CHROMA_VSS \w, \h, 1 +2: + FILTER_CHROMA_VSS \w, \h, 2 +3: + FILTER_CHROMA_VSS \w, \h, 3 +4: + FILTER_CHROMA_VSS \w, \h, 4 +5: + FILTER_CHROMA_VSS \w, \h, 5 +6: + FILTER_CHROMA_VSS \w, \h, 6 +7: + FILTER_CHROMA_VSS \w, \h, 7 +endfunc +.endm + +CHROMA_VSS 4, 4 +CHROMA_VSS 4, 8 +CHROMA_VSS 4, 16 +CHROMA_VSS 4, 32 +CHROMA_VSS 8, 2 +CHROMA_VSS 8, 4 +CHROMA_VSS 8, 6 +CHROMA_VSS 8, 8 +CHROMA_VSS 8, 16 +CHROMA_VSS 8, 32 +CHROMA_VSS 8, 12 +CHROMA_VSS 8, 64 +CHROMA_VSS 12, 16 +CHROMA_VSS 12, 32 +CHROMA_VSS 16, 4 +CHROMA_VSS 16, 8 +CHROMA_VSS 16, 12 +CHROMA_VSS 16, 16 +CHROMA_VSS 16, 32 +CHROMA_VSS 16, 64 +CHROMA_VSS 16, 24 +CHROMA_VSS 32, 8 +CHROMA_VSS 32, 16 +CHROMA_VSS 32, 24 +CHROMA_VSS 32, 32 +CHROMA_VSS 32, 64 +CHROMA_VSS 32, 48 +CHROMA_VSS 24, 32 +CHROMA_VSS 24, 64 +CHROMA_VSS 64, 16 +CHROMA_VSS 64, 32 +CHROMA_VSS 64, 48 +CHROMA_VSS 64, 64 +CHROMA_VSS 48, 64 + +// ***** chroma_hpp ***** +// void interp_horiz_pp_c(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx) +.macro CHROMA_HPP w, h +function x265_interp_4tap_horiz_pp_\w\()x\h\()_neon + cmp x4, #0 + beq 0f + cmp x4, #1 + beq 1f + cmp x4, #2 + beq 2f + cmp x4, #3 + beq 3f + cmp x4, #4 + beq 4f + cmp x4, #5 + beq 5f + cmp x4, #6 + beq 6f + cmp x4, #7 + beq 7f +0: + FILTER_CHROMA_HPP \w, \h, 0 +1: + FILTER_CHROMA_HPP \w, \h, 1 +2: + FILTER_CHROMA_HPP \w, \h, 2 +3: + FILTER_CHROMA_HPP \w, \h, 3 +4: + FILTER_CHROMA_HPP \w, \h, 4 +5: + FILTER_CHROMA_HPP \w, \h, 5 +6: + FILTER_CHROMA_HPP \w, \h, 6 +7: + FILTER_CHROMA_HPP \w, \h, 7 +endfunc +.endm + +CHROMA_HPP 2, 4 +CHROMA_HPP 2, 8 +CHROMA_HPP 2, 16 +CHROMA_HPP 4, 2 +CHROMA_HPP 4, 4 +CHROMA_HPP 4, 8 +CHROMA_HPP 4, 16 +CHROMA_HPP 4, 32 +CHROMA_HPP 6, 8 +CHROMA_HPP 6, 16 +CHROMA_HPP 8, 2 +CHROMA_HPP 8, 4 +CHROMA_HPP 8, 6 +CHROMA_HPP 8, 8 +CHROMA_HPP 8, 12 +CHROMA_HPP 8, 16 +CHROMA_HPP 8, 32 +CHROMA_HPP 8, 64 +CHROMA_HPP 12, 16 +CHROMA_HPP 12, 32 +CHROMA_HPP 16, 4 +CHROMA_HPP 16, 8 +CHROMA_HPP 16, 12 +CHROMA_HPP 16, 16 +CHROMA_HPP 16, 24 +CHROMA_HPP 16, 32 +CHROMA_HPP 16, 64 +CHROMA_HPP 24, 32 +CHROMA_HPP 24, 64 +CHROMA_HPP 32, 8 +CHROMA_HPP 32, 16 +CHROMA_HPP 32, 24 +CHROMA_HPP 32, 32 +CHROMA_HPP 32, 48 +CHROMA_HPP 32, 64 +CHROMA_HPP 48, 64 +CHROMA_HPP 64, 16 +CHROMA_HPP 64, 32 +CHROMA_HPP 64, 48 +CHROMA_HPP 64, 64 + +// ***** chroma_hps ***** +// void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt) +.macro CHROMA_HPS w, h +function x265_interp_4tap_horiz_ps_\w\()x\h\()_neon + cmp x4, #0 + beq 0f + cmp x4, #1 + beq 1f + cmp x4, #2 + beq 2f + cmp x4, #3 + beq 3f + cmp x4, #4 + beq 4f + cmp x4, #5 + beq 5f + cmp x4, #6 + beq 6f + cmp x4, #7 + beq 7f +0: + FILTER_CHROMA_HPS \w, \h, 0 +1: + FILTER_CHROMA_HPS \w, \h, 1 +2: + FILTER_CHROMA_HPS \w, \h, 2 +3: + FILTER_CHROMA_HPS \w, \h, 3 +4: + FILTER_CHROMA_HPS \w, \h, 4 +5: + FILTER_CHROMA_HPS \w, \h, 5 +6: + FILTER_CHROMA_HPS \w, \h, 6 +7: + FILTER_CHROMA_HPS \w, \h, 7 +endfunc +.endm + +CHROMA_HPS 2, 4 +CHROMA_HPS 2, 8 +CHROMA_HPS 2, 16 +CHROMA_HPS 4, 2 +CHROMA_HPS 4, 4 +CHROMA_HPS 4, 8 +CHROMA_HPS 4, 16 +CHROMA_HPS 4, 32 +CHROMA_HPS 6, 8 +CHROMA_HPS 6, 16 +CHROMA_HPS 8, 2 +CHROMA_HPS 8, 4 +CHROMA_HPS 8, 6 +CHROMA_HPS 8, 8 +CHROMA_HPS 8, 12 +CHROMA_HPS 8, 16 +CHROMA_HPS 8, 32 +CHROMA_HPS 8, 64 +CHROMA_HPS 12, 16 +CHROMA_HPS 12, 32 +CHROMA_HPS 16, 4 +CHROMA_HPS 16, 8 +CHROMA_HPS 16, 12 +CHROMA_HPS 16, 16 +CHROMA_HPS 16, 24 +CHROMA_HPS 16, 32 +CHROMA_HPS 16, 64 +CHROMA_HPS 24, 32 +CHROMA_HPS 24, 64 +CHROMA_HPS 32, 8 +CHROMA_HPS 32, 16 +CHROMA_HPS 32, 24 +CHROMA_HPS 32, 32 +CHROMA_HPS 32, 48 +CHROMA_HPS 32, 64 +CHROMA_HPS 48, 64 +CHROMA_HPS 64, 16 +CHROMA_HPS 64, 32 +CHROMA_HPS 64, 48 +CHROMA_HPS 64, 64 + +const g_luma_s16, align=8 +// a, b, c, d, e, f, g, h +.hword 0, 0, 0, 64, 0, 0, 0, 0 +.hword -1, 4, -10, 58, 17, -5, 1, 0 +.hword -1, 4, -11, 40, 40, -11, 4, -1 +.hword 0, 1, -5, 17, 58, -10, 4, -1 +endconst
View file
x265_3.6.tar.gz/source/common/aarch64/loopfilter-prim.cpp
Added
@@ -0,0 +1,291 @@ +#include "loopfilter-prim.h" + +#define PIXEL_MIN 0 + + + +#if !(HIGH_BIT_DEPTH) && defined(HAVE_NEON) +#include<arm_neon.h> + +namespace +{ + + +/* get the sign of input variable (TODO: this is a dup, make common) */ +static inline int8_t signOf(int x) +{ + return (x >> 31) | ((int)((((uint32_t) - x)) >> 31)); +} + +static inline int8x8_t sign_diff_neon(const uint8x8_t in0, const uint8x8_t in1) +{ + int16x8_t in = vsubl_u8(in0, in1); + return vmovn_s16(vmaxq_s16(vminq_s16(in, vdupq_n_s16(1)), vdupq_n_s16(-1))); +} + +static void calSign_neon(int8_t *dst, const pixel *src1, const pixel *src2, const int endX) +{ + int x = 0; + for (; (x + 8) <= endX; x += 8) + { + *(int8x8_t *)&dstx = sign_diff_neon(*(uint8x8_t *)&src1x, *(uint8x8_t *)&src2x); + } + + for (; x < endX; x++) + { + dstx = signOf(src1x - src2x); + } +} + +static void processSaoCUE0_neon(pixel *rec, int8_t *offsetEo, int width, int8_t *signLeft, intptr_t stride) +{ + + + int y; + int8_t signRight, signLeft0; + int8_t edgeType; + + for (y = 0; y < 2; y++) + { + signLeft0 = signLefty; + int x = 0; + + if (width >= 8) + { + int8x8_t vsignRight; + int8x8x2_t shifter; + shifter.val10 = signLeft0; + static const int8x8_t index = {8, 0, 1, 2, 3, 4, 5, 6}; + int8x8_t tbl = *(int8x8_t *)offsetEo; + for (; (x + 8) <= width; x += 8) + { + uint8x8_t in = *(uint8x8_t *)&recx; + vsignRight = sign_diff_neon(in, *(uint8x8_t *)&recx + 1); + shifter.val0 = vneg_s8(vsignRight); + int8x8_t tmp = shifter.val0; + int8x8_t edge = vtbl2_s8(shifter, index); + int8x8_t vedgeType = vadd_s8(vadd_s8(vsignRight, edge), vdup_n_s8(2)); + shifter.val10 = tmp7; + int16x8_t t1 = vmovl_s8(vtbl1_s8(tbl, vedgeType)); + t1 = vaddw_u8(t1, in); + t1 = vmaxq_s16(t1, vdupq_n_s16(0)); + t1 = vminq_s16(t1, vdupq_n_s16(255)); + *(uint8x8_t *)&recx = vmovn_u16(t1); + } + signLeft0 = shifter.val10; + } + for (; x < width; x++) + { + signRight = ((recx - recx + 1) < 0) ? -1 : ((recx - recx + 1) > 0) ? 1 : 0; + edgeType = signRight + signLeft0 + 2; + signLeft0 = -signRight; + recx = x265_clip(recx + offsetEoedgeType); + } + rec += stride; + } +} + +static void processSaoCUE1_neon(pixel *rec, int8_t *upBuff1, int8_t *offsetEo, intptr_t stride, int width) +{ + int x = 0; + int8_t signDown; + int edgeType; + + if (width >= 8) + { + int8x8_t tbl = *(int8x8_t *)offsetEo; + for (; (x + 8) <= width; x += 8) + { + uint8x8_t in0 = *(uint8x8_t *)&recx; + uint8x8_t in1 = *(uint8x8_t *)&recx + stride; + int8x8_t vsignDown = sign_diff_neon(in0, in1); + int8x8_t vedgeType = vadd_s8(vadd_s8(vsignDown, *(int8x8_t *)&upBuff1x), vdup_n_s8(2)); + *(int8x8_t *)&upBuff1x = vneg_s8(vsignDown); + int16x8_t t1 = vmovl_s8(vtbl1_s8(tbl, vedgeType)); + t1 = vaddw_u8(t1, in0); + *(uint8x8_t *)&recx = vqmovun_s16(t1); + } + } + for (; x < width; x++) + { + signDown = signOf(recx - recx + stride); + edgeType = signDown + upBuff1x + 2; + upBuff1x = -signDown; + recx = x265_clip(recx + offsetEoedgeType); + } +} + +static void processSaoCUE1_2Rows_neon(pixel *rec, int8_t *upBuff1, int8_t *offsetEo, intptr_t stride, int width) +{ + int y; + int8_t signDown; + int edgeType; + + for (y = 0; y < 2; y++) + { + int x = 0; + if (width >= 8) + { + int8x8_t tbl = *(int8x8_t *)offsetEo; + for (; (x + 8) <= width; x += 8) + { + uint8x8_t in0 = *(uint8x8_t *)&recx; + uint8x8_t in1 = *(uint8x8_t *)&recx + stride; + int8x8_t vsignDown = sign_diff_neon(in0, in1); + int8x8_t vedgeType = vadd_s8(vadd_s8(vsignDown, *(int8x8_t *)&upBuff1x), vdup_n_s8(2)); + *(int8x8_t *)&upBuff1x = vneg_s8(vsignDown); + int16x8_t t1 = vmovl_s8(vtbl1_s8(tbl, vedgeType)); + t1 = vaddw_u8(t1, in0); + t1 = vmaxq_s16(t1, vdupq_n_s16(0)); + t1 = vminq_s16(t1, vdupq_n_s16(255)); + *(uint8x8_t *)&recx = vmovn_u16(t1); + + } + } + for (; x < width; x++) + { + signDown = signOf(recx - recx + stride); + edgeType = signDown + upBuff1x + 2; + upBuff1x = -signDown; + recx = x265_clip(recx + offsetEoedgeType); + } + rec += stride; + } +} + +static void processSaoCUE2_neon(pixel *rec, int8_t *bufft, int8_t *buff1, int8_t *offsetEo, int width, intptr_t stride) +{ + int x; + + if (abs(buff1 - bufft) < 16) + { + for (x = 0; x < width; x++) + { + int8_t signDown = signOf(recx - recx + stride + 1); + int edgeType = signDown + buff1x + 2; + bufftx + 1 = -signDown; + recx = x265_clip(recx + offsetEoedgeType);; + } + } + else + { + int8x8_t tbl = *(int8x8_t *)offsetEo; + x = 0; + for (; (x + 8) <= width; x += 8) + { + uint8x8_t in0 = *(uint8x8_t *)&recx; + uint8x8_t in1 = *(uint8x8_t *)&recx + stride + 1; + int8x8_t vsignDown = sign_diff_neon(in0, in1); + int8x8_t vedgeType = vadd_s8(vadd_s8(vsignDown, *(int8x8_t *)&buff1x), vdup_n_s8(2)); + *(int8x8_t *)&bufftx + 1 = vneg_s8(vsignDown); + int16x8_t t1 = vmovl_s8(vtbl1_s8(tbl, vedgeType)); + t1 = vaddw_u8(t1, in0); + t1 = vmaxq_s16(t1, vdupq_n_s16(0)); + t1 = vminq_s16(t1, vdupq_n_s16(255)); + *(uint8x8_t *)&recx = vmovn_u16(t1); + } + for (; x < width; x++) + { + int8_t signDown = signOf(recx - recx + stride + 1); + int edgeType = signDown + buff1x + 2; + bufftx + 1 = -signDown; + recx = x265_clip(recx + offsetEoedgeType);; + } + + } +} + + +static void processSaoCUE3_neon(pixel *rec, int8_t *upBuff1, int8_t *offsetEo, intptr_t stride, int startX, int endX) +{ + int8_t signDown; + int8_t edgeType; + int8x8_t tbl = *(int8x8_t *)offsetEo; + + int x = startX + 1; + for (; (x + 8) <= endX; x += 8) + { + uint8x8_t in0 = *(uint8x8_t *)&recx; + uint8x8_t in1 = *(uint8x8_t *)&recx + stride; + int8x8_t vsignDown = sign_diff_neon(in0, in1); + int8x8_t vedgeType = vadd_s8(vadd_s8(vsignDown, *(int8x8_t *)&upBuff1x), vdup_n_s8(2)); + *(int8x8_t *)&upBuff1x - 1 = vneg_s8(vsignDown); + int16x8_t t1 = vmovl_s8(vtbl1_s8(tbl, vedgeType)); + t1 = vaddw_u8(t1, in0); + t1 = vmaxq_s16(t1, vdupq_n_s16(0)); + t1 = vminq_s16(t1, vdupq_n_s16(255)); + *(uint8x8_t *)&recx = vmovn_u16(t1); + + } + for (; x < endX; x++) + { + signDown = signOf(recx - recx + stride); + edgeType = signDown + upBuff1x + 2; + upBuff1x - 1 = -signDown; + recx = x265_clip(recx + offsetEoedgeType); + } +} + +static void processSaoCUB0_neon(pixel *rec, const int8_t *offset, int ctuWidth, int ctuHeight, intptr_t stride) +{ +#define SAO_BO_BITS 5 + const int boShift = X265_DEPTH - SAO_BO_BITS; + int x, y; + int8x8x4_t table; + table = *(int8x8x4_t *)offset; + + for (y = 0; y < ctuHeight; y++) + { + + for (x = 0; (x + 8) <= ctuWidth; x += 8) + { + int8x8_t in = *(int8x8_t *)&recx; + int8x8_t offsets = vtbl4_s8(table, vshr_n_u8(in, boShift)); + int16x8_t tmp = vmovl_s8(offsets); + tmp = vaddw_u8(tmp, in); + tmp = vmaxq_s16(tmp, vdupq_n_s16(0)); + tmp = vminq_s16(tmp, vdupq_n_s16(255)); + *(uint8x8_t *)&recx = vmovn_u16(tmp); + } + for (; x < ctuWidth; x++) + { + recx = x265_clip(recx + offsetrecx >> boShift); + } + rec += stride; + } +} + +} + + + +namespace X265_NS +{ +void setupLoopFilterPrimitives_neon(EncoderPrimitives &p) +{ + p.saoCuOrgE0 = processSaoCUE0_neon; + p.saoCuOrgE1 = processSaoCUE1_neon; + p.saoCuOrgE1_2Rows = processSaoCUE1_2Rows_neon; + p.saoCuOrgE20 = processSaoCUE2_neon; + p.saoCuOrgE21 = processSaoCUE2_neon; + p.saoCuOrgE30 = processSaoCUE3_neon; + p.saoCuOrgE31 = processSaoCUE3_neon; + p.saoCuOrgB0 = processSaoCUB0_neon; + p.sign = calSign_neon; + +} + + +#else //HIGH_BIT_DEPTH + + +namespace X265_NS +{ +void setupLoopFilterPrimitives_neon(EncoderPrimitives &) +{ +} + +#endif + + +}
View file
x265_3.6.tar.gz/source/common/aarch64/loopfilter-prim.h
Added
@@ -0,0 +1,16 @@ +#ifndef _LOOPFILTER_NEON_H__ +#define _LOOPFILTER_NEON_H__ + +#include "common.h" +#include "primitives.h" + +#define PIXEL_MIN 0 + +namespace X265_NS +{ +void setupLoopFilterPrimitives_neon(EncoderPrimitives &p); + +}; + + +#endif
View file
x265_3.6.tar.gz/source/common/aarch64/mc-a-common.S
Added
@@ -0,0 +1,48 @@ +/***************************************************************************** + * Copyright (C) 2022-2023 MulticoreWare, Inc + * + * Authors: David Chen <david.chen@myais.com.cn> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +// This file contains the macros written using NEON instruction set +// that are also used by the SVE2 functions + +.arch armv8-a + +#ifdef __APPLE__ +.section __RODATA,__rodata +#else +.section .rodata +#endif + +.macro addAvg_start + lsl x3, x3, #1 + lsl x4, x4, #1 + mov w11, #0x40 + dup v30.16b, w11 +.endm + +.macro addavg_1 v0, v1 + add \v0\().8h, \v0\().8h, \v1\().8h + saddl v16.4s, \v0\().4h, v30.4h + saddl2 v17.4s, \v0\().8h, v30.8h + shrn \v0\().4h, v16.4s, #7 + shrn2 \v0\().8h, v17.4s, #7 +.endm
View file
x265_3.6.tar.gz/source/common/aarch64/mc-a-sve2.S
Added
@@ -0,0 +1,924 @@ +/***************************************************************************** + * Copyright (C) 2022-2023 MulticoreWare, Inc + * + * Authors: David Chen <david.chen@myais.com.cn> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "asm-sve.S" +#include "mc-a-common.S" + +.arch armv8-a+sve2 + +#ifdef __APPLE__ +.section __RODATA,__rodata +#else +.section .rodata +#endif + +.align 4 + +.text + +function PFX(pixel_avg_pp_12x16_sve2) + sub x1, x1, #4 + sub x3, x3, #4 + sub x5, x5, #4 + ptrue p0.s, vl1 + ptrue p1.b, vl8 + mov x11, #4 +.rept 16 + ld1w {z0.s}, p0/z, x2 + ld1b {z1.b}, p1/z, x2, x11 + ld1w {z2.s}, p0/z, x4 + ld1b {z3.b}, p1/z, x4, x11 + add x2, x2, #4 + add x2, x2, x3 + add x4, x4, #4 + add x4, x4, x5 + urhadd z0.b, p1/m, z0.b, z2.b + urhadd z1.b, p1/m, z1.b, z3.b + st1b {z0.b}, p1, x0 + st1b {z1.b}, p1, x0, x11 + add x0, x0, #4 + add x0, x0, x1 +.endr + ret +endfunc + +function PFX(pixel_avg_pp_24x32_sve2) + mov w12, #4 + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_avg_pp_24x32 + sub x1, x1, #16 + sub x3, x3, #16 + sub x5, x5, #16 +.lpavg_24x32_sve2: + sub w12, w12, #1 +.rept 8 + ld1 {v0.16b}, x2, #16 + ld1 {v1.8b}, x2, x3 + ld1 {v2.16b}, x4, #16 + ld1 {v3.8b}, x4, x5 + urhadd v0.16b, v0.16b, v2.16b + urhadd v1.8b, v1.8b, v3.8b + st1 {v0.16b}, x0, #16 + st1 {v1.8b}, x0, x1 +.endr + cbnz w12, .lpavg_24x32_sve2 + ret +.vl_gt_16_pixel_avg_pp_24x32: + mov x10, #24 + mov x11, #0 + whilelt p0.b, x11, x10 +.vl_gt_16_loop_pixel_avg_pp_24x32: + sub w12, w12, #1 +.rept 8 + ld1b {z0.b}, p0/z, x2 + ld1b {z2.b}, p0/z, x4 + add x2, x2, x3 + add x4, x4, x5 + urhadd z0.b, p0/m, z0.b, z2.b + st1b {z0.b}, p0, x0 + add x0, x0, x1 +.endr + cbnz w12, .vl_gt_16_loop_pixel_avg_pp_24x32 + ret +endfunc + +.macro pixel_avg_pp_32xN_sve2 h +function PFX(pixel_avg_pp_32x\h\()_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_avg_pp_32_\h +.rept \h + ld1 {v0.16b-v1.16b}, x2, x3 + ld1 {v2.16b-v3.16b}, x4, x5 + urhadd v0.16b, v0.16b, v2.16b + urhadd v1.16b, v1.16b, v3.16b + st1 {v0.16b-v1.16b}, x0, x1 +.endr + ret +.vl_gt_16_pixel_avg_pp_32_\h: + ptrue p0.b, vl32 +.rept \h + ld1b {z0.b}, p0/z, x2 + ld1b {z2.b}, p0/z, x4 + add x2, x2, x3 + add x4, x4, x5 + urhadd z0.b, p0/m, z0.b, z2.b + st1b {z0.b}, p0, x0 + add x0, x0, x1 +.endr + ret +endfunc +.endm + +pixel_avg_pp_32xN_sve2 8 +pixel_avg_pp_32xN_sve2 16 +pixel_avg_pp_32xN_sve2 24 + +.macro pixel_avg_pp_32xN1_sve2 h +function PFX(pixel_avg_pp_32x\h\()_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_avg_pp_32xN1_\h + mov w12, #\h / 8 +.lpavg_sve2_32x\h\(): + sub w12, w12, #1 +.rept 8 + ld1 {v0.16b-v1.16b}, x2, x3 + ld1 {v2.16b-v3.16b}, x4, x5 + urhadd v0.16b, v0.16b, v2.16b + urhadd v1.16b, v1.16b, v3.16b + st1 {v0.16b-v1.16b}, x0, x1 +.endr + cbnz w12, .lpavg_sve2_32x\h + ret +.vl_gt_16_pixel_avg_pp_32xN1_\h: + ptrue p0.b, vl32 + mov w12, #\h / 8 +.eq_32_loop_pixel_avg_pp_32xN1_\h\(): + sub w12, w12, #1 +.rept 8 + ld1b {z0.b}, p0/z, x2 + ld1b {z2.b}, p0/z, x4 + add x2, x2, x3 + add x4, x4, x5 + urhadd z0.b, p0/m, z0.b, z2.b + st1b {z0.b}, p0, x0 + add x0, x0, x1 +.endr + cbnz w12, .eq_32_loop_pixel_avg_pp_32xN1_\h + ret +endfunc +.endm + +pixel_avg_pp_32xN1_sve2 32 +pixel_avg_pp_32xN1_sve2 64 + +function PFX(pixel_avg_pp_48x64_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_avg_pp_48x64 + mov w12, #8 +.lpavg_48x64_sve2: + sub w12, w12, #1 +.rept 8 + ld1 {v0.16b-v2.16b}, x2, x3 + ld1 {v3.16b-v5.16b}, x4, x5 + urhadd v0.16b, v0.16b, v3.16b + urhadd v1.16b, v1.16b, v4.16b + urhadd v2.16b, v2.16b, v5.16b + st1 {v0.16b-v2.16b}, x0, x1 +.endr + cbnz w12, .lpavg_48x64_sve2 + ret +.vl_gt_16_pixel_avg_pp_48x64: + cmp x9, #32 + bgt .vl_gt_32_pixel_avg_pp_48x64 + ptrue p0.b, vl32 + ptrue p1.b, vl16 + mov w12, #8 +.vl_eq_32_pixel_avg_pp_48x64: + sub w12, w12, #1 +.rept 8 + ld1b {z0.b}, p0/z, x2 + ld1b {z1.b}, p1/z, x2, #1, mul vl + ld1b {z2.b}, p0/z, x4 + ld1b {z3.b}, p1/z, x4, #1, mul vl + add x2, x2, x3 + add x4, x4, x5 + urhadd z0.b, p0/m, z0.b, z2.b + urhadd z1.b, p1/m, z1.b, z3.b + st1b {z0.b}, p0, x0 + st1b {z1.b}, p1, x0, #1, mul vl + add x0, x0, x1 +.endr + cbnz w12, .vl_eq_32_pixel_avg_pp_48x64 + ret +.vl_gt_32_pixel_avg_pp_48x64: + mov x10, #48 + mov x11, #0 + whilelt p0.b, x11, x10 + mov w12, #8 +.loop_gt_32_pixel_avg_pp_48x64: + sub w12, w12, #1 +.rept 8 + ld1b {z0.b}, p0/z, x2 + ld1b {z2.b}, p0/z, x4 + add x2, x2, x3 + add x4, x4, x5 + urhadd z0.b, p0/m, z0.b, z2.b + st1b {z0.b}, p0, x0 + add x0, x0, x1 +.endr + cbnz w12, .loop_gt_32_pixel_avg_pp_48x64 + ret +endfunc + +.macro pixel_avg_pp_64xN_sve2 h +function PFX(pixel_avg_pp_64x\h\()_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_avg_pp_64x\h + mov w12, #\h / 4 +.lpavg_sve2_64x\h\(): + sub w12, w12, #1 +.rept 4 + ld1 {v0.16b-v3.16b}, x2, x3 + ld1 {v4.16b-v7.16b}, x4, x5 + urhadd v0.16b, v0.16b, v4.16b + urhadd v1.16b, v1.16b, v5.16b + urhadd v2.16b, v2.16b, v6.16b + urhadd v3.16b, v3.16b, v7.16b + st1 {v0.16b-v3.16b}, x0, x1 +.endr + cbnz w12, .lpavg_sve2_64x\h + ret +.vl_gt_16_pixel_avg_pp_64x\h\(): + cmp x9, #48 + bgt .vl_gt_48_pixel_avg_pp_64x\h + ptrue p0.b, vl32 + mov w12, #\h / 4 +.vl_eq_32_pixel_avg_pp_64x\h\(): + sub w12, w12, #1 +.rept 4 + ld1b {z0.b}, p0/z, x2 + ld1b {z1.b}, p0/z, x2, #1, mul vl + ld1b {z2.b}, p0/z, x4 + ld1b {z3.b}, p0/z, x4, #1, mul vl + add x2, x2, x3 + add x4, x4, x5 + urhadd z0.b, p0/m, z0.b, z2.b + urhadd z1.b, p0/m, z1.b, z3.b + st1b {z0.b}, p0, x0 + st1b {z1.b}, p0, x0, #1, mul vl + add x0, x0, x1 +.endr + cbnz w12, .vl_eq_32_pixel_avg_pp_64x\h + ret +.vl_gt_48_pixel_avg_pp_64x\h\(): + ptrue p0.b, vl64 + mov w12, #\h / 4 +.vl_eq_64_pixel_avg_pp_64x\h\(): + sub w12, w12, #1 +.rept 4 + ld1b {z0.b}, p0/z, x2 + ld1b {z2.b}, p0/z, x4 + add x2, x2, x3 + add x4, x4, x5 + urhadd z0.b, p0/m, z0.b, z2.b + st1b {z0.b}, p0, x0 + add x0, x0, x1 +.endr + cbnz w12, .vl_eq_64_pixel_avg_pp_64x\h + ret +endfunc +.endm + +pixel_avg_pp_64xN_sve2 16 +pixel_avg_pp_64xN_sve2 32 +pixel_avg_pp_64xN_sve2 48 +pixel_avg_pp_64xN_sve2 64 + +// void addAvg(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride) + +.macro addAvg_2xN_sve2 h +function PFX(addAvg_2x\h\()_sve2) + ptrue p0.s, vl2 + ptrue p1.h, vl4 + ptrue p2.h, vl2 +.rept \h / 2 + ld1rw {z0.s}, p0/z, x0 + ld1rw {z1.s}, p0/z, x1 + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + ld1rw {z2.s}, p0/z, x0 + ld1rw {z3.s}, p0/z, x1 + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + add z0.h, p1/m, z0.h, z1.h + add z2.h, p1/m, z2.h, z3.h + sqrshrnb z0.b, z0.h, #7 + add z0.b, z0.b, #0x80 + sqrshrnb z2.b, z2.h, #7 + add z2.b, z2.b, #0x80 + st1b {z0.h}, p2, x2 + add x2, x2, x5 + st1b {z2.h}, p2, x2 + add x2, x2, x5 +.endr + ret +endfunc +.endm + +addAvg_2xN_sve2 4 +addAvg_2xN_sve2 8 +addAvg_2xN_sve2 16 + +.macro addAvg_6xN_sve2 h +function PFX(addAvg_6x\h\()_sve2) + mov w12, #\h / 2 + ptrue p0.b, vl16 + ptrue p2.h, vl6 +.loop_sve2_addavg_6x\h\(): + sub w12, w12, #1 + ld1b {z0.b}, p0/z, x0 + ld1b {z1.b}, p0/z, x1 + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + ld1b {z2.b}, p0/z, x0 + ld1b {z3.b}, p0/z, x1 + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + add z0.h, p0/m, z0.h, z1.h + add z2.h, p0/m, z2.h, z3.h + sqrshrnb z0.b, z0.h, #7 + sqrshrnb z2.b, z2.h, #7 + add z0.b, z0.b, #0x80 + add z2.b, z2.b, #0x80 + st1b {z0.h}, p2, x2 + add x2, x2, x5 + st1b {z2.h}, p2, x2 + add x2, x2, x5 + cbnz w12, .loop_sve2_addavg_6x\h + ret +endfunc +.endm + +addAvg_6xN_sve2 8 +addAvg_6xN_sve2 16 + +.macro addAvg_8xN_sve2 h +function PFX(addAvg_8x\h\()_sve2) + ptrue p0.b, vl16 +.rept \h / 2 + ld1b {z0.b}, p0/z, x0 + ld1b {z1.b}, p0/z, x1 + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + ld1b {z2.b}, p0/z, x0 + ld1b {z3.b}, p0/z, x1 + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + add z0.h, p0/m, z0.h, z1.h + add z2.h, p0/m, z2.h, z3.h + sqrshrnb z0.b, z0.h, #7 + add z0.b, z0.b, #0x80 + sqrshrnb z2.b, z2.h, #7 + add z2.b, z2.b, #0x80 + st1b {z0.h}, p0, x2 + add x2, x2, x5 + st1b {z2.h}, p0, x2 + add x2, x2, x5 +.endr + ret +endfunc +.endm + +.macro addAvg_8xN1_sve2 h +function PFX(addAvg_8x\h\()_sve2) + mov w12, #\h / 2 + ptrue p0.b, vl16 +.loop_sve2_addavg_8x\h\(): + sub w12, w12, #1 + ld1b {z0.b}, p0/z, x0 + ld1b {z1.b}, p0/z, x1 + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + ld1b {z2.b}, p0/z, x0 + ld1b {z3.b}, p0/z, x1 + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + add z0.h, p0/m, z0.h, z1.h + add z2.h, p0/m, z2.h, z3.h + sqrshrnb z0.b, z0.h, #7 + add z0.b, z0.b, #0x80 + sqrshrnb z2.b, z2.h, #7 + add z2.b, z2.b, #0x80 + st1b {z0.h}, p0, x2 + add x2, x2, x5 + st1b {z2.h}, p0, x2 + add x2, x2, x5 + cbnz w12, .loop_sve2_addavg_8x\h + ret +endfunc +.endm + +addAvg_8xN_sve2 2 +addAvg_8xN_sve2 4 +addAvg_8xN_sve2 6 +addAvg_8xN_sve2 8 +addAvg_8xN_sve2 12 +addAvg_8xN_sve2 16 +addAvg_8xN1_sve2 32 +addAvg_8xN1_sve2 64 + +.macro addAvg_12xN_sve2 h +function PFX(addAvg_12x\h\()_sve2) + mov w12, #\h + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_addAvg_12x\h + ptrue p0.b, vl16 + ptrue p1.b, vl8 +.loop_sve2_addavg_12x\h\(): + sub w12, w12, #1 + ld1b {z0.b}, p0/z, x0 + ld1b {z1.b}, p0/z, x1 + ld1b {z2.b}, p1/z, x0, #1, mul vl + ld1b {z3.b}, p1/z, x1, #1, mul vl + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + add z0.h, p0/m, z0.h, z1.h + add z2.h, p1/m, z2.h, z3.h + sqrshrnb z0.b, z0.h, #7 + add z0.b, z0.b, #0x80 + sqrshrnb z2.b, z2.h, #7 + add z2.b, z2.b, #0x80 + st1b {z0.h}, p0, x2 + st1b {z2.h}, p1, x2, #1, mul vl + add x2, x2, x5 + cbnz w12, .loop_sve2_addavg_12x\h + ret +.vl_gt_16_addAvg_12x\h\(): + mov x10, #24 + mov x11, #0 + whilelt p0.b, x11, x10 +.loop_sve2_gt_16_addavg_12x\h\(): + sub w12, w12, #1 + ld1b {z0.b}, p0/z, x0 + ld1b {z1.b}, p0/z, x1 + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + add z0.h, p0/m, z0.h, z1.h + sqrshrnb z0.b, z0.h, #7 + add z0.b, z0.b, #0x80 + sqrshrnb z2.b, z2.h, #7 + add z2.b, z2.b, #0x80 + st1b {z0.h}, p0, x2 + add x2, x2, x5 + cbnz w12, .loop_sve2_gt_16_addavg_12x\h + ret +endfunc +.endm + +addAvg_12xN_sve2 16 +addAvg_12xN_sve2 32 + +.macro addAvg_16xN_sve2 h +function PFX(addAvg_16x\h\()_sve2) + mov w12, #\h + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_addAvg_16x\h + ptrue p0.b, vl16 +.loop_eq_16_sve2_addavg_16x\h\(): + sub w12, w12, #1 + ld1b {z0.b}, p0/z, x0 + ld1b {z1.b}, p0/z, x1 + ld1b {z2.b}, p0/z, x0, #1, mul vl + ld1b {z3.b}, p0/z, x1, #1, mul vl + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + add z0.h, p0/m, z0.h, z1.h + add z2.h, p0/m, z2.h, z3.h + sqrshrnb z0.b, z0.h, #7 + add z0.b, z0.b, #0x80 + sqrshrnb z2.b, z2.h, #7 + add z2.b, z2.b, #0x80 + st1b {z0.h}, p0, x2 + st1b {z2.h}, p0, x2, #1, mul vl + add x2, x2, x5 + cbnz w12, .loop_eq_16_sve2_addavg_16x\h + ret +.vl_gt_16_addAvg_16x\h\(): + cmp x9, #32 + bgt .vl_gt_32_addAvg_16x\h + ptrue p0.b, vl32 +.loop_gt_16_sve2_addavg_16x\h\(): + sub w12, w12, #1 + ld1b {z0.b}, p0/z, x0 + ld1b {z1.b}, p0/z, x1 + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + add z0.h, p0/m, z0.h, z1.h + sqrshrnb z0.b, z0.h, #7 + add z0.b, z0.b, #0x80 + st1b {z0.h}, p1, x2 + add x2, x2, x5 + cbnz w12, .loop_gt_16_sve2_addavg_16x\h + ret +.vl_gt_32_addAvg_16x\h\(): + mov x10, #48 + mov x11, #0 + whilelt p0.b, x11, x10 +.loop_gt_32_sve2_addavg_16x\h\(): + sub w12, w12, #1 + ld1b {z0.b}, p0/z, x0 + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + add z0.h, p0/m, z0.h, z1.h + sqrshrnb z0.b, z0.h, #7 + add z0.b, z0.b, #0x80 + st1b {z0.h}, p0, x2 + add x2, x2, x5 + cbnz w12, .loop_gt_32_sve2_addavg_16x\h + ret +endfunc +.endm + +addAvg_16xN_sve2 4 +addAvg_16xN_sve2 8 +addAvg_16xN_sve2 12 +addAvg_16xN_sve2 16 +addAvg_16xN_sve2 24 +addAvg_16xN_sve2 32 +addAvg_16xN_sve2 64 + +.macro addAvg_24xN_sve2 h +function PFX(addAvg_24x\h\()_sve2) + mov w12, #\h + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_addAvg_24x\h + addAvg_start +.loop_eq_16_sve2_addavg_24x\h\(): + sub w12, w12, #1 + ld1 {v0.16b-v2.16b}, x0, x3 + ld1 {v3.16b-v5.16b}, x1, x4 + addavg_1 v0, v3 + addavg_1 v1, v4 + addavg_1 v2, v5 + sqxtun v0.8b, v0.8h + sqxtun v1.8b, v1.8h + sqxtun v2.8b, v2.8h + st1 {v0.8b-v2.8b}, x2, x5 + cbnz w12, .loop_eq_16_sve2_addavg_24x\h + ret +.vl_gt_16_addAvg_24x\h\(): + cmp x9, #48 + bgt .vl_gt_48_addAvg_24x\h + ptrue p0.b, vl32 + ptrue p1.b, vl16 +.loop_gt_16_sve2_addavg_24x\h\(): + sub w12, w12, #1 + ld1b {z0.b}, p0/z, x0 + ld1b {z1.b}, p1/z, x0, #1, mul vl + ld1b {z2.b}, p0/z, x1 + ld1b {z3.b}, p1/z, x1, #1, mul vl + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + add z0.h, p0/m, z0.h, z2.h + add z1.h, p1/m, z1.h, z3.h + sqrshrnb z0.b, z0.h, #7 + add z0.b, z0.b, #0x80 + sqrshrnb z1.b, z1.h, #7 + add z1.b, z1.b, #0x80 + st1b {z0.h}, p0, x2 + st1b {z1.h}, p1, x2, #1, mul vl + add x2, x2, x5 + cbnz w12, .loop_gt_16_sve2_addavg_24x\h + ret +.vl_gt_48_addAvg_24x\h\(): + mov x10, #48 + mov x11, #0 + whilelt p0.b, x11, x10 +.loop_gt_48_sve2_addavg_24x\h\(): + sub w12, w12, #1 + ld1b {z0.b}, p0/z, x0 + ld1b {z2.b}, p0/z, x1 + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + add z0.h, p0/m, z0.h, z2.h + sqrshrnb z0.b, z0.h, #7 + add z0.b, z0.b, #0x80 + st1b {z0.h}, p0, x2 + add x2, x2, x5 + cbnz w12, .loop_gt_48_sve2_addavg_24x\h + ret +endfunc +.endm + +addAvg_24xN_sve2 32 +addAvg_24xN_sve2 64 + +.macro addAvg_32xN_sve2 h +function PFX(addAvg_32x\h\()_sve2) + mov w12, #\h + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_addAvg_32x\h + ptrue p0.b, vl16 +.loop_eq_16_sve2_addavg_32x\h\(): + sub w12, w12, #1 + ld1b {z0.b}, p0/z, x0 + ld1b {z1.b}, p0/z, x0, #1, mul vl + ld1b {z2.b}, p0/z, x0, #2, mul vl + ld1b {z3.b}, p0/z, x0, #3, mul vl + ld1b {z4.b}, p0/z, x1 + ld1b {z5.b}, p0/z, x1, #1, mul vl + ld1b {z6.b}, p0/z, x1, #2, mul vl + ld1b {z7.b}, p0/z, x1, #3, mul vl + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + add z0.h, p0/m, z0.h, z4.h + add z1.h, p0/m, z1.h, z5.h + add z2.h, p0/m, z2.h, z6.h + add z3.h, p0/m, z3.h, z7.h + sqrshrnb z0.b, z0.h, #7 + add z0.b, z0.b, #0x80 + sqrshrnb z1.b, z1.h, #7 + add z1.b, z1.b, #0x80 + sqrshrnb z2.b, z2.h, #7 + add z2.b, z2.b, #0x80 + sqrshrnb z3.b, z3.h, #7 + add z3.b, z3.b, #0x80 + st1b {z0.h}, p0, x2 + st1b {z1.h}, p0, x2, #1, mul vl + st1b {z2.h}, p0, x2, #2, mul vl + st1b {z3.h}, p0, x2, #3, mul vl + add x2, x2, x5 + cbnz w12, .loop_eq_16_sve2_addavg_32x\h + ret +.vl_gt_16_addAvg_32x\h\(): + cmp x9, #48 + bgt .vl_gt_48_addAvg_32x\h + ptrue p0.b, vl32 +.loop_gt_eq_32_sve2_addavg_32x\h\(): + sub w12, w12, #1 + ld1b {z0.b}, p0/z, x0 + ld1b {z1.b}, p0/z, x0, #1, mul vl + ld1b {z2.b}, p0/z, x1 + ld1b {z3.b}, p0/z, x1, #1, mul vl + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + add z0.h, p0/m, z0.h, z2.h + add z1.h, p0/m, z1.h, z3.h + sqrshrnb z0.b, z0.h, #7 + add z1.b, z1.b, #0x80 + sqrshrnb z1.b, z1.h, #7 + add z0.b, z0.b, #0x80 + st1b {z0.h}, p0, x2 + st1b {z1.h}, p0, x2, #1, mul vl + add x2, x2, x5 + cbnz w12, .loop_gt_eq_32_sve2_addavg_32x\h + ret +.vl_gt_48_addAvg_32x\h\(): + ptrue p0.b, vl64 +.loop_eq_64_sve2_addavg_32x\h\(): + sub w12, w12, #1 + ld1b {z0.b}, p0/z, x0 + ld1b {z1.b}, p0/z, x1 + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + add z0.h, p0/m, z0.h, z1.h + sqrshrnb z0.b, z0.h, #7 + add z0.b, z0.b, #0x80 + st1b {z0.h}, p0, x2 + add x2, x2, x5 + cbnz w12, .loop_eq_64_sve2_addavg_32x\h + ret +endfunc +.endm + +addAvg_32xN_sve2 8 +addAvg_32xN_sve2 16 +addAvg_32xN_sve2 24 +addAvg_32xN_sve2 32 +addAvg_32xN_sve2 48 +addAvg_32xN_sve2 64 + +function PFX(addAvg_48x64_sve2) + mov w12, #64 + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_addAvg_48x64 + addAvg_start + sub x3, x3, #64 + sub x4, x4, #64 +.loop_eq_16_sve2_addavg_48x64: + sub w12, w12, #1 + ld1 {v0.8h-v3.8h}, x0, #64 + ld1 {v4.8h-v7.8h}, x1, #64 + ld1 {v20.8h-v21.8h}, x0, x3 + ld1 {v22.8h-v23.8h}, x1, x4 + addavg_1 v0, v4 + addavg_1 v1, v5 + addavg_1 v2, v6 + addavg_1 v3, v7 + addavg_1 v20, v22 + addavg_1 v21, v23 + sqxtun v0.8b, v0.8h + sqxtun2 v0.16b, v1.8h + sqxtun v1.8b, v2.8h + sqxtun2 v1.16b, v3.8h + sqxtun v2.8b, v20.8h + sqxtun2 v2.16b, v21.8h + st1 {v0.16b-v2.16b}, x2, x5 + cbnz w12, .loop_eq_16_sve2_addavg_48x64 + ret +.vl_gt_16_addAvg_48x64: + cmp x9, #48 + bgt .vl_gt_48_addAvg_48x64 + ptrue p0.b, vl32 +.loop_gt_eq_32_sve2_addavg_48x64: + sub w12, w12, #1 + ld1b {z0.b}, p0/z, x0 + ld1b {z1.b}, p0/z, x0, #1, mul vl + ld1b {z2.b}, p0/z, x0, #2, mul vl + ld1b {z4.b}, p0/z, x1 + ld1b {z5.b}, p0/z, x1, #1, mul vl + ld1b {z6.b}, p0/z, x1, #2, mul vl + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + add z0.h, p0/m, z0.h, z4.h + add z1.h, p0/m, z1.h, z5.h + add z2.h, p0/m, z2.h, z6.h + sqrshrnb z0.b, z0.h, #7 + add z0.b, z0.b, #0x80 + sqrshrnb z1.b, z1.h, #7 + add z1.b, z1.b, #0x80 + sqrshrnb z2.b, z2.h, #7 + add z2.b, z2.b, #0x80 + st1b {z0.h}, p0, x2 + st1b {z1.h}, p0, x2, #1, mul vl + st1b {z2.h}, p0, x2, #2, mul vl + add x2, x2, x5 + cbnz w12, .loop_gt_eq_32_sve2_addavg_48x64 + ret +.vl_gt_48_addAvg_48x64: + cmp x9, #112 + bgt .vl_gt_112_addAvg_48x64 + ptrue p0.b, vl64 + ptrue p1.b, vl32 +.loop_gt_48_sve2_addavg_48x64: + sub w12, w12, #1 + ld1b {z0.b}, p0/z, x0 + ld1b {z1.b}, p1/z, x0, #1, mul vl + ld1b {z4.b}, p0/z, x1 + ld1b {z5.b}, p1/z, x1, #1, mul vl + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + add z0.h, p0/m, z0.h, z4.h + add z1.h, p1/m, z1.h, z5.h + sqrshrnb z0.b, z0.h, #7 + add z0.b, z0.b, #0x80 + sqrshrnb z1.b, z1.h, #7 + add z1.b, z1.b, #0x80 + st1b {z0.h}, p0, x2 + st1b {z1.h}, p1, x2, #1, mul vl + add x2, x2, x5 + cbnz w12, .loop_gt_48_sve2_addavg_48x64 + ret +.vl_gt_112_addAvg_48x64: + mov x10, #96 + mov x11, #0 + whilelt p0.b, x11, x10 +.loop_gt_112_sve2_addavg_48x64: + sub w12, w12, #1 + ld1b {z0.b}, p0/z, x0 + ld1b {z4.b}, p0/z, x1 + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + add z0.h, p0/m, z0.h, z4.h + sqrshrnb z0.b, z0.h, #7 + add z0.b, z0.b, #0x80 + st1b {z0.h}, p0, x2 + add x2, x2, x5 + cbnz w12, .loop_gt_112_sve2_addavg_48x64 + ret +endfunc + +.macro addAvg_64xN_sve2 h +function PFX(addAvg_64x\h\()_sve2) + mov w12, #\h + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_addAvg_64x\h + addAvg_start + sub x3, x3, #64 + sub x4, x4, #64 +.loop_eq_16_sve2_addavg_64x\h\(): + sub w12, w12, #1 + ld1 {v0.8h-v3.8h}, x0, #64 + ld1 {v4.8h-v7.8h}, x1, #64 + ld1 {v20.8h-v23.8h}, x0, x3 + ld1 {v24.8h-v27.8h}, x1, x4 + addavg_1 v0, v4 + addavg_1 v1, v5 + addavg_1 v2, v6 + addavg_1 v3, v7 + addavg_1 v20, v24 + addavg_1 v21, v25 + addavg_1 v22, v26 + addavg_1 v23, v27 + sqxtun v0.8b, v0.8h + sqxtun2 v0.16b, v1.8h + sqxtun v1.8b, v2.8h + sqxtun2 v1.16b, v3.8h + sqxtun v2.8b, v20.8h + sqxtun2 v2.16b, v21.8h + sqxtun v3.8b, v22.8h + sqxtun2 v3.16b, v23.8h + st1 {v0.16b-v3.16b}, x2, x5 + cbnz w12, .loop_eq_16_sve2_addavg_64x\h + ret +.vl_gt_16_addAvg_64x\h\(): + cmp x9, #48 + bgt .vl_gt_48_addAvg_64x\h + ptrue p0.b, vl32 +.loop_gt_eq_32_sve2_addavg_64x\h\(): + sub w12, w12, #1 + ld1b {z0.b}, p0/z, x0 + ld1b {z1.b}, p0/z, x0, #1, mul vl + ld1b {z2.b}, p0/z, x0, #2, mul vl + ld1b {z3.b}, p0/z, x0, #3, mul vl + ld1b {z4.b}, p0/z, x1 + ld1b {z5.b}, p0/z, x1, #1, mul vl + ld1b {z6.b}, p0/z, x1, #2, mul vl + ld1b {z7.b}, p0/z, x1, #3, mul vl + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + add z0.h, p0/m, z0.h, z4.h + add z1.h, p0/m, z1.h, z5.h + add z2.h, p0/m, z2.h, z6.h + add z3.h, p0/m, z3.h, z7.h + sqrshrnb z0.b, z0.h, #7 + add z0.b, z0.b, #0x80 + sqrshrnb z1.b, z1.h, #7 + add z1.b, z1.b, #0x80 + sqrshrnb z2.b, z2.h, #7 + add z2.b, z2.b, #0x80 + sqrshrnb z3.b, z3.h, #7 + add z3.b, z3.b, #0x80 + st1b {z0.h}, p0, x2 + st1b {z1.h}, p0, x2, #1, mul vl + st1b {z2.h}, p0, x2, #2, mul vl + st1b {z3.h}, p0, x2, #3, mul vl + add x2, x2, x5 + cbnz w12, .loop_gt_eq_32_sve2_addavg_64x\h + ret +.vl_gt_48_addAvg_64x\h\(): + cmp x9, #112 + bgt .vl_gt_112_addAvg_64x\h + ptrue p0.b, vl64 +.loop_gt_eq_48_sve2_addavg_64x\h\(): + sub w12, w12, #1 + ld1b {z0.b}, p0/z, x0 + ld1b {z1.b}, p0/z, x0, #1, mul vl + ld1b {z4.b}, p0/z, x1 + ld1b {z5.b}, p0/z, x1, #1, mul vl + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + add z0.h, p0/m, z0.h, z4.h + add z1.h, p0/m, z1.h, z5.h + sqrshrnb z0.b, z0.h, #7 + add z0.b, z0.b, #0x80 + sqrshrnb z1.b, z1.h, #7 + add z1.b, z1.b, #0x80 + st1b {z0.h}, p0, x2 + st1b {z1.h}, p0, x2, #1, mul vl + add x2, x2, x5 + cbnz w12, .loop_gt_eq_48_sve2_addavg_64x\h + ret +.vl_gt_112_addAvg_64x\h\(): + ptrue p0.b, vl128 +.loop_gt_eq_128_sve2_addavg_64x\h\(): + sub w12, w12, #1 + ld1b {z0.b}, p0/z, x0 + ld1b {z4.b}, p0/z, x1 + add x0, x0, x3, lsl #1 + add x1, x1, x4, lsl #1 + add z0.h, p0/m, z0.h, z4.h + sqrshrnb z0.b, z0.h, #7 + add z0.b, z0.b, #0x80 + st1b {z0.h}, p0, x2 + add x2, x2, x5 + cbnz w12, .loop_gt_eq_128_sve2_addavg_64x\h + ret +endfunc +.endm + +addAvg_64xN_sve2 16 +addAvg_64xN_sve2 32 +addAvg_64xN_sve2 48 +addAvg_64xN_sve2 64
View file
x265_3.5.tar.gz/source/common/aarch64/mc-a.S -> x265_3.6.tar.gz/source/common/aarch64/mc-a.S
Changed
@@ -1,7 +1,8 @@ /***************************************************************************** - * Copyright (C) 2020 MulticoreWare, Inc + * Copyright (C) 2020-2021 MulticoreWare, Inc * * Authors: Hongbin Liu <liuhongbin1@huawei.com> + * Sebastian Pop <spop@amazon.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -22,15 +23,20 @@ *****************************************************************************/ #include "asm.S" +#include "mc-a-common.S" +#ifdef __APPLE__ +.section __RODATA,__rodata +#else .section .rodata +#endif .align 4 .text .macro pixel_avg_pp_4xN_neon h -function x265_pixel_avg_pp_4x\h\()_neon +function PFX(pixel_avg_pp_4x\h\()_neon) .rept \h ld1 {v0.s}0, x2, x3 ld1 {v1.s}0, x4, x5 @@ -46,7 +52,7 @@ pixel_avg_pp_4xN_neon 16 .macro pixel_avg_pp_8xN_neon h -function x265_pixel_avg_pp_8x\h\()_neon +function PFX(pixel_avg_pp_8x\h\()_neon) .rept \h ld1 {v0.8b}, x2, x3 ld1 {v1.8b}, x4, x5 @@ -61,3 +67,491 @@ pixel_avg_pp_8xN_neon 8 pixel_avg_pp_8xN_neon 16 pixel_avg_pp_8xN_neon 32 + +function PFX(pixel_avg_pp_12x16_neon) + sub x1, x1, #4 + sub x3, x3, #4 + sub x5, x5, #4 +.rept 16 + ld1 {v0.s}0, x2, #4 + ld1 {v1.8b}, x2, x3 + ld1 {v2.s}0, x4, #4 + ld1 {v3.8b}, x4, x5 + urhadd v4.8b, v0.8b, v2.8b + urhadd v5.8b, v1.8b, v3.8b + st1 {v4.s}0, x0, #4 + st1 {v5.8b}, x0, x1 +.endr + ret +endfunc + +.macro pixel_avg_pp_16xN_neon h +function PFX(pixel_avg_pp_16x\h\()_neon) +.rept \h + ld1 {v0.16b}, x2, x3 + ld1 {v1.16b}, x4, x5 + urhadd v2.16b, v0.16b, v1.16b + st1 {v2.16b}, x0, x1 +.endr + ret +endfunc +.endm + +pixel_avg_pp_16xN_neon 4 +pixel_avg_pp_16xN_neon 8 +pixel_avg_pp_16xN_neon 12 +pixel_avg_pp_16xN_neon 16 +pixel_avg_pp_16xN_neon 32 + +function PFX(pixel_avg_pp_16x64_neon) + mov w12, #8 +.lpavg_16x64: + sub w12, w12, #1 +.rept 8 + ld1 {v0.16b}, x2, x3 + ld1 {v1.16b}, x4, x5 + urhadd v2.16b, v0.16b, v1.16b + st1 {v2.16b}, x0, x1 +.endr + cbnz w12, .lpavg_16x64 + ret +endfunc + +function PFX(pixel_avg_pp_24x32_neon) + sub x1, x1, #16 + sub x3, x3, #16 + sub x5, x5, #16 + mov w12, #4 +.lpavg_24x32: + sub w12, w12, #1 +.rept 8 + ld1 {v0.16b}, x2, #16 + ld1 {v1.8b}, x2, x3 + ld1 {v2.16b}, x4, #16 + ld1 {v3.8b}, x4, x5 + urhadd v0.16b, v0.16b, v2.16b + urhadd v1.8b, v1.8b, v3.8b + st1 {v0.16b}, x0, #16 + st1 {v1.8b}, x0, x1 +.endr + cbnz w12, .lpavg_24x32 + ret +endfunc + +.macro pixel_avg_pp_32xN_neon h +function PFX(pixel_avg_pp_32x\h\()_neon) +.rept \h + ld1 {v0.16b-v1.16b}, x2, x3 + ld1 {v2.16b-v3.16b}, x4, x5 + urhadd v0.16b, v0.16b, v2.16b + urhadd v1.16b, v1.16b, v3.16b + st1 {v0.16b-v1.16b}, x0, x1 +.endr + ret +endfunc +.endm + +pixel_avg_pp_32xN_neon 8 +pixel_avg_pp_32xN_neon 16 +pixel_avg_pp_32xN_neon 24 + +.macro pixel_avg_pp_32xN1_neon h +function PFX(pixel_avg_pp_32x\h\()_neon) + mov w12, #\h / 8 +.lpavg_32x\h\(): + sub w12, w12, #1 +.rept 8 + ld1 {v0.16b-v1.16b}, x2, x3 + ld1 {v2.16b-v3.16b}, x4, x5 + urhadd v0.16b, v0.16b, v2.16b + urhadd v1.16b, v1.16b, v3.16b + st1 {v0.16b-v1.16b}, x0, x1 +.endr + cbnz w12, .lpavg_32x\h + ret +endfunc +.endm + +pixel_avg_pp_32xN1_neon 32 +pixel_avg_pp_32xN1_neon 64 + +function PFX(pixel_avg_pp_48x64_neon) + mov w12, #8 +.lpavg_48x64: + sub w12, w12, #1 +.rept 8 + ld1 {v0.16b-v2.16b}, x2, x3 + ld1 {v3.16b-v5.16b}, x4, x5 + urhadd v0.16b, v0.16b, v3.16b + urhadd v1.16b, v1.16b, v4.16b + urhadd v2.16b, v2.16b, v5.16b + st1 {v0.16b-v2.16b}, x0, x1 +.endr + cbnz w12, .lpavg_48x64 + ret +endfunc + +.macro pixel_avg_pp_64xN_neon h +function PFX(pixel_avg_pp_64x\h\()_neon) + mov w12, #\h / 4 +.lpavg_64x\h\(): + sub w12, w12, #1 +.rept 4 + ld1 {v0.16b-v3.16b}, x2, x3 + ld1 {v4.16b-v7.16b}, x4, x5 + urhadd v0.16b, v0.16b, v4.16b + urhadd v1.16b, v1.16b, v5.16b + urhadd v2.16b, v2.16b, v6.16b + urhadd v3.16b, v3.16b, v7.16b + st1 {v0.16b-v3.16b}, x0, x1 +.endr + cbnz w12, .lpavg_64x\h + ret +endfunc +.endm + +pixel_avg_pp_64xN_neon 16 +pixel_avg_pp_64xN_neon 32 +pixel_avg_pp_64xN_neon 48 +pixel_avg_pp_64xN_neon 64 + +// void addAvg(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride) +.macro addAvg_2xN h +function PFX(addAvg_2x\h\()_neon) + addAvg_start +.rept \h / 2 + ldr w10, x0 + ldr w11, x1 + add x0, x0, x3 + add x1, x1, x4 + ldr w12, x0 + ldr w13, x1 + add x0, x0, x3 + add x1, x1, x4 + dup v0.2s, w10 + dup v1.2s, w11 + dup v2.2s, w12 + dup v3.2s, w13 + add v0.4h, v0.4h, v1.4h + add v2.4h, v2.4h, v3.4h + saddl v0.4s, v0.4h, v30.4h + saddl v2.4s, v2.4h, v30.4h + shrn v0.4h, v0.4s, #7 + shrn2 v0.8h, v2.4s, #7 + sqxtun v0.8b, v0.8h + st1 {v0.h}0, x2, x5 + st1 {v0.h}2, x2, x5 +.endr + ret +endfunc +.endm + +addAvg_2xN 4 +addAvg_2xN 8 +addAvg_2xN 16 + +.macro addAvg_4xN h +function PFX(addAvg_4x\h\()_neon) + addAvg_start +.rept \h / 2 + ld1 {v0.8b}, x0, x3 + ld1 {v1.8b}, x1, x4 + ld1 {v2.8b}, x0, x3 + ld1 {v3.8b}, x1, x4 + add v0.4h, v0.4h, v1.4h + add v2.4h, v2.4h, v3.4h + saddl v0.4s, v0.4h, v30.4h + saddl v2.4s, v2.4h, v30.4h + shrn v0.4h, v0.4s, #7 + shrn2 v0.8h, v2.4s, #7 + sqxtun v0.8b, v0.8h + st1 {v0.s}0, x2, x5 + st1 {v0.s}1, x2, x5 +.endr + ret +endfunc +.endm + +addAvg_4xN 2 +addAvg_4xN 4 +addAvg_4xN 8 +addAvg_4xN 16 +addAvg_4xN 32 + +.macro addAvg_6xN h +function PFX(addAvg_6x\h\()_neon) + addAvg_start + mov w12, #\h / 2 + sub x5, x5, #4 +.loop_addavg_6x\h: + sub w12, w12, #1 + ld1 {v0.16b}, x0, x3 + ld1 {v1.16b}, x1, x4 + ld1 {v2.16b}, x0, x3 + ld1 {v3.16b}, x1, x4 + add v0.8h, v0.8h, v1.8h + add v2.8h, v2.8h, v3.8h + saddl v16.4s, v0.4h, v30.4h + saddl2 v17.4s, v0.8h, v30.8h + saddl v18.4s, v2.4h, v30.4h + saddl2 v19.4s, v2.8h, v30.8h + shrn v0.4h, v16.4s, #7 + shrn2 v0.8h, v17.4s, #7 + shrn v1.4h, v18.4s, #7 + shrn2 v1.8h, v19.4s, #7 + sqxtun v0.8b, v0.8h + sqxtun v1.8b, v1.8h + str s0, x2, #4 + st1 {v0.h}2, x2, x5 + str s1, x2, #4 + st1 {v1.h}2, x2, x5 + cbnz w12, .loop_addavg_6x\h + ret +endfunc +.endm + +addAvg_6xN 8 +addAvg_6xN 16 + +.macro addAvg_8xN h +function PFX(addAvg_8x\h\()_neon) + addAvg_start +.rept \h / 2 + ld1 {v0.16b}, x0, x3 + ld1 {v1.16b}, x1, x4 + ld1 {v2.16b}, x0, x3 + ld1 {v3.16b}, x1, x4 + add v0.8h, v0.8h, v1.8h + add v2.8h, v2.8h, v3.8h + saddl v16.4s, v0.4h, v30.4h + saddl2 v17.4s, v0.8h, v30.8h + saddl v18.4s, v2.4h, v30.4h + saddl2 v19.4s, v2.8h, v30.8h + shrn v0.4h, v16.4s, #7 + shrn2 v0.8h, v17.4s, #7 + shrn v1.4h, v18.4s, #7 + shrn2 v1.8h, v19.4s, #7 + sqxtun v0.8b, v0.8h + sqxtun v1.8b, v1.8h + st1 {v0.8b}, x2, x5 + st1 {v1.8b}, x2, x5 +.endr + ret +endfunc +.endm + +.macro addAvg_8xN1 h +function PFX(addAvg_8x\h\()_neon) + addAvg_start + mov w12, #\h / 2 +.loop_addavg_8x\h: + sub w12, w12, #1 + ld1 {v0.16b}, x0, x3 + ld1 {v1.16b}, x1, x4 + ld1 {v2.16b}, x0, x3 + ld1 {v3.16b}, x1, x4 + add v0.8h, v0.8h, v1.8h + add v2.8h, v2.8h, v3.8h + saddl v16.4s, v0.4h, v30.4h + saddl2 v17.4s, v0.8h, v30.8h + saddl v18.4s, v2.4h, v30.4h + saddl2 v19.4s, v2.8h, v30.8h + shrn v0.4h, v16.4s, #7 + shrn2 v0.8h, v17.4s, #7 + shrn v1.4h, v18.4s, #7 + shrn2 v1.8h, v19.4s, #7 + sqxtun v0.8b, v0.8h + sqxtun v1.8b, v1.8h + st1 {v0.8b}, x2, x5 + st1 {v1.8b}, x2, x5 + cbnz w12, .loop_addavg_8x\h + ret +endfunc +.endm + +addAvg_8xN 2 +addAvg_8xN 4 +addAvg_8xN 6 +addAvg_8xN 8 +addAvg_8xN 12 +addAvg_8xN 16 +addAvg_8xN1 32 +addAvg_8xN1 64 + +.macro addAvg_12xN h +function PFX(addAvg_12x\h\()_neon) + addAvg_start + sub x3, x3, #16 + sub x4, x4, #16 + sub x5, x5, #8 + mov w12, #\h +.loop_addAvg_12X\h\(): + sub w12, w12, #1 + ld1 {v0.16b}, x0, #16 + ld1 {v1.16b}, x1, #16 + ld1 {v2.8b}, x0, x3 + ld1 {v3.8b}, x1, x4 + add v0.8h, v0.8h, v1.8h + add v2.4h, v2.4h, v3.4h + saddl v16.4s, v0.4h, v30.4h + saddl2 v17.4s, v0.8h, v30.8h + saddl v18.4s, v2.4h, v30.4h + shrn v0.4h, v16.4s, #7 + shrn2 v0.8h, v17.4s, #7 + shrn v1.4h, v18.4s, #7 + sqxtun v0.8b, v0.8h + sqxtun v1.8b, v1.8h + st1 {v0.8b}, x2, #8 + st1 {v1.s}0, x2, x5 + cbnz w12, .loop_addAvg_12X\h + ret +endfunc +.endm + +addAvg_12xN 16 +addAvg_12xN 32 + +.macro addAvg_16xN h +function PFX(addAvg_16x\h\()_neon) + addAvg_start + mov w12, #\h +.loop_addavg_16x\h: + sub w12, w12, #1 + ld1 {v0.8h-v1.8h}, x0, x3 + ld1 {v2.8h-v3.8h}, x1, x4 + addavg_1 v0, v2 + addavg_1 v1, v3 + sqxtun v0.8b, v0.8h + sqxtun2 v0.16b, v1.8h + st1 {v0.16b}, x2, x5 + cbnz w12, .loop_addavg_16x\h + ret +endfunc +.endm + +addAvg_16xN 4 +addAvg_16xN 8 +addAvg_16xN 12 +addAvg_16xN 16 +addAvg_16xN 24 +addAvg_16xN 32 +addAvg_16xN 64 + +.macro addAvg_24xN h +function PFX(addAvg_24x\h\()_neon) + addAvg_start + mov w12, #\h +.loop_addavg_24x\h\(): + sub w12, w12, #1 + ld1 {v0.16b-v2.16b}, x0, x3 + ld1 {v3.16b-v5.16b}, x1, x4 + addavg_1 v0, v3 + addavg_1 v1, v4 + addavg_1 v2, v5 + sqxtun v0.8b, v0.8h + sqxtun v1.8b, v1.8h + sqxtun v2.8b, v2.8h + st1 {v0.8b-v2.8b}, x2, x5 + cbnz w12, .loop_addavg_24x\h + ret +endfunc +.endm + +addAvg_24xN 32 +addAvg_24xN 64 + +.macro addAvg_32xN h +function PFX(addAvg_32x\h\()_neon) + addAvg_start + mov w12, #\h +.loop_addavg_32x\h\(): + sub w12, w12, #1 + ld1 {v0.8h-v3.8h}, x0, x3 + ld1 {v4.8h-v7.8h}, x1, x4 + addavg_1 v0, v4 + addavg_1 v1, v5 + addavg_1 v2, v6 + addavg_1 v3, v7 + sqxtun v0.8b, v0.8h + sqxtun v1.8b, v1.8h + sqxtun v2.8b, v2.8h + sqxtun v3.8b, v3.8h + st1 {v0.8b-v3.8b}, x2, x5 + cbnz w12, .loop_addavg_32x\h + ret +endfunc +.endm + +addAvg_32xN 8 +addAvg_32xN 16 +addAvg_32xN 24 +addAvg_32xN 32 +addAvg_32xN 48 +addAvg_32xN 64 + +function PFX(addAvg_48x64_neon) + addAvg_start + sub x3, x3, #64 + sub x4, x4, #64 + mov w12, #64 +.loop_addavg_48x64: + sub w12, w12, #1 + ld1 {v0.8h-v3.8h}, x0, #64 + ld1 {v4.8h-v7.8h}, x1, #64 + ld1 {v20.8h-v21.8h}, x0, x3 + ld1 {v22.8h-v23.8h}, x1, x4 + addavg_1 v0, v4 + addavg_1 v1, v5 + addavg_1 v2, v6 + addavg_1 v3, v7 + addavg_1 v20, v22 + addavg_1 v21, v23 + sqxtun v0.8b, v0.8h + sqxtun2 v0.16b, v1.8h + sqxtun v1.8b, v2.8h + sqxtun2 v1.16b, v3.8h + sqxtun v2.8b, v20.8h + sqxtun2 v2.16b, v21.8h + st1 {v0.16b-v2.16b}, x2, x5 + cbnz w12, .loop_addavg_48x64 + ret +endfunc + +.macro addAvg_64xN h +function PFX(addAvg_64x\h\()_neon) + addAvg_start + mov w12, #\h + sub x3, x3, #64 + sub x4, x4, #64 +.loop_addavg_64x\h\(): + sub w12, w12, #1 + ld1 {v0.8h-v3.8h}, x0, #64 + ld1 {v4.8h-v7.8h}, x1, #64 + ld1 {v20.8h-v23.8h}, x0, x3 + ld1 {v24.8h-v27.8h}, x1, x4 + addavg_1 v0, v4 + addavg_1 v1, v5 + addavg_1 v2, v6 + addavg_1 v3, v7 + addavg_1 v20, v24 + addavg_1 v21, v25 + addavg_1 v22, v26 + addavg_1 v23, v27 + sqxtun v0.8b, v0.8h + sqxtun2 v0.16b, v1.8h + sqxtun v1.8b, v2.8h + sqxtun2 v1.16b, v3.8h + sqxtun v2.8b, v20.8h + sqxtun2 v2.16b, v21.8h + sqxtun v3.8b, v22.8h + sqxtun2 v3.16b, v23.8h + st1 {v0.16b-v3.16b}, x2, x5 + cbnz w12, .loop_addavg_64x\h + ret +endfunc +.endm + +addAvg_64xN 16 +addAvg_64xN 32 +addAvg_64xN 48 +addAvg_64xN 64
View file
x265_3.6.tar.gz/source/common/aarch64/p2s-common.S
Added
@@ -0,0 +1,102 @@ +/***************************************************************************** + * Copyright (C) 2022-2023 MulticoreWare, Inc + * + * Authors: David Chen <david.chen@myais.com.cn> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +// This file contains the macros written using NEON instruction set +// that are also used by the SVE2 functions + +.arch armv8-a + +#ifdef __APPLE__ +.section __RODATA,__rodata +#else +.section .rodata +#endif + +.align 4 + +#if HIGH_BIT_DEPTH +# if BIT_DEPTH == 10 +# define P2S_SHIFT 4 +# elif BIT_DEPTH == 12 +# define P2S_SHIFT 2 +# endif +.macro p2s_start + add x3, x3, x3 + add x1, x1, x1 + movi v31.8h, #0xe0, lsl #8 +.endm + +#else // if !HIGH_BIT_DEPTH +# define P2S_SHIFT 6 +.macro p2s_start + add x3, x3, x3 + movi v31.8h, #0xe0, lsl #8 +.endm +#endif // HIGH_BIT_DEPTH + +.macro p2s_2x2 +#if HIGH_BIT_DEPTH + ld1 {v0.s}0, x0, x1 + ld1 {v0.s}1, x0, x1 + shl v3.8h, v0.8h, #P2S_SHIFT +#else + ldrh w10, x0 + add x0, x0, x1 + ldrh w11, x0 + orr w10, w10, w11, lsl #16 + add x0, x0, x1 + dup v0.4s, w10 + ushll v3.8h, v0.8b, #P2S_SHIFT +#endif + add v3.8h, v3.8h, v31.8h + st1 {v3.s}0, x2, x3 + st1 {v3.s}1, x2, x3 +.endm + +.macro p2s_6x2 +#if HIGH_BIT_DEPTH + ld1 {v0.d}0, x0, #8 + ld1 {v1.s}0, x0, x1 + ld1 {v0.d}1, x0, #8 + ld1 {v1.s}1, x0, x1 + shl v3.8h, v0.8h, #P2S_SHIFT + shl v4.8h, v1.8h, #P2S_SHIFT +#else + ldr s0, x0 + ldrh w10, x0, #4 + add x0, x0, x1 + ld1 {v0.s}1, x0 + ldrh w11, x0, #4 + add x0, x0, x1 + orr w10, w10, w11, lsl #16 + dup v1.4s, w10 + ushll v3.8h, v0.8b, #P2S_SHIFT + ushll v4.8h, v1.8b, #P2S_SHIFT +#endif + add v3.8h, v3.8h, v31.8h + add v4.8h, v4.8h, v31.8h + st1 {v3.d}0, x2, #8 + st1 {v4.s}0, x2, x3 + st1 {v3.d}1, x2, #8 + st1 {v4.s}1, x2, x3 +.endm
View file
x265_3.6.tar.gz/source/common/aarch64/p2s-sve.S
Added
@@ -0,0 +1,445 @@ +/***************************************************************************** + * Copyright (C) 2022-2023 MulticoreWare, Inc + * + * Authors: David Chen <david.chen@myais.com.cn> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "asm-sve.S" +#include "p2s-common.S" + +.arch armv8-a+sve + +#ifdef __APPLE__ +.section __RODATA,__rodata +#else +.section .rodata +#endif + +.align 4 + +.text + +#if HIGH_BIT_DEPTH +# if BIT_DEPTH == 10 +# define P2S_SHIFT 4 +# elif BIT_DEPTH == 12 +# define P2S_SHIFT 2 +# endif + +.macro p2s_start_sve + add x3, x3, x3 + add x1, x1, x1 + mov z31.h, #0xe0, lsl #8 +.endm + +#else // if !HIGH_BIT_DEPTH +# define P2S_SHIFT 6 +.macro p2s_start_sve + add x3, x3, x3 + mov z31.h, #0xe0, lsl #8 +.endm + +#endif // HIGH_BIT_DEPTH + +// filterPixelToShort(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride) +.macro p2s_2xN_sve h +function PFX(filterPixelToShort_2x\h\()_sve) + p2s_start_sve +.rept \h / 2 + p2s_2x2 +.endr + ret +endfunc +.endm + +p2s_2xN_sve 4 +p2s_2xN_sve 8 +p2s_2xN_sve 16 + +.macro p2s_6xN_sve h +function PFX(filterPixelToShort_6x\h\()_sve) + p2s_start_sve + sub x3, x3, #8 +#if HIGH_BIT_DEPTH + sub x1, x1, #8 +#endif +.rept \h / 2 + p2s_6x2 +.endr + ret +endfunc +.endm + +p2s_6xN_sve 8 +p2s_6xN_sve 16 + +function PFX(filterPixelToShort_4x2_sve) + p2s_start_sve +#if HIGH_BIT_DEPTH + ptrue p0.h, vl8 + index z1.d, #0, x1 + index z2.d, #0, x3 + ld1d {z3.d}, p0/z, x0, z1.d + lsl z3.h, p0/m, z3.h, #P2S_SHIFT + add z3.h, p0/m, z3.h, z31.h + st1d {z3.d}, p0, x2, z2.d +#else + ptrue p0.h, vl4 + ld1b {z0.h}, p0/z, x0 + add x0, x0, x1 + ld1b {z1.h}, p0/z, x0 + lsl z0.h, p0/m, z0.h, #P2S_SHIFT + lsl z1.h, p0/m, z1.h, #P2S_SHIFT + add z0.h, p0/m, z0.h, z31.h + add z1.h, p0/m, z1.h, z31.h + st1h {z0.h}, p0, x2 + add x2, x2, x3 + st1h {z1.h}, p0, x2 +#endif + ret +endfunc + + +.macro p2s_8xN_sve h +function PFX(filterPixelToShort_8x\h\()_sve) + p2s_start_sve + ptrue p0.h, vl8 +.rept \h +#if HIGH_BIT_DEPTH + ld1d {z0.d}, p0/z, x0 + add x0, x0, x1 + lsl z0.h, p0/m, z0.h, #P2S_SHIFT + add z0.h, p0/m, z0.h, z31.h + st1h {z0.h}, p0, x2 + add x2, x2, x3 +#else + ld1b {z0.h}, p0/z, x0 + add x0, x0, x1 + lsl z0.h, p0/m, z0.h, #P2S_SHIFT + add z0.h, p0/m, z0.h, z31.h + st1h {z0.h}, p0, x2 + add x2, x2, x3 +#endif +.endr + ret +endfunc +.endm + +p2s_8xN_sve 2 + +.macro p2s_32xN_sve h +function PFX(filterPixelToShort_32x\h\()_sve) +#if HIGH_BIT_DEPTH + p2s_start_sve + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_filterPixelToShort_high_32x\h + ptrue p0.h, vl8 +.rept \h + ld1h {z0.h}, p0/z, x0 + ld1h {z1.h}, p0/z, x0, #1, mul vl + ld1h {z2.h}, p0/z, x0, #2, mul vl + ld1h {z3.h}, p0/z, x0, #3, mul vl + add x0, x0, x1 + lsl z0.h, p0/m, z0.h, #P2S_SHIFT + lsl z1.h, p0/m, z1.h, #P2S_SHIFT + lsl z2.h, p0/m, z2.h, #P2S_SHIFT + lsl z3.h, p0/m, z3.h, #P2S_SHIFT + add z0.h, p0/m, z0.h, z31.h + add z1.h, p0/m, z1.h, z31.h + add z2.h, p0/m, z2.h, z31.h + add z3.h, p0/m, z3.h, z31.h + st1h {z0.h}, p0, x2 + st1h {z1.h}, p0, x2, #1, mul vl + st1h {z2.h}, p0, x2, #2, mul vl + st1h {z3.h}, p0, x2, #3, mul vl + add x2, x2, x3 +.endr + ret +.vl_gt_16_filterPixelToShort_high_32x\h\(): + cmp x9, #48 + bgt .vl_gt_48_filterPixelToShort_high_32x\h + ptrue p0.h, vl16 +.rept \h + ld1h {z0.h}, p0/z, x0 + ld1h {z1.h}, p0/z, x0, #1, mul vl + add x0, x0, x1 + lsl z0.h, p0/m, z0.h, #P2S_SHIFT + lsl z1.h, p0/m, z1.h, #P2S_SHIFT + add z0.h, p0/m, z0.h, z31.h + add z1.h, p0/m, z1.h, z31.h + st1h {z0.h}, p0, x2 + st1h {z1.h}, p0, x2, #1, mul vl + add x2, x2, x3 +.endr + ret +.vl_gt_48_filterPixelToShort_high_32x\h\(): + ptrue p0.h, vl32 +.rept \h + ld1h {z0.h}, p0/z, x0 + add x0, x0, x1 + lsl z0.h, p0/m, z0.h, #P2S_SHIFT + add z0.h, p0/m, z0.h, z31.h + st1h {z0.h}, p0, x2 + add x2, x2, x3 +.endr + ret +#else + p2s_start + mov x9, #\h +.loop_filter_sve_P2S_32x\h: + sub x9, x9, #1 + ld1 {v0.16b-v1.16b}, x0, x1 + ushll v22.8h, v0.8b, #P2S_SHIFT + ushll2 v23.8h, v0.16b, #P2S_SHIFT + ushll v24.8h, v1.8b, #P2S_SHIFT + ushll2 v25.8h, v1.16b, #P2S_SHIFT + add v22.8h, v22.8h, v31.8h + add v23.8h, v23.8h, v31.8h + add v24.8h, v24.8h, v31.8h + add v25.8h, v25.8h, v31.8h + st1 {v22.16b-v25.16b}, x2, x3 + cbnz x9, .loop_filter_sve_P2S_32x\h + ret +#endif +endfunc +.endm + +p2s_32xN_sve 8 +p2s_32xN_sve 16 +p2s_32xN_sve 24 +p2s_32xN_sve 32 +p2s_32xN_sve 48 +p2s_32xN_sve 64 + +.macro p2s_64xN_sve h +function PFX(filterPixelToShort_64x\h\()_sve) +#if HIGH_BIT_DEPTH + p2s_start_sve + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_filterPixelToShort_high_64x\h + ptrue p0.h, vl8 +.rept \h + ld1h {z0.h}, p0/z, x0 + ld1h {z1.h}, p0/z, x0, #1, mul vl + ld1h {z2.h}, p0/z, x0, #2, mul vl + ld1h {z3.h}, p0/z, x0, #3, mul vl + ld1h {z4.h}, p0/z, x0, #4, mul vl + ld1h {z5.h}, p0/z, x0, #5, mul vl + ld1h {z6.h}, p0/z, x0, #6, mul vl + ld1h {z7.h}, p0/z, x0, #7, mul vl + add x0, x0, x1 + lsl z0.h, p0/m, z0.h, #P2S_SHIFT + lsl z1.h, p0/m, z1.h, #P2S_SHIFT + lsl z2.h, p0/m, z2.h, #P2S_SHIFT + lsl z3.h, p0/m, z3.h, #P2S_SHIFT + lsl z4.h, p0/m, z4.h, #P2S_SHIFT + lsl z5.h, p0/m, z5.h, #P2S_SHIFT + lsl z6.h, p0/m, z6.h, #P2S_SHIFT + lsl z7.h, p0/m, z7.h, #P2S_SHIFT + add z0.h, p0/m, z0.h, z31.h + add z1.h, p0/m, z1.h, z31.h + add z2.h, p0/m, z2.h, z31.h + add z3.h, p0/m, z3.h, z31.h + add z4.h, p0/m, z4.h, z31.h + add z5.h, p0/m, z5.h, z31.h + add z6.h, p0/m, z6.h, z31.h + add z7.h, p0/m, z7.h, z31.h + st1h {z0.h}, p0, x2 + st1h {z1.h}, p0, x2, #1, mul vl + st1h {z2.h}, p0, x2, #2, mul vl + st1h {z3.h}, p0, x2, #3, mul vl + st1h {z4.h}, p0, x2, #4, mul vl + st1h {z5.h}, p0, x2, #5, mul vl + st1h {z6.h}, p0, x2, #6, mul vl + st1h {z7.h}, p0, x2, #7, mul vl + add x2, x2, x3 +.endr + ret +.vl_gt_16_filterPixelToShort_high_64x\h\(): + cmp x9, #48 + bgt .vl_gt_48_filterPixelToShort_high_64x\h + ptrue p0.h, vl16 +.rept \h + ld1h {z0.h}, p0/z, x0 + ld1h {z1.h}, p0/z, x0, #1, mul vl + ld1h {z2.h}, p0/z, x0, #2, mul vl + ld1h {z3.h}, p0/z, x0, #3, mul vl + add x0, x0, x1 + lsl z0.h, p0/m, z0.h, #P2S_SHIFT + lsl z1.h, p0/m, z1.h, #P2S_SHIFT + lsl z2.h, p0/m, z2.h, #P2S_SHIFT + lsl z3.h, p0/m, z3.h, #P2S_SHIFT + add z0.h, p0/m, z0.h, z31.h + add z1.h, p0/m, z1.h, z31.h + add z2.h, p0/m, z2.h, z31.h + add z3.h, p0/m, z3.h, z31.h + st1h {z0.h}, p0, x2 + st1h {z1.h}, p0, x2, #1, mul vl + st1h {z2.h}, p0, x2, #2, mul vl + st1h {z3.h}, p0, x2, #3, mul vl + add x2, x2, x3 +.endr + ret +.vl_gt_48_filterPixelToShort_high_64x\h\(): + cmp x9, #112 + bgt .vl_gt_112_filterPixelToShort_high_64x\h + ptrue p0.h, vl32 +.rept \h + ld1h {z0.h}, p0/z, x0 + ld1h {z1.h}, p0/z, x0, #1, mul vl + add x0, x0, x1 + lsl z0.h, p0/m, z0.h, #P2S_SHIFT + lsl z1.h, p0/m, z1.h, #P2S_SHIFT + add z0.h, p0/m, z0.h, z31.h + add z1.h, p0/m, z1.h, z31.h + st1h {z0.h}, p0, x2 + st1h {z1.h}, p0, x2, #1, mul vl + add x2, x2, x3 +.endr + ret +.vl_gt_112_filterPixelToShort_high_64x\h\(): + ptrue p0.h, vl64 +.rept \h + ld1h {z0.h}, p0/z, x0 + add x0, x0, x1 + lsl z0.h, p0/m, z0.h, #P2S_SHIFT + add z0.h, p0/m, z0.h, z31.h + st1h {z0.h}, p0, x2 + add x2, x2, x3 +.endr + ret +#else + p2s_start + sub x3, x3, #64 + mov x9, #\h +.loop_filter_sve_P2S_64x\h: + sub x9, x9, #1 + ld1 {v0.16b-v3.16b}, x0, x1 + ushll v16.8h, v0.8b, #P2S_SHIFT + ushll2 v17.8h, v0.16b, #P2S_SHIFT + ushll v18.8h, v1.8b, #P2S_SHIFT + ushll2 v19.8h, v1.16b, #P2S_SHIFT + ushll v20.8h, v2.8b, #P2S_SHIFT + ushll2 v21.8h, v2.16b, #P2S_SHIFT + ushll v22.8h, v3.8b, #P2S_SHIFT + ushll2 v23.8h, v3.16b, #P2S_SHIFT + add v16.8h, v16.8h, v31.8h + add v17.8h, v17.8h, v31.8h + add v18.8h, v18.8h, v31.8h + add v19.8h, v19.8h, v31.8h + add v20.8h, v20.8h, v31.8h + add v21.8h, v21.8h, v31.8h + add v22.8h, v22.8h, v31.8h + add v23.8h, v23.8h, v31.8h + st1 {v16.16b-v19.16b}, x2, #64 + st1 {v20.16b-v23.16b}, x2, x3 + cbnz x9, .loop_filter_sve_P2S_64x\h + ret +#endif +endfunc +.endm + +p2s_64xN_sve 16 +p2s_64xN_sve 32 +p2s_64xN_sve 48 +p2s_64xN_sve 64 + +function PFX(filterPixelToShort_48x64_sve) +#if HIGH_BIT_DEPTH + p2s_start_sve + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_filterPixelToShort_high_48x64 + ptrue p0.h, vl8 +.rept 64 + ld1h {z0.h}, p0/z, x0 + ld1h {z1.h}, p0/z, x0, #1, mul vl + ld1h {z2.h}, p0/z, x0, #2, mul vl + ld1h {z3.h}, p0/z, x0, #3, mul vl + ld1h {z4.h}, p0/z, x0, #4, mul vl + ld1h {z5.h}, p0/z, x0, #5, mul vl + add x0, x0, x1 + lsl z0.h, p0/m, z0.h, #P2S_SHIFT + lsl z1.h, p0/m, z1.h, #P2S_SHIFT + lsl z2.h, p0/m, z2.h, #P2S_SHIFT + lsl z3.h, p0/m, z3.h, #P2S_SHIFT + lsl z4.h, p0/m, z4.h, #P2S_SHIFT + lsl z5.h, p0/m, z5.h, #P2S_SHIFT + add z0.h, p0/m, z0.h, z31.h + add z1.h, p0/m, z1.h, z31.h + add z2.h, p0/m, z2.h, z31.h + add z3.h, p0/m, z3.h, z31.h + add z4.h, p0/m, z4.h, z31.h + add z5.h, p0/m, z5.h, z31.h + st1h {z0.h}, p0, x2 + st1h {z1.h}, p0, x2, #1, mul vl + st1h {z2.h}, p0, x2, #2, mul vl + st1h {z3.h}, p0, x2, #3, mul vl + st1h {z4.h}, p0, x2, #4, mul vl + st1h {z5.h}, p0, x2, #5, mul vl + add x2, x2, x3 +.endr + ret +.vl_gt_16_filterPixelToShort_high_48x64: + ptrue p0.h, vl16 +.rept 64 + ld1h {z0.h}, p0/z, x0 + ld1h {z1.h}, p0/z, x0, #1, mul vl + ld1h {z2.h}, p0/z, x0, #2, mul vl + add x0, x0, x1 + lsl z0.h, p0/m, z0.h, #P2S_SHIFT + lsl z1.h, p0/m, z1.h, #P2S_SHIFT + lsl z2.h, p0/m, z2.h, #P2S_SHIFT + add z0.h, p0/m, z0.h, z31.h + add z1.h, p0/m, z1.h, z31.h + add z2.h, p0/m, z2.h, z31.h + st1h {z0.h}, p0, x2 + st1h {z1.h}, p0, x2, #1, mul vl + st1h {z2.h}, p0, x2, #2, mul vl + add x2, x2, x3 +.endr + ret +#else + p2s_start + sub x3, x3, #64 + mov x9, #64 +.loop_filterP2S_sve_48x64: + sub x9, x9, #1 + ld1 {v0.16b-v2.16b}, x0, x1 + ushll v16.8h, v0.8b, #P2S_SHIFT + ushll2 v17.8h, v0.16b, #P2S_SHIFT + ushll v18.8h, v1.8b, #P2S_SHIFT + ushll2 v19.8h, v1.16b, #P2S_SHIFT + ushll v20.8h, v2.8b, #P2S_SHIFT + ushll2 v21.8h, v2.16b, #P2S_SHIFT + add v16.8h, v16.8h, v31.8h + add v17.8h, v17.8h, v31.8h + add v18.8h, v18.8h, v31.8h + add v19.8h, v19.8h, v31.8h + add v20.8h, v20.8h, v31.8h + add v21.8h, v21.8h, v31.8h + st1 {v16.16b-v19.16b}, x2, #64 + st1 {v20.16b-v21.16b}, x2, x3 + cbnz x9, .loop_filterP2S_sve_48x64 + ret +#endif +endfunc
View file
x265_3.6.tar.gz/source/common/aarch64/p2s.S
Added
@@ -0,0 +1,386 @@ +/***************************************************************************** + * Copyright (C) 2021 MulticoreWare, Inc + * + * Authors: Sebastian Pop <spop@amazon.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "asm.S" +#include "p2s-common.S" + +#ifdef __APPLE__ +.section __RODATA,__rodata +#else +.section .rodata +#endif + +.align 4 + +.text + +// filterPixelToShort(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride) +.macro p2s_2xN h +function PFX(filterPixelToShort_2x\h\()_neon) + p2s_start +.rept \h / 2 + p2s_2x2 +.endr + ret +endfunc +.endm + +p2s_2xN 4 +p2s_2xN 8 +p2s_2xN 16 + +.macro p2s_6xN h +function PFX(filterPixelToShort_6x\h\()_neon) + p2s_start + sub x3, x3, #8 +#if HIGH_BIT_DEPTH + sub x1, x1, #8 +#endif +.rept \h / 2 + p2s_6x2 +.endr + ret +endfunc +.endm + +p2s_6xN 8 +p2s_6xN 16 + +function PFX(filterPixelToShort_4x2_neon) + p2s_start +#if HIGH_BIT_DEPTH + ld1 {v0.d}0, x0, x1 + ld1 {v0.d}1, x0, x1 + shl v3.8h, v0.8h, #P2S_SHIFT +#else + ld1 {v0.s}0, x0, x1 + ld1 {v0.s}1, x0, x1 + ushll v3.8h, v0.8b, #P2S_SHIFT +#endif + add v3.8h, v3.8h, v31.8h + st1 {v3.d}0, x2, x3 + st1 {v3.d}1, x2, x3 + ret +endfunc + +function PFX(filterPixelToShort_4x4_neon) + p2s_start +#if HIGH_BIT_DEPTH + ld1 {v0.d}0, x0, x1 + ld1 {v0.d}1, x0, x1 + shl v3.8h, v0.8h, #P2S_SHIFT +#else + ld1 {v0.s}0, x0, x1 + ld1 {v0.s}1, x0, x1 + ushll v3.8h, v0.8b, #P2S_SHIFT +#endif + add v3.8h, v3.8h, v31.8h + st1 {v3.d}0, x2, x3 + st1 {v3.d}1, x2, x3 +#if HIGH_BIT_DEPTH + ld1 {v1.d}0, x0, x1 + ld1 {v1.d}1, x0, x1 + shl v4.8h, v1.8h, #P2S_SHIFT +#else + ld1 {v1.s}0, x0, x1 + ld1 {v1.s}1, x0, x1 + ushll v4.8h, v1.8b, #P2S_SHIFT +#endif + add v4.8h, v4.8h, v31.8h + st1 {v4.d}0, x2, x3 + st1 {v4.d}1, x2, x3 + ret +endfunc + +.macro p2s_4xN h +function PFX(filterPixelToShort_4x\h\()_neon) + p2s_start +.rept \h / 2 +#if HIGH_BIT_DEPTH + ld1 {v0.16b}, x0, x1 + shl v0.8h, v0.8h, #P2S_SHIFT +#else + ld1 {v0.8b}, x0, x1 + ushll v0.8h, v0.8b, #P2S_SHIFT +#endif + add v2.4h, v0.4h, v31.4h + st1 {v2.4h}, x2, x3 +#if HIGH_BIT_DEPTH + ld1 {v1.16b}, x0, x1 + shl v1.8h, v1.8h, #P2S_SHIFT +#else + ld1 {v1.8b}, x0, x1 + ushll v1.8h, v1.8b, #P2S_SHIFT +#endif + add v3.4h, v1.4h, v31.4h + st1 {v3.4h}, x2, x3 +.endr + ret +endfunc +.endm + +p2s_4xN 8 +p2s_4xN 16 +p2s_4xN 32 + +.macro p2s_8xN h +function PFX(filterPixelToShort_8x\h\()_neon) + p2s_start +.rept \h / 2 +#if HIGH_BIT_DEPTH + ld1 {v0.16b}, x0, x1 + ld1 {v1.16b}, x0, x1 + shl v0.8h, v0.8h, #P2S_SHIFT + shl v1.8h, v1.8h, #P2S_SHIFT +#else + ld1 {v0.8b}, x0, x1 + ld1 {v1.8b}, x0, x1 + ushll v0.8h, v0.8b, #P2S_SHIFT + ushll v1.8h, v1.8b, #P2S_SHIFT +#endif + add v2.8h, v0.8h, v31.8h + st1 {v2.8h}, x2, x3 + add v3.8h, v1.8h, v31.8h + st1 {v3.8h}, x2, x3 +.endr + ret +endfunc +.endm + +p2s_8xN 2 +p2s_8xN 4 +p2s_8xN 6 +p2s_8xN 8 +p2s_8xN 12 +p2s_8xN 16 +p2s_8xN 32 +p2s_8xN 64 + +.macro p2s_12xN h +function PFX(filterPixelToShort_12x\h\()_neon) + p2s_start + sub x3, x3, #16 +.rept \h +#if HIGH_BIT_DEPTH + ld1 {v0.16b-v1.16b}, x0, x1 + shl v2.8h, v0.8h, #P2S_SHIFT + shl v3.8h, v1.8h, #P2S_SHIFT +#else + ld1 {v0.16b}, x0, x1 + ushll v2.8h, v0.8b, #P2S_SHIFT + ushll2 v3.8h, v0.16b, #P2S_SHIFT +#endif + add v2.8h, v2.8h, v31.8h + add v3.8h, v3.8h, v31.8h + st1 {v2.16b}, x2, #16 + st1 {v3.8b}, x2, x3 +.endr + ret +endfunc +.endm + +p2s_12xN 16 +p2s_12xN 32 + +.macro p2s_16xN h +function PFX(filterPixelToShort_16x\h\()_neon) + p2s_start +.rept \h +#if HIGH_BIT_DEPTH + ld1 {v0.16b-v1.16b}, x0, x1 + shl v2.8h, v0.8h, #P2S_SHIFT + shl v3.8h, v1.8h, #P2S_SHIFT +#else + ld1 {v0.16b}, x0, x1 + ushll v2.8h, v0.8b, #P2S_SHIFT + ushll2 v3.8h, v0.16b, #P2S_SHIFT +#endif + add v2.8h, v2.8h, v31.8h + add v3.8h, v3.8h, v31.8h + st1 {v2.16b-v3.16b}, x2, x3 +.endr + ret +endfunc +.endm + +p2s_16xN 4 +p2s_16xN 8 +p2s_16xN 12 +p2s_16xN 16 +p2s_16xN 24 +p2s_16xN 32 +p2s_16xN 64 + +.macro p2s_24xN h +function PFX(filterPixelToShort_24x\h\()_neon) + p2s_start +.rept \h +#if HIGH_BIT_DEPTH + ld1 {v0.16b-v2.16b}, x0, x1 + shl v3.8h, v0.8h, #P2S_SHIFT + shl v4.8h, v1.8h, #P2S_SHIFT + shl v5.8h, v2.8h, #P2S_SHIFT +#else + ld1 {v0.8b-v2.8b}, x0, x1 + ushll v3.8h, v0.8b, #P2S_SHIFT + ushll v4.8h, v1.8b, #P2S_SHIFT + ushll v5.8h, v2.8b, #P2S_SHIFT +#endif + add v3.8h, v3.8h, v31.8h + add v4.8h, v4.8h, v31.8h + add v5.8h, v5.8h, v31.8h + st1 {v3.16b-v5.16b}, x2, x3 +.endr + ret +endfunc +.endm + +p2s_24xN 32 +p2s_24xN 64 + +.macro p2s_32xN h +function PFX(filterPixelToShort_32x\h\()_neon) + p2s_start + mov x9, #\h +.loop_filterP2S_32x\h: + sub x9, x9, #1 +#if HIGH_BIT_DEPTH + ld1 {v0.16b-v3.16b}, x0, x1 + shl v22.8h, v0.8h, #P2S_SHIFT + shl v23.8h, v1.8h, #P2S_SHIFT + shl v24.8h, v2.8h, #P2S_SHIFT + shl v25.8h, v3.8h, #P2S_SHIFT +#else + ld1 {v0.16b-v1.16b}, x0, x1 + ushll v22.8h, v0.8b, #P2S_SHIFT + ushll2 v23.8h, v0.16b, #P2S_SHIFT + ushll v24.8h, v1.8b, #P2S_SHIFT + ushll2 v25.8h, v1.16b, #P2S_SHIFT +#endif + add v22.8h, v22.8h, v31.8h + add v23.8h, v23.8h, v31.8h + add v24.8h, v24.8h, v31.8h + add v25.8h, v25.8h, v31.8h + st1 {v22.16b-v25.16b}, x2, x3 + cbnz x9, .loop_filterP2S_32x\h + ret +endfunc +.endm + +p2s_32xN 8 +p2s_32xN 16 +p2s_32xN 24 +p2s_32xN 32 +p2s_32xN 48 +p2s_32xN 64 + +.macro p2s_64xN h +function PFX(filterPixelToShort_64x\h\()_neon) + p2s_start +#if HIGH_BIT_DEPTH + sub x1, x1, #64 +#endif + sub x3, x3, #64 + mov x9, #\h +.loop_filterP2S_64x\h: + sub x9, x9, #1 +#if HIGH_BIT_DEPTH + ld1 {v0.16b-v3.16b}, x0, #64 + ld1 {v4.16b-v7.16b}, x0, x1 + shl v16.8h, v0.8h, #P2S_SHIFT + shl v17.8h, v1.8h, #P2S_SHIFT + shl v18.8h, v2.8h, #P2S_SHIFT + shl v19.8h, v3.8h, #P2S_SHIFT + shl v20.8h, v4.8h, #P2S_SHIFT + shl v21.8h, v5.8h, #P2S_SHIFT + shl v22.8h, v6.8h, #P2S_SHIFT + shl v23.8h, v7.8h, #P2S_SHIFT +#else + ld1 {v0.16b-v3.16b}, x0, x1 + ushll v16.8h, v0.8b, #P2S_SHIFT + ushll2 v17.8h, v0.16b, #P2S_SHIFT + ushll v18.8h, v1.8b, #P2S_SHIFT + ushll2 v19.8h, v1.16b, #P2S_SHIFT + ushll v20.8h, v2.8b, #P2S_SHIFT + ushll2 v21.8h, v2.16b, #P2S_SHIFT + ushll v22.8h, v3.8b, #P2S_SHIFT + ushll2 v23.8h, v3.16b, #P2S_SHIFT +#endif + add v16.8h, v16.8h, v31.8h + add v17.8h, v17.8h, v31.8h + add v18.8h, v18.8h, v31.8h + add v19.8h, v19.8h, v31.8h + add v20.8h, v20.8h, v31.8h + add v21.8h, v21.8h, v31.8h + add v22.8h, v22.8h, v31.8h + add v23.8h, v23.8h, v31.8h + st1 {v16.16b-v19.16b}, x2, #64 + st1 {v20.16b-v23.16b}, x2, x3 + cbnz x9, .loop_filterP2S_64x\h + ret +endfunc +.endm + +p2s_64xN 16 +p2s_64xN 32 +p2s_64xN 48 +p2s_64xN 64 + +function PFX(filterPixelToShort_48x64_neon) + p2s_start +#if HIGH_BIT_DEPTH + sub x1, x1, #64 +#endif + sub x3, x3, #64 + mov x9, #64 +.loop_filterP2S_48x64: + sub x9, x9, #1 +#if HIGH_BIT_DEPTH + ld1 {v0.16b-v3.16b}, x0, #64 + ld1 {v4.16b-v5.16b}, x0, x1 + shl v16.8h, v0.8h, #P2S_SHIFT + shl v17.8h, v1.8h, #P2S_SHIFT + shl v18.8h, v2.8h, #P2S_SHIFT + shl v19.8h, v3.8h, #P2S_SHIFT + shl v20.8h, v4.8h, #P2S_SHIFT + shl v21.8h, v5.8h, #P2S_SHIFT +#else + ld1 {v0.16b-v2.16b}, x0, x1 + ushll v16.8h, v0.8b, #P2S_SHIFT + ushll2 v17.8h, v0.16b, #P2S_SHIFT + ushll v18.8h, v1.8b, #P2S_SHIFT + ushll2 v19.8h, v1.16b, #P2S_SHIFT + ushll v20.8h, v2.8b, #P2S_SHIFT + ushll2 v21.8h, v2.16b, #P2S_SHIFT +#endif + add v16.8h, v16.8h, v31.8h + add v17.8h, v17.8h, v31.8h + add v18.8h, v18.8h, v31.8h + add v19.8h, v19.8h, v31.8h + add v20.8h, v20.8h, v31.8h + add v21.8h, v21.8h, v31.8h + st1 {v16.16b-v19.16b}, x2, #64 + st1 {v20.16b-v21.16b}, x2, x3 + cbnz x9, .loop_filterP2S_48x64 + ret +endfunc
View file
x265_3.6.tar.gz/source/common/aarch64/pixel-prim.cpp
Added
@@ -0,0 +1,2059 @@ +#include "common.h" +#include "slicetype.h" // LOWRES_COST_MASK +#include "primitives.h" +#include "x265.h" + +#include "pixel-prim.h" +#include "arm64-utils.h" +#if HAVE_NEON + +#include <arm_neon.h> + +using namespace X265_NS; + + + +namespace +{ + + +/* SATD SA8D variants - based on x264 */ +static inline void SUMSUB_AB(int16x8_t &sum, int16x8_t &sub, const int16x8_t a, const int16x8_t b) +{ + sum = vaddq_s16(a, b); + sub = vsubq_s16(a, b); +} + +static inline void transpose_8h(int16x8_t &t1, int16x8_t &t2, const int16x8_t s1, const int16x8_t s2) +{ + t1 = vtrn1q_s16(s1, s2); + t2 = vtrn2q_s16(s1, s2); +} + +static inline void transpose_4s(int16x8_t &t1, int16x8_t &t2, const int16x8_t s1, const int16x8_t s2) +{ + t1 = vtrn1q_s32(s1, s2); + t2 = vtrn2q_s32(s1, s2); +} + +#if (X265_DEPTH <= 10) +static inline void transpose_2d(int16x8_t &t1, int16x8_t &t2, const int16x8_t s1, const int16x8_t s2) +{ + t1 = vtrn1q_s64(s1, s2); + t2 = vtrn2q_s64(s1, s2); +} +#endif + + +static inline void SUMSUB_ABCD(int16x8_t &s1, int16x8_t &d1, int16x8_t &s2, int16x8_t &d2, + int16x8_t a, int16x8_t b, int16x8_t c, int16x8_t d) +{ + SUMSUB_AB(s1, d1, a, b); + SUMSUB_AB(s2, d2, c, d); +} + +static inline void HADAMARD4_V(int16x8_t &r1, int16x8_t &r2, int16x8_t &r3, int16x8_t &r4, + int16x8_t &t1, int16x8_t &t2, int16x8_t &t3, int16x8_t &t4) +{ + SUMSUB_ABCD(t1, t2, t3, t4, r1, r2, r3, r4); + SUMSUB_ABCD(r1, r3, r2, r4, t1, t3, t2, t4); +} + + +static int _satd_4x8_8x4_end_neon(int16x8_t v0, int16x8_t v1, int16x8_t v2, int16x8_t v3) + +{ + + int16x8_t v4, v5, v6, v7, v16, v17, v18, v19; + + + SUMSUB_AB(v16, v17, v0, v1); + SUMSUB_AB(v18, v19, v2, v3); + + SUMSUB_AB(v4 , v6 , v16, v18); + SUMSUB_AB(v5 , v7 , v17, v19); + + v0 = vtrn1q_s16(v4, v5); + v1 = vtrn2q_s16(v4, v5); + v2 = vtrn1q_s16(v6, v7); + v3 = vtrn2q_s16(v6, v7); + + SUMSUB_AB(v16, v17, v0, v1); + SUMSUB_AB(v18, v19, v2, v3); + + v0 = vtrn1q_s32(v16, v18); + v1 = vtrn2q_s32(v16, v18); + v2 = vtrn1q_s32(v17, v19); + v3 = vtrn2q_s32(v17, v19); + + v0 = vabsq_s16(v0); + v1 = vabsq_s16(v1); + v2 = vabsq_s16(v2); + v3 = vabsq_s16(v3); + + v0 = vmaxq_u16(v0, v1); + v1 = vmaxq_u16(v2, v3); + + v0 = vaddq_u16(v0, v1); + return vaddlvq_u16(v0); +} + +static inline int _satd_4x4_neon(int16x8_t v0, int16x8_t v1) +{ + int16x8_t v2, v3; + SUMSUB_AB(v2, v3, v0, v1); + + v0 = vzip1q_s64(v2, v3); + v1 = vzip2q_s64(v2, v3); + SUMSUB_AB(v2, v3, v0, v1); + + v0 = vtrn1q_s16(v2, v3); + v1 = vtrn2q_s16(v2, v3); + SUMSUB_AB(v2, v3, v0, v1); + + v0 = vtrn1q_s32(v2, v3); + v1 = vtrn2q_s32(v2, v3); + + v0 = vabsq_s16(v0); + v1 = vabsq_s16(v1); + v0 = vmaxq_u16(v0, v1); + + return vaddlvq_s16(v0); +} + +static void _satd_8x4v_8x8h_neon(int16x8_t &v0, int16x8_t &v1, int16x8_t &v2, int16x8_t &v3, int16x8_t &v20, + int16x8_t &v21, int16x8_t &v22, int16x8_t &v23) +{ + int16x8_t v16, v17, v18, v19, v4, v5, v6, v7; + + SUMSUB_AB(v16, v18, v0, v2); + SUMSUB_AB(v17, v19, v1, v3); + + HADAMARD4_V(v20, v21, v22, v23, v0, v1, v2, v3); + + transpose_8h(v0, v1, v16, v17); + transpose_8h(v2, v3, v18, v19); + transpose_8h(v4, v5, v20, v21); + transpose_8h(v6, v7, v22, v23); + + SUMSUB_AB(v16, v17, v0, v1); + SUMSUB_AB(v18, v19, v2, v3); + SUMSUB_AB(v20, v21, v4, v5); + SUMSUB_AB(v22, v23, v6, v7); + + transpose_4s(v0, v2, v16, v18); + transpose_4s(v1, v3, v17, v19); + transpose_4s(v4, v6, v20, v22); + transpose_4s(v5, v7, v21, v23); + + v0 = vabsq_s16(v0); + v1 = vabsq_s16(v1); + v2 = vabsq_s16(v2); + v3 = vabsq_s16(v3); + v4 = vabsq_s16(v4); + v5 = vabsq_s16(v5); + v6 = vabsq_s16(v6); + v7 = vabsq_s16(v7); + + v0 = vmaxq_u16(v0, v2); + v1 = vmaxq_u16(v1, v3); + v2 = vmaxq_u16(v4, v6); + v3 = vmaxq_u16(v5, v7); + +} + +#if HIGH_BIT_DEPTH + +#if (X265_DEPTH > 10) +static inline void transpose_2d(int32x4_t &t1, int32x4_t &t2, const int32x4_t s1, const int32x4_t s2) +{ + t1 = vtrn1q_s64(s1, s2); + t2 = vtrn2q_s64(s1, s2); +} + +static inline void ISUMSUB_AB(int32x4_t &sum, int32x4_t &sub, const int32x4_t a, const int32x4_t b) +{ + sum = vaddq_s32(a, b); + sub = vsubq_s32(a, b); +} + +static inline void ISUMSUB_AB_FROM_INT16(int32x4_t &suml, int32x4_t &sumh, int32x4_t &subl, int32x4_t &subh, + const int16x8_t a, const int16x8_t b) +{ + suml = vaddl_s16(vget_low_s16(a), vget_low_s16(b)); + sumh = vaddl_high_s16(a, b); + subl = vsubl_s16(vget_low_s16(a), vget_low_s16(b)); + subh = vsubl_high_s16(a, b); +} + +#endif + +static inline void _sub_8x8_fly(const uint16_t *pix1, intptr_t stride_pix1, const uint16_t *pix2, intptr_t stride_pix2, + int16x8_t &v0, int16x8_t &v1, int16x8_t &v2, int16x8_t &v3, + int16x8_t &v20, int16x8_t &v21, int16x8_t &v22, int16x8_t &v23) +{ + uint16x8_t r0, r1, r2, r3; + uint16x8_t t0, t1, t2, t3; + int16x8_t v16, v17; + int16x8_t v18, v19; + + r0 = *(uint16x8_t *)(pix1 + 0 * stride_pix1); + r1 = *(uint16x8_t *)(pix1 + 1 * stride_pix1); + r2 = *(uint16x8_t *)(pix1 + 2 * stride_pix1); + r3 = *(uint16x8_t *)(pix1 + 3 * stride_pix1); + + t0 = *(uint16x8_t *)(pix2 + 0 * stride_pix2); + t1 = *(uint16x8_t *)(pix2 + 1 * stride_pix2); + t2 = *(uint16x8_t *)(pix2 + 2 * stride_pix2); + t3 = *(uint16x8_t *)(pix2 + 3 * stride_pix2); + + v16 = vsubq_u16(r0, t0); + v17 = vsubq_u16(r1, t1); + v18 = vsubq_u16(r2, t2); + v19 = vsubq_u16(r3, t3); + + r0 = *(uint16x8_t *)(pix1 + 4 * stride_pix1); + r1 = *(uint16x8_t *)(pix1 + 5 * stride_pix1); + r2 = *(uint16x8_t *)(pix1 + 6 * stride_pix1); + r3 = *(uint16x8_t *)(pix1 + 7 * stride_pix1); + + t0 = *(uint16x8_t *)(pix2 + 4 * stride_pix2); + t1 = *(uint16x8_t *)(pix2 + 5 * stride_pix2); + t2 = *(uint16x8_t *)(pix2 + 6 * stride_pix2); + t3 = *(uint16x8_t *)(pix2 + 7 * stride_pix2); + + v20 = vsubq_u16(r0, t0); + v21 = vsubq_u16(r1, t1); + v22 = vsubq_u16(r2, t2); + v23 = vsubq_u16(r3, t3); + + SUMSUB_AB(v0, v1, v16, v17); + SUMSUB_AB(v2, v3, v18, v19); + +} + + + + +static void _satd_16x4_neon(const uint16_t *pix1, intptr_t stride_pix1, const uint16_t *pix2, intptr_t stride_pix2, + int16x8_t &v0, int16x8_t &v1, int16x8_t &v2, int16x8_t &v3) +{ + uint8x16_t r0, r1, r2, r3; + uint8x16_t t0, t1, t2, t3; + int16x8_t v16, v17, v20, v21; + int16x8_t v18, v19, v22, v23; + + r0 = *(int16x8_t *)(pix1 + 0 * stride_pix1); + r1 = *(int16x8_t *)(pix1 + 1 * stride_pix1); + r2 = *(int16x8_t *)(pix1 + 2 * stride_pix1); + r3 = *(int16x8_t *)(pix1 + 3 * stride_pix1); + + t0 = *(int16x8_t *)(pix2 + 0 * stride_pix2); + t1 = *(int16x8_t *)(pix2 + 1 * stride_pix2); + t2 = *(int16x8_t *)(pix2 + 2 * stride_pix2); + t3 = *(int16x8_t *)(pix2 + 3 * stride_pix2); + + + v16 = vsubq_u16((r0), (t0)); + v17 = vsubq_u16((r1), (t1)); + v18 = vsubq_u16((r2), (t2)); + v19 = vsubq_u16((r3), (t3)); + + r0 = *(int16x8_t *)(pix1 + 0 * stride_pix1 + 8); + r1 = *(int16x8_t *)(pix1 + 1 * stride_pix1 + 8); + r2 = *(int16x8_t *)(pix1 + 2 * stride_pix1 + 8); + r3 = *(int16x8_t *)(pix1 + 3 * stride_pix1 + 8); + + t0 = *(int16x8_t *)(pix2 + 0 * stride_pix2 + 8); + t1 = *(int16x8_t *)(pix2 + 1 * stride_pix2 + 8); + t2 = *(int16x8_t *)(pix2 + 2 * stride_pix2 + 8); + t3 = *(int16x8_t *)(pix2 + 3 * stride_pix2 + 8); + + + v20 = vsubq_u16(r0, t0); + v21 = vsubq_u16(r1, t1); + v22 = vsubq_u16(r2, t2); + v23 = vsubq_u16(r3, t3); + + SUMSUB_AB(v0, v1, v16, v17); + SUMSUB_AB(v2, v3, v18, v19); + + _satd_8x4v_8x8h_neon(v0, v1, v2, v3, v20, v21, v22, v23); + +} + + +int pixel_satd_4x4_neon(const uint16_t *pix1, intptr_t stride_pix1, const uint16_t *pix2, intptr_t stride_pix2) +{ + uint64x2_t t0, t1, r0, r1; + t00 = *(uint64_t *)(pix1 + 0 * stride_pix1); + t10 = *(uint64_t *)(pix1 + 1 * stride_pix1); + t01 = *(uint64_t *)(pix1 + 2 * stride_pix1); + t11 = *(uint64_t *)(pix1 + 3 * stride_pix1); + + r00 = *(uint64_t *)(pix2 + 0 * stride_pix1); + r10 = *(uint64_t *)(pix2 + 1 * stride_pix2); + r01 = *(uint64_t *)(pix2 + 2 * stride_pix2); + r11 = *(uint64_t *)(pix2 + 3 * stride_pix2); + + return _satd_4x4_neon(vsubq_u16(t0, r0), vsubq_u16(r1, t1)); +} + + + + + + +int pixel_satd_8x4_neon(const uint16_t *pix1, intptr_t stride_pix1, const uint16_t *pix2, intptr_t stride_pix2) +{ + uint16x8_t i0, i1, i2, i3, i4, i5, i6, i7; + + i0 = *(uint16x8_t *)(pix1 + 0 * stride_pix1); + i1 = *(uint16x8_t *)(pix2 + 0 * stride_pix2); + i2 = *(uint16x8_t *)(pix1 + 1 * stride_pix1); + i3 = *(uint16x8_t *)(pix2 + 1 * stride_pix2); + i4 = *(uint16x8_t *)(pix1 + 2 * stride_pix1); + i5 = *(uint16x8_t *)(pix2 + 2 * stride_pix2); + i6 = *(uint16x8_t *)(pix1 + 3 * stride_pix1); + i7 = *(uint16x8_t *)(pix2 + 3 * stride_pix2); + + int16x8_t v0 = vsubq_u16(i0, i1); + int16x8_t v1 = vsubq_u16(i2, i3); + int16x8_t v2 = vsubq_u16(i4, i5); + int16x8_t v3 = vsubq_u16(i6, i7); + + return _satd_4x8_8x4_end_neon(v0, v1, v2, v3); +} + + +int pixel_satd_16x16_neon(const uint16_t *pix1, intptr_t stride_pix1, const uint16_t *pix2, intptr_t stride_pix2) +{ + int32x4_t v30 = vdupq_n_u32(0), v31 = vdupq_n_u32(0); + int16x8_t v0, v1, v2, v3; + for (int offset = 0; offset <= 12; offset += 4) { + _satd_16x4_neon(pix1 + offset * stride_pix1, stride_pix1, pix2 + offset * stride_pix2, stride_pix2, v0, v1, v2, v3); + v30 = vpadalq_u16(v30, v0); + v30 = vpadalq_u16(v30, v1); + v31 = vpadalq_u16(v31, v2); + v31 = vpadalq_u16(v31, v3); + } + return vaddvq_s32(vaddq_s32(v30, v31)); + +} + +#else //HIGH_BIT_DEPTH + +static void _satd_16x4_neon(const uint8_t *pix1, intptr_t stride_pix1, const uint8_t *pix2, intptr_t stride_pix2, + int16x8_t &v0, int16x8_t &v1, int16x8_t &v2, int16x8_t &v3) +{ + uint8x16_t r0, r1, r2, r3; + uint8x16_t t0, t1, t2, t3; + int16x8_t v16, v17, v20, v21; + int16x8_t v18, v19, v22, v23; + + r0 = *(uint8x16_t *)(pix1 + 0 * stride_pix1); + r1 = *(uint8x16_t *)(pix1 + 1 * stride_pix1); + r2 = *(uint8x16_t *)(pix1 + 2 * stride_pix1); + r3 = *(uint8x16_t *)(pix1 + 3 * stride_pix1); + + t0 = *(uint8x16_t *)(pix2 + 0 * stride_pix2); + t1 = *(uint8x16_t *)(pix2 + 1 * stride_pix2); + t2 = *(uint8x16_t *)(pix2 + 2 * stride_pix2); + t3 = *(uint8x16_t *)(pix2 + 3 * stride_pix2); + + + + v16 = vsubl_u8(vget_low_u8(r0), vget_low_u8(t0)); + v20 = vsubl_high_u8(r0, t0); + v17 = vsubl_u8(vget_low_u8(r1), vget_low_u8(t1)); + v21 = vsubl_high_u8(r1, t1); + v18 = vsubl_u8(vget_low_u8(r2), vget_low_u8(t2)); + v22 = vsubl_high_u8(r2, t2); + v19 = vsubl_u8(vget_low_u8(r3), vget_low_u8(t3)); + v23 = vsubl_high_u8(r3, t3); + + SUMSUB_AB(v0, v1, v16, v17); + SUMSUB_AB(v2, v3, v18, v19); + + _satd_8x4v_8x8h_neon(v0, v1, v2, v3, v20, v21, v22, v23); + +} + + +static inline void _sub_8x8_fly(const uint8_t *pix1, intptr_t stride_pix1, const uint8_t *pix2, intptr_t stride_pix2, + int16x8_t &v0, int16x8_t &v1, int16x8_t &v2, int16x8_t &v3, + int16x8_t &v20, int16x8_t &v21, int16x8_t &v22, int16x8_t &v23) +{ + uint8x8_t r0, r1, r2, r3; + uint8x8_t t0, t1, t2, t3; + int16x8_t v16, v17; + int16x8_t v18, v19; + + r0 = *(uint8x8_t *)(pix1 + 0 * stride_pix1); + r1 = *(uint8x8_t *)(pix1 + 1 * stride_pix1); + r2 = *(uint8x8_t *)(pix1 + 2 * stride_pix1); + r3 = *(uint8x8_t *)(pix1 + 3 * stride_pix1); + + t0 = *(uint8x8_t *)(pix2 + 0 * stride_pix2); + t1 = *(uint8x8_t *)(pix2 + 1 * stride_pix2); + t2 = *(uint8x8_t *)(pix2 + 2 * stride_pix2); + t3 = *(uint8x8_t *)(pix2 + 3 * stride_pix2); + + v16 = vsubl_u8(r0, t0); + v17 = vsubl_u8(r1, t1); + v18 = vsubl_u8(r2, t2); + v19 = vsubl_u8(r3, t3); + + r0 = *(uint8x8_t *)(pix1 + 4 * stride_pix1); + r1 = *(uint8x8_t *)(pix1 + 5 * stride_pix1); + r2 = *(uint8x8_t *)(pix1 + 6 * stride_pix1); + r3 = *(uint8x8_t *)(pix1 + 7 * stride_pix1); + + t0 = *(uint8x8_t *)(pix2 + 4 * stride_pix2); + t1 = *(uint8x8_t *)(pix2 + 5 * stride_pix2); + t2 = *(uint8x8_t *)(pix2 + 6 * stride_pix2); + t3 = *(uint8x8_t *)(pix2 + 7 * stride_pix2); + + v20 = vsubl_u8(r0, t0); + v21 = vsubl_u8(r1, t1); + v22 = vsubl_u8(r2, t2); + v23 = vsubl_u8(r3, t3); + + + SUMSUB_AB(v0, v1, v16, v17); + SUMSUB_AB(v2, v3, v18, v19); + +} + +int pixel_satd_4x4_neon(const uint8_t *pix1, intptr_t stride_pix1, const uint8_t *pix2, intptr_t stride_pix2) +{ + uint32x2_t t0, t1, r0, r1; + t00 = *(uint32_t *)(pix1 + 0 * stride_pix1); + t10 = *(uint32_t *)(pix1 + 1 * stride_pix1); + t01 = *(uint32_t *)(pix1 + 2 * stride_pix1); + t11 = *(uint32_t *)(pix1 + 3 * stride_pix1); + + r00 = *(uint32_t *)(pix2 + 0 * stride_pix1); + r10 = *(uint32_t *)(pix2 + 1 * stride_pix2); + r01 = *(uint32_t *)(pix2 + 2 * stride_pix2); + r11 = *(uint32_t *)(pix2 + 3 * stride_pix2); + + return _satd_4x4_neon(vsubl_u8(t0, r0), vsubl_u8(r1, t1)); +} + + +int pixel_satd_8x4_neon(const uint8_t *pix1, intptr_t stride_pix1, const uint8_t *pix2, intptr_t stride_pix2) +{ + uint8x8_t i0, i1, i2, i3, i4, i5, i6, i7; + + i0 = *(uint8x8_t *)(pix1 + 0 * stride_pix1); + i1 = *(uint8x8_t *)(pix2 + 0 * stride_pix2); + i2 = *(uint8x8_t *)(pix1 + 1 * stride_pix1); + i3 = *(uint8x8_t *)(pix2 + 1 * stride_pix2); + i4 = *(uint8x8_t *)(pix1 + 2 * stride_pix1); + i5 = *(uint8x8_t *)(pix2 + 2 * stride_pix2); + i6 = *(uint8x8_t *)(pix1 + 3 * stride_pix1); + i7 = *(uint8x8_t *)(pix2 + 3 * stride_pix2); + + int16x8_t v0 = vsubl_u8(i0, i1); + int16x8_t v1 = vsubl_u8(i2, i3); + int16x8_t v2 = vsubl_u8(i4, i5); + int16x8_t v3 = vsubl_u8(i6, i7); + + return _satd_4x8_8x4_end_neon(v0, v1, v2, v3); +} + +int pixel_satd_16x16_neon(const uint8_t *pix1, intptr_t stride_pix1, const uint8_t *pix2, intptr_t stride_pix2) +{ + int16x8_t v30, v31; + int16x8_t v0, v1, v2, v3; + + _satd_16x4_neon(pix1, stride_pix1, pix2, stride_pix2, v0, v1, v2, v3); + v30 = vaddq_s16(v0, v1); + v31 = vaddq_s16(v2, v3); + + _satd_16x4_neon(pix1 + 4 * stride_pix1, stride_pix1, pix2 + 4 * stride_pix2, stride_pix2, v0, v1, v2, v3); + v0 = vaddq_s16(v0, v1); + v1 = vaddq_s16(v2, v3); + v30 = vaddq_s16(v30, v0); + v31 = vaddq_s16(v31, v1); + + _satd_16x4_neon(pix1 + 8 * stride_pix1, stride_pix1, pix2 + 8 * stride_pix2, stride_pix2, v0, v1, v2, v3); + v0 = vaddq_s16(v0, v1); + v1 = vaddq_s16(v2, v3); + v30 = vaddq_s16(v30, v0); + v31 = vaddq_s16(v31, v1); + + _satd_16x4_neon(pix1 + 12 * stride_pix1, stride_pix1, pix2 + 12 * stride_pix2, stride_pix2, v0, v1, v2, v3); + v0 = vaddq_s16(v0, v1); + v1 = vaddq_s16(v2, v3); + v30 = vaddq_s16(v30, v0); + v31 = vaddq_s16(v31, v1); + + int32x4_t sum0 = vpaddlq_u16(v30); + int32x4_t sum1 = vpaddlq_u16(v31); + sum0 = vaddq_s32(sum0, sum1); + return vaddvq_s32(sum0); + +} +#endif //HIGH_BIT_DEPTH + + +static inline void _sa8d_8x8_neon_end(int16x8_t &v0, int16x8_t &v1, int16x8_t v2, int16x8_t v3, + int16x8_t v20, int16x8_t v21, int16x8_t v22, int16x8_t v23) +{ + int16x8_t v16, v17, v18, v19; + int16x8_t v4, v5, v6, v7; + + SUMSUB_AB(v16, v18, v0, v2); + SUMSUB_AB(v17, v19, v1, v3); + + HADAMARD4_V(v20, v21, v22, v23, v0, v1, v2, v3); + + SUMSUB_AB(v0, v16, v16, v20); + SUMSUB_AB(v1, v17, v17, v21); + SUMSUB_AB(v2, v18, v18, v22); + SUMSUB_AB(v3, v19, v19, v23); + + transpose_8h(v20, v21, v16, v17); + transpose_8h(v4, v5, v0, v1); + transpose_8h(v22, v23, v18, v19); + transpose_8h(v6, v7, v2, v3); + +#if (X265_DEPTH <= 10) + + int16x8_t v24, v25; + + SUMSUB_AB(v2, v3, v20, v21); + SUMSUB_AB(v24, v25, v4, v5); + SUMSUB_AB(v0, v1, v22, v23); + SUMSUB_AB(v4, v5, v6, v7); + + transpose_4s(v20, v22, v2, v0); + transpose_4s(v21, v23, v3, v1); + transpose_4s(v16, v18, v24, v4); + transpose_4s(v17, v19, v25, v5); + + SUMSUB_AB(v0, v2, v20, v22); + SUMSUB_AB(v1, v3, v21, v23); + SUMSUB_AB(v4, v6, v16, v18); + SUMSUB_AB(v5, v7, v17, v19); + + transpose_2d(v16, v20, v0, v4); + transpose_2d(v17, v21, v1, v5); + transpose_2d(v18, v22, v2, v6); + transpose_2d(v19, v23, v3, v7); + + + v16 = vabsq_s16(v16); + v17 = vabsq_s16(v17); + v18 = vabsq_s16(v18); + v19 = vabsq_s16(v19); + v20 = vabsq_s16(v20); + v21 = vabsq_s16(v21); + v22 = vabsq_s16(v22); + v23 = vabsq_s16(v23); + + v16 = vmaxq_u16(v16, v20); + v17 = vmaxq_u16(v17, v21); + v18 = vmaxq_u16(v18, v22); + v19 = vmaxq_u16(v19, v23); + +#if HIGH_BIT_DEPTH + v0 = vpaddlq_u16(v16); + v1 = vpaddlq_u16(v17); + v0 = vpadalq_u16(v0, v18); + v1 = vpadalq_u16(v1, v19); + +#else //HIGH_BIT_DEPTH + + v0 = vaddq_u16(v16, v17); + v1 = vaddq_u16(v18, v19); + +#endif //HIGH_BIT_DEPTH + +#else // HIGH_BIT_DEPTH 12 bit only, switching math to int32, each int16x8 is up-convreted to 2 int32x4 (low and high) + + int32x4_t v2l, v2h, v3l, v3h, v24l, v24h, v25l, v25h, v0l, v0h, v1l, v1h; + int32x4_t v22l, v22h, v23l, v23h; + int32x4_t v4l, v4h, v5l, v5h; + int32x4_t v6l, v6h, v7l, v7h; + int32x4_t v16l, v16h, v17l, v17h; + int32x4_t v18l, v18h, v19l, v19h; + int32x4_t v20l, v20h, v21l, v21h; + + ISUMSUB_AB_FROM_INT16(v2l, v2h, v3l, v3h, v20, v21); + ISUMSUB_AB_FROM_INT16(v24l, v24h, v25l, v25h, v4, v5); + + v22l = vmovl_s16(vget_low_s16(v22)); + v22h = vmovl_high_s16(v22); + v23l = vmovl_s16(vget_low_s16(v23)); + v23h = vmovl_high_s16(v23); + + ISUMSUB_AB(v0l, v1l, v22l, v23l); + ISUMSUB_AB(v0h, v1h, v22h, v23h); + + v6l = vmovl_s16(vget_low_s16(v6)); + v6h = vmovl_high_s16(v6); + v7l = vmovl_s16(vget_low_s16(v7)); + v7h = vmovl_high_s16(v7); + + ISUMSUB_AB(v4l, v5l, v6l, v7l); + ISUMSUB_AB(v4h, v5h, v6h, v7h); + + transpose_2d(v20l, v22l, v2l, v0l); + transpose_2d(v21l, v23l, v3l, v1l); + transpose_2d(v16l, v18l, v24l, v4l); + transpose_2d(v17l, v19l, v25l, v5l); + + transpose_2d(v20h, v22h, v2h, v0h); + transpose_2d(v21h, v23h, v3h, v1h); + transpose_2d(v16h, v18h, v24h, v4h); + transpose_2d(v17h, v19h, v25h, v5h); + + ISUMSUB_AB(v0l, v2l, v20l, v22l); + ISUMSUB_AB(v1l, v3l, v21l, v23l); + ISUMSUB_AB(v4l, v6l, v16l, v18l); + ISUMSUB_AB(v5l, v7l, v17l, v19l); + + ISUMSUB_AB(v0h, v2h, v20h, v22h); + ISUMSUB_AB(v1h, v3h, v21h, v23h); + ISUMSUB_AB(v4h, v6h, v16h, v18h); + ISUMSUB_AB(v5h, v7h, v17h, v19h); + + v16l = v0l; + v16h = v4l; + v20l = v0h; + v20h = v4h; + + v17l = v1l; + v17h = v5l; + v21l = v1h; + v21h = v5h; + + v18l = v2l; + v18h = v6l; + v22l = v2h; + v22h = v6h; + + v19l = v3l; + v19h = v7l; + v23l = v3h; + v23h = v7h; + + v16l = vabsq_s32(v16l); + v17l = vabsq_s32(v17l); + v18l = vabsq_s32(v18l); + v19l = vabsq_s32(v19l); + v20l = vabsq_s32(v20l); + v21l = vabsq_s32(v21l); + v22l = vabsq_s32(v22l); + v23l = vabsq_s32(v23l); + + v16h = vabsq_s32(v16h); + v17h = vabsq_s32(v17h); + v18h = vabsq_s32(v18h); + v19h = vabsq_s32(v19h); + v20h = vabsq_s32(v20h); + v21h = vabsq_s32(v21h); + v22h = vabsq_s32(v22h); + v23h = vabsq_s32(v23h); + + v16l = vmaxq_u32(v16l, v20l); + v17l = vmaxq_u32(v17l, v21l); + v18l = vmaxq_u32(v18l, v22l); + v19l = vmaxq_u32(v19l, v23l); + + v16h = vmaxq_u32(v16h, v20h); + v17h = vmaxq_u32(v17h, v21h); + v18h = vmaxq_u32(v18h, v22h); + v19h = vmaxq_u32(v19h, v23h); + + v16l = vaddq_u32(v16l, v16h); + v17l = vaddq_u32(v17l, v17h); + v18l = vaddq_u32(v18l, v18h); + v19l = vaddq_u32(v19l, v19h); + + v0 = vaddq_u32(v16l, v17l); + v1 = vaddq_u32(v18l, v19l); + + +#endif + +} + + + +static inline void _satd_8x8_neon(const pixel *pix1, intptr_t stride_pix1, const pixel *pix2, intptr_t stride_pix2, + int16x8_t &v0, int16x8_t &v1, int16x8_t &v2, int16x8_t &v3) +{ + + int16x8_t v20, v21, v22, v23; + _sub_8x8_fly(pix1, stride_pix1, pix2, stride_pix2, v0, v1, v2, v3, v20, v21, v22, v23); + _satd_8x4v_8x8h_neon(v0, v1, v2, v3, v20, v21, v22, v23); + +} + + + +int pixel_satd_8x8_neon(const pixel *pix1, intptr_t stride_pix1, const pixel *pix2, intptr_t stride_pix2) +{ + int16x8_t v30, v31; + int16x8_t v0, v1, v2, v3; + + _satd_8x8_neon(pix1, stride_pix1, pix2, stride_pix2, v0, v1, v2, v3); +#if !(HIGH_BIT_DEPTH) + v30 = vaddq_u16(v0, v1); + v31 = vaddq_u16(v2, v3); + + uint16x8_t sum = vaddq_u16(v30, v31); + return vaddvq_s32(vpaddlq_u16(sum)); +#else + + v30 = vaddq_u16(v0, v1); + v31 = vaddq_u16(v2, v3); + + int32x4_t sum = vpaddlq_u16(v30); + sum = vpadalq_u16(sum, v31); + return vaddvq_s32(sum); +#endif +} + + +int pixel_sa8d_8x8_neon(const pixel *pix1, intptr_t stride_pix1, const pixel *pix2, intptr_t stride_pix2) +{ + int16x8_t v0, v1, v2, v3; + int16x8_t v20, v21, v22, v23; + + _sub_8x8_fly(pix1, stride_pix1, pix2, stride_pix2, v0, v1, v2, v3, v20, v21, v22, v23); + _sa8d_8x8_neon_end(v0, v1, v2, v3, v20, v21, v22, v23); + +#if HIGH_BIT_DEPTH + int32x4_t s = vaddq_u32(v0, v1); + return (vaddvq_u32(s) + 1) >> 1; +#else + return (vaddlvq_s16(vaddq_u16(v0, v1)) + 1) >> 1; +#endif +} + + + + + +int pixel_sa8d_16x16_neon(const pixel *pix1, intptr_t stride_pix1, const pixel *pix2, intptr_t stride_pix2) +{ + int16x8_t v0, v1, v2, v3; + int16x8_t v20, v21, v22, v23; + int32x4_t v30, v31; + + _sub_8x8_fly(pix1, stride_pix1, pix2, stride_pix2, v0, v1, v2, v3, v20, v21, v22, v23); + _sa8d_8x8_neon_end(v0, v1, v2, v3, v20, v21, v22, v23); + +#if !(HIGH_BIT_DEPTH) + v30 = vpaddlq_u16(v0); + v31 = vpaddlq_u16(v1); +#else + v30 = vaddq_s32(v0, v1); +#endif + + _sub_8x8_fly(pix1 + 8, stride_pix1, pix2 + 8, stride_pix2, v0, v1, v2, v3, v20, v21, v22, v23); + _sa8d_8x8_neon_end(v0, v1, v2, v3, v20, v21, v22, v23); + +#if !(HIGH_BIT_DEPTH) + v30 = vpadalq_u16(v30, v0); + v31 = vpadalq_u16(v31, v1); +#else + v31 = vaddq_s32(v0, v1); +#endif + + + _sub_8x8_fly(pix1 + 8 * stride_pix1, stride_pix1, pix2 + 8 * stride_pix2, stride_pix2, v0, v1, v2, v3, v20, v21, v22, + v23); + _sa8d_8x8_neon_end(v0, v1, v2, v3, v20, v21, v22, v23); + +#if !(HIGH_BIT_DEPTH) + v30 = vpadalq_u16(v30, v0); + v31 = vpadalq_u16(v31, v1); +#else + v30 = vaddq_s32(v30, v0); + v31 = vaddq_s32(v31, v1); +#endif + + _sub_8x8_fly(pix1 + 8 * stride_pix1 + 8, stride_pix1, pix2 + 8 * stride_pix2 + 8, stride_pix2, v0, v1, v2, v3, v20, v21, + v22, v23); + _sa8d_8x8_neon_end(v0, v1, v2, v3, v20, v21, v22, v23); + +#if !(HIGH_BIT_DEPTH) + v30 = vpadalq_u16(v30, v0); + v31 = vpadalq_u16(v31, v1); +#else + v30 = vaddq_s32(v30, v0); + v31 = vaddq_s32(v31, v1); +#endif + + v30 = vaddq_u32(v30, v31); + + return (vaddvq_u32(v30) + 1) >> 1; +} + + + + + + + + +template<int size> +void blockfill_s_neon(int16_t *dst, intptr_t dstride, int16_t val) +{ + for (int y = 0; y < size; y++) + { + int x = 0; + int16x8_t v = vdupq_n_s16(val); + for (; (x + 8) <= size; x += 8) + { + *(int16x8_t *)&dsty * dstride + x = v; + } + for (; x < size; x++) + { + dsty * dstride + x = val; + } + } +} + +template<int lx, int ly> +int sad_pp_neon(const pixel *pix1, intptr_t stride_pix1, const pixel *pix2, intptr_t stride_pix2) +{ + int sum = 0; + + + for (int y = 0; y < ly; y++) + { +#if HIGH_BIT_DEPTH + int x = 0; + uint16x8_t vsum16_1 = vdupq_n_u16(0); + for (; (x + 8) <= lx; x += 8) + { + uint16x8_t p1 = *(uint16x8_t *)&pix1x; + uint16x8_t p2 = *(uint16x8_t *)&pix2x; + vsum16_1 = vabaq_s16(vsum16_1, p1, p2); + + } + if (lx & 4) + { + uint16x4_t p1 = *(uint16x4_t *)&pix1x; + uint16x4_t p2 = *(uint16x4_t *)&pix2x; + sum += vaddlv_s16(vaba_s16(vdup_n_s16(0), p1, p2)); + x += 4; + } + if (lx >= 4) + { + sum += vaddlvq_s16(vsum16_1); + } + +#else + + int x = 0; + uint16x8_t vsum16_1 = vdupq_n_u16(0); + uint16x8_t vsum16_2 = vdupq_n_u16(0); + + for (; (x + 16) <= lx; x += 16) + { + uint8x16_t p1 = *(uint8x16_t *)&pix1x; + uint8x16_t p2 = *(uint8x16_t *)&pix2x; + vsum16_1 = vabal_u8(vsum16_1, vget_low_u8(p1), vget_low_u8(p2)); + vsum16_2 = vabal_high_u8(vsum16_2, p1, p2); + } + if (lx & 8) + { + uint8x8_t p1 = *(uint8x8_t *)&pix1x; + uint8x8_t p2 = *(uint8x8_t *)&pix2x; + vsum16_1 = vabal_u8(vsum16_1, p1, p2); + x += 8; + } + if (lx & 4) + { + uint32x2_t p1 = vdup_n_u32(0); + p10 = *(uint32_t *)&pix1x; + uint32x2_t p2 = vdup_n_u32(0); + p20 = *(uint32_t *)&pix2x; + vsum16_1 = vabal_u8(vsum16_1, p1, p2); + x += 4; + } + if (lx >= 16) + { + vsum16_1 = vaddq_u16(vsum16_1, vsum16_2); + } + if (lx >= 4) + { + sum += vaddvq_u16(vsum16_1); + } + +#endif + if (lx & 3) for (; x < lx; x++) + { + sum += abs(pix1x - pix2x); + } + + pix1 += stride_pix1; + pix2 += stride_pix2; + } + + return sum; +} + +template<int lx, int ly> +void sad_x3_neon(const pixel *pix1, const pixel *pix2, const pixel *pix3, const pixel *pix4, intptr_t frefstride, + int32_t *res) +{ + res0 = 0; + res1 = 0; + res2 = 0; + for (int y = 0; y < ly; y++) + { + int x = 0; + uint16x8_t vsum16_0 = vdupq_n_u16(0); + uint16x8_t vsum16_1 = vdupq_n_u16(0); + uint16x8_t vsum16_2 = vdupq_n_u16(0); +#if HIGH_BIT_DEPTH + for (; (x + 8) <= lx; x += 8) + { + uint16x8_t p1 = *(uint16x8_t *)&pix1x; + uint16x8_t p2 = *(uint16x8_t *)&pix2x; + uint16x8_t p3 = *(uint16x8_t *)&pix3x; + uint16x8_t p4 = *(uint16x8_t *)&pix4x; + vsum16_0 = vabaq_s16(vsum16_0, p1, p2); + vsum16_1 = vabaq_s16(vsum16_1, p1, p3); + vsum16_2 = vabaq_s16(vsum16_2, p1, p4); + + } + if (lx & 4) + { + uint16x4_t p1 = *(uint16x4_t *)&pix1x; + uint16x4_t p2 = *(uint16x4_t *)&pix2x; + uint16x4_t p3 = *(uint16x4_t *)&pix3x; + uint16x4_t p4 = *(uint16x4_t *)&pix4x; + res0 += vaddlv_s16(vaba_s16(vdup_n_s16(0), p1, p2)); + res1 += vaddlv_s16(vaba_s16(vdup_n_s16(0), p1, p3)); + res2 += vaddlv_s16(vaba_s16(vdup_n_s16(0), p1, p4)); + x += 4; + } + if (lx >= 4) + { + res0 += vaddlvq_s16(vsum16_0); + res1 += vaddlvq_s16(vsum16_1); + res2 += vaddlvq_s16(vsum16_2); + } +#else + + for (; (x + 16) <= lx; x += 16) + { + uint8x16_t p1 = *(uint8x16_t *)&pix1x; + uint8x16_t p2 = *(uint8x16_t *)&pix2x; + uint8x16_t p3 = *(uint8x16_t *)&pix3x; + uint8x16_t p4 = *(uint8x16_t *)&pix4x; + vsum16_0 = vabal_u8(vsum16_0, vget_low_u8(p1), vget_low_u8(p2)); + vsum16_0 = vabal_high_u8(vsum16_0, p1, p2); + vsum16_1 = vabal_u8(vsum16_1, vget_low_u8(p1), vget_low_u8(p3)); + vsum16_1 = vabal_high_u8(vsum16_1, p1, p3); + vsum16_2 = vabal_u8(vsum16_2, vget_low_u8(p1), vget_low_u8(p4)); + vsum16_2 = vabal_high_u8(vsum16_2, p1, p4); + } + if (lx & 8) + { + uint8x8_t p1 = *(uint8x8_t *)&pix1x; + uint8x8_t p2 = *(uint8x8_t *)&pix2x; + uint8x8_t p3 = *(uint8x8_t *)&pix3x; + uint8x8_t p4 = *(uint8x8_t *)&pix4x; + vsum16_0 = vabal_u8(vsum16_0, p1, p2); + vsum16_1 = vabal_u8(vsum16_1, p1, p3); + vsum16_2 = vabal_u8(vsum16_2, p1, p4); + x += 8; + } + if (lx & 4) + { + uint32x2_t p1 = vdup_n_u32(0); + p10 = *(uint32_t *)&pix1x; + uint32x2_t p2 = vdup_n_u32(0); + p20 = *(uint32_t *)&pix2x; + uint32x2_t p3 = vdup_n_u32(0); + p30 = *(uint32_t *)&pix3x; + uint32x2_t p4 = vdup_n_u32(0); + p40 = *(uint32_t *)&pix4x; + vsum16_0 = vabal_u8(vsum16_0, p1, p2); + vsum16_1 = vabal_u8(vsum16_1, p1, p3); + vsum16_2 = vabal_u8(vsum16_2, p1, p4); + x += 4; + } + if (lx >= 4) + { + res0 += vaddvq_u16(vsum16_0); + res1 += vaddvq_u16(vsum16_1); + res2 += vaddvq_u16(vsum16_2); + } + +#endif + if (lx & 3) for (; x < lx; x++) + { + res0 += abs(pix1x - pix2x); + res1 += abs(pix1x - pix3x); + res2 += abs(pix1x - pix4x); + } + + pix1 += FENC_STRIDE; + pix2 += frefstride; + pix3 += frefstride; + pix4 += frefstride; + } +} + +template<int lx, int ly> +void sad_x4_neon(const pixel *pix1, const pixel *pix2, const pixel *pix3, const pixel *pix4, const pixel *pix5, + intptr_t frefstride, int32_t *res) +{ + int32x4_t result = {0}; + for (int y = 0; y < ly; y++) + { + int x = 0; + uint16x8_t vsum16_0 = vdupq_n_u16(0); + uint16x8_t vsum16_1 = vdupq_n_u16(0); + uint16x8_t vsum16_2 = vdupq_n_u16(0); + uint16x8_t vsum16_3 = vdupq_n_u16(0); +#if HIGH_BIT_DEPTH + for (; (x + 16) <= lx; x += 16) + { + uint16x8x2_t p1 = vld1q_u16_x2(&pix1x); + uint16x8x2_t p2 = vld1q_u16_x2(&pix2x); + uint16x8x2_t p3 = vld1q_u16_x2(&pix3x); + uint16x8x2_t p4 = vld1q_u16_x2(&pix4x); + uint16x8x2_t p5 = vld1q_u16_x2(&pix5x); + vsum16_0 = vabaq_s16(vsum16_0, p1.val0, p2.val0); + vsum16_1 = vabaq_s16(vsum16_1, p1.val0, p3.val0); + vsum16_2 = vabaq_s16(vsum16_2, p1.val0, p4.val0); + vsum16_3 = vabaq_s16(vsum16_3, p1.val0, p5.val0); + vsum16_0 = vabaq_s16(vsum16_0, p1.val1, p2.val1); + vsum16_1 = vabaq_s16(vsum16_1, p1.val1, p3.val1); + vsum16_2 = vabaq_s16(vsum16_2, p1.val1, p4.val1); + vsum16_3 = vabaq_s16(vsum16_3, p1.val1, p5.val1); + } + if (lx & 8) + { + uint16x8_t p1 = *(uint16x8_t *)&pix1x; + uint16x8_t p2 = *(uint16x8_t *)&pix2x; + uint16x8_t p3 = *(uint16x8_t *)&pix3x; + uint16x8_t p4 = *(uint16x8_t *)&pix4x; + uint16x8_t p5 = *(uint16x8_t *)&pix5x; + vsum16_0 = vabaq_s16(vsum16_0, p1, p2); + vsum16_1 = vabaq_s16(vsum16_1, p1, p3); + vsum16_2 = vabaq_s16(vsum16_2, p1, p4); + vsum16_3 = vabaq_s16(vsum16_3, p1, p5); + x += 8; + } + if (lx & 4) + { + /* This is equivalent to getting the absolute difference of pix1x with each of + * pix2 - pix5, then summing across the vector (4 values each) and adding the + * result to result. */ + uint16x8_t p1 = vreinterpretq_s16_u64( + vld1q_dup_u64((uint64_t *)&pix1x)); + uint16x8_t p2_3 = vcombine_s16(*(uint16x4_t *)&pix2x, *(uint16x4_t *)&pix3x); + uint16x8_t p4_5 = vcombine_s16(*(uint16x4_t *)&pix4x, *(uint16x4_t *)&pix5x); + + uint16x8_t a = vabdq_u16(p1, p2_3); + uint16x8_t b = vabdq_u16(p1, p4_5); + + result = vpadalq_s16(result, vpaddq_s16(a, b)); + x += 4; + } + if (lx >= 4) + { + /* This is equivalent to adding across each of the sum vectors and then adding + * to result. */ + uint16x8_t a = vpaddq_s16(vsum16_0, vsum16_1); + uint16x8_t b = vpaddq_s16(vsum16_2, vsum16_3); + uint16x8_t c = vpaddq_s16(a, b); + result = vpadalq_s16(result, c); + } + +#else + + for (; (x + 16) <= lx; x += 16) + { + uint8x16_t p1 = *(uint8x16_t *)&pix1x; + uint8x16_t p2 = *(uint8x16_t *)&pix2x; + uint8x16_t p3 = *(uint8x16_t *)&pix3x; + uint8x16_t p4 = *(uint8x16_t *)&pix4x; + uint8x16_t p5 = *(uint8x16_t *)&pix5x; + vsum16_0 = vabal_u8(vsum16_0, vget_low_u8(p1), vget_low_u8(p2)); + vsum16_0 = vabal_high_u8(vsum16_0, p1, p2); + vsum16_1 = vabal_u8(vsum16_1, vget_low_u8(p1), vget_low_u8(p3)); + vsum16_1 = vabal_high_u8(vsum16_1, p1, p3); + vsum16_2 = vabal_u8(vsum16_2, vget_low_u8(p1), vget_low_u8(p4)); + vsum16_2 = vabal_high_u8(vsum16_2, p1, p4); + vsum16_3 = vabal_u8(vsum16_3, vget_low_u8(p1), vget_low_u8(p5)); + vsum16_3 = vabal_high_u8(vsum16_3, p1, p5); + } + if (lx & 8) + { + uint8x8_t p1 = *(uint8x8_t *)&pix1x; + uint8x8_t p2 = *(uint8x8_t *)&pix2x; + uint8x8_t p3 = *(uint8x8_t *)&pix3x; + uint8x8_t p4 = *(uint8x8_t *)&pix4x; + uint8x8_t p5 = *(uint8x8_t *)&pix5x; + vsum16_0 = vabal_u8(vsum16_0, p1, p2); + vsum16_1 = vabal_u8(vsum16_1, p1, p3); + vsum16_2 = vabal_u8(vsum16_2, p1, p4); + vsum16_3 = vabal_u8(vsum16_3, p1, p5); + x += 8; + } + if (lx & 4) + { + uint8x16_t p1 = vreinterpretq_u32_u8( + vld1q_dup_u32((uint32_t *)&pix1x)); + + uint32x4_t p_x4; + p_x4 = vld1q_lane_u32((uint32_t *)&pix2x, p_x4, 0); + p_x4 = vld1q_lane_u32((uint32_t *)&pix3x, p_x4, 1); + p_x4 = vld1q_lane_u32((uint32_t *)&pix4x, p_x4, 2); + p_x4 = vld1q_lane_u32((uint32_t *)&pix5x, p_x4, 3); + + uint16x8_t sum = vabdl_u8(vget_low_u8(p1), vget_low_u8(p_x4)); + uint16x8_t sum2 = vabdl_high_u8(p1, p_x4); + + uint16x8_t a = vpaddq_u16(sum, sum2); + result = vpadalq_u16(result, a); + } + if (lx >= 4) + { + result0 += vaddvq_u16(vsum16_0); + result1 += vaddvq_u16(vsum16_1); + result2 += vaddvq_u16(vsum16_2); + result3 += vaddvq_u16(vsum16_3); + } + +#endif + if (lx & 3) for (; x < lx; x++) + { + result0 += abs(pix1x - pix2x); + result1 += abs(pix1x - pix3x); + result2 += abs(pix1x - pix4x); + result3 += abs(pix1x - pix5x); + } + + pix1 += FENC_STRIDE; + pix2 += frefstride; + pix3 += frefstride; + pix4 += frefstride; + pix5 += frefstride; + } + vst1q_s32(res, result); +} + + +template<int lx, int ly, class T1, class T2> +sse_t sse_neon(const T1 *pix1, intptr_t stride_pix1, const T2 *pix2, intptr_t stride_pix2) +{ + sse_t sum = 0; + + int32x4_t vsum1 = vdupq_n_s32(0); + int32x4_t vsum2 = vdupq_n_s32(0); + for (int y = 0; y < ly; y++) + { + int x = 0; + for (; (x + 8) <= lx; x += 8) + { + int16x8_t tmp; + if (sizeof(T1) == 2 && sizeof(T2) == 2) + { + tmp = vsubq_s16(*(int16x8_t *)&pix1x, *(int16x8_t *)&pix2x); + } + else if (sizeof(T1) == 1 && sizeof(T2) == 1) + { + tmp = vsubl_u8(*(uint8x8_t *)&pix1x, *(uint8x8_t *)&pix2x); + } + else + { + X265_CHECK(false, "unsupported sse"); + } + vsum1 = vmlal_s16(vsum1, vget_low_s16(tmp), vget_low_s16(tmp)); + vsum2 = vmlal_high_s16(vsum2, tmp, tmp); + } + for (; x < lx; x++) + { + int tmp = pix1x - pix2x; + sum += (tmp * tmp); + } + + if (sizeof(T1) == 2 && sizeof(T2) == 2) + { + int32x4_t vsum = vaddq_u32(vsum1, vsum2);; + sum += vaddvq_u32(vsum); + vsum1 = vsum2 = vdupq_n_u16(0); + } + + pix1 += stride_pix1; + pix2 += stride_pix2; + } + int32x4_t vsum = vaddq_u32(vsum1, vsum2); + + return sum + vaddvq_u32(vsum); +} + + +template<int bx, int by> +void blockcopy_ps_neon(int16_t *a, intptr_t stridea, const pixel *b, intptr_t strideb) +{ + for (int y = 0; y < by; y++) + { + int x = 0; + for (; (x + 8) <= bx; x += 8) + { +#if HIGH_BIT_DEPTH + *(int16x8_t *)&ax = *(int16x8_t *)&bx; +#else + *(int16x8_t *)&ax = vmovl_u8(*(int8x8_t *)&bx); +#endif + } + for (; x < bx; x++) + { + ax = (int16_t)bx; + } + + a += stridea; + b += strideb; + } +} + + +template<int bx, int by> +void blockcopy_pp_neon(pixel *a, intptr_t stridea, const pixel *b, intptr_t strideb) +{ + for (int y = 0; y < by; y++) + { + int x = 0; +#if HIGH_BIT_DEPTH + for (; (x + 8) <= bx; x += 8) + { + *(int16x8_t *)&ax = *(int16x8_t *)&bx; + } + if (bx & 4) + { + *(uint64_t *)&ax = *(uint64_t *)&bx; + x += 4; + } +#else + for (; (x + 16) <= bx; x += 16) + { + *(uint8x16_t *)&ax = *(uint8x16_t *)&bx; + } + if (bx & 8) + { + *(uint8x8_t *)&ax = *(uint8x8_t *)&bx; + x += 8; + } + if (bx & 4) + { + *(uint32_t *)&ax = *(uint32_t *)&bx; + x += 4; + } +#endif + for (; x < bx; x++) + { + ax = bx; + } + + a += stridea; + b += strideb; + } +} + + +template<int bx, int by> +void pixel_sub_ps_neon(int16_t *a, intptr_t dstride, const pixel *b0, const pixel *b1, intptr_t sstride0, + intptr_t sstride1) +{ + for (int y = 0; y < by; y++) + { + int x = 0; + for (; (x + 8) <= bx; x += 8) + { +#if HIGH_BIT_DEPTH + *(int16x8_t *)&ax = vsubq_s16(*(int16x8_t *)&b0x, *(int16x8_t *)&b1x); +#else + *(int16x8_t *)&ax = vsubl_u8(*(uint8x8_t *)&b0x, *(uint8x8_t *)&b1x); +#endif + } + for (; x < bx; x++) + { + ax = (int16_t)(b0x - b1x); + } + + b0 += sstride0; + b1 += sstride1; + a += dstride; + } +} + +template<int bx, int by> +void pixel_add_ps_neon(pixel *a, intptr_t dstride, const pixel *b0, const int16_t *b1, intptr_t sstride0, + intptr_t sstride1) +{ + for (int y = 0; y < by; y++) + { + int x = 0; + for (; (x + 8) <= bx; x += 8) + { + int16x8_t t; + int16x8_t b1e = *(int16x8_t *)&b1x; + int16x8_t b0e; +#if HIGH_BIT_DEPTH + b0e = *(int16x8_t *)&b0x; + t = vaddq_s16(b0e, b1e); + t = vminq_s16(t, vdupq_n_s16((1 << X265_DEPTH) - 1)); + t = vmaxq_s16(t, vdupq_n_s16(0)); + *(int16x8_t *)&ax = t; +#else + b0e = vmovl_u8(*(uint8x8_t *)&b0x); + t = vaddq_s16(b0e, b1e); + *(uint8x8_t *)&ax = vqmovun_s16(t); +#endif + } + for (; x < bx; x++) + { + ax = (int16_t)x265_clip(b0x + b1x); + } + + b0 += sstride0; + b1 += sstride1; + a += dstride; + } +} + +template<int bx, int by> +void addAvg_neon(const int16_t *src0, const int16_t *src1, pixel *dst, intptr_t src0Stride, intptr_t src1Stride, + intptr_t dstStride) +{ + + const int shiftNum = IF_INTERNAL_PREC + 1 - X265_DEPTH; + const int offset = (1 << (shiftNum - 1)) + 2 * IF_INTERNAL_OFFS; + + const int32x4_t addon = vdupq_n_s32(offset); + for (int y = 0; y < by; y++) + { + int x = 0; + + for (; (x + 8) <= bx; x += 8) + { + int16x8_t in0 = *(int16x8_t *)&src0x; + int16x8_t in1 = *(int16x8_t *)&src1x; + int32x4_t t1 = vaddl_s16(vget_low_s16(in0), vget_low_s16(in1)); + int32x4_t t2 = vaddl_high_s16(in0, in1); + t1 = vaddq_s32(t1, addon); + t2 = vaddq_s32(t2, addon); + t1 = vshrq_n_s32(t1, shiftNum); + t2 = vshrq_n_s32(t2, shiftNum); + int16x8_t t = vuzp1q_s16(t1, t2); +#if HIGH_BIT_DEPTH + t = vminq_s16(t, vdupq_n_s16((1 << X265_DEPTH) - 1)); + t = vmaxq_s16(t, vdupq_n_s16(0)); + *(int16x8_t *)&dstx = t; +#else + *(uint8x8_t *)&dstx = vqmovun_s16(t); +#endif + } + for (; x < bx; x += 2) + { + dstx + 0 = x265_clip((src0x + 0 + src1x + 0 + offset) >> shiftNum); + dstx + 1 = x265_clip((src0x + 1 + src1x + 1 + offset) >> shiftNum); + } + + src0 += src0Stride; + src1 += src1Stride; + dst += dstStride; + } +} + +template<int lx, int ly> +void pixelavg_pp_neon(pixel *dst, intptr_t dstride, const pixel *src0, intptr_t sstride0, const pixel *src1, + intptr_t sstride1, int) +{ + for (int y = 0; y < ly; y++) + { + int x = 0; + for (; (x + 8) <= lx; x += 8) + { +#if HIGH_BIT_DEPTH + uint16x8_t in0 = *(uint16x8_t *)&src0x; + uint16x8_t in1 = *(uint16x8_t *)&src1x; + uint16x8_t t = vrhaddq_u16(in0, in1); + *(uint16x8_t *)&dstx = t; +#else + int16x8_t in0 = vmovl_u8(*(uint8x8_t *)&src0x); + int16x8_t in1 = vmovl_u8(*(uint8x8_t *)&src1x); + int16x8_t t = vrhaddq_s16(in0, in1); + *(uint8x8_t *)&dstx = vmovn_u16(t); +#endif + } + for (; x < lx; x++) + { + dstx = (src0x + src1x + 1) >> 1; + } + + src0 += sstride0; + src1 += sstride1; + dst += dstride; + } +} + + +template<int size> +void cpy1Dto2D_shl_neon(int16_t *dst, const int16_t *src, intptr_t dstStride, int shift) +{ + X265_CHECK((((intptr_t)dst | (dstStride * sizeof(*dst))) & 15) == 0 || size == 4, "dst alignment error\n"); + X265_CHECK(((intptr_t)src & 15) == 0, "src alignment error\n"); + X265_CHECK(shift >= 0, "invalid shift\n"); + + for (int i = 0; i < size; i++) + { + int j = 0; + for (; (j + 8) <= size; j += 8) + { + *(int16x8_t *)&dstj = vshlq_s16(*(int16x8_t *)&srcj, vdupq_n_s16(shift)); + } + for (; j < size; j++) + { + dstj = srcj << shift; + } + src += size; + dst += dstStride; + } +} + + +template<int size> +uint64_t pixel_var_neon(const uint8_t *pix, intptr_t i_stride) +{ + uint32_t sum = 0, sqr = 0; + + int32x4_t vsqr = vdupq_n_s32(0); + for (int y = 0; y < size; y++) + { + int x = 0; + int16x8_t vsum = vdupq_n_s16(0); + for (; (x + 8) <= size; x += 8) + { + int16x8_t in; + in = vmovl_u8(*(uint8x8_t *)&pixx); + vsum = vaddq_u16(vsum, in); + vsqr = vmlal_s16(vsqr, vget_low_s16(in), vget_low_s16(in)); + vsqr = vmlal_high_s16(vsqr, in, in); + } + for (; x < size; x++) + { + sum += pixx; + sqr += pixx * pixx; + } + sum += vaddvq_s16(vsum); + + pix += i_stride; + } + sqr += vaddvq_u32(vsqr); + return sum + ((uint64_t)sqr << 32); +} + +template<int blockSize> +void getResidual_neon(const pixel *fenc, const pixel *pred, int16_t *residual, intptr_t stride) +{ + for (int y = 0; y < blockSize; y++) + { + int x = 0; + for (; (x + 8) < blockSize; x += 8) + { + int16x8_t vfenc, vpred; +#if HIGH_BIT_DEPTH + vfenc = *(int16x8_t *)&fencx; + vpred = *(int16x8_t *)&predx; +#else + vfenc = vmovl_u8(*(uint8x8_t *)&fencx); + vpred = vmovl_u8(*(uint8x8_t *)&predx); +#endif + *(int16x8_t *)&residualx = vsubq_s16(vfenc, vpred); + } + for (; x < blockSize; x++) + { + residualx = static_cast<int16_t>(fencx) - static_cast<int16_t>(predx); + } + fenc += stride; + residual += stride; + pred += stride; + } +} + +template<int size> +int psyCost_pp_neon(const pixel *source, intptr_t sstride, const pixel *recon, intptr_t rstride) +{ + static pixel zeroBuf8 /* = { 0 } */; + + if (size) + { + int dim = 1 << (size + 2); + uint32_t totEnergy = 0; + for (int i = 0; i < dim; i += 8) + { + for (int j = 0; j < dim; j += 8) + { + /* AC energy, measured by sa8d (AC + DC) minus SAD (DC) */ + int sourceEnergy = pixel_sa8d_8x8_neon(source + i * sstride + j, sstride, zeroBuf, 0) - + (sad_pp_neon<8, 8>(source + i * sstride + j, sstride, zeroBuf, 0) >> 2); + int reconEnergy = pixel_sa8d_8x8_neon(recon + i * rstride + j, rstride, zeroBuf, 0) - + (sad_pp_neon<8, 8>(recon + i * rstride + j, rstride, zeroBuf, 0) >> 2); + + totEnergy += abs(sourceEnergy - reconEnergy); + } + } + return totEnergy; + } + else + { + /* 4x4 is too small for sa8d */ + int sourceEnergy = pixel_satd_4x4_neon(source, sstride, zeroBuf, 0) - (sad_pp_neon<4, 4>(source, sstride, zeroBuf, + 0) >> 2); + int reconEnergy = pixel_satd_4x4_neon(recon, rstride, zeroBuf, 0) - (sad_pp_neon<4, 4>(recon, rstride, zeroBuf, + 0) >> 2); + return abs(sourceEnergy - reconEnergy); + } +} + + +template<int w, int h> +// Calculate sa8d in blocks of 8x8 +int sa8d8(const pixel *pix1, intptr_t i_pix1, const pixel *pix2, intptr_t i_pix2) +{ + int cost = 0; + + for (int y = 0; y < h; y += 8) + for (int x = 0; x < w; x += 8) + { + cost += pixel_sa8d_8x8_neon(pix1 + i_pix1 * y + x, i_pix1, pix2 + i_pix2 * y + x, i_pix2); + } + + return cost; +} + +template<int w, int h> +// Calculate sa8d in blocks of 16x16 +int sa8d16(const pixel *pix1, intptr_t i_pix1, const pixel *pix2, intptr_t i_pix2) +{ + int cost = 0; + + for (int y = 0; y < h; y += 16) + for (int x = 0; x < w; x += 16) + { + cost += pixel_sa8d_16x16_neon(pix1 + i_pix1 * y + x, i_pix1, pix2 + i_pix2 * y + x, i_pix2); + } + + return cost; +} + +template<int size> +void cpy2Dto1D_shl_neon(int16_t *dst, const int16_t *src, intptr_t srcStride, int shift) +{ + X265_CHECK(((intptr_t)dst & 15) == 0, "dst alignment error\n"); + X265_CHECK((((intptr_t)src | (srcStride * sizeof(*src))) & 15) == 0 || size == 4, "src alignment error\n"); + X265_CHECK(shift >= 0, "invalid shift\n"); + + for (int i = 0; i < size; i++) + { + for (int j = 0; j < size; j++) + { + dstj = srcj << shift; + } + + src += srcStride; + dst += size; + } +} + + +template<int w, int h> +// calculate satd in blocks of 4x4 +int satd4_neon(const pixel *pix1, intptr_t stride_pix1, const pixel *pix2, intptr_t stride_pix2) +{ + int satd = 0; + + for (int row = 0; row < h; row += 4) + for (int col = 0; col < w; col += 4) + satd += pixel_satd_4x4_neon(pix1 + row * stride_pix1 + col, stride_pix1, + pix2 + row * stride_pix2 + col, stride_pix2); + + return satd; +} + +template<int w, int h> +// calculate satd in blocks of 8x4 +int satd8_neon(const pixel *pix1, intptr_t stride_pix1, const pixel *pix2, intptr_t stride_pix2) +{ + int satd = 0; + + if (((w | h) & 15) == 0) + { + for (int row = 0; row < h; row += 16) + for (int col = 0; col < w; col += 16) + satd += pixel_satd_16x16_neon(pix1 + row * stride_pix1 + col, stride_pix1, + pix2 + row * stride_pix2 + col, stride_pix2); + + } + else if (((w | h) & 7) == 0) + { + for (int row = 0; row < h; row += 8) + for (int col = 0; col < w; col += 8) + satd += pixel_satd_8x8_neon(pix1 + row * stride_pix1 + col, stride_pix1, + pix2 + row * stride_pix2 + col, stride_pix2); + + } + else + { + for (int row = 0; row < h; row += 4) + for (int col = 0; col < w; col += 8) + satd += pixel_satd_8x4_neon(pix1 + row * stride_pix1 + col, stride_pix1, + pix2 + row * stride_pix2 + col, stride_pix2); + } + + return satd; +} + + +template<int blockSize> +void transpose_neon(pixel *dst, const pixel *src, intptr_t stride) +{ + for (int k = 0; k < blockSize; k++) + for (int l = 0; l < blockSize; l++) + { + dstk * blockSize + l = srcl * stride + k; + } +} + + +template<> +void transpose_neon<8>(pixel *dst, const pixel *src, intptr_t stride) +{ + transpose8x8(dst, src, 8, stride); +} + +template<> +void transpose_neon<16>(pixel *dst, const pixel *src, intptr_t stride) +{ + transpose16x16(dst, src, 16, stride); +} + +template<> +void transpose_neon<32>(pixel *dst, const pixel *src, intptr_t stride) +{ + transpose32x32(dst, src, 32, stride); +} + + +template<> +void transpose_neon<64>(pixel *dst, const pixel *src, intptr_t stride) +{ + transpose32x32(dst, src, 64, stride); + transpose32x32(dst + 32 * 64 + 32, src + 32 * stride + 32, 64, stride); + transpose32x32(dst + 32 * 64, src + 32, 64, stride); + transpose32x32(dst + 32, src + 32 * stride, 64, stride); +} + + +template<int size> +sse_t pixel_ssd_s_neon(const int16_t *a, intptr_t dstride) +{ + sse_t sum = 0; + + + int32x4_t vsum = vdupq_n_s32(0); + + for (int y = 0; y < size; y++) + { + int x = 0; + + for (; (x + 8) <= size; x += 8) + { + int16x8_t in = *(int16x8_t *)&ax; + vsum = vmlal_s16(vsum, vget_low_s16(in), vget_low_s16(in)); + vsum = vmlal_high_s16(vsum, (in), (in)); + } + for (; x < size; x++) + { + sum += ax * ax; + } + + a += dstride; + } + return sum + vaddvq_s32(vsum); +} + + +}; + + + + +namespace X265_NS +{ + + +void setupPixelPrimitives_neon(EncoderPrimitives &p) +{ +#define LUMA_PU(W, H) \ + p.puLUMA_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \ + p.puLUMA_ ## W ## x ## H.addAvgNONALIGNED = addAvg_neon<W, H>; \ + p.puLUMA_ ## W ## x ## H.addAvgALIGNED = addAvg_neon<W, H>; \ + p.puLUMA_ ## W ## x ## H.sad = sad_pp_neon<W, H>; \ + p.puLUMA_ ## W ## x ## H.sad_x3 = sad_x3_neon<W, H>; \ + p.puLUMA_ ## W ## x ## H.sad_x4 = sad_x4_neon<W, H>; \ + p.puLUMA_ ## W ## x ## H.pixelavg_ppNONALIGNED = pixelavg_pp_neon<W, H>; \ + p.puLUMA_ ## W ## x ## H.pixelavg_ppALIGNED = pixelavg_pp_neon<W, H>; + +#if !(HIGH_BIT_DEPTH) +#define LUMA_PU_S(W, H) \ + p.puLUMA_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \ + p.puLUMA_ ## W ## x ## H.addAvgNONALIGNED = addAvg_neon<W, H>; \ + p.puLUMA_ ## W ## x ## H.addAvgALIGNED = addAvg_neon<W, H>; +#else // !(HIGH_BIT_DEPTH) +#define LUMA_PU_S(W, H) \ + p.puLUMA_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \ + p.puLUMA_ ## W ## x ## H.addAvgNONALIGNED = addAvg_neon<W, H>; \ + p.puLUMA_ ## W ## x ## H.addAvgALIGNED = addAvg_neon<W, H>; \ + p.puLUMA_ ## W ## x ## H.sad_x3 = sad_x3_neon<W, H>; \ + p.puLUMA_ ## W ## x ## H.sad_x4 = sad_x4_neon<W, H>; \ + p.puLUMA_ ## W ## x ## H.pixelavg_ppNONALIGNED = pixelavg_pp_neon<W, H>; \ + p.puLUMA_ ## W ## x ## H.pixelavg_ppALIGNED = pixelavg_pp_neon<W, H>; +#endif // !(HIGH_BIT_DEPTH) + +#define LUMA_CU(W, H) \ + p.cuBLOCK_ ## W ## x ## H.sub_ps = pixel_sub_ps_neon<W, H>; \ + p.cuBLOCK_ ## W ## x ## H.add_psNONALIGNED = pixel_add_ps_neon<W, H>; \ + p.cuBLOCK_ ## W ## x ## H.add_psALIGNED = pixel_add_ps_neon<W, H>; \ + p.cuBLOCK_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \ + p.cuBLOCK_ ## W ## x ## H.copy_ps = blockcopy_ps_neon<W, H>; \ + p.cuBLOCK_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \ + p.cuBLOCK_ ## W ## x ## H.cpy2Dto1D_shl = cpy2Dto1D_shl_neon<W>; \ + p.cuBLOCK_ ## W ## x ## H.cpy1Dto2D_shlNONALIGNED = cpy1Dto2D_shl_neon<W>; \ + p.cuBLOCK_ ## W ## x ## H.cpy1Dto2D_shlALIGNED = cpy1Dto2D_shl_neon<W>; \ + p.cuBLOCK_ ## W ## x ## H.psy_cost_pp = psyCost_pp_neon<BLOCK_ ## W ## x ## H>; \ + p.cuBLOCK_ ## W ## x ## H.transpose = transpose_neon<W>; + + + LUMA_PU_S(4, 4); + LUMA_PU_S(8, 8); + LUMA_PU(16, 16); + LUMA_PU(32, 32); + LUMA_PU(64, 64); + LUMA_PU_S(4, 8); + LUMA_PU_S(8, 4); + LUMA_PU(16, 8); + LUMA_PU_S(8, 16); + LUMA_PU(16, 12); + LUMA_PU(12, 16); + LUMA_PU(16, 4); + LUMA_PU_S(4, 16); + LUMA_PU(32, 16); + LUMA_PU(16, 32); + LUMA_PU(32, 24); + LUMA_PU(24, 32); + LUMA_PU(32, 8); + LUMA_PU_S(8, 32); + LUMA_PU(64, 32); + LUMA_PU(32, 64); + LUMA_PU(64, 48); + LUMA_PU(48, 64); + LUMA_PU(64, 16); + LUMA_PU(16, 64); + +#if defined(__APPLE__) + p.puLUMA_4x4.sad = sad_pp_neon<4, 4>; + p.puLUMA_4x8.sad = sad_pp_neon<4, 8>; + p.puLUMA_4x16.sad = sad_pp_neon<4, 16>; +#endif // defined(__APPLE__) + p.puLUMA_8x4.sad = sad_pp_neon<8, 4>; + p.puLUMA_8x8.sad = sad_pp_neon<8, 8>; + p.puLUMA_8x16.sad = sad_pp_neon<8, 16>; + p.puLUMA_8x32.sad = sad_pp_neon<8, 32>; + +#if !(HIGH_BIT_DEPTH) + p.puLUMA_4x4.sad_x3 = sad_x3_neon<4, 4>; + p.puLUMA_4x4.sad_x4 = sad_x4_neon<4, 4>; + p.puLUMA_4x8.sad_x3 = sad_x3_neon<4, 8>; + p.puLUMA_4x8.sad_x4 = sad_x4_neon<4, 8>; + p.puLUMA_4x16.sad_x3 = sad_x3_neon<4, 16>; + p.puLUMA_4x16.sad_x4 = sad_x4_neon<4, 16>; +#endif // !(HIGH_BIT_DEPTH) + + p.puLUMA_4x4.satd = pixel_satd_4x4_neon; + p.puLUMA_8x4.satd = pixel_satd_8x4_neon; + + p.puLUMA_8x8.satd = satd8_neon<8, 8>; + p.puLUMA_16x16.satd = satd8_neon<16, 16>; + p.puLUMA_16x8.satd = satd8_neon<16, 8>; + p.puLUMA_8x16.satd = satd8_neon<8, 16>; + p.puLUMA_16x12.satd = satd8_neon<16, 12>; + p.puLUMA_16x4.satd = satd8_neon<16, 4>; + p.puLUMA_32x32.satd = satd8_neon<32, 32>; + p.puLUMA_32x16.satd = satd8_neon<32, 16>; + p.puLUMA_16x32.satd = satd8_neon<16, 32>; + p.puLUMA_32x24.satd = satd8_neon<32, 24>; + p.puLUMA_24x32.satd = satd8_neon<24, 32>; + p.puLUMA_32x8.satd = satd8_neon<32, 8>; + p.puLUMA_8x32.satd = satd8_neon<8, 32>; + p.puLUMA_64x64.satd = satd8_neon<64, 64>; + p.puLUMA_64x32.satd = satd8_neon<64, 32>; + p.puLUMA_32x64.satd = satd8_neon<32, 64>; + p.puLUMA_64x48.satd = satd8_neon<64, 48>; + p.puLUMA_48x64.satd = satd8_neon<48, 64>; + p.puLUMA_64x16.satd = satd8_neon<64, 16>; + p.puLUMA_16x64.satd = satd8_neon<16, 64>; + +#if HIGH_BIT_DEPTH + p.puLUMA_4x8.satd = satd4_neon<4, 8>; + p.puLUMA_4x16.satd = satd4_neon<4, 16>; +#endif // HIGH_BIT_DEPTH + +#if !defined(__APPLE__) || HIGH_BIT_DEPTH + p.puLUMA_12x16.satd = satd4_neon<12, 16>; +#endif // !defined(__APPLE__) + + + LUMA_CU(4, 4); + LUMA_CU(8, 8); + LUMA_CU(16, 16); + LUMA_CU(32, 32); + LUMA_CU(64, 64); + +#if !(HIGH_BIT_DEPTH) + p.cuBLOCK_8x8.var = pixel_var_neon<8>; + p.cuBLOCK_16x16.var = pixel_var_neon<16>; +#if defined(__APPLE__) + p.cuBLOCK_32x32.var = pixel_var_neon<32>; + p.cuBLOCK_64x64.var = pixel_var_neon<64>; +#endif // defined(__APPLE__) +#endif // !(HIGH_BIT_DEPTH) + + p.cuBLOCK_16x16.blockfill_sNONALIGNED = blockfill_s_neon<16>; + p.cuBLOCK_16x16.blockfill_sALIGNED = blockfill_s_neon<16>; + p.cuBLOCK_32x32.blockfill_sNONALIGNED = blockfill_s_neon<32>; + p.cuBLOCK_32x32.blockfill_sALIGNED = blockfill_s_neon<32>; + p.cuBLOCK_64x64.blockfill_sNONALIGNED = blockfill_s_neon<64>; + p.cuBLOCK_64x64.blockfill_sALIGNED = blockfill_s_neon<64>; + + + p.cuBLOCK_4x4.calcresidualNONALIGNED = getResidual_neon<4>; + p.cuBLOCK_4x4.calcresidualALIGNED = getResidual_neon<4>; + p.cuBLOCK_8x8.calcresidualNONALIGNED = getResidual_neon<8>; + p.cuBLOCK_8x8.calcresidualALIGNED = getResidual_neon<8>; + p.cuBLOCK_16x16.calcresidualNONALIGNED = getResidual_neon<16>; + p.cuBLOCK_16x16.calcresidualALIGNED = getResidual_neon<16>; + +#if defined(__APPLE__) + p.cuBLOCK_32x32.calcresidualNONALIGNED = getResidual_neon<32>; + p.cuBLOCK_32x32.calcresidualALIGNED = getResidual_neon<32>; +#endif // defined(__APPLE__) + + p.cuBLOCK_4x4.sa8d = pixel_satd_4x4_neon; + p.cuBLOCK_8x8.sa8d = pixel_sa8d_8x8_neon; + p.cuBLOCK_16x16.sa8d = pixel_sa8d_16x16_neon; + p.cuBLOCK_32x32.sa8d = sa8d16<32, 32>; + p.cuBLOCK_64x64.sa8d = sa8d16<64, 64>; + + +#define CHROMA_PU_420(W, H) \ + p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.addAvgNONALIGNED = addAvg_neon<W, H>; \ + p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.addAvgALIGNED = addAvg_neon<W, H>; \ + p.chromaX265_CSP_I420.puCHROMA_420_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \ + + + CHROMA_PU_420(4, 4); + CHROMA_PU_420(8, 8); + CHROMA_PU_420(16, 16); + CHROMA_PU_420(32, 32); + CHROMA_PU_420(4, 2); + CHROMA_PU_420(8, 4); + CHROMA_PU_420(4, 8); + CHROMA_PU_420(8, 6); + CHROMA_PU_420(6, 8); + CHROMA_PU_420(8, 2); + CHROMA_PU_420(2, 8); + CHROMA_PU_420(16, 8); + CHROMA_PU_420(8, 16); + CHROMA_PU_420(16, 12); + CHROMA_PU_420(12, 16); + CHROMA_PU_420(16, 4); + CHROMA_PU_420(4, 16); + CHROMA_PU_420(32, 16); + CHROMA_PU_420(16, 32); + CHROMA_PU_420(32, 24); + CHROMA_PU_420(24, 32); + CHROMA_PU_420(32, 8); + CHROMA_PU_420(8, 32); + + + + p.chromaX265_CSP_I420.puCHROMA_420_2x2.satd = NULL; + p.chromaX265_CSP_I420.puCHROMA_420_4x4.satd = pixel_satd_4x4_neon; + p.chromaX265_CSP_I420.puCHROMA_420_8x8.satd = satd8_neon<8, 8>; + p.chromaX265_CSP_I420.puCHROMA_420_16x16.satd = satd8_neon<16, 16>; + p.chromaX265_CSP_I420.puCHROMA_420_32x32.satd = satd8_neon<32, 32>; + + p.chromaX265_CSP_I420.puCHROMA_420_4x2.satd = NULL; + p.chromaX265_CSP_I420.puCHROMA_420_2x4.satd = NULL; + p.chromaX265_CSP_I420.puCHROMA_420_8x4.satd = pixel_satd_8x4_neon; + p.chromaX265_CSP_I420.puCHROMA_420_16x8.satd = satd8_neon<16, 8>; + p.chromaX265_CSP_I420.puCHROMA_420_8x16.satd = satd8_neon<8, 16>; + p.chromaX265_CSP_I420.puCHROMA_420_32x16.satd = satd8_neon<32, 16>; + p.chromaX265_CSP_I420.puCHROMA_420_16x32.satd = satd8_neon<16, 32>; + + p.chromaX265_CSP_I420.puCHROMA_420_8x6.satd = NULL; + p.chromaX265_CSP_I420.puCHROMA_420_6x8.satd = NULL; + p.chromaX265_CSP_I420.puCHROMA_420_8x2.satd = NULL; + p.chromaX265_CSP_I420.puCHROMA_420_2x8.satd = NULL; + p.chromaX265_CSP_I420.puCHROMA_420_16x12.satd = satd4_neon<16, 12>; + p.chromaX265_CSP_I420.puCHROMA_420_16x4.satd = satd4_neon<16, 4>; + p.chromaX265_CSP_I420.puCHROMA_420_32x24.satd = satd8_neon<32, 24>; + p.chromaX265_CSP_I420.puCHROMA_420_24x32.satd = satd8_neon<24, 32>; + p.chromaX265_CSP_I420.puCHROMA_420_32x8.satd = satd8_neon<32, 8>; + p.chromaX265_CSP_I420.puCHROMA_420_8x32.satd = satd8_neon<8, 32>; + +#if HIGH_BIT_DEPTH + p.chromaX265_CSP_I420.puCHROMA_420_4x8.satd = satd4_neon<4, 8>; + p.chromaX265_CSP_I420.puCHROMA_420_4x16.satd = satd4_neon<4, 16>; +#endif // HIGH_BIT_DEPTH + +#if !defined(__APPLE__) || HIGH_BIT_DEPTH + p.chromaX265_CSP_I420.puCHROMA_420_12x16.satd = satd4_neon<12, 16>; +#endif // !defined(__APPLE__) + + +#define CHROMA_CU_420(W, H) \ + p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.sse_pp = sse_neon<W, H, pixel, pixel>; \ + p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \ + p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.copy_ps = blockcopy_ps_neon<W, H>; \ + p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.sub_ps = pixel_sub_ps_neon<W, H>; \ + p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.add_psNONALIGNED = pixel_add_ps_neon<W, H>; \ + p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.add_psALIGNED = pixel_add_ps_neon<W, H>; + +#define CHROMA_CU_S_420(W, H) \ + p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \ + p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.copy_ps = blockcopy_ps_neon<W, H>; \ + p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.sub_ps = pixel_sub_ps_neon<W, H>; \ + p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.add_psNONALIGNED = pixel_add_ps_neon<W, H>; \ + p.chromaX265_CSP_I420.cuBLOCK_420_ ## W ## x ## H.add_psALIGNED = pixel_add_ps_neon<W, H>; + + + CHROMA_CU_S_420(4, 4) + CHROMA_CU_420(8, 8) + CHROMA_CU_420(16, 16) + CHROMA_CU_420(32, 32) + + + p.chromaX265_CSP_I420.cuBLOCK_8x8.sa8d = p.chromaX265_CSP_I420.puCHROMA_420_4x4.satd; + p.chromaX265_CSP_I420.cuBLOCK_16x16.sa8d = sa8d8<8, 8>; + p.chromaX265_CSP_I420.cuBLOCK_32x32.sa8d = sa8d16<16, 16>; + p.chromaX265_CSP_I420.cuBLOCK_64x64.sa8d = sa8d16<32, 32>; + + +#define CHROMA_PU_422(W, H) \ + p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.addAvgNONALIGNED = addAvg_neon<W, H>; \ + p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.addAvgALIGNED = addAvg_neon<W, H>; \ + p.chromaX265_CSP_I422.puCHROMA_422_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \ + + + CHROMA_PU_422(4, 8); + CHROMA_PU_422(8, 16); + CHROMA_PU_422(16, 32); + CHROMA_PU_422(32, 64); + CHROMA_PU_422(4, 4); + CHROMA_PU_422(2, 8); + CHROMA_PU_422(8, 8); + CHROMA_PU_422(4, 16); + CHROMA_PU_422(8, 12); + CHROMA_PU_422(6, 16); + CHROMA_PU_422(8, 4); + CHROMA_PU_422(2, 16); + CHROMA_PU_422(16, 16); + CHROMA_PU_422(8, 32); + CHROMA_PU_422(16, 24); + CHROMA_PU_422(12, 32); + CHROMA_PU_422(16, 8); + CHROMA_PU_422(4, 32); + CHROMA_PU_422(32, 32); + CHROMA_PU_422(16, 64); + CHROMA_PU_422(32, 48); + CHROMA_PU_422(24, 64); + CHROMA_PU_422(32, 16); + CHROMA_PU_422(8, 64); + + + p.chromaX265_CSP_I422.puCHROMA_422_2x4.satd = NULL; + p.chromaX265_CSP_I422.puCHROMA_422_8x16.satd = satd8_neon<8, 16>; + p.chromaX265_CSP_I422.puCHROMA_422_16x32.satd = satd8_neon<16, 32>; + p.chromaX265_CSP_I422.puCHROMA_422_32x64.satd = satd8_neon<32, 64>; + p.chromaX265_CSP_I422.puCHROMA_422_4x4.satd = pixel_satd_4x4_neon; + p.chromaX265_CSP_I422.puCHROMA_422_2x8.satd = NULL; + p.chromaX265_CSP_I422.puCHROMA_422_8x8.satd = satd8_neon<8, 8>; + p.chromaX265_CSP_I422.puCHROMA_422_16x16.satd = satd8_neon<16, 16>; + p.chromaX265_CSP_I422.puCHROMA_422_8x32.satd = satd8_neon<8, 32>; + p.chromaX265_CSP_I422.puCHROMA_422_32x32.satd = satd8_neon<32, 32>; + p.chromaX265_CSP_I422.puCHROMA_422_16x64.satd = satd8_neon<16, 64>; + p.chromaX265_CSP_I422.puCHROMA_422_6x16.satd = NULL; + p.chromaX265_CSP_I422.puCHROMA_422_8x4.satd = satd4_neon<8, 4>; + p.chromaX265_CSP_I422.puCHROMA_422_2x16.satd = NULL; + p.chromaX265_CSP_I422.puCHROMA_422_16x8.satd = satd8_neon<16, 8>; + p.chromaX265_CSP_I422.puCHROMA_422_32x16.satd = satd8_neon<32, 16>; + + p.chromaX265_CSP_I422.puCHROMA_422_8x12.satd = satd4_neon<8, 12>; + p.chromaX265_CSP_I422.puCHROMA_422_8x64.satd = satd8_neon<8, 64>; + p.chromaX265_CSP_I422.puCHROMA_422_12x32.satd = satd4_neon<12, 32>; + p.chromaX265_CSP_I422.puCHROMA_422_16x24.satd = satd8_neon<16, 24>; + p.chromaX265_CSP_I422.puCHROMA_422_24x64.satd = satd8_neon<24, 64>; + p.chromaX265_CSP_I422.puCHROMA_422_32x48.satd = satd8_neon<32, 48>; + +#if HIGH_BIT_DEPTH + p.chromaX265_CSP_I422.puCHROMA_422_4x8.satd = satd4_neon<4, 8>; + p.chromaX265_CSP_I422.puCHROMA_422_4x16.satd = satd4_neon<4, 16>; + p.chromaX265_CSP_I422.puCHROMA_422_4x32.satd = satd4_neon<4, 32>; +#endif // HIGH_BIT_DEPTH + + +#define CHROMA_CU_422(W, H) \ + p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.sse_pp = sse_neon<W, H, pixel, pixel>; \ + p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \ + p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.copy_ps = blockcopy_ps_neon<W, H>; \ + p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.sub_ps = pixel_sub_ps_neon<W, H>; \ + p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.add_psNONALIGNED = pixel_add_ps_neon<W, H>; \ + p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.add_psALIGNED = pixel_add_ps_neon<W, H>; + +#define CHROMA_CU_S_422(W, H) \ + p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.copy_pp = blockcopy_pp_neon<W, H>; \ + p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.copy_ps = blockcopy_ps_neon<W, H>; \ + p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.sub_ps = pixel_sub_ps_neon<W, H>; \ + p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.add_psNONALIGNED = pixel_add_ps_neon<W, H>; \ + p.chromaX265_CSP_I422.cuBLOCK_422_ ## W ## x ## H.add_psALIGNED = pixel_add_ps_neon<W, H>; + + + CHROMA_CU_S_422(4, 8) + CHROMA_CU_422(8, 16) + CHROMA_CU_422(16, 32) + CHROMA_CU_422(32, 64) + + p.chromaX265_CSP_I422.cuBLOCK_8x8.sa8d = p.chromaX265_CSP_I422.puCHROMA_422_4x8.satd; + p.chromaX265_CSP_I422.cuBLOCK_16x16.sa8d = sa8d8<8, 16>; + p.chromaX265_CSP_I422.cuBLOCK_32x32.sa8d = sa8d16<16, 32>; + p.chromaX265_CSP_I422.cuBLOCK_64x64.sa8d = sa8d16<32, 64>; + + +} + + +} + + +#endif +
View file
x265_3.6.tar.gz/source/common/aarch64/pixel-prim.h
Added
@@ -0,0 +1,23 @@ +#ifndef PIXEL_PRIM_NEON_H__ +#define PIXEL_PRIM_NEON_H__ + +#include "common.h" +#include "slicetype.h" // LOWRES_COST_MASK +#include "primitives.h" +#include "x265.h" + + + +namespace X265_NS +{ + + + +void setupPixelPrimitives_neon(EncoderPrimitives &p); + + +} + + +#endif +
View file
x265_3.6.tar.gz/source/common/aarch64/pixel-util-common.S
Added
@@ -0,0 +1,84 @@ +/***************************************************************************** + * Copyright (C) 2022-2023 MulticoreWare, Inc + * + * Authors: David Chen <david.chen@myais.com.cn> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +// This file contains the macros written using NEON instruction set +// that are also used by the SVE2 functions + +.arch armv8-a + +#ifdef __APPLE__ +.section __RODATA,__rodata +#else +.section .rodata +#endif + +.align 4 + +.macro pixel_var_start + movi v0.16b, #0 + movi v1.16b, #0 + movi v2.16b, #0 + movi v3.16b, #0 +.endm + +.macro pixel_var_1 v + uaddw v0.8h, v0.8h, \v\().8b + umull v30.8h, \v\().8b, \v\().8b + uaddw2 v1.8h, v1.8h, \v\().16b + umull2 v31.8h, \v\().16b, \v\().16b + uadalp v2.4s, v30.8h + uadalp v3.4s, v31.8h +.endm + +.macro pixel_var_end + uaddlv s0, v0.8h + uaddlv s1, v1.8h + add v2.4s, v2.4s, v3.4s + fadd s0, s0, s1 + uaddlv d2, v2.4s + fmov w0, s0 + fmov x2, d2 + orr x0, x0, x2, lsl #32 +.endm + +.macro ssimDist_start + movi v0.16b, #0 + movi v1.16b, #0 +.endm + +.macro ssimDist_end + uaddlv d0, v0.4s + uaddlv d1, v1.4s + str d0, x6 + str d1, x4 +.endm + +.macro normFact_start + movi v0.16b, #0 +.endm + +.macro normFact_end + uaddlv d0, v0.4s + str d0, x3 +.endm +
View file
x265_3.6.tar.gz/source/common/aarch64/pixel-util-sve.S
Added
@@ -0,0 +1,373 @@ +/***************************************************************************** + * Copyright (C) 2022-2023 MulticoreWare, Inc + * + * Authors: David Chen <david.chen@myais.com.cn> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "asm-sve.S" +#include "pixel-util-common.S" + +.arch armv8-a+sve + +#ifdef __APPLE__ +.section __RODATA,__rodata +#else +.section .rodata +#endif + +.align 4 + +.text + +function PFX(pixel_sub_ps_8x16_sve) + lsl x1, x1, #1 + ptrue p0.h, vl8 +.rept 8 + ld1b {z0.h}, p0/z, x2 + ld1b {z1.h}, p0/z, x3 + add x2, x2, x4 + add x3, x3, x5 + ld1b {z2.h}, p0/z, x2 + ld1b {z3.h}, p0/z, x3 + add x2, x2, x4 + add x3, x3, x5 + sub z4.h, z0.h, z1.h + sub z5.h, z2.h, z3.h + st1 {v4.8h}, x0, x1 + st1 {v5.8h}, x0, x1 +.endr + ret +endfunc + +//******* satd ******* +.macro satd_4x4_sve + ld1b {z0.h}, p0/z, x0 + ld1b {z2.h}, p0/z, x2 + add x0, x0, x1 + add x2, x2, x3 + ld1b {z1.h}, p0/z, x0 + ld1b {z3.h}, p0/z, x2 + add x0, x0, x1 + add x2, x2, x3 + ld1b {z4.h}, p0/z, x0 + ld1b {z6.h}, p0/z, x2 + add x0, x0, x1 + add x2, x2, x3 + ld1b {z5.h}, p0/z, x0 + ld1b {z7.h}, p0/z, x2 + add x0, x0, x1 + add x2, x2, x3 + + sub z0.h, z0.h, z2.h + sub z1.h, z1.h, z3.h + sub z2.h, z4.h, z6.h + sub z3.h, z5.h, z7.h + + add z4.h, z0.h, z2.h + add z5.h, z1.h, z3.h + sub z6.h, z0.h, z2.h + sub z7.h, z1.h, z3.h + + add z0.h, z4.h, z5.h + sub z1.h, z4.h, z5.h + + add z2.h, z6.h, z7.h + sub z3.h, z6.h, z7.h + + trn1 z4.h, z0.h, z2.h + trn2 z5.h, z0.h, z2.h + + trn1 z6.h, z1.h, z3.h + trn2 z7.h, z1.h, z3.h + + add z0.h, z4.h, z5.h + sub z1.h, z4.h, z5.h + + add z2.h, z6.h, z7.h + sub z3.h, z6.h, z7.h + + trn1 z4.s, z0.s, z1.s + trn2 z5.s, z0.s, z1.s + + trn1 z6.s, z2.s, z3.s + trn2 z7.s, z2.s, z3.s + + abs z4.h, p0/m, z4.h + abs z5.h, p0/m, z5.h + abs z6.h, p0/m, z6.h + abs z7.h, p0/m, z7.h + + smax z4.h, p0/m, z4.h, z5.h + smax z6.h, p0/m, z6.h, z7.h + + add z0.h, z4.h, z6.h + + uaddlp v0.2s, v0.4h + uaddlp v0.1d, v0.2s +.endm + +// int satd_4x4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +function PFX(pixel_satd_4x4_sve) + ptrue p0.h, vl4 + satd_4x4_sve + fmov x0, d0 + ret +endfunc + +function PFX(pixel_satd_8x4_sve) + ptrue p0.h, vl4 + mov x4, x0 + mov x5, x2 + satd_4x4_sve + add x0, x4, #4 + add x2, x5, #4 + umov x6, v0.d0 + satd_4x4_sve + umov x0, v0.d0 + add x0, x0, x6 + ret +endfunc + +function PFX(pixel_satd_8x12_sve) + ptrue p0.h, vl4 + mov x4, x0 + mov x5, x2 + mov x7, #0 + satd_4x4_sve + umov x6, v0.d0 + add x7, x7, x6 + add x0, x4, #4 + add x2, x5, #4 + satd_4x4_sve + umov x6, v0.d0 + add x7, x7, x6 +.rept 2 + sub x0, x0, #4 + sub x2, x2, #4 + mov x4, x0 + mov x5, x2 + satd_4x4_sve + umov x6, v0.d0 + add x7, x7, x6 + add x0, x4, #4 + add x2, x5, #4 + satd_4x4_sve + umov x6, v0.d0 + add x7, x7, x6 +.endr + mov x0, x7 + ret +endfunc + +.macro LOAD_DIFF_16x4_sve v0 v1 v2 v3 v4 v5 v6 v7 + mov x11, #8 // in order to consider CPUs whose vector size is greater than 128 bits + ld1b {z0.h}, p0/z, x0 + ld1b {z1.h}, p0/z, x0, x11 + ld1b {z2.h}, p0/z, x2 + ld1b {z3.h}, p0/z, x2, x11 + add x0, x0, x1 + add x2, x2, x3 + ld1b {z4.h}, p0/z, x0 + ld1b {z5.h}, p0/z, x0, x11 + ld1b {z6.h}, p0/z, x2 + ld1b {z7.h}, p0/z, x2, x11 + add x0, x0, x1 + add x2, x2, x3 + ld1b {z29.h}, p0/z, x0 + ld1b {z9.h}, p0/z, x0, x11 + ld1b {z10.h}, p0/z, x2 + ld1b {z11.h}, p0/z, x2, x11 + add x0, x0, x1 + add x2, x2, x3 + ld1b {z12.h}, p0/z, x0 + ld1b {z13.h}, p0/z, x0, x11 + ld1b {z14.h}, p0/z, x2 + ld1b {z15.h}, p0/z, x2, x11 + add x0, x0, x1 + add x2, x2, x3 + + sub \v0\().h, z0.h, z2.h + sub \v4\().h, z1.h, z3.h + sub \v1\().h, z4.h, z6.h + sub \v5\().h, z5.h, z7.h + sub \v2\().h, z29.h, z10.h + sub \v6\().h, z9.h, z11.h + sub \v3\().h, z12.h, z14.h + sub \v7\().h, z13.h, z15.h +.endm + +// one vertical hadamard pass and two horizontal +function PFX(satd_8x4v_8x8h_sve), export=0 + HADAMARD4_V z16.h, z18.h, z17.h, z19.h, z0.h, z2.h, z1.h, z3.h + HADAMARD4_V z20.h, z21.h, z22.h, z23.h, z0.h, z1.h, z2.h, z3.h + trn4 z0.h, z1.h, z2.h, z3.h, z16.h, z17.h, z18.h, z19.h + trn4 z4.h, z5.h, z6.h, z7.h, z20.h, z21.h, z22.h, z23.h + SUMSUB_ABCD z16.h, z17.h, z18.h, z19.h, z0.h, z1.h, z2.h, z3.h + SUMSUB_ABCD z20.h, z21.h, z22.h, z23.h, z4.h, z5.h, z6.h, z7.h + trn4 z0.s, z2.s, z1.s, z3.s, z16.s, z18.s, z17.s, z19.s + trn4 z4.s, z6.s, z5.s, z7.s, z20.s, z22.s, z21.s, z23.s + ABS8_SVE z0.h, z1.h, z2.h, z3.h, z4.h, z5.h, z6.h, z7.h, p0 + smax z0.h, p0/m, z0.h, z2.h + smax z1.h, p0/m, z1.h, z3.h + smax z4.h, p0/m, z4.h, z6.h + smax z5.h, p0/m, z5.h, z7.h + ret +endfunc + +function PFX(satd_16x4_sve), export=0 + LOAD_DIFF_16x4_sve z16, z17, z18, z19, z20, z21, z22, z23 + b PFX(satd_8x4v_8x8h_sve) +endfunc + +.macro pixel_satd_32x8_sve + mov x4, x0 + mov x5, x2 +.rept 2 + bl PFX(satd_16x4_sve) + add z30.h, z30.h, z0.h + add z31.h, z31.h, z1.h + add z30.h, z30.h, z4.h + add z31.h, z31.h, z5.h +.endr + add x0, x4, #16 + add x2, x5, #16 +.rept 2 + bl PFX(satd_16x4_sve) + add z30.h, z30.h, z0.h + add z31.h, z31.h, z1.h + add z30.h, z30.h, z4.h + add z31.h, z31.h, z5.h +.endr +.endm + +.macro satd_32x16_sve + movi v30.2d, #0 + movi v31.2d, #0 + pixel_satd_32x8_sve + sub x0, x0, #16 + sub x2, x2, #16 + pixel_satd_32x8_sve + add z0.h, z30.h, z31.h + uaddlv s0, v0.8h + mov w6, v0.s0 +.endm + +function PFX(pixel_satd_32x16_sve) + ptrue p0.h, vl8 + mov x10, x30 + satd_32x16_sve + mov x0, x6 + ret x10 +endfunc + +function PFX(pixel_satd_32x32_sve) + ptrue p0.h, vl8 + mov x10, x30 + mov x7, #0 + satd_32x16_sve + sub x0, x0, #16 + sub x2, x2, #16 + add x7, x7, x6 + satd_32x16_sve + add x0, x7, x6 + ret x10 +endfunc + +.macro satd_64x16_sve + mov x8, x0 + mov x9, x2 + satd_32x16_sve + add x7, x7, x6 + add x0, x8, #32 + add x2, x9, #32 + satd_32x16_sve + add x7, x7, x6 +.endm + +function PFX(pixel_satd_64x48_sve) + ptrue p0.h, vl8 + mov x10, x30 + mov x7, #0 +.rept 2 + satd_64x16_sve + sub x0, x0, #48 + sub x2, x2, #48 +.endr + satd_64x16_sve + mov x0, x7 + ret x10 +endfunc + +/********* ssim ***********/ +// uint32_t quant_c(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff) +// No need to fully use sve instructions for this function +function PFX(quant_sve) + mov w9, #1 + lsl w9, w9, w4 + mov z0.s, w9 + neg w9, w4 + mov z1.s, w9 + add w9, w9, #8 + mov z2.s, w9 + mov z3.s, w5 + + lsr w6, w6, #2 + eor z4.d, z4.d, z4.d + eor w10, w10, w10 + eor z17.d, z17.d, z17.d + +.loop_quant_sve: + ld1 {v18.4h}, x0, #8 + ld1 {v7.4s}, x1, #16 + sxtl v6.4s, v18.4h + + cmlt v5.4s, v6.4s, #0 + + abs v6.4s, v6.4s + + + mul v6.4s, v6.4s, v7.4s + + add v7.4s, v6.4s, v3.4s + sshl v7.4s, v7.4s, v1.4s + + mls v6.4s, v7.4s, v0.s0 + sshl v16.4s, v6.4s, v2.4s + st1 {v16.4s}, x2, #16 + + // numsig + cmeq v16.4s, v7.4s, v17.4s + add v4.4s, v4.4s, v16.4s + add w10, w10, #4 + + // level *= sign + eor z16.d, z7.d, z5.d + sub v16.4s, v16.4s, v5.4s + sqxtn v5.4h, v16.4s + st1 {v5.4h}, x3, #8 + + subs w6, w6, #1 + b.ne .loop_quant_sve + + addv s4, v4.4s + mov w9, v4.s0 + add w0, w10, w9 + ret +endfunc
View file
x265_3.6.tar.gz/source/common/aarch64/pixel-util-sve2.S
Added
@@ -0,0 +1,1686 @@ +/***************************************************************************** + * Copyright (C) 2022-2023 MulticoreWare, Inc + * + * Authors: David Chen <david.chen@myais.com.cn> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "asm-sve.S" +#include "pixel-util-common.S" + +.arch armv8-a+sve2 + +#ifdef __APPLE__ +.section __RODATA,__rodata +#else +.section .rodata +#endif + +.align 4 + +.text + +// uint64_t pixel_var(const pixel* pix, intptr_t i_stride) +function PFX(pixel_var_8x8_sve2) + ptrue p0.h, vl8 + ld1b {z0.h}, p0/z, x0 + add x0, x0, x1 + mul z31.h, z0.h, z0.h + uaddlp v1.4s, v31.8h +.rept 7 + ld1b {z4.h}, p0/z, x0 + add x0, x0, x1 + add z0.h, z0.h, z4.h + mul z31.h, z4.h, z4.h + uadalp z1.s, p0/m, z31.h +.endr + uaddlv s0, v0.8h + uaddlv d1, v1.4s + fmov w0, s0 + fmov x1, d1 + orr x0, x0, x1, lsl #32 + ret +endfunc + +function PFX(pixel_var_16x16_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_var_16x16 + pixel_var_start + mov w12, #16 +.loop_var_16_sve2: + sub w12, w12, #1 + ld1 {v4.16b}, x0, x1 + pixel_var_1 v4 + cbnz w12, .loop_var_16_sve2 + pixel_var_end + ret +.vl_gt_16_pixel_var_16x16: + ptrue p0.h, vl16 + mov z0.d, #0 +.rept 16 + ld1b {z4.h}, p0/z, x0 + add x0, x0, x1 + add z0.h, z0.h, z4.h + mul z30.h, z4.h, z4.h + uadalp z1.s, p0/m, z30.h +.endr + uaddv d0, p0, z0.h + uaddv d1, p0, z1.s + fmov w0, s0 + fmov x1, d1 + orr x0, x0, x1, lsl #32 + ret +endfunc + +function PFX(pixel_var_32x32_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_var_32x32 + pixel_var_start + mov w12, #32 +.loop_var_32_sve2: + sub w12, w12, #1 + ld1 {v4.16b-v5.16b}, x0, x1 + pixel_var_1 v4 + pixel_var_1 v5 + cbnz w12, .loop_var_32_sve2 + pixel_var_end + ret +.vl_gt_16_pixel_var_32x32: + cmp x9, #48 + bgt .vl_gt_48_pixel_var_32x32 + ptrue p0.b, vl32 + mov z0.d, #0 + mov z1.d, #0 +.rept 32 + ld1b {z4.b}, p0/z, x0 + add x0, x0, x1 + uaddwb z0.h, z0.h, z4.b + uaddwt z0.h, z0.h, z4.b + umullb z28.h, z4.b, z4.b + umullt z29.h, z4.b, z4.b + uadalp z1.s, p0/m, z28.h + uadalp z1.s, p0/m, z29.h +.endr + uaddv d0, p0, z0.h + uaddv d1, p0, z1.s + fmov w0, s0 + fmov x1, d1 + orr x0, x0, x1, lsl #32 + ret +.vl_gt_48_pixel_var_32x32: + ptrue p0.h, vl32 + mov z0.d, #0 + mov z1.d, #0 +.rept 32 + ld1b {z4.h}, p0/z, x0 + add x0, x0, x1 + add z0.h, z0.h, z4.h + mul z28.h, z4.h, z4.h + uadalp z1.s, p0/m, z28.h +.endr + uaddv d0, p0, z0.h + uaddv d1, p0, z1.s + fmov w0, s0 + fmov x1, d1 + orr x0, x0, x1, lsl #32 + ret +endfunc + +function PFX(pixel_var_64x64_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_var_64x64 + pixel_var_start + mov w12, #64 +.loop_var_64_sve2: + sub w12, w12, #1 + ld1 {v4.16b-v7.16b}, x0, x1 + pixel_var_1 v4 + pixel_var_1 v5 + pixel_var_1 v6 + pixel_var_1 v7 + cbnz w12, .loop_var_64_sve2 + pixel_var_end + ret +.vl_gt_16_pixel_var_64x64: + cmp x9, #48 + bgt .vl_gt_48_pixel_var_64x64 + ptrue p0.b, vl32 + mov z0.d, #0 + mov z2.d, #0 +.rept 64 + ld1b {z4.b}, p0/z, x0 + ld1b {z5.b}, p0/z, x0, #1, mul vl + add x0, x0, x1 + uaddwb z0.h, z0.h, z4.b + uaddwt z0.h, z0.h, z4.b + uaddwb z0.h, z0.h, z5.b + uaddwt z0.h, z0.h, z5.b + umullb z24.h, z4.b, z4.b + umullt z25.h, z4.b, z4.b + umullb z26.h, z5.b, z5.b + umullt z27.h, z5.b, z5.b + uadalp z2.s, p0/m, z24.h + uadalp z2.s, p0/m, z25.h + uadalp z2.s, p0/m, z26.h + uadalp z2.s, p0/m, z27.h +.endr + uaddv d0, p0, z0.h + uaddv d1, p0, z2.s + fmov w0, s0 + fmov x1, d1 + orr x0, x0, x1, lsl #32 + ret +.vl_gt_48_pixel_var_64x64: + cmp x9, #112 + bgt .vl_gt_112_pixel_var_64x64 + ptrue p0.b, vl64 + mov z0.d, #0 + mov z1.d, #0 +.rept 64 + ld1b {z4.b}, p0/z, x0 + add x0, x0, x1 + uaddwb z0.h, z0.h, z4.b + uaddwt z0.h, z0.h, z4.b + umullb z24.h, z4.b, z4.b + umullt z25.h, z4.b, z4.b + uadalp z2.s, p0/m, z24.h + uadalp z2.s, p0/m, z25.h +.endr + uaddv d0, p0, z0.h + uaddv d1, p0, z2.s + fmov w0, s0 + fmov x1, d1 + orr x0, x0, x1, lsl #32 + ret +.vl_gt_112_pixel_var_64x64: + ptrue p0.h, vl64 + mov z0.d, #0 + mov z1.d, #0 +.rept 64 + ld1b {z4.h}, p0/z, x0 + add x0, x0, x1 + add z0.h, z0.h, z4.h + mul z24.h, z4.h, z4.h + uadalp z1.s, p0/m, z24.h +.endr + uaddv d0, p0, z0.h + uaddv d1, p0, z1.s + fmov w0, s0 + fmov x1, d1 + orr x0, x0, x1, lsl #32 + ret +endfunc + +function PFX(getResidual16_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_getResidual16 + lsl x4, x3, #1 +.rept 8 + ld1 {v0.16b}, x0, x3 + ld1 {v1.16b}, x1, x3 + ld1 {v2.16b}, x0, x3 + ld1 {v3.16b}, x1, x3 + usubl v4.8h, v0.8b, v1.8b + usubl2 v5.8h, v0.16b, v1.16b + usubl v6.8h, v2.8b, v3.8b + usubl2 v7.8h, v2.16b, v3.16b + st1 {v4.8h-v5.8h}, x2, x4 + st1 {v6.8h-v7.8h}, x2, x4 +.endr + ret +.vl_gt_16_getResidual16: + ptrue p0.h, vl16 +.rept 16 + ld1b {z0.h}, p0/z, x0 + ld1b {z2.h}, p0/z, x1 + add x0, x0, x3 + add x1, x1, x3 + sub z4.h, z0.h, z2.h + st1h {z4.h}, p0, x2 + add x2, x2, x3, lsl #1 +.endr + ret +endfunc + +function PFX(getResidual32_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_getResidual32 + lsl x4, x3, #1 + mov w12, #4 +.loop_residual_32: + sub w12, w12, #1 +.rept 4 + ld1 {v0.16b-v1.16b}, x0, x3 + ld1 {v2.16b-v3.16b}, x1, x3 + ld1 {v4.16b-v5.16b}, x0, x3 + ld1 {v6.16b-v7.16b}, x1, x3 + usubl v16.8h, v0.8b, v2.8b + usubl2 v17.8h, v0.16b, v2.16b + usubl v18.8h, v1.8b, v3.8b + usubl2 v19.8h, v1.16b, v3.16b + usubl v20.8h, v4.8b, v6.8b + usubl2 v21.8h, v4.16b, v6.16b + usubl v22.8h, v5.8b, v7.8b + usubl2 v23.8h, v5.16b, v7.16b + st1 {v16.8h-v19.8h}, x2, x4 + st1 {v20.8h-v23.8h}, x2, x4 +.endr + cbnz w12, .loop_residual_32 + ret +.vl_gt_16_getResidual32: + cmp x9, #48 + bgt .vl_gt_48_getResidual32 + ptrue p0.b, vl32 +.rept 32 + ld1b {z0.b}, p0/z, x0 + ld1b {z2.b}, p0/z, x1 + add x0, x0, x3 + add x1, x1, x3 + usublb z4.h, z0.b, z2.b + usublt z5.h, z0.b, z2.b + st2h {z4.h, z5.h}, p0, x2 + add x2, x2, x3, lsl #1 +.endr + ret +.vl_gt_48_getResidual32: + ptrue p0.h, vl32 +.rept 32 + ld1b {z0.h}, p0/z, x0 + ld1b {z4.h}, p0/z, x1 + add x0, x0, x3 + add x1, x1, x3 + sub z8.h, z0.h, z4.h + st1h {z8.h}, p0, x2 + add x2, x2, x3, lsl #1 +.endr + ret +endfunc + +function PFX(pixel_sub_ps_32x32_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_sub_ps_32x32 + lsl x1, x1, #1 + mov w12, #4 +.loop_sub_ps_32_sve2: + sub w12, w12, #1 +.rept 4 + ld1 {v0.16b-v1.16b}, x2, x4 + ld1 {v2.16b-v3.16b}, x3, x5 + ld1 {v4.16b-v5.16b}, x2, x4 + ld1 {v6.16b-v7.16b}, x3, x5 + usubl v16.8h, v0.8b, v2.8b + usubl2 v17.8h, v0.16b, v2.16b + usubl v18.8h, v1.8b, v3.8b + usubl2 v19.8h, v1.16b, v3.16b + usubl v20.8h, v4.8b, v6.8b + usubl2 v21.8h, v4.16b, v6.16b + usubl v22.8h, v5.8b, v7.8b + usubl2 v23.8h, v5.16b, v7.16b + st1 {v16.8h-v19.8h}, x0, x1 + st1 {v20.8h-v23.8h}, x0, x1 +.endr + cbnz w12, .loop_sub_ps_32_sve2 + ret +.vl_gt_16_pixel_sub_ps_32x32: + cmp x9, #48 + bgt .vl_gt_48_pixel_sub_ps_32x32 + ptrue p0.b, vl32 + mov w12, #8 +.vl_gt_16_loop_sub_ps_32_sve2: + sub w12, w12, #1 +.rept 4 + ld1b {z0.b}, p0/z, x2 + ld1b {z2.b}, p0/z, x3 + add x2, x2, x4 + add x3, x3, x5 + usublb z16.h, z0.b, z2.b + usublt z17.h, z0.b, z2.b + st2h {z16.h, z17.h}, p0, x0 + add x0, x0, x1, lsl #1 +.endr + cbnz w12, .vl_gt_16_loop_sub_ps_32_sve2 + ret +.vl_gt_48_pixel_sub_ps_32x32: + ptrue p0.h, vl32 + mov w12, #8 +.vl_gt_48_loop_sub_ps_32_sve2: + sub w12, w12, #1 +.rept 4 + ld1b {z0.h}, p0/z, x2 + ld1b {z4.h}, p0/z, x3 + add x2, x2, x4 + add x3, x3, x5 + sub z8.h, z0.h, z4.h + st1h {z8.h}, p0, x0 + add x0, x0, x1, lsl #1 +.endr + cbnz w12, .vl_gt_48_loop_sub_ps_32_sve2 + ret +endfunc + +function PFX(pixel_sub_ps_64x64_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_sub_ps_64x64 + lsl x1, x1, #1 + sub x1, x1, #64 + mov w12, #16 +.loop_sub_ps_64_sve2: + sub w12, w12, #1 +.rept 4 + ld1 {v0.16b-v3.16b}, x2, x4 + ld1 {v4.16b-v7.16b}, x3, x5 + usubl v16.8h, v0.8b, v4.8b + usubl2 v17.8h, v0.16b, v4.16b + usubl v18.8h, v1.8b, v5.8b + usubl2 v19.8h, v1.16b, v5.16b + usubl v20.8h, v2.8b, v6.8b + usubl2 v21.8h, v2.16b, v6.16b + usubl v22.8h, v3.8b, v7.8b + usubl2 v23.8h, v3.16b, v7.16b + st1 {v16.8h-v19.8h}, x0, #64 + st1 {v20.8h-v23.8h}, x0, x1 +.endr + cbnz w12, .loop_sub_ps_64_sve2 + ret +.vl_gt_16_pixel_sub_ps_64x64: + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_sub_ps_64x64 + ptrue p0.b, vl32 + mov w12, #16 +.vl_gt_16_loop_sub_ps_64_sve2: + sub w12, w12, #1 +.rept 4 + ld1b {z0.b}, p0/z, x2 + ld1b {z1.b}, p0/z, x2, #1, mul vl + ld1b {z4.b}, p0/z, x3 + ld1b {z5.b}, p0/z, x3, #1, mul vl + add x2, x2, x4 + add x3, x3, x5 + usublb z16.h, z0.b, z4.b + usublt z17.h, z0.b, z4.b + usublb z18.h, z1.b, z5.b + usublt z19.h, z1.b, z5.b + st2h {z16.h, z17.h}, p0, x0 + st2h {z18.h, z19.h}, p0, x0, #2, mul vl + add x0, x0, x1, lsl #1 +.endr + cbnz w12, .vl_gt_16_loop_sub_ps_64_sve2 + ret +.vl_gt_48_pixel_sub_ps_64x64: + cmp x9, #112 + bgt .vl_gt_112_pixel_sub_ps_64x64 + ptrue p0.b, vl64 + mov w12, #16 +.vl_gt_48_loop_sub_ps_64_sve2: + sub w12, w12, #1 +.rept 4 + ld1b {z0.b}, p0/z, x2 + ld1b {z4.b}, p0/z, x3 + add x2, x2, x4 + add x3, x3, x5 + usublb z16.h, z0.b, z4.b + usublt z17.h, z0.b, z4.b + st2h {z16.h, z17.h}, p0, x0 + add x0, x0, x1, lsl #1 +.endr + cbnz w12, .vl_gt_48_loop_sub_ps_64_sve2 + ret +.vl_gt_112_pixel_sub_ps_64x64: + ptrue p0.h, vl64 + mov w12, #16 +.vl_gt_112_loop_sub_ps_64_sve2: + sub w12, w12, #1 +.rept 4 + ld1b {z0.h}, p0/z, x2 + ld1b {z8.h}, p0/z, x3 + add x2, x2, x4 + add x3, x3, x5 + sub z16.h, z0.h, z8.h + st1h {z16.h}, p0, x0 + add x0, x0, x1, lsl #1 +.endr + cbnz w12, .vl_gt_112_loop_sub_ps_64_sve2 + ret +endfunc + +function PFX(pixel_sub_ps_32x64_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_sub_ps_32x64 + lsl x1, x1, #1 + mov w12, #8 +.loop_sub_ps_32x64_sve2: + sub w12, w12, #1 +.rept 4 + ld1 {v0.16b-v1.16b}, x2, x4 + ld1 {v2.16b-v3.16b}, x3, x5 + ld1 {v4.16b-v5.16b}, x2, x4 + ld1 {v6.16b-v7.16b}, x3, x5 + usubl v16.8h, v0.8b, v2.8b + usubl2 v17.8h, v0.16b, v2.16b + usubl v18.8h, v1.8b, v3.8b + usubl2 v19.8h, v1.16b, v3.16b + usubl v20.8h, v4.8b, v6.8b + usubl2 v21.8h, v4.16b, v6.16b + usubl v22.8h, v5.8b, v7.8b + usubl2 v23.8h, v5.16b, v7.16b + st1 {v16.8h-v19.8h}, x0, x1 + st1 {v20.8h-v23.8h}, x0, x1 +.endr + cbnz w12, .loop_sub_ps_32x64_sve2 + ret +.vl_gt_16_pixel_sub_ps_32x64: + cmp x9, #48 + bgt .vl_gt_48_pixel_sub_ps_32x64 + ptrue p0.b, vl32 + mov w12, #8 +.vl_gt_16_loop_sub_ps_32x64_sve2: + sub w12, w12, #1 +.rept 8 + ld1b {z0.b}, p0/z, x2 + ld1b {z2.b}, p0/z, x3 + add x2, x2, x4 + add x3, x3, x5 + usublb z16.h, z0.b, z2.b + usublt z17.h, z0.b, z2.b + st2h {z16.h, z17.h}, p0, x0 + add x0, x0, x1, lsl #1 +.endr + cbnz w12, .vl_gt_16_loop_sub_ps_32x64_sve2 + ret +.vl_gt_48_pixel_sub_ps_32x64: + ptrue p0.h, vl32 + mov w12, #8 +.vl_gt_48_loop_sub_ps_32x64_sve2: + sub w12, w12, #1 +.rept 8 + ld1b {z0.h}, p0/z, x2 + ld1b {z4.h}, p0/z, x3 + add x2, x2, x4 + add x3, x3, x5 + sub z8.h, z0.h, z4.h + st1h {z8.h}, p0, x0 + add x0, x0, x1, lsl #1 +.endr + cbnz w12, .vl_gt_48_loop_sub_ps_32x64_sve2 + ret +endfunc + +function PFX(pixel_add_ps_4x4_sve2) + ptrue p0.h, vl8 + ptrue p1.h, vl4 +.rept 4 + ld1b {z0.h}, p0/z, x2 + ld1h {z2.h}, p1/z, x3 + add x2, x2, x4 + add x3, x3, x5, lsl #1 + add z4.h, z0.h, z2.h + sqxtunb z4.b, z4.h + st1b {z4.h}, p1, x0 + add x0, x0, x1 +.endr + ret +endfunc + +function PFX(pixel_add_ps_8x8_sve2) + ptrue p0.h, vl8 +.rept 8 + ld1b {z0.h}, p0/z, x2 + ld1h {z2.h}, p0/z, x3 + add x2, x2, x4 + add x3, x3, x5, lsl #1 + add z4.h, z0.h, z2.h + sqxtunb z4.b, z4.h + st1b {z4.h}, p0, x0 + add x0, x0, x1 +.endr + ret +endfunc + +.macro pixel_add_ps_16xN_sve2 h +function PFX(pixel_add_ps_16x\h\()_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_add_ps_16x\h + ptrue p0.b, vl16 +.rept \h + ld1b {z0.h}, p0/z, x2 + ld1b {z1.h}, p0/z, x2, #1, mul vl + ld1h {z2.h}, p0/z, x3 + ld1h {z3.h}, p0/z, x3, #1, mul vl + add x2, x2, x4 + add x3, x3, x5, lsl #1 + add z24.h, z0.h, z2.h + add z25.h, z1.h, z3.h + sqxtunb z6.b, z24.h + sqxtunb z7.b, z25.h + st1b {z6.h}, p0, x0 + st1b {z7.h}, p0, x0, #1, mul vl + add x0, x0, x1 +.endr + ret +.vl_gt_16_pixel_add_ps_16x\h\(): + ptrue p0.b, vl32 +.rept \h + ld1b {z0.h}, p0/z, x2 + ld1h {z2.h}, p0/z, x3 + add x2, x2, x4 + add x3, x3, x5, lsl #1 + add z24.h, z0.h, z2.h + sqxtunb z6.b, z24.h + st1b {z6.h}, p0, x0 + add x0, x0, x1 +.endr + ret +endfunc +.endm + +pixel_add_ps_16xN_sve2 16 +pixel_add_ps_16xN_sve2 32 + +.macro pixel_add_ps_32xN_sve2 h + function PFX(pixel_add_ps_32x\h\()_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_add_ps_32x\h + lsl x5, x5, #1 + mov w12, #\h / 4 +.loop_add_ps__sve2_32x\h\(): + sub w12, w12, #1 +.rept 4 + ld1 {v0.16b-v1.16b}, x2, x4 + ld1 {v16.8h-v19.8h}, x3, x5 + uxtl v4.8h, v0.8b + uxtl2 v5.8h, v0.16b + uxtl v6.8h, v1.8b + uxtl2 v7.8h, v1.16b + add v24.8h, v4.8h, v16.8h + add v25.8h, v5.8h, v17.8h + add v26.8h, v6.8h, v18.8h + add v27.8h, v7.8h, v19.8h + sqxtun v4.8b, v24.8h + sqxtun2 v4.16b, v25.8h + sqxtun v5.8b, v26.8h + sqxtun2 v5.16b, v27.8h + st1 {v4.16b-v5.16b}, x0, x1 +.endr + cbnz w12, .loop_add_ps__sve2_32x\h + ret +.vl_gt_16_pixel_add_ps_32x\h\(): + cmp x9, #48 + bgt .vl_gt_48_pixel_add_ps_32x\h + ptrue p0.b, vl32 +.rept \h + ld1b {z0.h}, p0/z, x2 + ld1b {z1.h}, p0/z, x2, #1, mul vl + ld1h {z4.h}, p0/z, x3 + ld1h {z5.h}, p0/z, x3, #1, mul vl + add x2, x2, x4 + add x3, x3, x5, lsl #1 + add z24.h, z0.h, z4.h + add z25.h, z1.h, z5.h + sqxtunb z6.b, z24.h + sqxtunb z7.b, z25.h + st1b {z6.h}, p0, x0 + st1b {z7.h}, p0, x0, #1, mul vl + add x0, x0, x1 +.endr + ret +.vl_gt_48_pixel_add_ps_32x\h\(): + ptrue p0.b, vl64 +.rept \h + ld1b {z0.h}, p0/z, x2 + ld1h {z4.h}, p0/z, x3 + add x2, x2, x4 + add x3, x3, x5, lsl #1 + add z24.h, z0.h, z4.h + sqxtunb z6.b, z24.h + st1b {z6.h}, p0, x0 + add x0, x0, x1 +.endr + ret +endfunc +.endm + +pixel_add_ps_32xN_sve2 32 +pixel_add_ps_32xN_sve2 64 + +function PFX(pixel_add_ps_64x64_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_add_ps_64x64 + ptrue p0.b, vl16 +.rept 64 + ld1b {z0.h}, p0/z, x2 + ld1b {z1.h}, p0/z, x2, #1, mul vl + ld1b {z2.h}, p0/z, x2, #2, mul vl + ld1b {z3.h}, p0/z, x2, #3, mul vl + ld1b {z4.h}, p0/z, x2, #4 ,mul vl + ld1b {z5.h}, p0/z, x2, #5, mul vl + ld1b {z6.h}, p0/z, x2, #6, mul vl + ld1b {z7.h}, p0/z, x2, #7, mul vl + ld1h {z8.h}, p0/z, x3 + ld1h {z9.h}, p0/z, x3, #1, mul vl + ld1h {z10.h}, p0/z, x3, #2, mul vl + ld1h {z11.h}, p0/z, x3, #3, mul vl + ld1h {z12.h}, p0/z, x3, #4, mul vl + ld1h {z13.h}, p0/z, x3, #5, mul vl + ld1h {z14.h}, p0/z, x3, #6, mul vl + ld1h {z15.h}, p0/z, x3, #7, mul vl + add x2, x2, x4 + add x3, x3, x5, lsl #1 + add z24.h, z0.h, z8.h + add z25.h, z1.h, z9.h + add z26.h, z2.h, z10.h + add z27.h, z3.h, z11.h + add z28.h, z4.h, z12.h + add z29.h, z5.h, z13.h + add z30.h, z6.h, z14.h + add z31.h, z7.h, z15.h + sqxtunb z6.b, z24.h + sqxtunb z7.b, z25.h + sqxtunb z8.b, z26.h + sqxtunb z9.b, z27.h + sqxtunb z10.b, z28.h + sqxtunb z11.b, z29.h + sqxtunb z12.b, z30.h + sqxtunb z13.b, z31.h + st1b {z6.h}, p0, x0 + st1b {z7.h}, p0, x0, #1, mul vl + st1b {z8.h}, p0, x0, #2, mul vl + st1b {z9.h}, p0, x0, #3, mul vl + st1b {z10.h}, p0, x0, #4, mul vl + st1b {z11.h}, p0, x0, #5, mul vl + st1b {z12.h}, p0, x0, #6, mul vl + st1b {z13.h}, p0, x0, #7, mul vl + add x0, x0, x1 +.endr + ret +.vl_gt_16_pixel_add_ps_64x64: + cmp x9, #48 + bgt .vl_gt_48_pixel_add_ps_64x64 + ptrue p0.b, vl32 +.rept 64 + ld1b {z0.h}, p0/z, x2 + ld1b {z1.h}, p0/z, x2, #1, mul vl + ld1b {z2.h}, p0/z, x2, #2, mul vl + ld1b {z3.h}, p0/z, x2, #3, mul vl + ld1h {z8.h}, p0/z, x3 + ld1h {z9.h}, p0/z, x3, #1, mul vl + ld1h {z10.h}, p0/z, x3, #2, mul vl + ld1h {z11.h}, p0/z, x3, #3, mul vl + add x2, x2, x4 + add x3, x3, x5, lsl #1 + add z24.h, z0.h, z8.h + add z25.h, z1.h, z9.h + add z26.h, z2.h, z10.h + add z27.h, z3.h, z11.h + sqxtunb z6.b, z24.h + sqxtunb z7.b, z25.h + sqxtunb z8.b, z26.h + sqxtunb z9.b, z27.h + st1b {z6.h}, p0, x0 + st1b {z7.h}, p0, x0, #1, mul vl + st1b {z8.h}, p0, x0, #2, mul vl + st1b {z9.h}, p0, x0, #3, mul vl + add x0, x0, x1 +.endr + ret +.vl_gt_48_pixel_add_ps_64x64: + cmp x9, #112 + bgt .vl_gt_112_pixel_add_ps_64x64 + ptrue p0.b, vl64 +.rept 64 + ld1b {z0.h}, p0/z, x2 + ld1b {z1.h}, p0/z, x2, #1, mul vl + ld1h {z8.h}, p0/z, x3 + ld1h {z9.h}, p0/z, x3, #1, mul vl + add x2, x2, x4 + add x3, x3, x5, lsl #1 + add z24.h, z0.h, z8.h + add z25.h, z1.h, z9.h + sqxtunb z6.b, z24.h + sqxtunb z7.b, z25.h + st1b {z6.h}, p0, x0 + st1b {z7.h}, p0, x0, #1, mul vl + add x0, x0, x1 +.endr + ret +.vl_gt_112_pixel_add_ps_64x64: + ptrue p0.b, vl128 +.rept 64 + ld1b {z0.h}, p0/z, x2 + ld1h {z8.h}, p0/z, x3 + add x2, x2, x4 + add x3, x3, x5, lsl #1 + add z24.h, z0.h, z8.h + sqxtunb z6.b, z24.h + st1b {z6.h}, p0, x0 + add x0, x0, x1 +.endr + ret +endfunc + +// Chroma add_ps +function PFX(pixel_add_ps_4x8_sve2) + ptrue p0.h,vl4 +.rept 8 + ld1b {z0.h}, p0/z, x2 + ld1h {z2.h}, p0/z, x3 + add x2, x2, x4 + add x3, x3, x5, lsl #1 + add z4.h, z0.h, z2.h + sqxtunb z4.b, z4.h + st1b {z4.h}, p0, x0 + add x0, x0, x1 +.endr + ret +endfunc + +function PFX(pixel_add_ps_8x16_sve2) + ptrue p0.h,vl8 +.rept 16 + ld1b {z0.h}, p0/z, x2 + ld1h {z2.h}, p0/z, x3 + add x2, x2, x4 + add x3, x3, x5, lsl #1 + add z4.h, z0.h, z2.h + sqxtunb z4.b, z4.h + st1b {z4.h}, p0, x0 + add x0, x0, x1 +.endr + ret +endfunc + +// void scale1D_128to64(pixel *dst, const pixel *src) +function PFX(scale1D_128to64_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_scale1D_128to64 + ptrue p0.b, vl16 +.rept 2 + ld2b {z0.b, z1.b}, p0/z, x1 + ld2b {z2.b, z3.b}, p0/z, x1, #2, mul vl + ld2b {z4.b, z5.b}, p0/z, x1, #4, mul vl + ld2b {z6.b, z7.b}, p0/z, x1, #6, mul vl + add x1, x1, #128 + urhadd z0.b, p0/m, z0.b, z1.b + urhadd z2.b, p0/m, z2.b, z3.b + urhadd z4.b, p0/m, z4.b, z5.b + urhadd z6.b, p0/m, z6.b, z7.b + st1b {z0.b}, p0, x0 + st1b {z2.b}, p0, x0, #1, mul vl + st1b {z4.b}, p0, x0, #2, mul vl + st1b {z6.b}, p0, x0, #3, mul vl + add x0, x0, #64 +.endr + ret +.vl_gt_16_scale1D_128to64: + cmp x9, #48 + bgt .vl_gt_48_scale1D_128to64 + ptrue p0.b, vl32 +.rept 2 + ld2b {z0.b, z1.b}, p0/z, x1 + ld2b {z2.b, z3.b}, p0/z, x1, #2, mul vl + add x1, x1, #128 + urhadd z0.b, p0/m, z0.b, z1.b + urhadd z2.b, p0/m, z2.b, z3.b + st1b {z0.b}, p0, x0 + st1b {z2.b}, p0, x0, #1, mul vl + add x0, x0, #64 +.endr + ret +.vl_gt_48_scale1D_128to64: + ptrue p0.b, vl64 +.rept 2 + ld2b {z0.b, z1.b}, p0/z, x1 + add x1, x1, #128 + urhadd z0.b, p0/m, z0.b, z1.b + st1b {z0.b}, p0, x0 + add x0, x0, #64 +.endr + ret +endfunc + +/***** dequant_scaling*****/ +// void dequant_scaling_c(const int16_t* quantCoef, const int32_t* deQuantCoef, int16_t* coef, int num, int per, int shift) +function PFX(dequant_scaling_sve2) + ptrue p0.h, vl8 + add x5, x5, #4 // shift + 4 + lsr x3, x3, #3 // num / 8 + cmp x5, x4 + blt .dequant_skip_sve2 + + mov x12, #1 + sub x6, x5, x4 // shift - per + sub x6, x6, #1 // shift - per - 1 + lsl x6, x12, x6 // 1 << shift - per - 1 (add) + mov z0.s, w6 + sub x7, x4, x5 // per - shift + mov z3.s, w7 + +.dequant_loop1_sve2: + ld1h {z19.h}, p0/z, x0 + ld1w {z2.s}, p0/z, x1 + add x1, x1, #16 + ld1w {z20.s}, p0/z, x1 + add x0, x0, #16 + add x1, x1, #16 + + sub x3, x3, #1 + sunpklo z1.s, z19.h + sunpkhi z19.s, z19.h + + mul z1.s, z1.s, z2.s // quantCoef * deQuantCoef + mul z19.s, z19.s, z20.s + add z1.s, z1.s, z0.s // quantCoef * deQuantCoef + add + add z19.s, z19.s, z0.s + + // No equivalent instructions in SVE2 for sshl + // as sqshl has double latency + sshl v1.4s, v1.4s, v3.4s + sshl v19.4s, v19.4s, v3.4s + + sqxtnb z16.h, z1.s + sqxtnb z17.h, z19.s + st1h {z16.s}, p0, x2 + st1h {z17.s}, p0, x2, #1, mul vl + add x2, x2, #16 + cbnz x3, .dequant_loop1_sve2 + ret + +.dequant_skip_sve2: + sub x6, x4, x5 // per - shift + mov z0.h, w6 + +.dequant_loop2_sve2: + ld1h {z19.h}, p0/z, x0 + ld1w {z2.s}, p0/z, x1 + add x1, x1, #16 + ld1w {z20.s}, p0/z, x1 + add x0, x0, #16 + add x1, x1, #16 + + + sub x3, x3, #1 + sunpklo z1.s, z19.h + sunpkhi z19.s, z19.h + + mul z1.s, z1.s, z2.s // quantCoef * deQuantCoef + mul z19.s, z19.s, z20.s + + // Keeping NEON instructions here in order to have + // one sqshl later + sqxtn v16.4h, v1.4s // x265_clip3 + sqxtn2 v16.8h, v19.4s + + sqshl z16.h, p0/m, z16.h, z0.h // coefQ << per - shift + st1h {z16.h}, p0, x2 + add x2, x2, #16 + cbnz x3, .dequant_loop2_sve2 + ret +endfunc + +// void dequant_normal_c(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift) +function PFX(dequant_normal_sve2) + lsr w2, w2, #4 // num / 16 + neg w4, w4 + mov z0.h, w3 + mov z1.s, w4 + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_dequant_normal_sve2 +.dqn_loop1_sve2: + ld1 {v2.8h, v3.8h}, x0, #32 + smull v16.4s, v2.4h, v0.4h + smull2 v17.4s, v2.8h, v0.8h + smull v18.4s, v3.4h, v0.4h + smull2 v19.4s, v3.8h, v0.8h + + srshl v16.4s, v16.4s, v1.4s + srshl v17.4s, v17.4s, v1.4s + srshl v18.4s, v18.4s, v1.4s + srshl v19.4s, v19.4s, v1.4s + + sqxtn v2.4h, v16.4s + sqxtn2 v2.8h, v17.4s + sqxtn v3.4h, v18.4s + sqxtn2 v3.8h, v19.4s + + sub w2, w2, #1 + st1 {v2.8h, v3.8h}, x1, #32 + cbnz w2, .dqn_loop1_sve2 + ret +.vl_gt_16_dequant_normal_sve2: + ptrue p0.h, vl16 +.gt_16_dqn_loop1_sve2: + ld1h {z2.h}, p0/z, x0 + add x0, x0, #32 + smullb z16.s, z2.h, z0.h + smullt z17.s, z2.h, z0.h + + srshl z16.s, p0/m, z16.s, z1.s + srshl z17.s, p0/m, z17.s, z1.s + + sqxtnb z2.h, z16.s + sqxtnt z2.h, z17.s + + sub w2, w2, #1 + st1h {z2.h}, p0, x1 + add x1, x1, #32 + cbnz w2, .gt_16_dqn_loop1_sve2 + ret + +endfunc + +// void ssim_4x4x2_core(const pixel* pix1, intptr_t stride1, const pixel* pix2, intptr_t stride2, int sums24) +function PFX(ssim_4x4x2_core_sve2) + ptrue p0.b, vl16 + movi v30.2d, #0 + movi v31.2d, #0 + + ld1b {z0.h}, p0/z, x0 + add x0, x0, x1 + ld1b {z1.h}, p0/z, x0 + add x0, x0, x1 + ld1b {z2.h}, p0/z, x0 + add x0, x0, x1 + ld1b {z3.h}, p0/z, x0 + add x0, x0, x1 + + ld1b {z4.h}, p0/z, x2 + add x2, x2, x3 + ld1b {z5.h}, p0/z, x2 + add x2, x2, x3 + ld1b {z6.h}, p0/z, x2 + add x2, x2, x3 + ld1b {z7.h}, p0/z, x2 + add x2, x2, x3 + + mul z16.h, z0.h, z0.h + mul z17.h, z1.h, z1.h + mul z18.h, z2.h, z2.h + uaddlp v30.4s, v16.8h + + mul z19.h, z3.h, z3.h + mul z20.h, z4.h, z4.h + mul z21.h, z5.h, z5.h + uadalp v30.4s, v17.8h + + mul z22.h, z6.h, z6.h + mul z23.h, z7.h, z7.h + mul z24.h, z0.h, z4.h + uadalp v30.4s, v18.8h + + mul z25.h, z1.h, z5.h + mul z26.h, z2.h, z6.h + mul z27.h, z3.h, z7.h + uadalp v30.4s, v19.8h + + add z28.h, z0.h, z1.h + add z29.h, z4.h, z5.h + uadalp v30.4s, v20.8h + uaddlp v31.4s, v24.8h + + add z28.h, z28.h, z2.h + add z29.h, z29.h, z6.h + uadalp v30.4s, v21.8h + uadalp v31.4s, v25.8h + + add z28.h, z28.h, z3.h + add z29.h, z29.h, z7.h + uadalp v30.4s, v22.8h + uadalp v31.4s, v26.8h + + // Better use NEON instructions here + uaddlp v28.4s, v28.8h + uaddlp v29.4s, v29.8h + uadalp v30.4s, v23.8h + uadalp v31.4s, v27.8h + + addp v28.4s, v28.4s, v28.4s + addp v29.4s, v29.4s, v29.4s + addp v30.4s, v30.4s, v30.4s + addp v31.4s, v31.4s, v31.4s + + st4 {v28.2s, v29.2s, v30.2s, v31.2s}, x4 + ret +endfunc + +// void ssimDist_c(const pixel* fenc, uint32_t fStride, const pixel* recon, intptr_t rstride, uint64_t *ssBlock, int shift, uint64_t *ac_k) +.macro ssimDist_start_sve2 + mov z0.d, #0 + mov z1.d, #0 +.endm + +.macro ssimDist_1_sve2 z0 z1 z2 z3 + sub z16.s, \z0\().s, \z2\().s + sub z17.s, \z1\().s, \z3\().s + mul z18.s, \z0\().s, \z0\().s + mul z19.s, \z1\().s, \z1\().s + mul z20.s, z16.s, z16.s + mul z21.s, z17.s, z17.s + add z0.s, z0.s, z18.s + add z0.s, z0.s, z19.s + add z1.s, z1.s, z20.s + add z1.s, z1.s, z21.s +.endm + +.macro ssimDist_end_sve2 + uaddv d0, p0, z0.s + uaddv d1, p0, z1.s + str d0, x6 + str d1, x4 +.endm + +function PFX(ssimDist4_sve2) + ssimDist_start + ptrue p0.s, vl4 +.rept 4 + ld1b {z4.s}, p0/z, x0 + add x0, x0, x1 + ld1b {z5.s}, p0/z, x2 + add x2, x2, x3 + sub z2.s, z4.s, z5.s + mul z3.s, z4.s, z4.s + mul z2.s, z2.s, z2.s + add z0.s, z0.s, z3.s + add z1.s, z1.s, z2.s +.endr + ssimDist_end + ret +endfunc + +function PFX(ssimDist8_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_ssimDist8 + ssimDist_start + ptrue p0.s, vl4 +.rept 8 + ld1b {z4.s}, p0/z, x0 + ld1b {z5.s}, p0/z, x0, #1, mul vl + add x0, x0, x1 + ld1b {z6.s}, p0/z, x2 + ld1b {z7.s}, p0/z, x2, #1, mul vl + add x2, x2, x3 + ssimDist_1_sve2 z4, z5, z6, z7 +.endr + ssimDist_end + ret +.vl_gt_16_ssimDist8: + ssimDist_start_sve2 + ptrue p0.s, vl8 +.rept 8 + ld1b {z4.s}, p0/z, x0 + add x0, x0, x1 + ld1b {z6.s}, p0/z, x2 + add x2, x2, x3 + sub z20.s, z4.s, z6.s + mul z16.s, z4.s, z4.s + mul z18.s, z20.s, z20.s + add z0.s, z0.s, z16.s + add z1.s, z1.s, z18.s +.endr + ssimDist_end_sve2 + ret +endfunc + +function PFX(ssimDist16_sve2) + mov w12, #16 + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_ssimDist16 + ssimDist_start + ptrue p0.s, vl4 +.loop_ssimDist16_sve2: + sub w12, w12, #1 + ld1b {z4.s}, p0/z, x0 + ld1b {z5.s}, p0/z, x0, #1, mul vl + ld1b {z6.s}, p0/z, x0, #2, mul vl + ld1b {z7.s}, p0/z, x0, #3, mul vl + add x0, x0, x1 + ld1b {z8.s}, p0/z, x2 + ld1b {z9.s}, p0/z, x2, #1, mul vl + ld1b {z10.s}, p0/z, x2, #2, mul vl + ld1b {z11.s}, p0/z, x2, #3, mul vl + add x2, x2, x3 + ssimDist_1_sve2 z4, z5, z8, z9 + ssimDist_1_sve2 z6, z7, z10, z11 + cbnz w12, .loop_ssimDist16_sve2 + ssimDist_end + ret +.vl_gt_16_ssimDist16: + cmp x9, #48 + bgt .vl_gt_48_ssimDist16 + ssimDist_start_sve2 + ptrue p0.s, vl8 +.vl_gt_16_loop_ssimDist16_sve2: + sub w12, w12, #1 + ld1b {z4.s}, p0/z, x0 + ld1b {z5.s}, p0/z, x0, #1, mul vl + add x0, x0, x1 + ld1b {z8.s}, p0/z, x2 + ld1b {z9.s}, p0/z, x2, #1, mul vl + add x2, x2, x3 + ssimDist_1_sve2 z4, z5, z8, z9 + cbnz w12, .vl_gt_16_loop_ssimDist16_sve2 + ssimDist_end_sve2 + ret +.vl_gt_48_ssimDist16: + ssimDist_start_sve2 + ptrue p0.s, vl16 +.vl_gt_48_loop_ssimDist16_sve2: + sub w12, w12, #1 + ld1b {z4.s}, p0/z, x0 + add x0, x0, x1 + ld1b {z8.s}, p0/z, x2 + add x2, x2, x3 + sub z20.s, z4.s, z8.s + mul z16.s, z4.s, z4.s + mul z18.s, z20.s, z20.s + add z0.s, z0.s, z16.s + add z1.s, z1.s, z18.s + cbnz w12, .vl_gt_48_loop_ssimDist16_sve2 + ssimDist_end_sve2 + ret +endfunc + +function PFX(ssimDist32_sve2) + mov w12, #32 + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_ssimDist32 + ssimDist_start + ptrue p0.s, vl4 +.loop_ssimDist32_sve2: + sub w12, w12, #1 + ld1b {z2.s}, p0/z, x0 + ld1b {z3.s}, p0/z, x0, #1, mul vl + ld1b {z4.s}, p0/z, x0, #2, mul vl + ld1b {z5.s}, p0/z, x0, #3, mul vl + ld1b {z6.s}, p0/z, x0, #4, mul vl + ld1b {z7.s}, p0/z, x0, #5, mul vl + ld1b {z8.s}, p0/z, x0, #6, mul vl + ld1b {z9.s}, p0/z, x0, #7, mul vl + add x0, x0, x1 + ld1b {z10.s}, p0/z, x2 + ld1b {z11.s}, p0/z, x2, #1, mul vl + ld1b {z12.s}, p0/z, x2, #2, mul vl + ld1b {z13.s}, p0/z, x2, #3, mul vl + ld1b {z14.s}, p0/z, x2, #4, mul vl + ld1b {z15.s}, p0/z, x2, #5, mul vl + ld1b {z30.s}, p0/z, x2, #6, mul vl + ld1b {z31.s}, p0/z, x2, #7, mul vl + add x2, x2, x3 + ssimDist_1_sve2 z2, z3, z10, z11 + ssimDist_1_sve2 z4, z5, z12, z13 + ssimDist_1_sve2 z6, z7, z14, z15 + ssimDist_1_sve2 z8, z9, z30, z31 + cbnz w12, .loop_ssimDist32_sve2 + ssimDist_end + ret +.vl_gt_16_ssimDist32: + cmp x9, #48 + bgt .vl_gt_48_ssimDist32 + ssimDist_start_sve2 + ptrue p0.s, vl8 +.vl_gt_16_loop_ssimDist32_sve2: + sub w12, w12, #1 + ld1b {z2.s}, p0/z, x0 + ld1b {z3.s}, p0/z, x0, #1, mul vl + ld1b {z4.s}, p0/z, x0, #2, mul vl + ld1b {z5.s}, p0/z, x0, #3, mul vl + add x0, x0, x1 + ld1b {z10.s}, p0/z, x2 + ld1b {z11.s}, p0/z, x2, #1, mul vl + ld1b {z12.s}, p0/z, x2, #2, mul vl + ld1b {z13.s}, p0/z, x2, #3, mul vl + add x2, x2, x3 + ssimDist_1_sve2 z2, z3, z10, z11 + ssimDist_1_sve2 z4, z5, z12, z13 + cbnz w12, .vl_gt_16_loop_ssimDist32_sve2 + ssimDist_end_sve2 + ret +.vl_gt_48_ssimDist32: + cmp x9, #112 + bgt .vl_gt_112_ssimDist32 + ssimDist_start_sve2 + ptrue p0.s, vl16 +.vl_gt_48_loop_ssimDist32_sve2: + sub w12, w12, #1 + ld1b {z2.s}, p0/z, x0 + ld1b {z3.s}, p0/z, x0, #1, mul vl + add x0, x0, x1 + ld1b {z10.s}, p0/z, x2 + ld1b {z11.s}, p0/z, x2, #1, mul vl + add x2, x2, x3 + ssimDist_1_sve2 z2, z3, z10, z11 + cbnz w12, .vl_gt_48_loop_ssimDist32_sve2 + ssimDist_end_sve2 + ret +.vl_gt_112_ssimDist32: + ssimDist_start_sve2 + ptrue p0.s, vl32 +.vl_gt_112_loop_ssimDist32_sve2: + sub w12, w12, #1 + ld1b {z2.s}, p0/z, x0 + add x0, x0, x1 + ld1b {z10.s}, p0/z, x2 + add x2, x2, x3 + sub z20.s, z2.s, z10.s + mul z16.s, z2.s, z2.s + mul z18.s, z20.s, z20.s + add z0.s, z0.s, z16.s + add z1.s, z1.s, z18.s + cbnz w12, .vl_gt_112_loop_ssimDist32_sve2 + ssimDist_end_sve2 + ret +endfunc + +function PFX(ssimDist64_sve2) + mov w12, #64 + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_ssimDist64 + ssimDist_start + ptrue p0.s, vl4 +.loop_ssimDist64_sve2: + sub w12, w12, #1 + ld1b {z2.s}, p0/z, x0 + ld1b {z3.s}, p0/z, x0, #1, mul vl + ld1b {z4.s}, p0/z, x0, #2, mul vl + ld1b {z5.s}, p0/z, x0, #3, mul vl + ld1b {z6.s}, p0/z, x0, #4, mul vl + ld1b {z7.s}, p0/z, x0, #5, mul vl + ld1b {z8.s}, p0/z, x0, #6, mul vl + ld1b {z9.s}, p0/z, x0, #7, mul vl + ld1b {z23.s}, p0/z, x2 + ld1b {z24.s}, p0/z, x2, #1, mul vl + ld1b {z25.s}, p0/z, x2, #2, mul vl + ld1b {z26.s}, p0/z, x2, #3, mul vl + ld1b {z27.s}, p0/z, x2, #4, mul vl + ld1b {z28.s}, p0/z, x2, #5, mul vl + ld1b {z29.s}, p0/z, x2, #6, mul vl + ld1b {z30.s}, p0/z, x2, #7, mul vl + ssimDist_1_sve2 z2, z3, z23, z24 + ssimDist_1_sve2 z4, z5, z25, z26 + ssimDist_1_sve2 z6, z7, z27, z28 + ssimDist_1_sve2 z8, z9, z29, z30 + mov x4, x0 + mov x5, x2 + add x4, x4, #32 + add x5, x5, #32 + ld1b {z2.s}, p0/z, x4 + ld1b {z3.s}, p0/z, x4, #1, mul vl + ld1b {z4.s}, p0/z, x4, #2, mul vl + ld1b {z5.s}, p0/z, x4, #3, mul vl + ld1b {z6.s}, p0/z, x4, #4, mul vl + ld1b {z7.s}, p0/z, x4, #5, mul vl + ld1b {z8.s}, p0/z, x4, #6, mul vl + ld1b {z9.s}, p0/z, x4, #7, mul vl + ld1b {z23.s}, p0/z, x5 + ld1b {z24.s}, p0/z, x5, #1, mul vl + ld1b {z25.s}, p0/z, x5, #2, mul vl + ld1b {z26.s}, p0/z, x5, #3, mul vl + ld1b {z27.s}, p0/z, x5, #4, mul vl + ld1b {z28.s}, p0/z, x5, #5, mul vl + ld1b {z29.s}, p0/z, x5, #6, mul vl + ld1b {z30.s}, p0/z, x5, #7, mul vl + ssimDist_1_sve2 z2, z3, z23, z24 + ssimDist_1_sve2 z4, z5, z25, z26 + ssimDist_1_sve2 z6, z7, z27, z28 + ssimDist_1_sve2 z8, z9, z29, z30 + add x0, x0, x1 + add x2, x2, x3 + cbnz w12, .loop_ssimDist64_sve2 + ssimDist_end + ret +.vl_gt_16_ssimDist64: + cmp x9, #48 + bgt .vl_gt_48_ssimDist64 + ssimDist_start_sve2 + ptrue p0.s, vl8 +.vl_gt_16_loop_ssimDist64_sve2: + sub w12, w12, #1 + ld1b {z2.s}, p0/z, x0 + ld1b {z3.s}, p0/z, x0, #1, mul vl + ld1b {z4.s}, p0/z, x0, #2, mul vl + ld1b {z5.s}, p0/z, x0, #3, mul vl + ld1b {z6.s}, p0/z, x0, #4, mul vl + ld1b {z7.s}, p0/z, x0, #5, mul vl + ld1b {z8.s}, p0/z, x0, #6, mul vl + ld1b {z9.s}, p0/z, x0, #7, mul vl + ld1b {z23.s}, p0/z, x2 + ld1b {z24.s}, p0/z, x2, #1, mul vl + ld1b {z25.s}, p0/z, x2, #2, mul vl + ld1b {z26.s}, p0/z, x2, #3, mul vl + ld1b {z27.s}, p0/z, x2, #4, mul vl + ld1b {z28.s}, p0/z, x2, #5, mul vl + ld1b {z29.s}, p0/z, x2, #6, mul vl + ld1b {z30.s}, p0/z, x2, #7, mul vl + ssimDist_1_sve2 z2, z3, z23, z24 + ssimDist_1_sve2 z4, z5, z25, z26 + ssimDist_1_sve2 z6, z7, z27, z28 + ssimDist_1_sve2 z8, z9, z29, z30 + add x0, x0, x1 + add x2, x2, x3 + cbnz w12, .vl_gt_16_loop_ssimDist64_sve2 + ssimDist_end_sve2 + ret +.vl_gt_48_ssimDist64: + cmp x9, #112 + bgt .vl_gt_112_ssimDist64 + ssimDist_start_sve2 + ptrue p0.s, vl16 +.vl_gt_48_loop_ssimDist64_sve2: + sub w12, w12, #1 + ld1b {z2.s}, p0/z, x0 + ld1b {z3.s}, p0/z, x0, #1, mul vl + ld1b {z4.s}, p0/z, x0, #2, mul vl + ld1b {z5.s}, p0/z, x0, #3, mul vl + ld1b {z23.s}, p0/z, x2 + ld1b {z24.s}, p0/z, x2, #1, mul vl + ld1b {z25.s}, p0/z, x2, #2, mul vl + ld1b {z26.s}, p0/z, x2, #3, mul vl + ssimDist_1_sve2 z2, z3, z23, z24 + ssimDist_1_sve2 z4, z5, z25, z26 + add x0, x0, x1 + add x2, x2, x3 + cbnz w12, .vl_gt_48_loop_ssimDist64_sve2 + ssimDist_end_sve2 + ret +.vl_gt_112_ssimDist64: + ssimDist_start_sve2 + ptrue p0.s, vl32 +.vl_gt_112_loop_ssimDist64_sve2: + sub w12, w12, #1 + ld1b {z2.s}, p0/z, x0 + ld1b {z3.s}, p0/z, x0, #1, mul vl + ld1b {z23.s}, p0/z, x2 + ld1b {z24.s}, p0/z, x2, #1, mul vl + ssimDist_1_sve2 z2, z3, z23, z24 + add x0, x0, x1 + add x2, x2, x3 + cbnz w12, .vl_gt_112_loop_ssimDist64_sve2 + ssimDist_end_sve2 + ret +endfunc + +// void normFact_c(const pixel* src, uint32_t blockSize, int shift, uint64_t *z_k) +.macro normFact_start_sve2 + mov z0.d, #0 +.endm + +.macro normFact_1_sve2 z0, z1 + mul z16.s, \z0\().s, \z0\().s + mul z17.s, \z1\().s, \z1\().s + add z0.s, z0.s, z16.s + add z0.s, z0.s, z17.s +.endm + +.macro normFact_end_sve2 + uaddv d0, p0, z0.s + str d0, x3 +.endm + +function PFX(normFact8_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_normFact8 + normFact_start + ptrue p0.s, vl4 +.rept 8 + ld1b {z4.s}, p0/z, x0 + ld1b {z5.s}, p0/z, x0, #1, mul vl + add x0, x0, x1 + normFact_1_sve2 z4, z5 +.endr + normFact_end + ret +.vl_gt_16_normFact8: + normFact_start_sve2 + ptrue p0.s, vl8 +.rept 8 + ld1b {z4.s}, p0/z, x0 + add x0, x0, x1 + mul z16.s, z4.s, z4.s + add z0.s, z0.s, z16.s +.endr + normFact_end_sve2 + ret +endfunc + +function PFX(normFact16_sve2) + mov w12, #16 + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_normFact16 + normFact_start + ptrue p0.s, vl4 +.loop_normFact16_sve2: + sub w12, w12, #1 + ld1b {z4.s}, p0/z, x0 + ld1b {z5.s}, p0/z, x0, #1, mul vl + ld1b {z6.s}, p0/z, x0, #2, mul vl + ld1b {z7.s}, p0/z, x0, #3, mul vl + add x0, x0, x1 + normFact_1_sve2 z4, z5 + normFact_1_sve2 z6, z7 + cbnz w12, .loop_normFact16_sve2 + normFact_end + ret +.vl_gt_16_normFact16: + cmp x9, #48 + bgt .vl_gt_48_normFact16 + normFact_start_sve2 + ptrue p0.s, vl8 +.vl_gt_16_loop_normFact16_sve2: + sub w12, w12, #1 + ld1b {z4.s}, p0/z, x0 + ld1b {z5.s}, p0/z, x0, #1, mul vl + add x0, x0, x1 + normFact_1_sve2 z4, z5 + cbnz w12, .vl_gt_16_loop_normFact16_sve2 + normFact_end_sve2 + ret +.vl_gt_48_normFact16: + normFact_start_sve2 + ptrue p0.s, vl16 +.vl_gt_48_loop_normFact16_sve2: + sub w12, w12, #1 + ld1b {z4.s}, p0/z, x0 + add x0, x0, x1 + mul z16.s, z4.s, z4.s + add z0.s, z0.s, z16.s + cbnz w12, .vl_gt_48_loop_normFact16_sve2 + normFact_end_sve2 + ret +endfunc + +function PFX(normFact32_sve2) + mov w12, #32 + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_normFact32 + normFact_start + ptrue p0.s, vl4 +.loop_normFact32_sve2: + sub w12, w12, #1 + ld1b {z4.s}, p0/z, x0 + ld1b {z5.s}, p0/z, x0, #1, mul vl + ld1b {z6.s}, p0/z, x0, #2, mul vl + ld1b {z7.s}, p0/z, x0, #3, mul vl + ld1b {z8.s}, p0/z, x0, #4, mul vl + ld1b {z9.s}, p0/z, x0, #5, mul vl + ld1b {z10.s}, p0/z, x0, #6, mul vl + ld1b {z11.s}, p0/z, x0, #7, mul vl + add x0, x0, x1 + normFact_1_sve2 z4, z5 + normFact_1_sve2 z6, z7 + normFact_1_sve2 z8, z9 + normFact_1_sve2 z10, z11 + cbnz w12, .loop_normFact32_sve2 + normFact_end + ret +.vl_gt_16_normFact32: + cmp x9, #48 + bgt .vl_gt_48_normFact32 + normFact_start_sve2 + ptrue p0.s, vl8 +.vl_gt_16_loop_normFact32_sve2: + sub w12, w12, #1 + ld1b {z4.s}, p0/z, x0 + ld1b {z5.s}, p0/z, x0, #1, mul vl + ld1b {z6.s}, p0/z, x0, #2, mul vl + ld1b {z7.s}, p0/z, x0, #3, mul vl + add x0, x0, x1 + normFact_1_sve2 z4, z5 + normFact_1_sve2 z6, z7 + cbnz w12, .vl_gt_16_loop_normFact32_sve2 + normFact_end_sve2 + ret +.vl_gt_48_normFact32: + cmp x9, #112 + bgt .vl_gt_112_normFact32 + normFact_start_sve2 + ptrue p0.s, vl16 +.vl_gt_48_loop_normFact32_sve2: + sub w12, w12, #1 + ld1b {z4.s}, p0/z, x0 + ld1b {z5.s}, p0/z, x0, #1, mul vl + add x0, x0, x1 + normFact_1_sve2 z4, z5 + cbnz w12, .vl_gt_48_loop_normFact32_sve2 + normFact_end_sve2 + ret +.vl_gt_112_normFact32: + normFact_start_sve2 + ptrue p0.s, vl32 +.vl_gt_112_loop_normFact32_sve2: + sub w12, w12, #1 + ld1b {z4.s}, p0/z, x0 + add x0, x0, x1 + mul z16.s, z4.s, z4.s + add z0.s, z0.s, z16.s + cbnz w12, .vl_gt_112_loop_normFact32_sve2 + normFact_end_sve2 + ret +endfunc + +function PFX(normFact64_sve2) + mov w12, #64 + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_normFact64 + normFact_start + ptrue p0.s, vl4 +.loop_normFact64_sve2: + sub w12, w12, #1 + ld1b {z4.s}, p0/z, x0 + ld1b {z5.s}, p0/z, x0, #1, mul vl + ld1b {z6.s}, p0/z, x0, #2, mul vl + ld1b {z7.s}, p0/z, x0, #3, mul vl + ld1b {z8.s}, p0/z, x0, #4, mul vl + ld1b {z9.s}, p0/z, x0, #5, mul vl + ld1b {z10.s}, p0/z, x0, #6, mul vl + ld1b {z11.s}, p0/z, x0, #7, mul vl + normFact_1_sve2 z4, z5 + normFact_1_sve2 z6, z7 + normFact_1_sve2 z8, z9 + normFact_1_sve2 z10, z11 + mov x2, x0 + add x2, x2, #32 + ld1b {z4.s}, p0/z, x2 + ld1b {z5.s}, p0/z, x2, #1, mul vl + ld1b {z6.s}, p0/z, x2, #2, mul vl + ld1b {z7.s}, p0/z, x2, #3, mul vl + ld1b {z8.s}, p0/z, x2, #4, mul vl + ld1b {z9.s}, p0/z, x2, #5, mul vl + ld1b {z10.s}, p0/z, x2, #6, mul vl + ld1b {z11.s}, p0/z, x2, #7, mul vl + normFact_1_sve2 z4, z5 + normFact_1_sve2 z6, z7 + normFact_1_sve2 z8, z9 + normFact_1_sve2 z10, z11 + add x0, x0, x1 + cbnz w12, .loop_normFact64_sve2 + normFact_end + ret +.vl_gt_16_normFact64: + cmp x9, #48 + bgt .vl_gt_48_normFact64 + normFact_start_sve2 + ptrue p0.s, vl8 +.vl_gt_16_loop_normFact64_sve2: + sub w12, w12, #1 + ld1b {z4.s}, p0/z, x0 + ld1b {z5.s}, p0/z, x0, #1, mul vl + ld1b {z6.s}, p0/z, x0, #2, mul vl + ld1b {z7.s}, p0/z, x0, #3, mul vl + ld1b {z8.s}, p0/z, x0, #4, mul vl + ld1b {z9.s}, p0/z, x0, #5, mul vl + ld1b {z10.s}, p0/z, x0, #6, mul vl + ld1b {z11.s}, p0/z, x0, #7, mul vl + normFact_1_sve2 z4, z5 + normFact_1_sve2 z6, z7 + normFact_1_sve2 z8, z9 + normFact_1_sve2 z10, z11 + add x0, x0, x1 + cbnz w12, .vl_gt_16_loop_normFact64_sve2 + normFact_end_sve2 + ret +.vl_gt_48_normFact64: + cmp x9, #112 + bgt .vl_gt_112_normFact64 + normFact_start_sve2 + ptrue p0.s, vl16 +.vl_gt_48_loop_normFact64_sve2: + sub w12, w12, #1 + ld1b {z4.s}, p0/z, x0 + ld1b {z5.s}, p0/z, x0, #1, mul vl + ld1b {z6.s}, p0/z, x0, #2, mul vl + ld1b {z7.s}, p0/z, x0, #3, mul vl + normFact_1_sve2 z4, z5 + normFact_1_sve2 z6, z7 + add x0, x0, x1 + cbnz w12, .vl_gt_48_loop_normFact64_sve2 + normFact_end_sve2 + ret +.vl_gt_112_normFact64: + normFact_start_sve2 + ptrue p0.s, vl32 +.vl_gt_112_loop_normFact64_sve2: + sub w12, w12, #1 + ld1b {z4.s}, p0/z, x0 + ld1b {z5.s}, p0/z, x0, #1, mul vl + normFact_1_sve2 z4, z5 + add x0, x0, x1 + cbnz w12, .vl_gt_112_loop_normFact64_sve2 + normFact_end_sve2 + ret +endfunc
View file
x265_3.5.tar.gz/source/common/aarch64/pixel-util.S -> x265_3.6.tar.gz/source/common/aarch64/pixel-util.S
Changed
@@ -1,8 +1,9 @@ /***************************************************************************** - * Copyright (C) 2020 MulticoreWare, Inc + * Copyright (C) 2020-2021 MulticoreWare, Inc * * Authors: Yimeng Su <yimeng.su@huawei.com> * Hongbin Liu <liuhongbin1@huawei.com> + * Sebastian Pop <spop@amazon.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -23,13 +24,652 @@ *****************************************************************************/ #include "asm.S" +#include "pixel-util-common.S" +#ifdef __APPLE__ +.section __RODATA,__rodata +#else .section .rodata +#endif .align 4 .text +// uint64_t pixel_var(const pixel* pix, intptr_t i_stride) +function PFX(pixel_var_8x8_neon) + ld1 {v4.8b}, x0, x1 // pixx + uxtl v0.8h, v4.8b // sum = pixx + umull v1.8h, v4.8b, v4.8b + uaddlp v1.4s, v1.8h // sqr = pixx * pixx + +.rept 7 + ld1 {v4.8b}, x0, x1 // pixx + umull v31.8h, v4.8b, v4.8b + uaddw v0.8h, v0.8h, v4.8b // sum += pixx + uadalp v1.4s, v31.8h // sqr += pixx * pixx +.endr + uaddlv s0, v0.8h + uaddlv d1, v1.4s + fmov w0, s0 + fmov x1, d1 + orr x0, x0, x1, lsl #32 // return sum + ((uint64_t)sqr << 32); + ret +endfunc + +function PFX(pixel_var_16x16_neon) + pixel_var_start + mov w12, #16 +.loop_var_16: + sub w12, w12, #1 + ld1 {v4.16b}, x0, x1 + pixel_var_1 v4 + cbnz w12, .loop_var_16 + pixel_var_end + ret +endfunc + +function PFX(pixel_var_32x32_neon) + pixel_var_start + mov w12, #32 +.loop_var_32: + sub w12, w12, #1 + ld1 {v4.16b-v5.16b}, x0, x1 + pixel_var_1 v4 + pixel_var_1 v5 + cbnz w12, .loop_var_32 + pixel_var_end + ret +endfunc + +function PFX(pixel_var_64x64_neon) + pixel_var_start + mov w12, #64 +.loop_var_64: + sub w12, w12, #1 + ld1 {v4.16b-v7.16b}, x0, x1 + pixel_var_1 v4 + pixel_var_1 v5 + pixel_var_1 v6 + pixel_var_1 v7 + cbnz w12, .loop_var_64 + pixel_var_end + ret +endfunc + +// void getResidual4_neon(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride) +function PFX(getResidual4_neon) + lsl x4, x3, #1 +.rept 2 + ld1 {v0.8b}, x0, x3 + ld1 {v1.8b}, x1, x3 + ld1 {v2.8b}, x0, x3 + ld1 {v3.8b}, x1, x3 + usubl v4.8h, v0.8b, v1.8b + usubl v5.8h, v2.8b, v3.8b + st1 {v4.8b}, x2, x4 + st1 {v5.8b}, x2, x4 +.endr + ret +endfunc + +function PFX(getResidual8_neon) + lsl x4, x3, #1 +.rept 4 + ld1 {v0.8b}, x0, x3 + ld1 {v1.8b}, x1, x3 + ld1 {v2.8b}, x0, x3 + ld1 {v3.8b}, x1, x3 + usubl v4.8h, v0.8b, v1.8b + usubl v5.8h, v2.8b, v3.8b + st1 {v4.16b}, x2, x4 + st1 {v5.16b}, x2, x4 +.endr + ret +endfunc + +function PFX(getResidual16_neon) + lsl x4, x3, #1 +.rept 8 + ld1 {v0.16b}, x0, x3 + ld1 {v1.16b}, x1, x3 + ld1 {v2.16b}, x0, x3 + ld1 {v3.16b}, x1, x3 + usubl v4.8h, v0.8b, v1.8b + usubl2 v5.8h, v0.16b, v1.16b + usubl v6.8h, v2.8b, v3.8b + usubl2 v7.8h, v2.16b, v3.16b + st1 {v4.8h-v5.8h}, x2, x4 + st1 {v6.8h-v7.8h}, x2, x4 +.endr + ret +endfunc + +function PFX(getResidual32_neon) + lsl x4, x3, #1 + mov w12, #4 +.loop_residual_32: + sub w12, w12, #1 +.rept 4 + ld1 {v0.16b-v1.16b}, x0, x3 + ld1 {v2.16b-v3.16b}, x1, x3 + ld1 {v4.16b-v5.16b}, x0, x3 + ld1 {v6.16b-v7.16b}, x1, x3 + usubl v16.8h, v0.8b, v2.8b + usubl2 v17.8h, v0.16b, v2.16b + usubl v18.8h, v1.8b, v3.8b + usubl2 v19.8h, v1.16b, v3.16b + usubl v20.8h, v4.8b, v6.8b + usubl2 v21.8h, v4.16b, v6.16b + usubl v22.8h, v5.8b, v7.8b + usubl2 v23.8h, v5.16b, v7.16b + st1 {v16.8h-v19.8h}, x2, x4 + st1 {v20.8h-v23.8h}, x2, x4 +.endr + cbnz w12, .loop_residual_32 + ret +endfunc + +// void pixel_sub_ps_neon(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1) +function PFX(pixel_sub_ps_4x4_neon) + lsl x1, x1, #1 +.rept 2 + ld1 {v0.8b}, x2, x4 + ld1 {v1.8b}, x3, x5 + ld1 {v2.8b}, x2, x4 + ld1 {v3.8b}, x3, x5 + usubl v4.8h, v0.8b, v1.8b + usubl v5.8h, v2.8b, v3.8b + st1 {v4.4h}, x0, x1 + st1 {v5.4h}, x0, x1 +.endr + ret +endfunc + +function PFX(pixel_sub_ps_8x8_neon) + lsl x1, x1, #1 +.rept 4 + ld1 {v0.8b}, x2, x4 + ld1 {v1.8b}, x3, x5 + ld1 {v2.8b}, x2, x4 + ld1 {v3.8b}, x3, x5 + usubl v4.8h, v0.8b, v1.8b + usubl v5.8h, v2.8b, v3.8b + st1 {v4.8h}, x0, x1 + st1 {v5.8h}, x0, x1 +.endr + ret +endfunc + +function PFX(pixel_sub_ps_16x16_neon) + lsl x1, x1, #1 +.rept 8 + ld1 {v0.16b}, x2, x4 + ld1 {v1.16b}, x3, x5 + ld1 {v2.16b}, x2, x4 + ld1 {v3.16b}, x3, x5 + usubl v4.8h, v0.8b, v1.8b + usubl2 v5.8h, v0.16b, v1.16b + usubl v6.8h, v2.8b, v3.8b + usubl2 v7.8h, v2.16b, v3.16b + st1 {v4.8h-v5.8h}, x0, x1 + st1 {v6.8h-v7.8h}, x0, x1 +.endr + ret +endfunc + +function PFX(pixel_sub_ps_32x32_neon) + lsl x1, x1, #1 + mov w12, #4 +.loop_sub_ps_32: + sub w12, w12, #1 +.rept 4 + ld1 {v0.16b-v1.16b}, x2, x4 + ld1 {v2.16b-v3.16b}, x3, x5 + ld1 {v4.16b-v5.16b}, x2, x4 + ld1 {v6.16b-v7.16b}, x3, x5 + usubl v16.8h, v0.8b, v2.8b + usubl2 v17.8h, v0.16b, v2.16b + usubl v18.8h, v1.8b, v3.8b + usubl2 v19.8h, v1.16b, v3.16b + usubl v20.8h, v4.8b, v6.8b + usubl2 v21.8h, v4.16b, v6.16b + usubl v22.8h, v5.8b, v7.8b + usubl2 v23.8h, v5.16b, v7.16b + st1 {v16.8h-v19.8h}, x0, x1 + st1 {v20.8h-v23.8h}, x0, x1 +.endr + cbnz w12, .loop_sub_ps_32 + ret +endfunc + +function PFX(pixel_sub_ps_64x64_neon) + lsl x1, x1, #1 + sub x1, x1, #64 + mov w12, #16 +.loop_sub_ps_64: + sub w12, w12, #1 +.rept 4 + ld1 {v0.16b-v3.16b}, x2, x4 + ld1 {v4.16b-v7.16b}, x3, x5 + usubl v16.8h, v0.8b, v4.8b + usubl2 v17.8h, v0.16b, v4.16b + usubl v18.8h, v1.8b, v5.8b + usubl2 v19.8h, v1.16b, v5.16b + usubl v20.8h, v2.8b, v6.8b + usubl2 v21.8h, v2.16b, v6.16b + usubl v22.8h, v3.8b, v7.8b + usubl2 v23.8h, v3.16b, v7.16b + st1 {v16.8h-v19.8h}, x0, #64 + st1 {v20.8h-v23.8h}, x0, x1 +.endr + cbnz w12, .loop_sub_ps_64 + ret +endfunc + +// chroma sub_ps +function PFX(pixel_sub_ps_4x8_neon) + lsl x1, x1, #1 +.rept 4 + ld1 {v0.8b}, x2, x4 + ld1 {v1.8b}, x3, x5 + ld1 {v2.8b}, x2, x4 + ld1 {v3.8b}, x3, x5 + usubl v4.8h, v0.8b, v1.8b + usubl v5.8h, v2.8b, v3.8b + st1 {v4.4h}, x0, x1 + st1 {v5.4h}, x0, x1 +.endr + ret +endfunc + +function PFX(pixel_sub_ps_8x16_neon) + lsl x1, x1, #1 +.rept 8 + ld1 {v0.8b}, x2, x4 + ld1 {v1.8b}, x3, x5 + ld1 {v2.8b}, x2, x4 + ld1 {v3.8b}, x3, x5 + usubl v4.8h, v0.8b, v1.8b + usubl v5.8h, v2.8b, v3.8b + st1 {v4.8h}, x0, x1 + st1 {v5.8h}, x0, x1 +.endr + ret +endfunc + +function PFX(pixel_sub_ps_16x32_neon) + lsl x1, x1, #1 +.rept 16 + ld1 {v0.16b}, x2, x4 + ld1 {v1.16b}, x3, x5 + ld1 {v2.16b}, x2, x4 + ld1 {v3.16b}, x3, x5 + usubl v4.8h, v0.8b, v1.8b + usubl2 v5.8h, v0.16b, v1.16b + usubl v6.8h, v2.8b, v3.8b + usubl2 v7.8h, v2.16b, v3.16b + st1 {v4.8h-v5.8h}, x0, x1 + st1 {v6.8h-v7.8h}, x0, x1 +.endr + ret +endfunc + +function PFX(pixel_sub_ps_32x64_neon) + lsl x1, x1, #1 + mov w12, #8 +.loop_sub_ps_32x64: + sub w12, w12, #1 +.rept 4 + ld1 {v0.16b-v1.16b}, x2, x4 + ld1 {v2.16b-v3.16b}, x3, x5 + ld1 {v4.16b-v5.16b}, x2, x4 + ld1 {v6.16b-v7.16b}, x3, x5 + usubl v16.8h, v0.8b, v2.8b + usubl2 v17.8h, v0.16b, v2.16b + usubl v18.8h, v1.8b, v3.8b + usubl2 v19.8h, v1.16b, v3.16b + usubl v20.8h, v4.8b, v6.8b + usubl2 v21.8h, v4.16b, v6.16b + usubl v22.8h, v5.8b, v7.8b + usubl2 v23.8h, v5.16b, v7.16b + st1 {v16.8h-v19.8h}, x0, x1 + st1 {v20.8h-v23.8h}, x0, x1 +.endr + cbnz w12, .loop_sub_ps_32x64 + ret +endfunc + +// void x265_pixel_add_ps_neon(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); +function PFX(pixel_add_ps_4x4_neon) + lsl x5, x5, #1 +.rept 2 + ld1 {v0.8b}, x2, x4 + ld1 {v1.8b}, x2, x4 + ld1 {v2.4h}, x3, x5 + ld1 {v3.4h}, x3, x5 + uxtl v0.8h, v0.8b + uxtl v1.8h, v1.8b + add v4.8h, v0.8h, v2.8h + add v5.8h, v1.8h, v3.8h + sqxtun v4.8b, v4.8h + sqxtun v5.8b, v5.8h + st1 {v4.s}0, x0, x1 + st1 {v5.s}0, x0, x1 +.endr + ret +endfunc + +function PFX(pixel_add_ps_8x8_neon) + lsl x5, x5, #1 +.rept 4 + ld1 {v0.8b}, x2, x4 + ld1 {v1.8b}, x2, x4 + ld1 {v2.8h}, x3, x5 + ld1 {v3.8h}, x3, x5 + uxtl v0.8h, v0.8b + uxtl v1.8h, v1.8b + add v4.8h, v0.8h, v2.8h + add v5.8h, v1.8h, v3.8h + sqxtun v4.8b, v4.8h + sqxtun v5.8b, v5.8h + st1 {v4.8b}, x0, x1 + st1 {v5.8b}, x0, x1 +.endr + ret +endfunc + +.macro pixel_add_ps_16xN_neon h +function PFX(pixel_add_ps_16x\h\()_neon) + lsl x5, x5, #1 + mov w12, #\h / 8 +.loop_add_ps_16x\h\(): + sub w12, w12, #1 +.rept 4 + ld1 {v0.16b}, x2, x4 + ld1 {v1.16b}, x2, x4 + ld1 {v16.8h-v17.8h}, x3, x5 + ld1 {v18.8h-v19.8h}, x3, x5 + uxtl v4.8h, v0.8b + uxtl2 v5.8h, v0.16b + uxtl v6.8h, v1.8b + uxtl2 v7.8h, v1.16b + add v24.8h, v4.8h, v16.8h + add v25.8h, v5.8h, v17.8h + add v26.8h, v6.8h, v18.8h + add v27.8h, v7.8h, v19.8h + sqxtun v4.8b, v24.8h + sqxtun2 v4.16b, v25.8h + sqxtun v5.8b, v26.8h + sqxtun2 v5.16b, v27.8h + st1 {v4.16b}, x0, x1 + st1 {v5.16b}, x0, x1 +.endr + cbnz w12, .loop_add_ps_16x\h + ret +endfunc +.endm + +pixel_add_ps_16xN_neon 16 +pixel_add_ps_16xN_neon 32 + +.macro pixel_add_ps_32xN_neon h + function PFX(pixel_add_ps_32x\h\()_neon) + lsl x5, x5, #1 + mov w12, #\h / 4 +.loop_add_ps_32x\h\(): + sub w12, w12, #1 +.rept 4 + ld1 {v0.16b-v1.16b}, x2, x4 + ld1 {v16.8h-v19.8h}, x3, x5 + uxtl v4.8h, v0.8b + uxtl2 v5.8h, v0.16b + uxtl v6.8h, v1.8b + uxtl2 v7.8h, v1.16b + add v24.8h, v4.8h, v16.8h + add v25.8h, v5.8h, v17.8h + add v26.8h, v6.8h, v18.8h + add v27.8h, v7.8h, v19.8h + sqxtun v4.8b, v24.8h + sqxtun2 v4.16b, v25.8h + sqxtun v5.8b, v26.8h + sqxtun2 v5.16b, v27.8h + st1 {v4.16b-v5.16b}, x0, x1 +.endr + cbnz w12, .loop_add_ps_32x\h + ret +endfunc +.endm + +pixel_add_ps_32xN_neon 32 +pixel_add_ps_32xN_neon 64 + +function PFX(pixel_add_ps_64x64_neon) + lsl x5, x5, #1 + sub x5, x5, #64 + mov w12, #32 +.loop_add_ps_64x64: + sub w12, w12, #1 +.rept 2 + ld1 {v0.16b-v3.16b}, x2, x4 + ld1 {v16.8h-v19.8h}, x3, #64 + ld1 {v20.8h-v23.8h}, x3, x5 + uxtl v4.8h, v0.8b + uxtl2 v5.8h, v0.16b + uxtl v6.8h, v1.8b + uxtl2 v7.8h, v1.16b + uxtl v24.8h, v2.8b + uxtl2 v25.8h, v2.16b + uxtl v26.8h, v3.8b + uxtl2 v27.8h, v3.16b + add v0.8h, v4.8h, v16.8h + add v1.8h, v5.8h, v17.8h + add v2.8h, v6.8h, v18.8h + add v3.8h, v7.8h, v19.8h + add v4.8h, v24.8h, v20.8h + add v5.8h, v25.8h, v21.8h + add v6.8h, v26.8h, v22.8h + add v7.8h, v27.8h, v23.8h + sqxtun v0.8b, v0.8h + sqxtun2 v0.16b, v1.8h + sqxtun v1.8b, v2.8h + sqxtun2 v1.16b, v3.8h + sqxtun v2.8b, v4.8h + sqxtun2 v2.16b, v5.8h + sqxtun v3.8b, v6.8h + sqxtun2 v3.16b, v7.8h + st1 {v0.16b-v3.16b}, x0, x1 +.endr + cbnz w12, .loop_add_ps_64x64 + ret +endfunc + +// Chroma add_ps +function PFX(pixel_add_ps_4x8_neon) + lsl x5, x5, #1 +.rept 4 + ld1 {v0.8b}, x2, x4 + ld1 {v1.8b}, x2, x4 + ld1 {v2.4h}, x3, x5 + ld1 {v3.4h}, x3, x5 + uxtl v0.8h, v0.8b + uxtl v1.8h, v1.8b + add v4.8h, v0.8h, v2.8h + add v5.8h, v1.8h, v3.8h + sqxtun v4.8b, v4.8h + sqxtun v5.8b, v5.8h + st1 {v4.s}0, x0, x1 + st1 {v5.s}0, x0, x1 +.endr + ret +endfunc + +function PFX(pixel_add_ps_8x16_neon) + lsl x5, x5, #1 +.rept 8 + ld1 {v0.8b}, x2, x4 + ld1 {v1.8b}, x2, x4 + ld1 {v2.8h}, x3, x5 + ld1 {v3.8h}, x3, x5 + uxtl v0.8h, v0.8b + uxtl v1.8h, v1.8b + add v4.8h, v0.8h, v2.8h + add v5.8h, v1.8h, v3.8h + sqxtun v4.8b, v4.8h + sqxtun v5.8b, v5.8h + st1 {v4.8b}, x0, x1 + st1 {v5.8b}, x0, x1 +.endr + ret +endfunc + +// void scale1D_128to64(pixel *dst, const pixel *src) +function PFX(scale1D_128to64_neon) +.rept 2 + ld2 {v0.16b, v1.16b}, x1, #32 + ld2 {v2.16b, v3.16b}, x1, #32 + ld2 {v4.16b, v5.16b}, x1, #32 + ld2 {v6.16b, v7.16b}, x1, #32 + urhadd v0.16b, v0.16b, v1.16b + urhadd v1.16b, v2.16b, v3.16b + urhadd v2.16b, v4.16b, v5.16b + urhadd v3.16b, v6.16b, v7.16b + st1 {v0.16b-v3.16b}, x0, #64 +.endr + ret +endfunc + +.macro scale2D_1 v0, v1 + uaddlp \v0\().8h, \v0\().16b + uaddlp \v1\().8h, \v1\().16b + add \v0\().8h, \v0\().8h, \v1\().8h +.endm + +// void scale2D_64to32(pixel* dst, const pixel* src, intptr_t stride) +function PFX(scale2D_64to32_neon) + mov w12, #32 +.loop_scale2D: + ld1 {v0.16b-v3.16b}, x1, x2 + sub w12, w12, #1 + ld1 {v4.16b-v7.16b}, x1, x2 + scale2D_1 v0, v4 + scale2D_1 v1, v5 + scale2D_1 v2, v6 + scale2D_1 v3, v7 + uqrshrn v0.8b, v0.8h, #2 + uqrshrn2 v0.16b, v1.8h, #2 + uqrshrn v1.8b, v2.8h, #2 + uqrshrn2 v1.16b, v3.8h, #2 + st1 {v0.16b-v1.16b}, x0, #32 + cbnz w12, .loop_scale2D + ret +endfunc + +// void planecopy_cp_c(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift) +function PFX(pixel_planecopy_cp_neon) + dup v2.16b, w6 + sub x5, x5, #1 +.loop_h: + mov x6, x0 + mov x12, x2 + mov x7, #0 +.loop_w: + ldr q0, x6, #16 + ushl v0.16b, v0.16b, v2.16b + str q0, x12, #16 + add x7, x7, #16 + cmp x7, x4 + blt .loop_w + + add x0, x0, x1 + add x2, x2, x3 + sub x5, x5, #1 + cbnz x5, .loop_h + +// handle last row + mov x5, x4 + lsr x5, x5, #3 +.loopW8: + ldr d0, x0, #8 + ushl v0.8b, v0.8b, v2.8b + str d0, x2, #8 + sub x4, x4, #8 + sub x5, x5, #1 + cbnz x5, .loopW8 + + mov x5, #8 + sub x5, x5, x4 + sub x0, x0, x5 + sub x2, x2, x5 + ldr d0, x0 + ushl v0.8b, v0.8b, v2.8b + str d0, x2 + ret +endfunc + +//******* satd ******* +.macro satd_4x4_neon + ld1 {v0.s}0, x0, x1 + ld1 {v0.s}1, x0, x1 + ld1 {v1.s}0, x2, x3 + ld1 {v1.s}1, x2, x3 + ld1 {v2.s}0, x0, x1 + ld1 {v2.s}1, x0, x1 + ld1 {v3.s}0, x2, x3 + ld1 {v3.s}1, x2, x3 + + usubl v4.8h, v0.8b, v1.8b + usubl v5.8h, v2.8b, v3.8b + + add v6.8h, v4.8h, v5.8h + sub v7.8h, v4.8h, v5.8h + + mov v4.d0, v6.d1 + add v0.4h, v6.4h, v4.4h + sub v2.4h, v6.4h, v4.4h + + mov v5.d0, v7.d1 + add v1.4h, v7.4h, v5.4h + sub v3.4h, v7.4h, v5.4h + + trn1 v4.4h, v0.4h, v1.4h + trn2 v5.4h, v0.4h, v1.4h + + trn1 v6.4h, v2.4h, v3.4h + trn2 v7.4h, v2.4h, v3.4h + + add v0.4h, v4.4h, v5.4h + sub v1.4h, v4.4h, v5.4h + + add v2.4h, v6.4h, v7.4h + sub v3.4h, v6.4h, v7.4h + + trn1 v4.2s, v0.2s, v1.2s + trn2 v5.2s, v0.2s, v1.2s + + trn1 v6.2s, v2.2s, v3.2s + trn2 v7.2s, v2.2s, v3.2s + + abs v4.4h, v4.4h + abs v5.4h, v5.4h + abs v6.4h, v6.4h + abs v7.4h, v7.4h + + smax v1.4h, v4.4h, v5.4h + smax v2.4h, v6.4h, v7.4h + + add v0.4h, v1.4h, v2.4h + uaddlp v0.2s, v0.4h + uaddlp v0.1d, v0.2s +.endm + +// int satd_4x4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +function PFX(pixel_satd_4x4_neon) + satd_4x4_neon + fmov x0, d0 + ret +endfunc + .macro x265_satd_4x8_8x4_end_neon add v0.8h, v4.8h, v6.8h add v1.8h, v5.8h, v7.8h @@ -59,7 +699,7 @@ .endm .macro pixel_satd_4x8_neon - ld1r {v1.2s}, x2, x3 + ld1r {v1.2s}, x2, x3 ld1r {v0.2s}, x0, x1 ld1r {v3.2s}, x2, x3 ld1r {v2.2s}, x0, x1 @@ -82,129 +722,995 @@ sub v5.8h, v0.8h, v1.8h ld1 {v6.s}1, x0, x1 usubl v3.8h, v6.8b, v7.8b - add v6.8h, v2.8h, v3.8h - sub v7.8h, v2.8h, v3.8h + add v6.8h, v2.8h, v3.8h + sub v7.8h, v2.8h, v3.8h x265_satd_4x8_8x4_end_neon .endm -// template<int w, int h> -// int satd4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) -function x265_pixel_satd_4x8_neon - pixel_satd_4x8_neon - mov w0, v0.s0 - ret +// template<int w, int h> +// int satd4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +function PFX(pixel_satd_4x8_neon) + pixel_satd_4x8_neon + mov w0, v0.s0 + ret +endfunc + +function PFX(pixel_satd_4x16_neon) + mov w4, #0 + pixel_satd_4x8_neon + mov w5, v0.s0 + add w4, w4, w5 + pixel_satd_4x8_neon + mov w5, v0.s0 + add w0, w5, w4 + ret +endfunc + +function PFX(pixel_satd_4x32_neon) + mov w4, #0 +.rept 4 + pixel_satd_4x8_neon + mov w5, v0.s0 + add w4, w4, w5 +.endr + mov w0, w4 + ret +endfunc + +function PFX(pixel_satd_12x16_neon) + mov x4, x0 + mov x5, x2 + mov w7, #0 + pixel_satd_4x8_neon + mov w6, v0.s0 + add w7, w7, w6 + pixel_satd_4x8_neon + mov w6, v0.s0 + add w7, w7, w6 + + add x0, x4, #4 + add x2, x5, #4 + pixel_satd_4x8_neon + mov w6, v0.s0 + add w7, w7, w6 + pixel_satd_4x8_neon + mov w6, v0.s0 + add w7, w7, w6 + + add x0, x4, #8 + add x2, x5, #8 + pixel_satd_4x8_neon + mov w6, v0.s0 + add w7, w7, w6 + pixel_satd_4x8_neon + mov w6, v0.s0 + add w0, w7, w6 + ret +endfunc + +function PFX(pixel_satd_12x32_neon) + mov x4, x0 + mov x5, x2 + mov w7, #0 +.rept 4 + pixel_satd_4x8_neon + mov w6, v0.s0 + add w7, w7, w6 +.endr + + add x0, x4, #4 + add x2, x5, #4 +.rept 4 + pixel_satd_4x8_neon + mov w6, v0.s0 + add w7, w7, w6 +.endr + + add x0, x4, #8 + add x2, x5, #8 +.rept 4 + pixel_satd_4x8_neon + mov w6, v0.s0 + add w7, w7, w6 +.endr + + mov w0, w7 + ret +endfunc + +function PFX(pixel_satd_8x4_neon) + mov x4, x0 + mov x5, x2 + satd_4x4_neon + add x0, x4, #4 + add x2, x5, #4 + umov x6, v0.d0 + satd_4x4_neon + umov x0, v0.d0 + add x0, x0, x6 + ret +endfunc + +.macro LOAD_DIFF_8x4 v0 v1 v2 v3 + ld1 {v0.8b}, x0, x1 + ld1 {v1.8b}, x2, x3 + ld1 {v2.8b}, x0, x1 + ld1 {v3.8b}, x2, x3 + ld1 {v4.8b}, x0, x1 + ld1 {v5.8b}, x2, x3 + ld1 {v6.8b}, x0, x1 + ld1 {v7.8b}, x2, x3 + usubl \v0, v0.8b, v1.8b + usubl \v1, v2.8b, v3.8b + usubl \v2, v4.8b, v5.8b + usubl \v3, v6.8b, v7.8b +.endm + +.macro LOAD_DIFF_16x4 v0 v1 v2 v3 v4 v5 v6 v7 + ld1 {v0.16b}, x0, x1 + ld1 {v1.16b}, x2, x3 + ld1 {v2.16b}, x0, x1 + ld1 {v3.16b}, x2, x3 + ld1 {v4.16b}, x0, x1 + ld1 {v5.16b}, x2, x3 + ld1 {v6.16b}, x0, x1 + ld1 {v7.16b}, x2, x3 + usubl \v0, v0.8b, v1.8b + usubl \v1, v2.8b, v3.8b + usubl \v2, v4.8b, v5.8b + usubl \v3, v6.8b, v7.8b + usubl2 \v4, v0.16b, v1.16b + usubl2 \v5, v2.16b, v3.16b + usubl2 \v6, v4.16b, v5.16b + usubl2 \v7, v6.16b, v7.16b +.endm + +function PFX(satd_16x4_neon), export=0 + LOAD_DIFF_16x4 v16.8h, v17.8h, v18.8h, v19.8h, v20.8h, v21.8h, v22.8h, v23.8h + b PFX(satd_8x4v_8x8h_neon) +endfunc + +function PFX(satd_8x8_neon), export=0 + LOAD_DIFF_8x4 v16.8h, v17.8h, v18.8h, v19.8h + LOAD_DIFF_8x4 v20.8h, v21.8h, v22.8h, v23.8h + b PFX(satd_8x4v_8x8h_neon) +endfunc + +// one vertical hadamard pass and two horizontal +function PFX(satd_8x4v_8x8h_neon), export=0 + HADAMARD4_V v16.8h, v18.8h, v17.8h, v19.8h, v0.8h, v2.8h, v1.8h, v3.8h + HADAMARD4_V v20.8h, v21.8h, v22.8h, v23.8h, v0.8h, v1.8h, v2.8h, v3.8h + trn4 v0.8h, v1.8h, v2.8h, v3.8h, v16.8h, v17.8h, v18.8h, v19.8h + trn4 v4.8h, v5.8h, v6.8h, v7.8h, v20.8h, v21.8h, v22.8h, v23.8h + SUMSUB_ABCD v16.8h, v17.8h, v18.8h, v19.8h, v0.8h, v1.8h, v2.8h, v3.8h + SUMSUB_ABCD v20.8h, v21.8h, v22.8h, v23.8h, v4.8h, v5.8h, v6.8h, v7.8h + trn4 v0.4s, v2.4s, v1.4s, v3.4s, v16.4s, v18.4s, v17.4s, v19.4s + trn4 v4.4s, v6.4s, v5.4s, v7.4s, v20.4s, v22.4s, v21.4s, v23.4s + ABS8 v0.8h, v1.8h, v2.8h, v3.8h, v4.8h, v5.8h, v6.8h, v7.8h + smax v0.8h, v0.8h, v2.8h + smax v1.8h, v1.8h, v3.8h + smax v2.8h, v4.8h, v6.8h + smax v3.8h, v5.8h, v7.8h + ret +endfunc + +function PFX(pixel_satd_8x8_neon) + mov x10, x30 + bl PFX(satd_8x8_neon) + add v0.8h, v0.8h, v1.8h + add v1.8h, v2.8h, v3.8h + add v0.8h, v0.8h, v1.8h + uaddlv s0, v0.8h + mov w0, v0.s0 + ret x10 +endfunc + +function PFX(pixel_satd_8x12_neon) + mov x4, x0 + mov x5, x2 + mov x7, #0 + satd_4x4_neon + umov x6, v0.d0 + add x7, x7, x6 + add x0, x4, #4 + add x2, x5, #4 + satd_4x4_neon + umov x6, v0.d0 + add x7, x7, x6 +.rept 2 + sub x0, x0, #4 + sub x2, x2, #4 + mov x4, x0 + mov x5, x2 + satd_4x4_neon + umov x6, v0.d0 + add x7, x7, x6 + add x0, x4, #4 + add x2, x5, #4 + satd_4x4_neon + umov x6, v0.d0 + add x7, x7, x6 +.endr + mov x0, x7 + ret +endfunc + +function PFX(pixel_satd_8x16_neon) + mov x10, x30 + bl PFX(satd_8x8_neon) + add v30.8h, v0.8h, v1.8h + add v31.8h, v2.8h, v3.8h + bl PFX(satd_8x8_neon) + add v30.8h, v30.8h, v0.8h + add v31.8h, v31.8h, v1.8h + add v30.8h, v30.8h, v2.8h + add v31.8h, v31.8h, v3.8h + add v0.8h, v30.8h, v31.8h + uaddlv s0, v0.8h + mov w0, v0.s0 + ret x10 +endfunc + +function PFX(pixel_satd_8x32_neon) + mov x10, x30 + bl PFX(satd_8x8_neon) + add v30.8h, v0.8h, v1.8h + add v31.8h, v2.8h, v3.8h +.rept 3 + bl PFX(satd_8x8_neon) + add v30.8h, v30.8h, v0.8h + add v31.8h, v31.8h, v1.8h + add v30.8h, v30.8h, v2.8h + add v31.8h, v31.8h, v3.8h +.endr + add v0.8h, v30.8h, v31.8h + uaddlv s0, v0.8h + mov w0, v0.s0 + ret x10 +endfunc + +function PFX(pixel_satd_8x64_neon) + mov x10, x30 + bl PFX(satd_8x8_neon) + add v30.8h, v0.8h, v1.8h + add v31.8h, v2.8h, v3.8h +.rept 7 + bl PFX(satd_8x8_neon) + add v30.8h, v30.8h, v0.8h + add v31.8h, v31.8h, v1.8h + add v30.8h, v30.8h, v2.8h + add v31.8h, v31.8h, v3.8h +.endr + add v0.8h, v30.8h, v31.8h + uaddlv s0, v0.8h + mov w0, v0.s0 + ret x10 +endfunc + +function PFX(pixel_satd_16x4_neon) + mov x10, x30 + bl PFX(satd_16x4_neon) + add v30.8h, v0.8h, v1.8h + add v31.8h, v2.8h, v3.8h + add v0.8h, v30.8h, v31.8h + uaddlv s0, v0.8h + mov w0, v0.s0 + ret x10 +endfunc + +function PFX(pixel_satd_16x8_neon) + mov x10, x30 + bl PFX(satd_16x4_neon) + add v30.8h, v0.8h, v1.8h + add v31.8h, v2.8h, v3.8h + bl PFX(satd_16x4_neon) + add v30.8h, v30.8h, v0.8h + add v31.8h, v31.8h, v1.8h + add v30.8h, v30.8h, v2.8h + add v31.8h, v31.8h, v3.8h + add v0.8h, v30.8h, v31.8h + uaddlv s0, v0.8h + mov w0, v0.s0 + ret x10 +endfunc + +function PFX(pixel_satd_16x12_neon) + mov x10, x30 + bl PFX(satd_16x4_neon) + add v30.8h, v0.8h, v1.8h + add v31.8h, v2.8h, v3.8h +.rept 2 + bl PFX(satd_16x4_neon) + add v30.8h, v30.8h, v0.8h + add v31.8h, v31.8h, v1.8h + add v30.8h, v30.8h, v2.8h + add v31.8h, v31.8h, v3.8h +.endr + add v0.8h, v30.8h, v31.8h + uaddlv s0, v0.8h + mov w0, v0.s0 + ret x10 +endfunc + +function PFX(pixel_satd_16x16_neon) + mov x10, x30 + bl PFX(satd_16x4_neon) + add v30.8h, v0.8h, v1.8h + add v31.8h, v2.8h, v3.8h +.rept 3 + bl PFX(satd_16x4_neon) + add v30.8h, v30.8h, v0.8h + add v31.8h, v31.8h, v1.8h + add v30.8h, v30.8h, v2.8h + add v31.8h, v31.8h, v3.8h +.endr + add v0.8h, v30.8h, v31.8h + uaddlv s0, v0.8h + mov w0, v0.s0 + ret x10 +endfunc + +function PFX(pixel_satd_16x24_neon) + mov x10, x30 + bl PFX(satd_16x4_neon) + add v30.8h, v0.8h, v1.8h + add v31.8h, v2.8h, v3.8h +.rept 5 + bl PFX(satd_16x4_neon) + add v30.8h, v30.8h, v0.8h + add v31.8h, v31.8h, v1.8h + add v30.8h, v30.8h, v2.8h + add v31.8h, v31.8h, v3.8h +.endr + add v0.8h, v30.8h, v31.8h + uaddlv s0, v0.8h + mov w0, v0.s0 + ret x10 +endfunc + +.macro pixel_satd_16x32_neon + bl PFX(satd_16x4_neon) + add v30.8h, v0.8h, v1.8h + add v31.8h, v2.8h, v3.8h +.rept 7 + bl PFX(satd_16x4_neon) + add v30.8h, v30.8h, v0.8h + add v31.8h, v31.8h, v1.8h + add v30.8h, v30.8h, v2.8h + add v31.8h, v31.8h, v3.8h +.endr +.endm + +function PFX(pixel_satd_16x32_neon) + mov x10, x30 + pixel_satd_16x32_neon + add v0.8h, v30.8h, v31.8h + uaddlv s0, v0.8h + mov w0, v0.s0 + ret x10 endfunc -// template<int w, int h> -// int satd4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) -function x265_pixel_satd_4x16_neon - eor w4, w4, w4 - pixel_satd_4x8_neon - mov w5, v0.s0 - add w4, w4, w5 - pixel_satd_4x8_neon - mov w5, v0.s0 - add w0, w5, w4 - ret +function PFX(pixel_satd_16x64_neon) + mov x10, x30 + bl PFX(satd_16x4_neon) + add v30.8h, v0.8h, v1.8h + add v31.8h, v2.8h, v3.8h +.rept 15 + bl PFX(satd_16x4_neon) + add v30.8h, v30.8h, v0.8h + add v31.8h, v31.8h, v1.8h + add v30.8h, v30.8h, v2.8h + add v31.8h, v31.8h, v3.8h +.endr + add v0.8h, v30.8h, v31.8h + uaddlv s0, v0.8h + mov w0, v0.s0 + ret x10 endfunc -// template<int w, int h> -// int satd4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) -function x265_pixel_satd_4x32_neon - eor w4, w4, w4 +function PFX(pixel_satd_24x32_neon) + mov x10, x30 + mov x7, #0 + mov x4, x0 + mov x5, x2 +.rept 3 + movi v30.8h, #0 + movi v31.8h, #0 .rept 4 - pixel_satd_4x8_neon - mov w5, v0.s0 - add w4, w4, w5 + bl PFX(satd_8x8_neon) + add v30.8h, v30.8h, v0.8h + add v31.8h, v31.8h, v1.8h + add v30.8h, v30.8h, v2.8h + add v31.8h, v31.8h, v3.8h .endr - mov w0, w4 - ret + add v0.8h, v30.8h, v31.8h + uaddlv s0, v0.8h + mov w6, v0.s0 + add x7, x7, x6 + add x4, x4, #8 + add x5, x5, #8 + mov x0, x4 + mov x2, x5 +.endr + mov x0, x7 + ret x10 endfunc -// template<int w, int h> -// int satd4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) -function x265_pixel_satd_12x16_neon +function PFX(pixel_satd_24x64_neon) + mov x10, x30 + mov x7, #0 mov x4, x0 mov x5, x2 - eor w7, w7, w7 - pixel_satd_4x8_neon +.rept 3 + movi v30.8h, #0 + movi v31.8h, #0 +.rept 4 + bl PFX(satd_8x8_neon) + add v30.8h, v30.8h, v0.8h + add v31.8h, v31.8h, v1.8h + add v30.8h, v30.8h, v2.8h + add v31.8h, v31.8h, v3.8h +.endr + add v0.8h, v30.8h, v31.8h + uaddlv s0, v0.8h mov w6, v0.s0 - add w7, w7, w6 - pixel_satd_4x8_neon + add x7, x7, x6 + add x4, x4, #8 + add x5, x5, #8 + mov x0, x4 + mov x2, x5 +.endr + sub x4, x4, #24 + sub x5, x5, #24 + add x0, x4, x1, lsl #5 + add x2, x5, x3, lsl #5 + mov x4, x0 + mov x5, x2 +.rept 3 + movi v30.8h, #0 + movi v31.8h, #0 +.rept 4 + bl PFX(satd_8x8_neon) + add v30.8h, v30.8h, v0.8h + add v31.8h, v31.8h, v1.8h + add v30.8h, v30.8h, v2.8h + add v31.8h, v31.8h, v3.8h +.endr + add v0.8h, v30.8h, v31.8h + uaddlv s0, v0.8h mov w6, v0.s0 - add w7, w7, w6 + add x7, x7, x6 + add x4, x4, #8 + add x5, x5, #8 + mov x0, x4 + mov x2, x5 +.endr + mov x0, x7 + ret x10 +endfunc - add x0, x4, #4 - add x2, x5, #4 - pixel_satd_4x8_neon - mov w6, v0.s0 - add w7, w7, w6 - pixel_satd_4x8_neon - mov w6, v0.s0 - add w7, w7, w6 +.macro pixel_satd_32x8 + mov x4, x0 + mov x5, x2 +.rept 2 + bl PFX(satd_16x4_neon) + add v30.8h, v30.8h, v0.8h + add v31.8h, v31.8h, v1.8h + add v30.8h, v30.8h, v2.8h + add v31.8h, v31.8h, v3.8h +.endr + add x0, x4, #16 + add x2, x5, #16 +.rept 2 + bl PFX(satd_16x4_neon) + add v30.8h, v30.8h, v0.8h + add v31.8h, v31.8h, v1.8h + add v30.8h, v30.8h, v2.8h + add v31.8h, v31.8h, v3.8h +.endr +.endm - add x0, x4, #8 - add x2, x5, #8 - pixel_satd_4x8_neon - mov w6, v0.s0 - add w7, w7, w6 - pixel_satd_4x8_neon +.macro satd_32x16_neon + movi v30.8h, #0 + movi v31.8h, #0 + pixel_satd_32x8 + sub x0, x0, #16 + sub x2, x2, #16 + pixel_satd_32x8 + add v0.8h, v30.8h, v31.8h + uaddlv s0, v0.8h mov w6, v0.s0 - add w0, w7, w6 - ret -endfunc +.endm -// template<int w, int h> -// int satd4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) -function x265_pixel_satd_12x32_neon +.macro satd_64x16_neon + mov x8, x0 + mov x9, x2 + satd_32x16_neon + add x7, x7, x6 + add x0, x8, #32 + add x2, x9, #32 + satd_32x16_neon + add x7, x7, x6 +.endm + +function PFX(pixel_satd_32x8_neon) + mov x10, x30 + mov x7, #0 mov x4, x0 mov x5, x2 - eor w7, w7, w7 -.rept 4 - pixel_satd_4x8_neon - mov w6, v0.s0 - add w7, w7, w6 + movi v30.8h, #0 + movi v31.8h, #0 + pixel_satd_32x8 + add v0.8h, v30.8h, v31.8h + uaddlv s0, v0.8h + mov w0, v0.s0 + ret x10 +endfunc + +function PFX(pixel_satd_32x16_neon) + mov x10, x30 + satd_32x16_neon + mov x0, x6 + ret x10 +endfunc + +function PFX(pixel_satd_32x24_neon) + mov x10, x30 + satd_32x16_neon + movi v30.8h, #0 + movi v31.8h, #0 + sub x0, x0, #16 + sub x2, x2, #16 + pixel_satd_32x8 + add v0.8h, v30.8h, v31.8h + uaddlv s0, v0.8h + mov w0, v0.s0 + add x0, x0, x6 + ret x10 +endfunc + +function PFX(pixel_satd_32x32_neon) + mov x10, x30 + mov x7, #0 + satd_32x16_neon + sub x0, x0, #16 + sub x2, x2, #16 + add x7, x7, x6 + satd_32x16_neon + add x0, x7, x6 + ret x10 +endfunc + +function PFX(pixel_satd_32x48_neon) + mov x10, x30 + mov x7, #0 +.rept 2 + satd_32x16_neon + sub x0, x0, #16 + sub x2, x2, #16 + add x7, x7, x6 .endr + satd_32x16_neon + add x0, x7, x6 + ret x10 +endfunc - add x0, x4, #4 - add x2, x5, #4 -.rept 4 - pixel_satd_4x8_neon - mov w6, v0.s0 - add w7, w7, w6 +function PFX(pixel_satd_32x64_neon) + mov x10, x30 + mov x7, #0 +.rept 3 + satd_32x16_neon + sub x0, x0, #16 + sub x2, x2, #16 + add x7, x7, x6 .endr + satd_32x16_neon + add x0, x7, x6 + ret x10 +endfunc - add x0, x4, #8 - add x2, x5, #8 -.rept 4 - pixel_satd_4x8_neon - mov w6, v0.s0 - add w7, w7, w6 +function PFX(pixel_satd_64x16_neon) + mov x10, x30 + mov x7, #0 + satd_64x16_neon + mov x0, x7 + ret x10 +endfunc + +function PFX(pixel_satd_64x32_neon) + mov x10, x30 + mov x7, #0 + satd_64x16_neon + sub x0, x0, #48 + sub x2, x2, #48 + satd_64x16_neon + mov x0, x7 + ret x10 +endfunc + +function PFX(pixel_satd_64x48_neon) + mov x10, x30 + mov x7, #0 +.rept 2 + satd_64x16_neon + sub x0, x0, #48 + sub x2, x2, #48 .endr + satd_64x16_neon + mov x0, x7 + ret x10 +endfunc - mov w0, w7 +function PFX(pixel_satd_64x64_neon) + mov x10, x30 + mov x7, #0 +.rept 3 + satd_64x16_neon + sub x0, x0, #48 + sub x2, x2, #48 +.endr + satd_64x16_neon + mov x0, x7 + ret x10 +endfunc + +function PFX(pixel_satd_48x64_neon) + mov x10, x30 + mov x7, #0 + mov x8, x0 + mov x9, x2 +.rept 3 + satd_32x16_neon + sub x0, x0, #16 + sub x2, x2, #16 + add x7, x7, x6 +.endr + satd_32x16_neon + add x7, x7, x6 + + add x0, x8, #32 + add x2, x9, #32 + pixel_satd_16x32_neon + add v0.8h, v30.8h, v31.8h + uaddlv s0, v0.8h + mov w6, v0.s0 + add x7, x7, x6 + + movi v30.8h, #0 + movi v31.8h, #0 + pixel_satd_16x32_neon + add v0.8h, v30.8h, v31.8h + uaddlv s0, v0.8h + mov w6, v0.s0 + add x0, x7, x6 + ret x10 +endfunc + +function PFX(sa8d_8x8_neon), export=0 + LOAD_DIFF_8x4 v16.8h, v17.8h, v18.8h, v19.8h + LOAD_DIFF_8x4 v20.8h, v21.8h, v22.8h, v23.8h + HADAMARD4_V v16.8h, v18.8h, v17.8h, v19.8h, v0.8h, v2.8h, v1.8h, v3.8h + HADAMARD4_V v20.8h, v21.8h, v22.8h, v23.8h, v0.8h, v1.8h, v2.8h, v3.8h + SUMSUB_ABCD v0.8h, v16.8h, v1.8h, v17.8h, v16.8h, v20.8h, v17.8h, v21.8h + SUMSUB_ABCD v2.8h, v18.8h, v3.8h, v19.8h, v18.8h, v22.8h, v19.8h, v23.8h + trn4 v4.8h, v5.8h, v6.8h, v7.8h, v0.8h, v1.8h, v2.8h, v3.8h + trn4 v20.8h, v21.8h, v22.8h, v23.8h, v16.8h, v17.8h, v18.8h, v19.8h + SUMSUB_ABCD v2.8h, v3.8h, v24.8h, v25.8h, v20.8h, v21.8h, v4.8h, v5.8h + SUMSUB_ABCD v0.8h, v1.8h, v4.8h, v5.8h, v22.8h, v23.8h, v6.8h, v7.8h + trn4 v20.4s, v22.4s, v21.4s, v23.4s, v2.4s, v0.4s, v3.4s, v1.4s + trn4 v16.4s, v18.4s, v17.4s, v19.4s, v24.4s, v4.4s, v25.4s, v5.4s + SUMSUB_ABCD v0.8h, v2.8h, v1.8h, v3.8h, v20.8h, v22.8h, v21.8h, v23.8h + SUMSUB_ABCD v4.8h, v6.8h, v5.8h, v7.8h, v16.8h, v18.8h, v17.8h, v19.8h + trn4 v16.2d, v20.2d, v17.2d, v21.2d, v0.2d, v4.2d, v1.2d, v5.2d + trn4 v18.2d, v22.2d, v19.2d, v23.2d, v2.2d, v6.2d, v3.2d, v7.2d + ABS8 v16.8h, v17.8h, v18.8h, v19.8h, v20.8h, v21.8h, v22.8h, v23.8h + smax v16.8h, v16.8h, v20.8h + smax v17.8h, v17.8h, v21.8h + smax v18.8h, v18.8h, v22.8h + smax v19.8h, v19.8h, v23.8h + add v0.8h, v16.8h, v17.8h + add v1.8h, v18.8h, v19.8h ret endfunc -// template<int w, int h> -// int satd4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) -function x265_pixel_satd_8x8_neon - eor w4, w4, w4 - mov x6, x0 - mov x7, x2 - pixel_satd_4x8_neon - mov w5, v0.s0 - add w4, w4, w5 - add x0, x6, #4 - add x2, x7, #4 - pixel_satd_4x8_neon +function PFX(pixel_sa8d_8x8_neon) + mov x10, x30 + bl PFX(sa8d_8x8_neon) + add v0.8h, v0.8h, v1.8h + uaddlv s0, v0.8h + mov w0, v0.s0 + add w0, w0, #1 + lsr w0, w0, #1 + ret x10 +endfunc + +function PFX(pixel_sa8d_8x16_neon) + mov x10, x30 + bl PFX(sa8d_8x8_neon) + add v0.8h, v0.8h, v1.8h + uaddlv s0, v0.8h mov w5, v0.s0 + add w5, w5, #1 + lsr w5, w5, #1 + bl PFX(sa8d_8x8_neon) + add v0.8h, v0.8h, v1.8h + uaddlv s0, v0.8h + mov w4, v0.s0 + add w4, w4, #1 + lsr w4, w4, #1 + add w0, w4, w5 + ret x10 +endfunc + +.macro sa8d_16x16 reg + bl PFX(sa8d_8x8_neon) + uaddlp v30.4s, v0.8h + uaddlp v31.4s, v1.8h + bl PFX(sa8d_8x8_neon) + uadalp v30.4s, v0.8h + uadalp v31.4s, v1.8h + sub x0, x0, x1, lsl #4 + sub x2, x2, x3, lsl #4 + add x0, x0, #8 + add x2, x2, #8 + bl PFX(sa8d_8x8_neon) + uadalp v30.4s, v0.8h + uadalp v31.4s, v1.8h + bl PFX(sa8d_8x8_neon) + uadalp v30.4s, v0.8h + uadalp v31.4s, v1.8h + add v0.4s, v30.4s, v31.4s + addv s0, v0.4s + mov \reg, v0.s0 + add \reg, \reg, #1 + lsr \reg, \reg, #1 +.endm + +function PFX(pixel_sa8d_16x16_neon) + mov x10, x30 + sa8d_16x16 w0 + ret x10 +endfunc + +function PFX(pixel_sa8d_16x32_neon) + mov x10, x30 + sa8d_16x16 w4 + sub x0, x0, #8 + sub x2, x2, #8 + sa8d_16x16 w5 add w0, w4, w5 + ret x10 +endfunc + +function PFX(pixel_sa8d_32x32_neon) + mov x10, x30 + sa8d_16x16 w4 + sub x0, x0, x1, lsl #4 + sub x2, x2, x3, lsl #4 + add x0, x0, #8 + add x2, x2, #8 + sa8d_16x16 w5 + sub x0, x0, #24 + sub x2, x2, #24 + sa8d_16x16 w6 + sub x0, x0, x1, lsl #4 + sub x2, x2, x3, lsl #4 + add x0, x0, #8 + add x2, x2, #8 + sa8d_16x16 w7 + add w4, w4, w5 + add w6, w6, w7 + add w0, w4, w6 + ret x10 +endfunc + +function PFX(pixel_sa8d_32x64_neon) + mov x10, x30 + mov w11, #4 + mov w9, #0 +.loop_sa8d_32: + sub w11, w11, #1 + sa8d_16x16 w4 + sub x0, x0, x1, lsl #4 + sub x2, x2, x3, lsl #4 + add x0, x0, #8 + add x2, x2, #8 + sa8d_16x16 w5 + add w4, w4, w5 + add w9, w9, w4 + sub x0, x0, #24 + sub x2, x2, #24 + cbnz w11, .loop_sa8d_32 + mov w0, w9 + ret x10 +endfunc + +function PFX(pixel_sa8d_64x64_neon) + mov x10, x30 + mov w11, #4 + mov w9, #0 +.loop_sa8d_64: + sub w11, w11, #1 + sa8d_16x16 w4 + sub x0, x0, x1, lsl #4 + sub x2, x2, x3, lsl #4 + add x0, x0, #8 + add x2, x2, #8 + sa8d_16x16 w5 + sub x0, x0, x1, lsl #4 + sub x2, x2, x3, lsl #4 + add x0, x0, #8 + add x2, x2, #8 + sa8d_16x16 w6 + sub x0, x0, x1, lsl #4 + sub x2, x2, x3, lsl #4 + add x0, x0, #8 + add x2, x2, #8 + sa8d_16x16 w7 + add w4, w4, w5 + add w6, w6, w7 + add w8, w4, w6 + add w9, w9, w8 + + sub x0, x0, #56 + sub x2, x2, #56 + cbnz w11, .loop_sa8d_64 + mov w0, w9 + ret x10 +endfunc + +/***** dequant_scaling*****/ +// void dequant_scaling_c(const int16_t* quantCoef, const int32_t* deQuantCoef, int16_t* coef, int num, int per, int shift) +function PFX(dequant_scaling_neon) + add x5, x5, #4 // shift + 4 + lsr x3, x3, #3 // num / 8 + cmp x5, x4 + blt .dequant_skip + + mov x12, #1 + sub x6, x5, x4 // shift - per + sub x6, x6, #1 // shift - per - 1 + lsl x6, x12, x6 // 1 << shift - per - 1 (add) + dup v0.4s, w6 + sub x7, x4, x5 // per - shift + dup v3.4s, w7 + +.dequant_loop1: + ld1 {v19.8h}, x0, #16 // quantCoef + ld1 {v2.4s}, x1, #16 // deQuantCoef + ld1 {v20.4s}, x1, #16 + sub x3, x3, #1 + sxtl v1.4s, v19.4h + sxtl2 v19.4s, v19.8h + + mul v1.4s, v1.4s, v2.4s // quantCoef * deQuantCoef + mul v19.4s, v19.4s, v20.4s + add v1.4s, v1.4s, v0.4s // quantCoef * deQuantCoef + add + add v19.4s, v19.4s, v0.4s + + sshl v1.4s, v1.4s, v3.4s + sshl v19.4s, v19.4s, v3.4s + sqxtn v16.4h, v1.4s // x265_clip3 + sqxtn2 v16.8h, v19.4s + st1 {v16.8h}, x2, #16 + cbnz x3, .dequant_loop1 + ret + +.dequant_skip: + sub x6, x4, x5 // per - shift + dup v0.8h, w6 + +.dequant_loop2: + ld1 {v19.8h}, x0, #16 // quantCoef + ld1 {v2.4s}, x1, #16 // deQuantCoef + ld1 {v20.4s}, x1, #16 + sub x3, x3, #1 + sxtl v1.4s, v19.4h + sxtl2 v19.4s, v19.8h + + mul v1.4s, v1.4s, v2.4s // quantCoef * deQuantCoef + mul v19.4s, v19.4s, v20.4s + sqxtn v16.4h, v1.4s // x265_clip3 + sqxtn2 v16.8h, v19.4s + + sqshl v16.8h, v16.8h, v0.8h // coefQ << per - shift + st1 {v16.8h}, x2, #16 + cbnz x3, .dequant_loop2 + ret +endfunc + +// void dequant_normal_c(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift) +function PFX(dequant_normal_neon) + lsr w2, w2, #4 // num / 16 + neg w4, w4 + dup v0.8h, w3 + dup v1.4s, w4 + +.dqn_loop1: + ld1 {v2.8h, v3.8h}, x0, #32 + smull v16.4s, v2.4h, v0.4h + smull2 v17.4s, v2.8h, v0.8h + smull v18.4s, v3.4h, v0.4h + smull2 v19.4s, v3.8h, v0.8h + + srshl v16.4s, v16.4s, v1.4s + srshl v17.4s, v17.4s, v1.4s + srshl v18.4s, v18.4s, v1.4s + srshl v19.4s, v19.4s, v1.4s + + sqxtn v2.4h, v16.4s + sqxtn2 v2.8h, v17.4s + sqxtn v3.4h, v18.4s + sqxtn2 v3.8h, v19.4s + + sub w2, w2, #1 + st1 {v2.8h, v3.8h}, x1, #32 + cbnz w2, .dqn_loop1 + ret +endfunc + +/********* ssim ***********/ +// void ssim_4x4x2_core(const pixel* pix1, intptr_t stride1, const pixel* pix2, intptr_t stride2, int sums24) +function PFX(ssim_4x4x2_core_neon) + ld1 {v0.8b}, x0, x1 + ld1 {v1.8b}, x0, x1 + ld1 {v2.8b}, x0, x1 + ld1 {v3.8b}, x0, x1 + + ld1 {v4.8b}, x2, x3 + ld1 {v5.8b}, x2, x3 + ld1 {v6.8b}, x2, x3 + ld1 {v7.8b}, x2, x3 + + umull v16.8h, v0.8b, v0.8b + umull v17.8h, v1.8b, v1.8b + umull v18.8h, v2.8b, v2.8b + uaddlp v30.4s, v16.8h + umull v19.8h, v3.8b, v3.8b + umull v20.8h, v4.8b, v4.8b + umull v21.8h, v5.8b, v5.8b + uadalp v30.4s, v17.8h + umull v22.8h, v6.8b, v6.8b + umull v23.8h, v7.8b, v7.8b + + umull v24.8h, v0.8b, v4.8b + uadalp v30.4s, v18.8h + umull v25.8h, v1.8b, v5.8b + umull v26.8h, v2.8b, v6.8b + umull v27.8h, v3.8b, v7.8b + uadalp v30.4s, v19.8h + + uaddl v28.8h, v0.8b, v1.8b + uaddl v29.8h, v4.8b, v5.8b + uadalp v30.4s, v20.8h + uaddlp v31.4s, v24.8h + + uaddw v28.8h, v28.8h, v2.8b + uaddw v29.8h, v29.8h, v6.8b + uadalp v30.4s, v21.8h + uadalp v31.4s, v25.8h + + uaddw v28.8h, v28.8h, v3.8b + uaddw v29.8h, v29.8h, v7.8b + uadalp v30.4s, v22.8h + uadalp v31.4s, v26.8h + + uaddlp v28.4s, v28.8h + uaddlp v29.4s, v29.8h + uadalp v30.4s, v23.8h + uadalp v31.4s, v27.8h + + addp v28.4s, v28.4s, v28.4s + addp v29.4s, v29.4s, v29.4s + addp v30.4s, v30.4s, v30.4s + addp v31.4s, v31.4s, v31.4s + + st4 {v28.2s, v29.2s, v30.2s, v31.2s}, x4 ret endfunc // int psyCost_pp(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride) -function x265_psyCost_4x4_neon +function PFX(psyCost_4x4_neon) ld1r {v4.2s}, x0, x1 ld1r {v5.2s}, x0, x1 ld1 {v4.s}1, x0, x1 @@ -286,7 +1792,7 @@ endfunc // uint32_t quant_c(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff) -function x265_quant_neon +function PFX(quant_neon) mov w9, #1 lsl w9, w9, w4 dup v0.2s, w9 @@ -341,79 +1847,597 @@ ret endfunc -.macro satd_4x4_neon - ld1 {v1.s}0, x2, x3 - ld1 {v0.s}0, x0, x1 - ld1 {v3.s}0, x2, x3 - ld1 {v2.s}0, x0, x1 +// uint32_t nquant_c(const int16_t* coef, const int32_t* quantCoeff, int16_t* qCoef, int qBits, int add, int numCoeff) +function PFX(nquant_neon) + neg x12, x3 + dup v0.4s, w12 // q0= -qbits + dup v1.4s, w4 // add - ld1 {v1.s}1, x2, x3 - ld1 {v0.s}1, x0, x1 - ld1 {v3.s}1, x2, x3 - ld1 {v2.s}1, x0, x1 + lsr w5, w5, #2 + movi v4.4s, #0 // v4= accumulate numsig + mov x4, #0 + movi v22.4s, #0 - usubl v4.8h, v0.8b, v1.8b - usubl v5.8h, v2.8b, v3.8b +.loop_nquant: + ld1 {v16.4h}, x0, #8 + sub w5, w5, #1 + sxtl v19.4s, v16.4h // v19 = coefblockpos - add v6.8h, v4.8h, v5.8h - sub v7.8h, v4.8h, v5.8h + cmlt v18.4s, v19.4s, #0 // v18 = sign - mov v4.d0, v6.d1 - add v0.8h, v6.8h, v4.8h - sub v2.8h, v6.8h, v4.8h + abs v19.4s, v19.4s // v19 = level=abs(coefblockpos) + ld1 {v20.4s}, x1, #16 // v20 = quantCoeffblockpos + mul v19.4s, v19.4s, v20.4s // v19 = tmplevel = abs(level) * quantCoeffblockpos; - mov v5.d0, v7.d1 - add v1.8h, v7.8h, v5.8h - sub v3.8h, v7.8h, v5.8h + add v20.4s, v19.4s, v1.4s // v20 = tmplevel+add + sshl v20.4s, v20.4s, v0.4s // v20 = level =(tmplevel+add) >> qbits - trn1 v4.4h, v0.4h, v1.4h - trn2 v5.4h, v0.4h, v1.4h + // numsig + cmeq v21.4s, v20.4s, v22.4s + add v4.4s, v4.4s, v21.4s + add x4, x4, #4 - trn1 v6.4h, v2.4h, v3.4h - trn2 v7.4h, v2.4h, v3.4h + eor v21.16b, v20.16b, v18.16b + sub v21.4s, v21.4s, v18.4s + sqxtn v16.4h, v21.4s + abs v17.4h, v16.4h + st1 {v17.4h}, x2, #8 - add v0.4h, v4.4h, v5.4h - sub v1.4h, v4.4h, v5.4h + cbnz w5, .loop_nquant - add v2.4h, v6.4h, v7.4h - sub v3.4h, v6.4h, v7.4h + uaddlv d4, v4.4s + fmov x12, d4 + add x0, x4, x12 + ret +endfunc - trn1 v4.2s, v0.2s, v1.2s - trn2 v5.2s, v0.2s, v1.2s +// void ssimDist_c(const pixel* fenc, uint32_t fStride, const pixel* recon, intptr_t rstride, uint64_t *ssBlock, int shift, uint64_t *ac_k) +.macro ssimDist_1 v4 v5 + sub v20.8h, \v4\().8h, \v5\().8h + smull v16.4s, \v4\().4h, \v4\().4h + smull2 v17.4s, \v4\().8h, \v4\().8h + smull v18.4s, v20.4h, v20.4h + smull2 v19.4s, v20.8h, v20.8h + add v0.4s, v0.4s, v16.4s + add v0.4s, v0.4s, v17.4s + add v1.4s, v1.4s, v18.4s + add v1.4s, v1.4s, v19.4s +.endm - trn1 v6.2s, v2.2s, v3.2s - trn2 v7.2s, v2.2s, v3.2s +function PFX(ssimDist4_neon) + ssimDist_start +.rept 4 + ld1 {v4.s}0, x0, x1 + ld1 {v5.s}0, x2, x3 + uxtl v4.8h, v4.8b + uxtl v5.8h, v5.8b + sub v2.4h, v4.4h, v5.4h + smull v3.4s, v4.4h, v4.4h + smull v2.4s, v2.4h, v2.4h + add v0.4s, v0.4s, v3.4s + add v1.4s, v1.4s, v2.4s +.endr + ssimDist_end + ret +endfunc - abs v4.4h, v4.4h - abs v5.4h, v5.4h - abs v6.4h, v6.4h - abs v7.4h, v7.4h +function PFX(ssimDist8_neon) + ssimDist_start +.rept 8 + ld1 {v4.8b}, x0, x1 + ld1 {v5.8b}, x2, x3 + uxtl v4.8h, v4.8b + uxtl v5.8h, v5.8b + ssimDist_1 v4, v5 +.endr + ssimDist_end + ret +endfunc - smax v1.4h, v4.4h, v5.4h - smax v2.4h, v6.4h, v7.4h +function PFX(ssimDist16_neon) + mov w12, #16 + ssimDist_start +.loop_ssimDist16: + sub w12, w12, #1 + ld1 {v4.16b}, x0, x1 + ld1 {v5.16b}, x2, x3 + uxtl v6.8h, v4.8b + uxtl v7.8h, v5.8b + uxtl2 v4.8h, v4.16b + uxtl2 v5.8h, v5.16b + ssimDist_1 v6, v7 + ssimDist_1 v4, v5 + cbnz w12, .loop_ssimDist16 + ssimDist_end + ret +endfunc - add v0.4h, v1.4h, v2.4h - uaddlp v0.2s, v0.4h - uaddlp v0.1d, v0.2s +function PFX(ssimDist32_neon) + mov w12, #32 + ssimDist_start +.loop_ssimDist32: + sub w12, w12, #1 + ld1 {v4.16b-v5.16b}, x0, x1 + ld1 {v6.16b-v7.16b}, x2, x3 + uxtl v21.8h, v4.8b + uxtl v22.8h, v6.8b + uxtl v23.8h, v5.8b + uxtl v24.8h, v7.8b + uxtl2 v25.8h, v4.16b + uxtl2 v26.8h, v6.16b + uxtl2 v27.8h, v5.16b + uxtl2 v28.8h, v7.16b + ssimDist_1 v21, v22 + ssimDist_1 v23, v24 + ssimDist_1 v25, v26 + ssimDist_1 v27, v28 + cbnz w12, .loop_ssimDist32 + ssimDist_end + ret +endfunc + +function PFX(ssimDist64_neon) + mov w12, #64 + ssimDist_start +.loop_ssimDist64: + sub w12, w12, #1 + ld1 {v4.16b-v7.16b}, x0, x1 + ld1 {v16.16b-v19.16b}, x2, x3 + uxtl v21.8h, v4.8b + uxtl v22.8h, v16.8b + uxtl v23.8h, v5.8b + uxtl v24.8h, v17.8b + uxtl2 v25.8h, v4.16b + uxtl2 v26.8h, v16.16b + uxtl2 v27.8h, v5.16b + uxtl2 v28.8h, v17.16b + ssimDist_1 v21, v22 + ssimDist_1 v23, v24 + ssimDist_1 v25, v26 + ssimDist_1 v27, v28 + uxtl v21.8h, v6.8b + uxtl v22.8h, v18.8b + uxtl v23.8h, v7.8b + uxtl v24.8h, v19.8b + uxtl2 v25.8h, v6.16b + uxtl2 v26.8h, v18.16b + uxtl2 v27.8h, v7.16b + uxtl2 v28.8h, v19.16b + ssimDist_1 v21, v22 + ssimDist_1 v23, v24 + ssimDist_1 v25, v26 + ssimDist_1 v27, v28 + cbnz w12, .loop_ssimDist64 + ssimDist_end + ret +endfunc + +// void normFact_c(const pixel* src, uint32_t blockSize, int shift, uint64_t *z_k) + +.macro normFact_1 v4 + smull v16.4s, \v4\().4h, \v4\().4h + smull2 v17.4s, \v4\().8h, \v4\().8h + add v0.4s, v0.4s, v16.4s + add v0.4s, v0.4s, v17.4s .endm -// int satd_4x4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) -function x265_pixel_satd_4x4_neon - satd_4x4_neon - umov x0, v0.d0 +function PFX(normFact8_neon) + normFact_start +.rept 8 + ld1 {v4.8b}, x0, x1 + uxtl v4.8h, v4.8b + normFact_1 v4 +.endr + normFact_end ret endfunc -// int satd_8x4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) -function x265_pixel_satd_8x4_neon - mov x4, x0 - mov x5, x2 - satd_4x4_neon - add x0, x4, #4 - add x2, x5, #4 - umov x6, v0.d0 - satd_4x4_neon - umov x0, v0.d0 - add x0, x0, x6 +function PFX(normFact16_neon) + mov w12, #16 + normFact_start +.loop_normFact16: + sub w12, w12, #1 + ld1 {v4.16b}, x0, x1 + uxtl v5.8h, v4.8b + uxtl2 v4.8h, v4.16b + normFact_1 v5 + normFact_1 v4 + cbnz w12, .loop_normFact16 + normFact_end + ret +endfunc + +function PFX(normFact32_neon) + mov w12, #32 + normFact_start +.loop_normFact32: + sub w12, w12, #1 + ld1 {v4.16b-v5.16b}, x0, x1 + uxtl v6.8h, v4.8b + uxtl2 v4.8h, v4.16b + uxtl v7.8h, v5.8b + uxtl2 v5.8h, v5.16b + normFact_1 v4 + normFact_1 v5 + normFact_1 v6 + normFact_1 v7 + cbnz w12, .loop_normFact32 + normFact_end + ret +endfunc + +function PFX(normFact64_neon) + mov w12, #64 + normFact_start +.loop_normFact64: + sub w12, w12, #1 + ld1 {v4.16b-v7.16b}, x0, x1 + uxtl v26.8h, v4.8b + uxtl2 v24.8h, v4.16b + uxtl v27.8h, v5.8b + uxtl2 v25.8h, v5.16b + normFact_1 v24 + normFact_1 v25 + normFact_1 v26 + normFact_1 v27 + uxtl v26.8h, v6.8b + uxtl2 v24.8h, v6.16b + uxtl v27.8h, v7.8b + uxtl2 v25.8h, v7.16b + normFact_1 v24 + normFact_1 v25 + normFact_1 v26 + normFact_1 v27 + cbnz w12, .loop_normFact64 + normFact_end + ret +endfunc + +// void weight_pp_c(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset) +function PFX(weight_pp_neon) + sub x2, x2, x3 + ldr w9, sp // offset + lsl w5, w5, #6 // w0 << correction + + // count trailing zeros in w5 and compare against shift right amount. + rbit w10, w5 + clz w10, w10 + cmp w10, w7 + b.lt .unfoldedShift + + // shift right only removes trailing zeros: hoist LSR out of the loop. + lsr w10, w5, w7 // w0 << correction >> shift + dup v25.16b, w10 + lsr w6, w6, w7 // round >> shift + add w6, w6, w9 // round >> shift + offset + dup v26.8h, w6 + + // Check arithmetic range. + mov w11, #255 + madd w11, w11, w10, w6 + add w11, w11, w9 + lsr w11, w11, #16 + cbnz w11, .widenTo32Bit + + // 16-bit arithmetic is enough. +.loopHpp: + mov x12, x3 +.loopWpp: + ldr q0, x0, #16 + sub x12, x12, #16 + umull v1.8h, v0.8b, v25.8b // val *= w0 << correction >> shift + umull2 v2.8h, v0.16b, v25.16b + add v1.8h, v1.8h, v26.8h // val += round >> shift + offset + add v2.8h, v2.8h, v26.8h + sqxtun v0.8b, v1.8h // val = x265_clip(val) + sqxtun2 v0.16b, v2.8h + str q0, x1, #16 + cbnz x12, .loopWpp + add x1, x1, x2 + add x0, x0, x2 + sub x4, x4, #1 + cbnz x4, .loopHpp + ret + + // 32-bit arithmetic is needed. +.widenTo32Bit: +.loopHpp32: + mov x12, x3 +.loopWpp32: + ldr d0, x0, #8 + sub x12, x12, #8 + uxtl v0.8h, v0.8b + umull v1.4s, v0.4h, v25.4h // val *= w0 << correction >> shift + umull2 v2.4s, v0.8h, v25.8h + add v1.4s, v1.4s, v26.4s // val += round >> shift + offset + add v2.4s, v2.4s, v26.4s + sqxtn v0.4h, v1.4s // val = x265_clip(val) + sqxtn2 v0.8h, v2.4s + sqxtun v0.8b, v0.8h + str d0, x1, #8 + cbnz x12, .loopWpp32 + add x1, x1, x2 + add x0, x0, x2 + sub x4, x4, #1 + cbnz x4, .loopHpp32 + ret + + // The shift right cannot be moved out of the loop. +.unfoldedShift: + dup v25.8h, w5 // w0 << correction + dup v26.4s, w6 // round + neg w7, w7 // -shift + dup v27.4s, w7 + dup v29.4s, w9 // offset +.loopHppUS: + mov x12, x3 +.loopWppUS: + ldr d0, x0, #8 + sub x12, x12, #8 + uxtl v0.8h, v0.8b + umull v1.4s, v0.4h, v25.4h // val *= w0 + umull2 v2.4s, v0.8h, v25.8h + add v1.4s, v1.4s, v26.4s // val += round + add v2.4s, v2.4s, v26.4s + sshl v1.4s, v1.4s, v27.4s // val >>= shift + sshl v2.4s, v2.4s, v27.4s + add v1.4s, v1.4s, v29.4s // val += offset + add v2.4s, v2.4s, v29.4s + sqxtn v0.4h, v1.4s // val = x265_clip(val) + sqxtn2 v0.8h, v2.4s + sqxtun v0.8b, v0.8h + str d0, x1, #8 + cbnz x12, .loopWppUS + add x1, x1, x2 + add x0, x0, x2 + sub x4, x4, #1 + cbnz x4, .loopHppUS + ret +endfunc + +// int scanPosLast( +// const uint16_t *scan, // x0 +// const coeff_t *coeff, // x1 +// uint16_t *coeffSign, // x2 +// uint16_t *coeffFlag, // x3 +// uint8_t *coeffNum, // x4 +// int numSig, // x5 +// const uint16_t* scanCG4x4, // x6 +// const int trSize) // x7 +function PFX(scanPosLast_neon) + // convert unit of Stride(trSize) to int16_t + add x7, x7, x7 + + // load scan table and convert to Byte + ldp q0, q1, x6 + xtn v0.8b, v0.8h + xtn2 v0.16b, v1.8h // v0 - Zigzag scan table + + movrel x10, g_SPL_and_mask + ldr q28, x10 // v28 = mask for pmovmskb + movi v31.16b, #0 // v31 = {0, ..., 0} + add x10, x7, x7 // 2*x7 + add x11, x10, x7 // 3*x7 + add x9, x4, #1 // CG count + +.loop_spl: + // position of current CG + ldrh w6, x0, #32 + add x6, x1, x6, lsl #1 + + // loading current CG + ldr d2, x6 + ldr d3, x6, x7 + ldr d4, x6, x10 + ldr d5, x6, x11 + mov v2.d1, v3.d0 + mov v4.d1, v5.d0 + sqxtn v2.8b, v2.8h + sqxtn2 v2.16b, v4.8h + + // Zigzag + tbl v3.16b, {v2.16b}, v0.16b + + // get sign + cmhi v5.16b, v3.16b, v31.16b // v5 = non-zero + cmlt v3.16b, v3.16b, #0 // v3 = negative + + // val - w13 = pmovmskb(v3) + and v3.16b, v3.16b, v28.16b + mov d4, v3.d1 + addv b23, v3.8b + addv b24, v4.8b + mov v23.b1, v24.b0 + fmov w13, s23 + + // mask - w15 = pmovmskb(v5) + and v5.16b, v5.16b, v28.16b + mov d6, v5.d1 + addv b25, v5.8b + addv b26, v6.8b + mov v25.b1, v26.b0 + fmov w15, s25 + + // coeffFlag = reverse_bit(w15) in 16-bit + rbit w12, w15 + lsr w12, w12, #16 + fmov s30, w12 + strh w12, x3, #2 + + // accelerate by preparing w13 = w13 & w15 + and w13, w13, w15 + mov x14, xzr +.loop_spl_1: + cbz w15, .pext_end + clz w6, w15 + lsl w13, w13, w6 + lsl w15, w15, w6 + extr w14, w14, w13, #31 + bfm w15, wzr, #1, #0 + b .loop_spl_1 +.pext_end: + strh w14, x2, #2 + + // compute coeffNum = popcount(coeffFlag) + cnt v30.8b, v30.8b + addp v30.8b, v30.8b, v30.8b + fmov w6, s30 + sub x5, x5, x6 + strb w6, x4, #1 + + cbnz x5, .loop_spl + + // count trailing zeros + rbit w13, w12 + clz w13, w13 + lsr w12, w12, w13 + strh w12, x3, #-2 + + // get last pos + sub x9, x4, x9 + lsl x0, x9, #4 + eor w13, w13, #15 + add x0, x0, x13 + ret +endfunc + +// uint32_t costCoeffNxN( +// uint16_t *scan, // x0 +// coeff_t *coeff, // x1 +// intptr_t trSize, // x2 +// uint16_t *absCoeff, // x3 +// uint8_t *tabSigCtx, // x4 +// uint16_t scanFlagMask, // x5 +// uint8_t *baseCtx, // x6 +// int offset, // x7 +// int scanPosSigOff, // sp +// int subPosBase) // sp + 8 +function PFX(costCoeffNxN_neon) + // abs(coeff) + add x2, x2, x2 + ld1 {v1.d}0, x1, x2 + ld1 {v1.d}1, x1, x2 + ld1 {v2.d}0, x1, x2 + ld1 {v2.d}1, x1, x2 + abs v1.8h, v1.8h + abs v2.8h, v2.8h + + // WARNING: beyond-bound read here! + // loading scan table + ldr w2, sp + eor w15, w2, #15 + add x1, x0, x15, lsl #1 + ldp q20, q21, x1 + uzp1 v20.16b, v20.16b, v21.16b + movi v21.16b, #15 + eor v0.16b, v20.16b, v21.16b + + // reorder coeff + uzp1 v22.16b, v1.16b, v2.16b + uzp2 v23.16b, v1.16b, v2.16b + tbl v24.16b, {v22.16b}, v0.16b + tbl v25.16b, {v23.16b}, v0.16b + zip1 v2.16b, v24.16b, v25.16b + zip2 v3.16b, v24.16b, v25.16b + + // loading tabSigCtx (+offset) + ldr q1, x4 + tbl v1.16b, {v1.16b}, v0.16b + dup v4.16b, w7 + movi v5.16b, #0 + tbl v4.16b, {v4.16b}, v5.16b + add v1.16b, v1.16b, v4.16b + + // register mapping + // x0 - sum + // x1 - entropyStateBits + // v1 - sigCtx + // {v3,v2} - abs(coeff) + // x2 - scanPosSigOff + // x3 - absCoeff + // x4 - numNonZero + // x5 - scanFlagMask + // x6 - baseCtx + mov x0, #0 + movrel x1, PFX_C(entropyStateBits) + mov x4, #0 + mov x11, #0 + movi v31.16b, #0 + cbz x2, .idx_zero +.loop_ccnn: +// { +// const uint32_t cnt = tabSigCtxblkPos + offset + posOffset; +// ctxSig = cnt & posZeroMask; +// const uint32_t mstate = baseCtxctxSig; +// const uint32_t mps = mstate & 1; +// const uint32_t stateBits = x265_entropyStateBitsmstate ^ sig; +// uint32_t nextState = (stateBits >> 24) + mps; +// if ((mstate ^ sig) == 1) +// nextState = sig; +// baseCtxctxSig = (uint8_t)nextState; +// sum += stateBits; +// } +// absCoeffnumNonZero = tmpCoeffblkPos; +// numNonZero += sig; +// scanPosSigOff--; + + add x13, x3, x4, lsl #1 + sub x2, x2, #1 + str h2, x13 // absCoeffnumNonZero = tmpCoeffblkPos + fmov w14, s1 // x14 = ctxSig + uxtb w14, w14 + ubfx w11, w5, #0, #1 // x11 = sig + lsr x5, x5, #1 + add x4, x4, x11 // numNonZero += sig + ext v1.16b, v1.16b, v31.16b, #1 + ext v2.16b, v2.16b, v3.16b, #2 + ext v3.16b, v3.16b, v31.16b, #2 + ldrb w9, x6, x14 // mstate = baseCtxctxSig + and w10, w9, #1 // mps = mstate & 1 + eor w9, w9, w11 // x9 = mstate ^ sig + add x12, x1, x9, lsl #2 + ldr w13, x12 + add w0, w0, w13 // sum += x265_entropyStateBitsmstate ^ sig + ldrb w13, x12, #3 + add w10, w10, w13 // nextState = (stateBits >> 24) + mps + cmp w9, #1 + csel w10, w11, w10, eq + strb w10, x6, x14 + cbnz x2, .loop_ccnn +.idx_zero: + + add x13, x3, x4, lsl #1 + add x4, x4, x15 + str h2, x13 // absCoeffnumNonZero = tmpCoeffblkPos + + ldr x9, sp, #8 // subPosBase + uxth w9, w9 + cmp w9, #0 + cset x2, eq + add x4, x4, x2 + cbz x4, .exit_ccnn + + sub w2, w2, #1 + uxtb w2, w2 + fmov w3, s1 + and w2, w2, w3 + + ldrb w3, x6, x2 // mstate = baseCtxctxSig + eor w4, w5, w3 // x5 = mstate ^ sig + and w3, w3, #1 // mps = mstate & 1 + add x1, x1, x4, lsl #2 + ldr w11, x1 + ldrb w12, x1, #3 + add w0, w0, w11 // sum += x265_entropyStateBitsmstate ^ sig + add w3, w3, w12 // nextState = (stateBits >> 24) + mps + cmp w4, #1 + csel w3, w5, w3, eq + strb w3, x6, x2 +.exit_ccnn: + ubfx w0, w0, #0, #24 ret endfunc + +const g_SPL_and_mask, align=8 +.byte 0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80, 0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80 +endconst
View file
x265_3.6.tar.gz/source/common/aarch64/sad-a-common.S
Added
@@ -0,0 +1,514 @@ +/***************************************************************************** + * Copyright (C) 2022-2023 MulticoreWare, Inc + * + * Authors: David Chen <david.chen@myais.com.cn> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +// This file contains the macros written using NEON instruction set +// that are also used by the SVE2 functions + +#include "asm.S" + +.arch armv8-a + +#ifdef __APPLE__ +.section __RODATA,__rodata +#else +.section .rodata +#endif + +.align 4 + +.macro SAD_START_4 f + ld1 {v0.s}0, x0, x1 + ld1 {v0.s}1, x0, x1 + ld1 {v1.s}0, x2, x3 + ld1 {v1.s}1, x2, x3 + \f v16.8h, v0.8b, v1.8b +.endm + +.macro SAD_4 h +.rept \h / 2 - 1 + SAD_START_4 uabal +.endr +.endm + +.macro SAD_START_8 f + ld1 {v0.8b}, x0, x1 + ld1 {v1.8b}, x2, x3 + ld1 {v2.8b}, x0, x1 + ld1 {v3.8b}, x2, x3 + \f v16.8h, v0.8b, v1.8b + \f v17.8h, v2.8b, v3.8b +.endm + +.macro SAD_8 h +.rept \h / 2 - 1 + SAD_START_8 uabal +.endr +.endm + +.macro SAD_START_16 f + ld1 {v0.16b}, x0, x1 + ld1 {v1.16b}, x2, x3 + ld1 {v2.16b}, x0, x1 + ld1 {v3.16b}, x2, x3 + \f v16.8h, v0.8b, v1.8b + \f\()2 v17.8h, v0.16b, v1.16b + uabal v16.8h, v2.8b, v3.8b + uabal2 v17.8h, v2.16b, v3.16b +.endm + +.macro SAD_16 h +.rept \h / 2 - 1 + SAD_START_16 uabal +.endr +.endm + +.macro SAD_START_32 + movi v16.16b, #0 + movi v17.16b, #0 + movi v18.16b, #0 + movi v19.16b, #0 +.endm + +.macro SAD_32 + ld1 {v0.16b-v1.16b}, x0, x1 + ld1 {v2.16b-v3.16b}, x2, x3 + ld1 {v4.16b-v5.16b}, x0, x1 + ld1 {v6.16b-v7.16b}, x2, x3 + uabal v16.8h, v0.8b, v2.8b + uabal2 v17.8h, v0.16b, v2.16b + uabal v18.8h, v1.8b, v3.8b + uabal2 v19.8h, v1.16b, v3.16b + uabal v16.8h, v4.8b, v6.8b + uabal2 v17.8h, v4.16b, v6.16b + uabal v18.8h, v5.8b, v7.8b + uabal2 v19.8h, v5.16b, v7.16b +.endm + +.macro SAD_END_32 + add v16.8h, v16.8h, v17.8h + add v17.8h, v18.8h, v19.8h + add v16.8h, v16.8h, v17.8h + uaddlv s0, v16.8h + fmov w0, s0 + ret +.endm + +.macro SAD_START_64 + movi v16.16b, #0 + movi v17.16b, #0 + movi v18.16b, #0 + movi v19.16b, #0 + movi v20.16b, #0 + movi v21.16b, #0 + movi v22.16b, #0 + movi v23.16b, #0 +.endm + +.macro SAD_64 + ld1 {v0.16b-v3.16b}, x0, x1 + ld1 {v4.16b-v7.16b}, x2, x3 + ld1 {v24.16b-v27.16b}, x0, x1 + ld1 {v28.16b-v31.16b}, x2, x3 + uabal v16.8h, v0.8b, v4.8b + uabal2 v17.8h, v0.16b, v4.16b + uabal v18.8h, v1.8b, v5.8b + uabal2 v19.8h, v1.16b, v5.16b + uabal v20.8h, v2.8b, v6.8b + uabal2 v21.8h, v2.16b, v6.16b + uabal v22.8h, v3.8b, v7.8b + uabal2 v23.8h, v3.16b, v7.16b + + uabal v16.8h, v24.8b, v28.8b + uabal2 v17.8h, v24.16b, v28.16b + uabal v18.8h, v25.8b, v29.8b + uabal2 v19.8h, v25.16b, v29.16b + uabal v20.8h, v26.8b, v30.8b + uabal2 v21.8h, v26.16b, v30.16b + uabal v22.8h, v27.8b, v31.8b + uabal2 v23.8h, v27.16b, v31.16b +.endm + +.macro SAD_END_64 + add v16.8h, v16.8h, v17.8h + add v17.8h, v18.8h, v19.8h + add v16.8h, v16.8h, v17.8h + uaddlp v16.4s, v16.8h + add v18.8h, v20.8h, v21.8h + add v19.8h, v22.8h, v23.8h + add v17.8h, v18.8h, v19.8h + uaddlp v17.4s, v17.8h + add v16.4s, v16.4s, v17.4s + uaddlv d0, v16.4s + fmov x0, d0 + ret +.endm + +.macro SAD_START_12 + movrel x12, sad12_mask + ld1 {v31.16b}, x12 + movi v16.16b, #0 + movi v17.16b, #0 +.endm + +.macro SAD_12 + ld1 {v0.16b}, x0, x1 + and v0.16b, v0.16b, v31.16b + ld1 {v1.16b}, x2, x3 + and v1.16b, v1.16b, v31.16b + ld1 {v2.16b}, x0, x1 + and v2.16b, v2.16b, v31.16b + ld1 {v3.16b}, x2, x3 + and v3.16b, v3.16b, v31.16b + uabal v16.8h, v0.8b, v1.8b + uabal2 v17.8h, v0.16b, v1.16b + uabal v16.8h, v2.8b, v3.8b + uabal2 v17.8h, v2.16b, v3.16b +.endm + +.macro SAD_END_12 + add v16.8h, v16.8h, v17.8h + uaddlv s0, v16.8h + fmov w0, s0 + ret +.endm + +.macro SAD_START_24 + movi v16.16b, #0 + movi v17.16b, #0 + movi v18.16b, #0 + sub x1, x1, #16 + sub x3, x3, #16 +.endm + +.macro SAD_24 + ld1 {v0.16b}, x0, #16 + ld1 {v1.8b}, x0, x1 + ld1 {v2.16b}, x2, #16 + ld1 {v3.8b}, x2, x3 + ld1 {v4.16b}, x0, #16 + ld1 {v5.8b}, x0, x1 + ld1 {v6.16b}, x2, #16 + ld1 {v7.8b}, x2, x3 + uabal v16.8h, v0.8b, v2.8b + uabal2 v17.8h, v0.16b, v2.16b + uabal v18.8h, v1.8b, v3.8b + uabal v16.8h, v4.8b, v6.8b + uabal2 v17.8h, v4.16b, v6.16b + uabal v18.8h, v5.8b, v7.8b +.endm + +.macro SAD_END_24 + add v16.8h, v16.8h, v17.8h + add v16.8h, v16.8h, v18.8h + uaddlv s0, v16.8h + fmov w0, s0 + ret +.endm + +.macro SAD_START_48 + movi v16.16b, #0 + movi v17.16b, #0 + movi v18.16b, #0 + movi v19.16b, #0 + movi v20.16b, #0 + movi v21.16b, #0 +.endm + +.macro SAD_48 + ld1 {v0.16b-v2.16b}, x0, x1 + ld1 {v4.16b-v6.16b}, x2, x3 + ld1 {v24.16b-v26.16b}, x0, x1 + ld1 {v28.16b-v30.16b}, x2, x3 + uabal v16.8h, v0.8b, v4.8b + uabal2 v17.8h, v0.16b, v4.16b + uabal v18.8h, v1.8b, v5.8b + uabal2 v19.8h, v1.16b, v5.16b + uabal v20.8h, v2.8b, v6.8b + uabal2 v21.8h, v2.16b, v6.16b + + uabal v16.8h, v24.8b, v28.8b + uabal2 v17.8h, v24.16b, v28.16b + uabal v18.8h, v25.8b, v29.8b + uabal2 v19.8h, v25.16b, v29.16b + uabal v20.8h, v26.8b, v30.8b + uabal2 v21.8h, v26.16b, v30.16b +.endm + +.macro SAD_END_48 + add v16.8h, v16.8h, v17.8h + add v17.8h, v18.8h, v19.8h + add v16.8h, v16.8h, v17.8h + uaddlv s0, v16.8h + fmov w0, s0 + add v18.8h, v20.8h, v21.8h + uaddlv s1, v18.8h + fmov w1, s1 + add w0, w0, w1 + ret +.endm + +.macro SAD_X_START_4 h, x, f + ld1 {v0.s}0, x0, x9 + ld1 {v0.s}1, x0, x9 + ld1 {v1.s}0, x1, x5 + ld1 {v1.s}1, x1, x5 + ld1 {v2.s}0, x2, x5 + ld1 {v2.s}1, x2, x5 + ld1 {v3.s}0, x3, x5 + ld1 {v3.s}1, x3, x5 + \f v16.8h, v0.8b, v1.8b + \f v17.8h, v0.8b, v2.8b + \f v18.8h, v0.8b, v3.8b +.if \x == 4 + ld1 {v4.s}0, x4, x5 + ld1 {v4.s}1, x4, x5 + \f v19.8h, v0.8b, v4.8b +.endif +.endm + +.macro SAD_X_4 h, x +.rept \h/2 - 1 + SAD_X_START_4 \h, \x, uabal +.endr +.endm + +.macro SAD_X_END_4 x + uaddlv s0, v16.8h + uaddlv s1, v17.8h + uaddlv s2, v18.8h + stp s0, s1, x6 +.if \x == 3 + str s2, x6, #8 +.elseif \x == 4 + uaddlv s3, v19.8h + stp s2, s3, x6, #8 +.endif + ret +.endm + +.macro SAD_X_START_8 h, x, f + ld1 {v0.8b}, x0, x9 + ld1 {v1.8b}, x1, x5 + ld1 {v2.8b}, x2, x5 + ld1 {v3.8b}, x3, x5 + \f v16.8h, v0.8b, v1.8b + \f v17.8h, v0.8b, v2.8b + \f v18.8h, v0.8b, v3.8b +.if \x == 4 + ld1 {v4.8b}, x4, x5 + \f v19.8h, v0.8b, v4.8b +.endif +.endm + +.macro SAD_X_8 h x +.rept \h - 1 + SAD_X_START_8 \h, \x, uabal +.endr +.endm + +.macro SAD_X_END_8 x + SAD_X_END_4 \x +.endm + +.macro SAD_X_START_12 h, x, f + ld1 {v0.16b}, x0, x9 + and v0.16b, v0.16b, v31.16b + ld1 {v1.16b}, x1, x5 + and v1.16b, v1.16b, v31.16b + ld1 {v2.16b}, x2, x5 + and v2.16b, v2.16b, v31.16b + ld1 {v3.16b}, x3, x5 + and v3.16b, v3.16b, v31.16b + \f v16.8h, v1.8b, v0.8b + \f\()2 v20.8h, v1.16b, v0.16b + \f v17.8h, v2.8b, v0.8b + \f\()2 v21.8h, v2.16b, v0.16b + \f v18.8h, v3.8b, v0.8b + \f\()2 v22.8h, v3.16b, v0.16b +.if \x == 4 + ld1 {v4.16b}, x4, x5 + and v4.16b, v4.16b, v31.16b + \f v19.8h, v4.8b, v0.8b + \f\()2 v23.8h, v4.16b, v0.16b +.endif +.endm + +.macro SAD_X_12 h x +.rept \h - 1 + SAD_X_START_12 \h, \x, uabal +.endr +.endm + +.macro SAD_X_END_12 x + SAD_X_END_16 \x +.endm + +.macro SAD_X_START_16 h, x, f + ld1 {v0.16b}, x0, x9 + ld1 {v1.16b}, x1, x5 + ld1 {v2.16b}, x2, x5 + ld1 {v3.16b}, x3, x5 + \f v16.8h, v1.8b, v0.8b + \f\()2 v20.8h, v1.16b, v0.16b + \f v17.8h, v2.8b, v0.8b + \f\()2 v21.8h, v2.16b, v0.16b + \f v18.8h, v3.8b, v0.8b + \f\()2 v22.8h, v3.16b, v0.16b +.if \x == 4 + ld1 {v4.16b}, x4, x5 + \f v19.8h, v4.8b, v0.8b + \f\()2 v23.8h, v4.16b, v0.16b +.endif +.endm + +.macro SAD_X_16 h x +.rept \h - 1 + SAD_X_START_16 \h, \x, uabal +.endr +.endm + +.macro SAD_X_END_16 x + add v16.8h, v16.8h, v20.8h + add v17.8h, v17.8h, v21.8h + add v18.8h, v18.8h, v22.8h +.if \x == 4 + add v19.8h, v19.8h, v23.8h +.endif + + SAD_X_END_4 \x +.endm + +.macro SAD_X_START_24 x + SAD_X_START_32 \x + sub x5, x5, #16 + sub x9, x9, #16 +.endm + +.macro SAD_X_24 base v1 v2 + ld1 {v0.16b}, \base , #16 + ld1 {v1.8b}, \base , x5 + uabal \v1\().8h, v0.8b, v6.8b + uabal \v1\().8h, v1.8b, v7.8b + uabal2 \v2\().8h, v0.16b, v6.16b +.endm + +.macro SAD_X_END_24 x + SAD_X_END_16 \x +.endm + +.macro SAD_X_START_32 x + movi v16.16b, #0 + movi v17.16b, #0 + movi v18.16b, #0 + movi v20.16b, #0 + movi v21.16b, #0 + movi v22.16b, #0 +.if \x == 4 + movi v19.16b, #0 + movi v23.16b, #0 +.endif +.endm + +.macro SAD_X_32 base v1 v2 + ld1 {v0.16b-v1.16b}, \base , x5 + uabal \v1\().8h, v0.8b, v6.8b + uabal \v1\().8h, v1.8b, v7.8b + uabal2 \v2\().8h, v0.16b, v6.16b + uabal2 \v2\().8h, v1.16b, v7.16b +.endm + +.macro SAD_X_END_32 x + SAD_X_END_16 \x +.endm + +.macro SAD_X_START_48 x + SAD_X_START_32 \x +.endm + +.macro SAD_X_48 x1 v1 v2 + ld1 {v0.16b-v2.16b}, \x1 , x5 + uabal \v1\().8h, v0.8b, v4.8b + uabal \v1\().8h, v1.8b, v5.8b + uabal \v1\().8h, v2.8b, v6.8b + uabal2 \v2\().8h, v0.16b, v4.16b + uabal2 \v2\().8h, v1.16b, v5.16b + uabal2 \v2\().8h, v2.16b, v6.16b +.endm + +.macro SAD_X_END_48 x + SAD_X_END_64 \x +.endm + +.macro SAD_X_START_64 x + SAD_X_START_32 \x +.endm + +.macro SAD_X_64 x1 v1 v2 + ld1 {v0.16b-v3.16b}, \x1 , x5 + uabal \v1\().8h, v0.8b, v4.8b + uabal \v1\().8h, v1.8b, v5.8b + uabal \v1\().8h, v2.8b, v6.8b + uabal \v1\().8h, v3.8b, v7.8b + uabal2 \v2\().8h, v0.16b, v4.16b + uabal2 \v2\().8h, v1.16b, v5.16b + uabal2 \v2\().8h, v2.16b, v6.16b + uabal2 \v2\().8h, v3.16b, v7.16b +.endm + +.macro SAD_X_END_64 x + uaddlp v16.4s, v16.8h + uaddlp v17.4s, v17.8h + uaddlp v18.4s, v18.8h + uaddlp v20.4s, v20.8h + uaddlp v21.4s, v21.8h + uaddlp v22.4s, v22.8h + add v16.4s, v16.4s, v20.4s + add v17.4s, v17.4s, v21.4s + add v18.4s, v18.4s, v22.4s + trn2 v20.2d, v16.2d, v16.2d + trn2 v21.2d, v17.2d, v17.2d + trn2 v22.2d, v18.2d, v18.2d + add v16.2s, v16.2s, v20.2s + add v17.2s, v17.2s, v21.2s + add v18.2s, v18.2s, v22.2s + uaddlp v16.1d, v16.2s + uaddlp v17.1d, v17.2s + uaddlp v18.1d, v18.2s + stp s16, s17, x6, #8 +.if \x == 3 + str s18, x6 +.elseif \x == 4 + uaddlp v19.4s, v19.8h + uaddlp v23.4s, v23.8h + add v19.4s, v19.4s, v23.4s + trn2 v23.2d, v19.2d, v19.2d + add v19.2s, v19.2s, v23.2s + uaddlp v19.1d, v19.2s + stp s18, s19, x6 +.endif + ret +.endm + +const sad12_mask, align=8 +.byte 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 0, 0, 0, 0 +endconst
View file
x265_3.6.tar.gz/source/common/aarch64/sad-a-sve2.S
Added
@@ -0,0 +1,511 @@ +/***************************************************************************** + * Copyright (C) 2022-2023 MulticoreWare, Inc + * + * Authors: David Chen <david.chen@myais.com.cn> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "asm-sve.S" +#include "sad-a-common.S" + +.arch armv8-a+sve2 + +#ifdef __APPLE__ +.section __RODATA,__rodata +#else +.section .rodata +#endif + +.align 4 + +.text + +.macro SAD_SVE2_16 h + mov z16.d, #0 + ptrue p0.h, vl16 +.rept \h + ld1b {z0.h}, p0/z, x0 + ld1b {z2.h}, p0/z, x2 + add x0, x0, x1 + add x2, x2, x3 + uaba z16.h, z0.h, z2.h +.endr + uaddv d0, p0, z16.h + fmov w0, s0 + ret +.endm + +.macro SAD_SVE2_32 h + ptrue p0.b, vl32 +.rept \h + ld1b {z0.b}, p0/z, x0 + ld1b {z4.b}, p0/z, x2 + add x0, x0, x1 + add x2, x2, x3 + uabalb z16.h, z0.b, z4.b + uabalt z16.h, z0.b, z4.b +.endr + uaddv d0, p0, z16.h + fmov w0, s0 + ret +.endm + +.macro SAD_SVE2_64 h + cmp x9, #48 + bgt .vl_gt_48_pixel_sad_64x\h + mov z16.d, #0 + mov z17.d, #0 + mov z18.d, #0 + mov z19.d, #0 + ptrue p0.b, vl32 +.rept \h + ld1b {z0.b}, p0/z, x0 + ld1b {z1.b}, p0/z, x0, #1, mul vl + ld1b {z4.b}, p0/z, x2 + ld1b {z5.b}, p0/z, x2, #1, mul vl + add x0, x0, x1 + add x2, x2, x3 + uabalb z16.h, z0.b, z4.b + uabalt z17.h, z0.b, z4.b + uabalb z18.h, z1.b, z5.b + uabalt z19.h, z1.b, z5.b +.endr + add z16.h, z16.h, z17.h + add z17.h, z18.h, z19.h + add z16.h, z16.h, z17.h + uadalp z24.s, p0/m, z16.h + uaddv d5, p0, z24.s + fmov x0, d5 + ret +.vl_gt_48_pixel_sad_64x\h\(): + mov z16.d, #0 + mov z17.d, #0 + mov z24.d, #0 + ptrue p0.b, vl64 +.rept \h + ld1b {z0.b}, p0/z, x0 + ld1b {z4.b}, p0/z, x2 + add x0, x0, x1 + add x2, x2, x3 + uabalb z16.h, z0.b, z4.b + uabalt z17.h, z0.b, z4.b +.endr + add z16.h, z16.h, z17.h + uadalp z24.s, p0/m, z16.h + uaddv d5, p0, z24.s + fmov x0, d5 + ret +.endm + +.macro SAD_SVE2_24 h + mov z16.d, #0 + mov x10, #24 + mov x11, #0 + whilelt p0.b, x11, x10 +.rept \h + ld1b {z0.b}, p0/z, x0 + ld1b {z8.b}, p0/z, x2 + add x0, x0, x1 + add x2, x2, x3 + uabalb z16.h, z0.b, z8.b + uabalt z16.h, z0.b, z8.b +.endr + uaddv d5, p0, z16.h + fmov w0, s5 + ret +.endm + +.macro SAD_SVE2_48 h + cmp x9, #48 + bgt .vl_gt_48_pixel_sad_48x\h + mov z16.d, #0 + mov z17.d, #0 + mov z18.d, #0 + mov z19.d, #0 + ptrue p0.b, vl32 + ptrue p1.b, vl16 +.rept \h + ld1b {z0.b}, p0/z, x0 + ld1b {z1.b}, p1/z, x0, #1, mul vl + ld1b {z8.b}, p0/z, x2 + ld1b {z9.b}, p1/z, x2, #1, mul vl + add x0, x0, x1 + add x2, x2, x3 + uabalb z16.h, z0.b, z8.b + uabalt z17.h, z0.b, z8.b + uabalb z18.h, z1.b, z9.b + uabalt z19.h, z1.b, z9.b +.endr + add z16.h, z16.h, z17.h + add z17.h, z18.h, z19.h + add z16.h, z16.h, z17.h + uaddv d5, p0, z16.h + fmov w0, s5 + ret +.vl_gt_48_pixel_sad_48x\h\(): + mov z16.d, #0 + mov z17.d, #0 + mov x10, #48 + mov x11, #0 + whilelt p0.b, x11, x10 +.rept \h + ld1b {z0.b}, p0/z, x0 + ld1b {z8.b}, p0/z, x2 + add x0, x0, x1 + add x2, x2, x3 + uabalb z16.h, z0.b, z8.b + uabalt z17.h, z0.b, z8.b +.endr + add z16.h, z16.h, z17.h + uaddv d5, p0, z16.h + fmov w0, s5 + ret +.endm + +// Fully unrolled. +.macro SAD_FUNC_SVE2 w, h +function PFX(pixel_sad_\w\()x\h\()_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_sad_\w\()x\h + SAD_START_\w uabdl + SAD_\w \h +.if \w > 4 + add v16.8h, v16.8h, v17.8h +.endif + uaddlv s0, v16.8h + fmov w0, s0 + ret +.vl_gt_16_pixel_sad_\w\()x\h\(): +.if \w == 4 || \w == 8 || \w == 12 + SAD_START_\w uabdl + SAD_\w \h +.if \w > 4 + add v16.8h, v16.8h, v17.8h +.endif + uaddlv s0, v16.8h + fmov w0, s0 + ret +.else + SAD_SVE2_\w \h +.endif +endfunc +.endm + +// Loop unrolled 4. +.macro SAD_FUNC_LOOP_SVE2 w, h +function PFX(pixel_sad_\w\()x\h\()_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_sad_loop_\w\()x\h + SAD_START_\w + + mov w9, #\h/8 +.loop_sve2_\w\()x\h: + sub w9, w9, #1 +.rept 4 + SAD_\w +.endr + cbnz w9, .loop_sve2_\w\()x\h + + SAD_END_\w + +.vl_gt_16_pixel_sad_loop_\w\()x\h\(): +.if \w == 4 || \w == 8 || \w == 12 + SAD_START_\w + + mov w9, #\h/8 +.loop_sve2_loop_\w\()x\h: + sub w9, w9, #1 +.rept 4 + SAD_\w +.endr + cbnz w9, .loop_sve2_loop_\w\()x\h + + SAD_END_\w +.else + SAD_SVE2_\w \h +.endif +endfunc +.endm + +SAD_FUNC_SVE2 4, 4 +SAD_FUNC_SVE2 4, 8 +SAD_FUNC_SVE2 4, 16 +SAD_FUNC_SVE2 8, 4 +SAD_FUNC_SVE2 8, 8 +SAD_FUNC_SVE2 8, 16 +SAD_FUNC_SVE2 8, 32 +SAD_FUNC_SVE2 16, 4 +SAD_FUNC_SVE2 16, 8 +SAD_FUNC_SVE2 16, 12 +SAD_FUNC_SVE2 16, 16 +SAD_FUNC_SVE2 16, 32 +SAD_FUNC_SVE2 16, 64 + +SAD_FUNC_LOOP_SVE2 32, 8 +SAD_FUNC_LOOP_SVE2 32, 16 +SAD_FUNC_LOOP_SVE2 32, 24 +SAD_FUNC_LOOP_SVE2 32, 32 +SAD_FUNC_LOOP_SVE2 32, 64 +SAD_FUNC_LOOP_SVE2 64, 16 +SAD_FUNC_LOOP_SVE2 64, 32 +SAD_FUNC_LOOP_SVE2 64, 48 +SAD_FUNC_LOOP_SVE2 64, 64 +SAD_FUNC_LOOP_SVE2 12, 16 +SAD_FUNC_LOOP_SVE2 24, 32 +SAD_FUNC_LOOP_SVE2 48, 64 + +// SAD_X3 and SAD_X4 code start + +.macro SAD_X_SVE2_24_INNER_GT_16 base z + ld1b {z4.b}, p0/z, \base + add \base, \base, x5 + uabalb \z\().h, z4.b, z0.b + uabalt \z\().h, z4.b, z0.b +.endm + +.macro SAD_X_SVE2_24 h x + mov z20.d, #0 + mov z21.d, #0 + mov z22.d, #0 + mov z23.d, #0 + mov x10, #24 + mov x11, #0 + whilelt p0.b, x11, x10 +.rept \h + ld1b {z0.b}, p0/z, x0 + add x0, x0, x9 + SAD_X_SVE2_24_INNER_GT_16 x1, z20 + SAD_X_SVE2_24_INNER_GT_16 x2, z21 + SAD_X_SVE2_24_INNER_GT_16 x3, z22 +.if \x == 4 + SAD_X_SVE2_24_INNER_GT_16 x4, z23 +.endif +.endr + uaddlv s0, v20.8h + uaddlv s1, v21.8h + uaddlv s2, v22.8h + stp s0, s1, x6 +.if \x == 3 + str s2, x6, #8 +.elseif \x == 4 + uaddv d0, p0, z20.h + uaddv d1, p0, z21.h + uaddv d2, p0, z22.h + stp s2, s3, x6, #8 +.endif + ret +.endm + +.macro SAD_X_SVE2_32_INNER_GT_16 base z + ld1b {z4.b}, p0/z, \base + add \base, \base, x5 + uabalb \z\().h, z4.b, z0.b + uabalt \z\().h, z4.b, z0.b +.endm + +.macro SAD_X_SVE2_32 h x + mov z20.d, #0 + mov z21.d, #0 + mov z22.d, #0 + mov z23.d, #0 + ptrue p0.b, vl32 +.rept \h + ld1b {z0.b}, p0/z, x0 + add x0, x0, x9 + SAD_X_SVE2_32_INNER_GT_16 x1, z20 + SAD_X_SVE2_32_INNER_GT_16 x2, z21 + SAD_X_SVE2_32_INNER_GT_16 x3, z22 +.if \x == 4 + SAD_X_SVE2_32_INNER_GT_16 x4, z23 +.endif +.endr + uaddv d0, p0, z20.h + uaddv d1, p0, z21.h + uaddv d2, p0, z22.h + stp s0, s1, x6 +.if \x == 3 + str s2, x6, #8 +.elseif \x == 4 + uaddv d3, p0, z23.h + stp s2, s3, x6, #8 +.endif + ret +.endm + +// static void x264_pixel_sad_x3_##size(pixel *fenc, pixel *pix0, pixel *pix1, pixel *pix2, intptr_t i_stride, int scores3) +// static void x264_pixel_sad_x4_##size(pixel *fenc, pixel *pix0, pixel *pix1,pixel *pix2, pixel *pix3, intptr_t i_stride, int scores4) +.macro SAD_X_FUNC_SVE2 x, w, h +function PFX(sad_x\x\()_\w\()x\h\()_sve2) + mov x9, #FENC_STRIDE + +// Make function arguments for x == 3 look like x == 4. +.if \x == 3 + mov x6, x5 + mov x5, x4 +.endif + rdvl x11, #1 + cmp x11, #16 + bgt .vl_gt_16_sad_x\x\()_\w\()x\h +.if \w == 12 + movrel x12, sad12_mask + ld1 {v31.16b}, x12 +.endif + + SAD_X_START_\w \h, \x, uabdl + SAD_X_\w \h, \x + SAD_X_END_\w \x +.vl_gt_16_sad_x\x\()_\w\()x\h\(): +.if \w == 24 || \w == 32 + SAD_X_SVE2_\w \h, \x +.else +.if \w == 12 + movrel x12, sad12_mask + ld1 {v31.16b}, x12 +.endif + + SAD_X_START_\w \h, \x, uabdl + SAD_X_\w \h, \x + SAD_X_END_\w \x +.endif +endfunc +.endm + +.macro SAD_X_LOOP_SVE2 x, w, h +function PFX(sad_x\x\()_\w\()x\h\()_sve2) + mov x9, #FENC_STRIDE + +// Make function arguments for x == 3 look like x == 4. +.if \x == 3 + mov x6, x5 + mov x5, x4 +.endif + rdvl x11, #1 + cmp x11, #16 + bgt .vl_gt_16_sad_x_loop_\x\()_\w\()x\h + SAD_X_START_\w \x + mov w12, #\h/4 +.loop_sad_sve2_x\x\()_\w\()x\h: + sub w12, w12, #1 + .rept 4 + .if \w == 24 + ld1 {v6.16b}, x0, #16 + ld1 {v7.8b}, x0, x9 + .elseif \w == 32 + ld1 {v6.16b-v7.16b}, x0, x9 + .elseif \w == 48 + ld1 {v4.16b-v6.16b}, x0, x9 + .elseif \w == 64 + ld1 {v4.16b-v7.16b}, x0, x9 + .endif + SAD_X_\w x1, v16, v20 + SAD_X_\w x2, v17, v21 + SAD_X_\w x3, v18, v22 + .if \x == 4 + SAD_X_\w x4, v19, v23 + .endif + .endr + cbnz w12, .loop_sad_sve2_x\x\()_\w\()x\h + SAD_X_END_\w \x +.vl_gt_16_sad_x_loop_\x\()_\w\()x\h\(): +.if \w == 24 || \w == 32 + SAD_X_SVE2_\w \h, \x + ret +.else + SAD_X_START_\w \x + mov w12, #\h/4 +.loop_sad_sve2_gt_16_x\x\()_\w\()x\h: + sub w12, w12, #1 + .rept 4 + .if \w == 24 + ld1 {v6.16b}, x0, #16 + ld1 {v7.8b}, x0, x9 + .elseif \w == 32 + ld1 {v6.16b-v7.16b}, x0, x9 + .elseif \w == 48 + ld1 {v4.16b-v6.16b}, x0, x9 + .elseif \w == 64 + ld1 {v4.16b-v7.16b}, x0, x9 + .endif + SAD_X_\w x1, v16, v20 + SAD_X_\w x2, v17, v21 + SAD_X_\w x3, v18, v22 + .if \x == 4 + SAD_X_\w x4, v19, v23 + .endif + .endr + cbnz w12, .loop_sad_sve2_gt_16_x\x\()_\w\()x\h + SAD_X_END_\w \x +.endif +endfunc +.endm + + +SAD_X_FUNC_SVE2 3, 4, 4 +SAD_X_FUNC_SVE2 3, 4, 8 +SAD_X_FUNC_SVE2 3, 4, 16 +SAD_X_FUNC_SVE2 3, 8, 4 +SAD_X_FUNC_SVE2 3, 8, 8 +SAD_X_FUNC_SVE2 3, 8, 16 +SAD_X_FUNC_SVE2 3, 8, 32 +SAD_X_FUNC_SVE2 3, 12, 16 +SAD_X_FUNC_SVE2 3, 16, 4 +SAD_X_FUNC_SVE2 3, 16, 8 +SAD_X_FUNC_SVE2 3, 16, 12 +SAD_X_FUNC_SVE2 3, 16, 16 +SAD_X_FUNC_SVE2 3, 16, 32 +SAD_X_FUNC_SVE2 3, 16, 64 +SAD_X_LOOP_SVE2 3, 24, 32 +SAD_X_LOOP_SVE2 3, 32, 8 +SAD_X_LOOP_SVE2 3, 32, 16 +SAD_X_LOOP_SVE2 3, 32, 24 +SAD_X_LOOP_SVE2 3, 32, 32 +SAD_X_LOOP_SVE2 3, 32, 64 +SAD_X_LOOP_SVE2 3, 48, 64 +SAD_X_LOOP_SVE2 3, 64, 16 +SAD_X_LOOP_SVE2 3, 64, 32 +SAD_X_LOOP_SVE2 3, 64, 48 +SAD_X_LOOP_SVE2 3, 64, 64 + +SAD_X_FUNC_SVE2 4, 4, 4 +SAD_X_FUNC_SVE2 4, 4, 8 +SAD_X_FUNC_SVE2 4, 4, 16 +SAD_X_FUNC_SVE2 4, 8, 4 +SAD_X_FUNC_SVE2 4, 8, 8 +SAD_X_FUNC_SVE2 4, 8, 16 +SAD_X_FUNC_SVE2 4, 8, 32 +SAD_X_FUNC_SVE2 4, 12, 16 +SAD_X_FUNC_SVE2 4, 16, 4 +SAD_X_FUNC_SVE2 4, 16, 8 +SAD_X_FUNC_SVE2 4, 16, 12 +SAD_X_FUNC_SVE2 4, 16, 16 +SAD_X_FUNC_SVE2 4, 16, 32 +SAD_X_FUNC_SVE2 4, 16, 64 +SAD_X_LOOP_SVE2 4, 24, 32 +SAD_X_LOOP_SVE2 4, 32, 8 +SAD_X_LOOP_SVE2 4, 32, 16 +SAD_X_LOOP_SVE2 4, 32, 24 +SAD_X_LOOP_SVE2 4, 32, 32 +SAD_X_LOOP_SVE2 4, 32, 64 +SAD_X_LOOP_SVE2 4, 48, 64 +SAD_X_LOOP_SVE2 4, 64, 16 +SAD_X_LOOP_SVE2 4, 64, 32 +SAD_X_LOOP_SVE2 4, 64, 48 +SAD_X_LOOP_SVE2 4, 64, 64
View file
x265_3.5.tar.gz/source/common/aarch64/sad-a.S -> x265_3.6.tar.gz/source/common/aarch64/sad-a.S
Changed
@@ -1,7 +1,8 @@ /***************************************************************************** - * Copyright (C) 2020 MulticoreWare, Inc + * Copyright (C) 2020-2021 MulticoreWare, Inc * * Authors: Hongbin Liu <liuhongbin1@huawei.com> + * Sebastian Pop <spop@amazon.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -22,84 +23,186 @@ *****************************************************************************/ #include "asm.S" +#include "sad-a-common.S" +#ifdef __APPLE__ +.section __RODATA,__rodata +#else .section .rodata +#endif .align 4 .text -.macro SAD_X_START_8 x - ld1 {v0.8b}, x0, x9 -.if \x == 3 - ld1 {v1.8b}, x1, x4 - ld1 {v2.8b}, x2, x4 - ld1 {v3.8b}, x3, x4 -.elseif \x == 4 - ld1 {v1.8b}, x1, x5 - ld1 {v2.8b}, x2, x5 - ld1 {v3.8b}, x3, x5 - ld1 {v4.8b}, x4, x5 -.endif - uabdl v16.8h, v0.8b, v1.8b - uabdl v17.8h, v0.8b, v2.8b - uabdl v18.8h, v0.8b, v3.8b -.if \x == 4 - uabdl v19.8h, v0.8b, v4.8b +// Fully unrolled. +.macro SAD_FUNC w, h +function PFX(pixel_sad_\w\()x\h\()_neon) + SAD_START_\w uabdl + SAD_\w \h +.if \w > 4 + add v16.8h, v16.8h, v17.8h .endif + uaddlv s0, v16.8h + fmov w0, s0 + ret +endfunc +.endm + +// Loop unrolled 4. +.macro SAD_FUNC_LOOP w, h +function PFX(pixel_sad_\w\()x\h\()_neon) + SAD_START_\w + + mov w9, #\h/8 +.loop_\w\()x\h: + sub w9, w9, #1 +.rept 4 + SAD_\w +.endr + cbnz w9, .loop_\w\()x\h + + SAD_END_\w +endfunc .endm -.macro SAD_X_8 x - ld1 {v0.8b}, x0, x9 +SAD_FUNC 4, 4 +SAD_FUNC 4, 8 +SAD_FUNC 4, 16 +SAD_FUNC 8, 4 +SAD_FUNC 8, 8 +SAD_FUNC 8, 16 +SAD_FUNC 8, 32 +SAD_FUNC 16, 4 +SAD_FUNC 16, 8 +SAD_FUNC 16, 12 +SAD_FUNC 16, 16 +SAD_FUNC 16, 32 +SAD_FUNC 16, 64 + +SAD_FUNC_LOOP 32, 8 +SAD_FUNC_LOOP 32, 16 +SAD_FUNC_LOOP 32, 24 +SAD_FUNC_LOOP 32, 32 +SAD_FUNC_LOOP 32, 64 +SAD_FUNC_LOOP 64, 16 +SAD_FUNC_LOOP 64, 32 +SAD_FUNC_LOOP 64, 48 +SAD_FUNC_LOOP 64, 64 +SAD_FUNC_LOOP 12, 16 +SAD_FUNC_LOOP 24, 32 +SAD_FUNC_LOOP 48, 64 + +// SAD_X3 and SAD_X4 code start + +// static void x264_pixel_sad_x3_##size(pixel *fenc, pixel *pix0, pixel *pix1, pixel *pix2, intptr_t i_stride, int scores3) +// static void x264_pixel_sad_x4_##size(pixel *fenc, pixel *pix0, pixel *pix1,pixel *pix2, pixel *pix3, intptr_t i_stride, int scores4) +.macro SAD_X_FUNC x, w, h +function PFX(sad_x\x\()_\w\()x\h\()_neon) + mov x9, #FENC_STRIDE + +// Make function arguments for x == 3 look like x == 4. .if \x == 3 - ld1 {v1.8b}, x1, x4 - ld1 {v2.8b}, x2, x4 - ld1 {v3.8b}, x3, x4 -.elseif \x == 4 - ld1 {v1.8b}, x1, x5 - ld1 {v2.8b}, x2, x5 - ld1 {v3.8b}, x3, x5 - ld1 {v4.8b}, x4, x5 + mov x6, x5 + mov x5, x4 .endif - uabal v16.8h, v0.8b, v1.8b - uabal v17.8h, v0.8b, v2.8b - uabal v18.8h, v0.8b, v3.8b -.if \x == 4 - uabal v19.8h, v0.8b, v4.8b + +.if \w == 12 + movrel x12, sad12_mask + ld1 {v31.16b}, x12 .endif + + SAD_X_START_\w \h, \x, uabdl + SAD_X_\w \h, \x + SAD_X_END_\w \x +endfunc .endm -.macro SAD_X_8xN x, h -function x265_sad_x\x\()_8x\h\()_neon +.macro SAD_X_LOOP x, w, h +function PFX(sad_x\x\()_\w\()x\h\()_neon) mov x9, #FENC_STRIDE - SAD_X_START_8 \x -.rept \h - 1 - SAD_X_8 \x -.endr - uaddlv s0, v16.8h - uaddlv s1, v17.8h - uaddlv s2, v18.8h -.if \x == 4 - uaddlv s3, v19.8h -.endif +// Make function arguments for x == 3 look like x == 4. .if \x == 3 - stp s0, s1, x5 - str s2, x5, #8 -.elseif \x == 4 - stp s0, s1, x6 - stp s2, s3, x6, #8 + mov x6, x5 + mov x5, x4 .endif - ret + SAD_X_START_\w \x + mov w12, #\h/4 +.loop_sad_x\x\()_\w\()x\h: + sub w12, w12, #1 + .rept 4 + .if \w == 24 + ld1 {v6.16b}, x0, #16 + ld1 {v7.8b}, x0, x9 + .elseif \w == 32 + ld1 {v6.16b-v7.16b}, x0, x9 + .elseif \w == 48 + ld1 {v4.16b-v6.16b}, x0, x9 + .elseif \w == 64 + ld1 {v4.16b-v7.16b}, x0, x9 + .endif + SAD_X_\w x1, v16, v20 + SAD_X_\w x2, v17, v21 + SAD_X_\w x3, v18, v22 + .if \x == 4 + SAD_X_\w x4, v19, v23 + .endif + .endr + cbnz w12, .loop_sad_x\x\()_\w\()x\h + SAD_X_END_\w \x endfunc .endm -SAD_X_8xN 3 4 -SAD_X_8xN 3 8 -SAD_X_8xN 3 16 -SAD_X_8xN 3 32 -SAD_X_8xN 4 4 -SAD_X_8xN 4 8 -SAD_X_8xN 4 16 -SAD_X_8xN 4 32 +SAD_X_FUNC 3, 4, 4 +SAD_X_FUNC 3, 4, 8 +SAD_X_FUNC 3, 4, 16 +SAD_X_FUNC 3, 8, 4 +SAD_X_FUNC 3, 8, 8 +SAD_X_FUNC 3, 8, 16 +SAD_X_FUNC 3, 8, 32 +SAD_X_FUNC 3, 12, 16 +SAD_X_FUNC 3, 16, 4 +SAD_X_FUNC 3, 16, 8 +SAD_X_FUNC 3, 16, 12 +SAD_X_FUNC 3, 16, 16 +SAD_X_FUNC 3, 16, 32 +SAD_X_FUNC 3, 16, 64 +SAD_X_LOOP 3, 24, 32 +SAD_X_LOOP 3, 32, 8 +SAD_X_LOOP 3, 32, 16 +SAD_X_LOOP 3, 32, 24 +SAD_X_LOOP 3, 32, 32 +SAD_X_LOOP 3, 32, 64 +SAD_X_LOOP 3, 48, 64 +SAD_X_LOOP 3, 64, 16 +SAD_X_LOOP 3, 64, 32 +SAD_X_LOOP 3, 64, 48 +SAD_X_LOOP 3, 64, 64 + +SAD_X_FUNC 4, 4, 4 +SAD_X_FUNC 4, 4, 8 +SAD_X_FUNC 4, 4, 16 +SAD_X_FUNC 4, 8, 4 +SAD_X_FUNC 4, 8, 8 +SAD_X_FUNC 4, 8, 16 +SAD_X_FUNC 4, 8, 32 +SAD_X_FUNC 4, 12, 16 +SAD_X_FUNC 4, 16, 4 +SAD_X_FUNC 4, 16, 8 +SAD_X_FUNC 4, 16, 12 +SAD_X_FUNC 4, 16, 16 +SAD_X_FUNC 4, 16, 32 +SAD_X_FUNC 4, 16, 64 +SAD_X_LOOP 4, 24, 32 +SAD_X_LOOP 4, 32, 8 +SAD_X_LOOP 4, 32, 16 +SAD_X_LOOP 4, 32, 24 +SAD_X_LOOP 4, 32, 32 +SAD_X_LOOP 4, 32, 64 +SAD_X_LOOP 4, 48, 64 +SAD_X_LOOP 4, 64, 16 +SAD_X_LOOP 4, 64, 32 +SAD_X_LOOP 4, 64, 48 +SAD_X_LOOP 4, 64, 64
View file
x265_3.6.tar.gz/source/common/aarch64/ssd-a-common.S
Added
@@ -0,0 +1,37 @@ +/***************************************************************************** + * Copyright (C) 2022-2023 MulticoreWare, Inc + * + * Authors: David Chen <david.chen@myais.com.cn> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +// This file contains the macros written using NEON instruction set +// that are also used by the SVE2 functions + +#include "asm.S" + +.arch armv8-a + +.macro ret_v0_w0 + trn2 v1.2d, v0.2d, v0.2d + add v0.2s, v0.2s, v1.2s + addp v0.2s, v0.2s, v0.2s + fmov w0, s0 + ret +.endm
View file
x265_3.6.tar.gz/source/common/aarch64/ssd-a-sve.S
Added
@@ -0,0 +1,78 @@ +/***************************************************************************** + * Copyright (C) 2022-2023 MulticoreWare, Inc + * + * Authors: David Chen <david.chen@myais.com.cn> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "asm-sve.S" + +.arch armv8-a+sve + +#ifdef __APPLE__ +.section __RODATA,__rodata +#else +.section .rodata +#endif + +.align 4 + +.text + +function PFX(pixel_sse_pp_4x4_sve) + ptrue p0.s, vl4 + ld1b {z0.s}, p0/z, x0 + ld1b {z17.s}, p0/z, x2 + add x0, x0, x1 + add x2, x2, x3 + sub z0.s, p0/m, z0.s, z17.s + mul z0.s, p0/m, z0.s, z0.s +.rept 3 + ld1b {z16.s}, p0/z, x0 + ld1b {z17.s}, p0/z, x2 + add x0, x0, x1 + add x2, x2, x3 + sub z16.s, p0/m, z16.s, z17.s + mla z0.s, p0/m, z16.s, z16.s +.endr + uaddv d0, p0, z0.s + fmov w0, s0 + ret +endfunc + +function PFX(pixel_sse_pp_4x8_sve) + ptrue p0.s, vl4 + ld1b {z0.s}, p0/z, x0 + ld1b {z17.s}, p0/z, x2 + add x0, x0, x1 + add x2, x2, x3 + sub z0.s, p0/m, z0.s, z17.s + mul z0.s, p0/m, z0.s, z0.s +.rept 7 + ld1b {z16.s}, p0/z, x0 + ld1b {z17.s}, p0/z, x2 + add x0, x0, x1 + add x2, x2, x3 + sub z16.s, p0/m, z16.s, z17.s + mla z0.s, p0/m, z16.s, z16.s +.endr + uaddv d0, p0, z0.s + fmov w0, s0 + ret +endfunc
View file
x265_3.6.tar.gz/source/common/aarch64/ssd-a-sve2.S
Added
@@ -0,0 +1,887 @@ +/***************************************************************************** + * Copyright (C) 2022-2023 MulticoreWare, Inc + * + * Authors: David Chen <david.chen@myais.com.cn> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "asm-sve.S" +#include "ssd-a-common.S" + +.arch armv8-a+sve2 + +#ifdef __APPLE__ +.section __RODATA,__rodata +#else +.section .rodata +#endif + +.align 4 + +.text + +function PFX(pixel_sse_pp_32x32_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_sse_pp_32x32 + mov w12, #8 + movi v0.16b, #0 + movi v1.16b, #0 +.loop_sse_pp_32_sve2: + sub w12, w12, #1 +.rept 4 + ld1 {v16.16b,v17.16b}, x0, x1 + ld1 {v18.16b,v19.16b}, x2, x3 + usubl v2.8h, v16.8b, v18.8b + usubl2 v3.8h, v16.16b, v18.16b + usubl v4.8h, v17.8b, v19.8b + usubl2 v5.8h, v17.16b, v19.16b + smlal v0.4s, v2.4h, v2.4h + smlal2 v1.4s, v2.8h, v2.8h + smlal v0.4s, v3.4h, v3.4h + smlal2 v1.4s, v3.8h, v3.8h + smlal v0.4s, v4.4h, v4.4h + smlal2 v1.4s, v4.8h, v4.8h + smlal v0.4s, v5.4h, v5.4h + smlal2 v1.4s, v5.8h, v5.8h +.endr + cbnz w12, .loop_sse_pp_32_sve2 + add v0.4s, v0.4s, v1.4s + ret_v0_w0 +.vl_gt_16_pixel_sse_pp_32x32: + ptrue p0.b, vl32 + ld1b {z16.b}, p0/z, x0 + ld1b {z18.b}, p0/z, x2 + add x0, x0, x1 + add x2, x2, x3 + usublb z1.h, z16.b, z18.b + usublt z2.h, z16.b, z18.b + smullb z0.s, z1.h, z1.h + smlalt z0.s, z1.h, z1.h + smlalb z0.s, z2.h, z2.h + smlalt z0.s, z2.h, z2.h +.rept 31 + ld1b {z16.b}, p0/z, x0 + ld1b {z18.b}, p0/z, x2 + add x0, x0, x1 + add x2, x2, x3 + usublb z1.h, z16.b, z18.b + usublt z2.h, z16.b, z18.b + smullb z0.s, z1.h, z1.h + smlalt z0.s, z1.h, z1.h + smlalb z0.s, z2.h, z2.h + smlalt z0.s, z2.h, z2.h +.endr + uaddv d3, p0, z0.s + fmov w0, s3 + ret +endfunc + +function PFX(pixel_sse_pp_32x64_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_sse_pp_32x64 + ptrue p0.b, vl16 + ld1b {z16.b}, p0/z, x0 + ld1b {z17.b}, p0/z, x0, #1, mul vl + ld1b {z18.b}, p0/z, x2 + ld1b {z19.b}, p0/z, x2, #1, mul vl + add x0, x0, x1 + add x2, x2, x3 + usublb z1.h, z16.b, z18.b + usublt z2.h, z16.b, z18.b + usublb z3.h, z17.b, z19.b + usublt z4.h, z17.b, z19.b + smullb z20.s, z1.h, z1.h + smullt z21.s, z1.h, z1.h + smlalb z20.s, z2.h, z2.h + smlalt z21.s, z2.h, z2.h + smlalb z20.s, z3.h, z3.h + smlalt z21.s, z3.h, z3.h + smlalb z20.s, z4.h, z4.h + smlalt z21.s, z4.h, z4.h +.rept 63 + ld1b {z16.b}, p0/z, x0 + ld1b {z17.b}, p0/z, x0, #1, mul vl + ld1b {z18.b}, p0/z, x2 + ld1b {z19.b}, p0/z, x2, #1, mul vl + add x0, x0, x1 + add x2, x2, x3 + usublb z1.h, z16.b, z18.b + usublt z2.h, z16.b, z18.b + usublb z3.h, z17.b, z19.b + usublt z4.h, z17.b, z19.b + smlalb z20.s, z1.h, z1.h + smlalt z21.s, z1.h, z1.h + smlalb z20.s, z2.h, z2.h + smlalt z21.s, z2.h, z2.h + smlalb z20.s, z3.h, z3.h + smlalt z21.s, z3.h, z3.h + smlalb z20.s, z4.h, z4.h + smlalt z21.s, z4.h, z4.h +.endr + uaddv d3, p0, z20.s + fmov w0, s3 + uaddv d4, p0, z21.s + fmov w1, s4 + add w0, w0, w1 + ret +.vl_gt_16_pixel_sse_pp_32x64: + ptrue p0.b, vl32 + ld1b {z16.b}, p0/z, x0 + ld1b {z18.b}, p0/z, x2 + add x0, x0, x1 + add x2, x2, x3 + usublb z1.h, z16.b, z18.b + usublt z2.h, z16.b, z18.b + smullb z20.s, z1.h, z1.h + smullt z21.s, z1.h, z1.h + smlalb z20.s, z2.h, z2.h + smlalt z21.s, z2.h, z2.h +.rept 63 + ld1b {z16.b}, p0/z, x0 + ld1b {z18.b}, p0/z, x2 + add x0, x0, x1 + add x2, x2, x3 + usublb z1.h, z16.b, z18.b + usublt z2.h, z16.b, z18.b + smlalb z20.s, z1.h, z1.h + smlalt z21.s, z1.h, z1.h + smlalb z20.s, z2.h, z2.h + smlalt z21.s, z2.h, z2.h +.endr + uaddv d3, p0, z20.s + fmov w0, s3 + uaddv d4, p0, z21.s + fmov w1, s4 + add w0, w0, w1 + ret +endfunc + +function PFX(pixel_sse_pp_64x64_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_sse_pp_64x64 + mov w12, #16 + movi v0.16b, #0 + movi v1.16b, #0 + +.loop_sse_pp_64_sve2: + sub w12, w12, #1 +.rept 4 + ld1 {v16.16b-v19.16b}, x0, x1 + ld1 {v20.16b-v23.16b}, x2, x3 + + usubl v2.8h, v16.8b, v20.8b + usubl2 v3.8h, v16.16b, v20.16b + usubl v4.8h, v17.8b, v21.8b + usubl2 v5.8h, v17.16b, v21.16b + smlal v0.4s, v2.4h, v2.4h + smlal2 v1.4s, v2.8h, v2.8h + smlal v0.4s, v3.4h, v3.4h + smlal2 v1.4s, v3.8h, v3.8h + smlal v0.4s, v4.4h, v4.4h + smlal2 v1.4s, v4.8h, v4.8h + smlal v0.4s, v5.4h, v5.4h + smlal2 v1.4s, v5.8h, v5.8h + + usubl v2.8h, v18.8b, v22.8b + usubl2 v3.8h, v18.16b, v22.16b + usubl v4.8h, v19.8b, v23.8b + usubl2 v5.8h, v19.16b, v23.16b + smlal v0.4s, v2.4h, v2.4h + smlal2 v1.4s, v2.8h, v2.8h + smlal v0.4s, v3.4h, v3.4h + smlal2 v1.4s, v3.8h, v3.8h + smlal v0.4s, v4.4h, v4.4h + smlal2 v1.4s, v4.8h, v4.8h + smlal v0.4s, v5.4h, v5.4h + smlal2 v1.4s, v5.8h, v5.8h +.endr + cbnz w12, .loop_sse_pp_64_sve2 + add v0.4s, v0.4s, v1.4s + ret_v0_w0 +.vl_gt_16_pixel_sse_pp_64x64: + cmp x9, #48 + bgt .vl_gt_48_pixel_sse_pp_64x64 + ptrue p0.b, vl32 + ld1b {z16.b}, p0/z, x0 + ld1b {z17.b}, p0/z, x0, #1, mul vl + ld1b {z20.b}, p0/z, x2 + ld1b {z21.b}, p0/z, x2, #1, mul vl + add x0, x0, x1 + add x2, x2, x3 + usublb z1.h, z16.b, z20.b + usublt z2.h, z16.b, z20.b + usublb z3.h, z17.b, z21.b + usublt z4.h, z17.b, z21.b + smullb z24.s, z1.h, z1.h + smullt z25.s, z1.h, z1.h + smlalb z24.s, z2.h, z2.h + smlalt z25.s, z2.h, z2.h + smlalb z24.s, z3.h, z3.h + smlalt z25.s, z3.h, z3.h + smlalb z24.s, z4.h, z4.h + smlalt z25.s, z4.h, z4.h +.rept 63 + ld1b {z16.b}, p0/z, x0 + ld1b {z17.b}, p0/z, x0, #1, mul vl + ld1b {z20.b}, p0/z, x2 + ld1b {z21.b}, p0/z, x2, #1, mul vl + add x0, x0, x1 + add x2, x2, x3 + usublb z1.h, z16.b, z20.b + usublt z2.h, z16.b, z20.b + usublb z3.h, z17.b, z21.b + usublt z4.h, z17.b, z21.b + smlalb z24.s, z1.h, z1.h + smlalt z25.s, z1.h, z1.h + smlalb z24.s, z2.h, z2.h + smlalt z25.s, z2.h, z2.h + smlalb z24.s, z3.h, z3.h + smlalt z25.s, z3.h, z3.h + smlalb z24.s, z4.h, z4.h + smlalt z25.s, z4.h, z4.h +.endr + uaddv d3, p0, z24.s + fmov w0, s3 + uaddv d4, p0, z25.s + fmov w1, s4 + add w0, w0, w1 + ret +.vl_gt_48_pixel_sse_pp_64x64: + ptrue p0.b, vl64 + ld1b {z16.b}, p0/z, x0 + ld1b {z20.b}, p0/z, x2 + add x0, x0, x1 + add x2, x2, x3 + usublb z1.h, z16.b, z20.b + usublt z2.h, z16.b, z20.b + smullb z24.s, z1.h, z1.h + smullt z25.s, z1.h, z1.h + smlalb z24.s, z2.h, z2.h + smlalt z25.s, z2.h, z2.h +.rept 63 + ld1b {z16.b}, p0/z, x0 + ld1b {z20.b}, p0/z, x2 + add x0, x0, x1 + add x2, x2, x3 + usublb z1.h, z16.b, z20.b + usublt z2.h, z16.b, z20.b + smlalb z24.s, z1.h, z1.h + smlalt z25.s, z1.h, z1.h + smlalb z24.s, z2.h, z2.h + smlalt z25.s, z2.h, z2.h +.endr + uaddv d3, p0, z24.s + fmov w0, s3 + uaddv d4, p0, z25.s + fmov w1, s4 + add w0, w0, w1 + ret +endfunc + +function PFX(pixel_sse_ss_4x4_sve2) + ptrue p0.b, vl8 + ld1b {z16.b}, p0/z, x0 + ld1b {z17.b}, p0/z, x2 + add x0, x0, x1, lsl #1 + add x2, x2, x3, lsl #1 + sub z1.h, z16.h, z17.h + smullb z3.s, z1.h, z1.h + smullt z4.s, z1.h, z1.h +.rept 3 + ld1b {z16.b}, p0/z, x0 + ld1b {z17.b}, p0/z, x2 + add x0, x0, x1, lsl #1 + add x2, x2, x3, lsl #1 + sub z1.h, z16.h, z17.h + smlalb z3.s, z1.h, z1.h + smlalt z4.s, z1.h, z1.h +.endr + uaddv d3, p0, z3.s + fmov w0, s3 + uaddv d4, p0, z4.s + fmov w1, s4 + add w0, w0, w1 + ret +endfunc + +function PFX(pixel_sse_ss_8x8_sve2) + ptrue p0.b, vl16 + ld1b {z16.b}, p0/z, x0 + ld1b {z17.b}, p0/z, x2 + add x0, x0, x1, lsl #1 + add x2, x2, x3, lsl #1 + sub z1.h, z16.h, z17.h + smullb z3.s, z1.h, z1.h + smullt z4.s, z1.h, z1.h +.rept 7 + ld1b {z16.b}, p0/z, x0 + ld1b {z17.b}, p0/z, x2 + add x0, x0, x1, lsl #1 + add x2, x2, x3, lsl #1 + sub z1.h, z16.h, z17.h + smlalb z3.s, z1.h, z1.h + smlalt z4.s, z1.h, z1.h +.endr + uaddv d3, p0, z3.s + fmov w0, s3 + uaddv d4, p0, z4.s + fmov w1, s4 + add w0, w0, w1 + ret +endfunc + +function PFX(pixel_sse_ss_16x16_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_sse_ss_16x16 + ptrue p0.b, vl16 + ld1b {z16.b}, p0/z, x0 + ld1b {z17.b}, p0/z, x0, #1, mul vl + ld1b {z18.b}, p0/z, x2 + ld1b {z19.b}, p0/z, x2, #1, mul vl + add x0, x0, x1, lsl #1 + add x2, x2, x3, lsl #1 + sub z1.h, z16.h, z18.h + sub z2.h, z17.h, z19.h + smullb z3.s, z1.h, z1.h + smullt z4.s, z1.h, z1.h + smlalb z3.s, z2.h, z2.h + smlalt z4.s, z2.h, z2.h +.rept 15 + ld1b {z16.b}, p0/z, x0 + ld1b {z17.b}, p0/z, x0, #1, mul vl + ld1b {z18.b}, p0/z, x2 + ld1b {z19.b}, p0/z, x2, #1, mul vl + add x0, x0, x1, lsl #1 + add x2, x2, x3, lsl #1 + sub z1.h, z16.h, z18.h + sub z2.h, z17.h, z19.h + smlalb z3.s, z1.h, z1.h + smlalt z4.s, z1.h, z1.h + smlalb z3.s, z2.h, z2.h + smlalt z4.s, z2.h, z2.h +.endr + uaddv d3, p0, z3.s + fmov w0, s3 + uaddv d4, p0, z4.s + fmov w1, s4 + add w0, w0, w1 + ret +.vl_gt_16_pixel_sse_ss_16x16: + ptrue p0.b, vl32 + ld1b {z16.b}, p0/z, x0 + ld1b {z18.b}, p0/z, x2 + add x0, x0, x1, lsl #1 + add x2, x2, x3, lsl #1 + sub z1.h, z16.h, z18.h + smullb z3.s, z1.h, z1.h + smullt z4.s, z1.h, z1.h +.rept 15 + ld1b {z16.b}, p0/z, x0 + ld1b {z18.b}, p0/z, x2 + add x0, x0, x1, lsl #1 + add x2, x2, x3, lsl #1 + sub z1.h, z16.h, z18.h + smlalb z3.s, z1.h, z1.h + smlalt z4.s, z1.h, z1.h +.endr + uaddv d3, p0, z3.s + fmov w0, s3 + uaddv d4, p0, z4.s + fmov w1, s4 + add w0, w0, w1 + ret +endfunc + +function PFX(pixel_sse_ss_32x32_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_sse_ss_32x32 + ptrue p0.b, vl16 + ld1b {z16.b}, p0/z, x0 + ld1b {z17.b}, p0/z, x0, #1, mul vl + ld1b {z18.b}, p0/z, x0, #2, mul vl + ld1b {z19.b}, p0/z, x0, #3, mul vl + ld1b {z20.b}, p0/z, x2 + ld1b {z21.b}, p0/z, x2, #1, mul vl + ld1b {z22.b}, p0/z, x2, #2, mul vl + ld1b {z23.b}, p0/z, x2, #3, mul vl + add x0, x0, x1, lsl #1 + add x2, x2, x3, lsl #1 + sub z1.h, z16.h, z20.h + sub z2.h, z17.h, z21.h + sub z3.h, z18.h, z22.h + sub z4.h, z19.h, z23.h + smullb z5.s, z1.h, z1.h + smullt z6.s, z1.h, z1.h + smlalb z5.s, z2.h, z2.h + smlalt z6.s, z2.h, z2.h + smlalb z5.s, z3.h, z3.h + smlalt z6.s, z3.h, z3.h + smlalb z5.s, z4.h, z4.h + smlalt z6.s, z4.h, z4.h +.rept 31 + ld1b {z16.b}, p0/z, x0 + ld1b {z17.b}, p0/z, x0, #1, mul vl + ld1b {z18.b}, p0/z, x0, #2, mul vl + ld1b {z19.b}, p0/z, x0, #3, mul vl + ld1b {z20.b}, p0/z, x2 + ld1b {z21.b}, p0/z, x2, #1, mul vl + ld1b {z22.b}, p0/z, x2, #2, mul vl + ld1b {z23.b}, p0/z, x2, #3, mul vl + add x0, x0, x1, lsl #1 + add x2, x2, x3, lsl #1 + sub z1.h, z16.h, z20.h + sub z2.h, z17.h, z21.h + sub z3.h, z18.h, z22.h + sub z4.h, z19.h, z23.h + smlalb z5.s, z1.h, z1.h + smlalt z6.s, z1.h, z1.h + smlalb z5.s, z2.h, z2.h + smlalt z6.s, z2.h, z2.h + smlalb z5.s, z3.h, z3.h + smlalt z6.s, z3.h, z3.h + smlalb z5.s, z4.h, z4.h + smlalt z6.s, z4.h, z4.h +.endr + uaddv d3, p0, z5.s + fmov w0, s3 + uaddv d4, p0, z6.s + fmov w1, s4 + add w0, w0, w1 + ret +.vl_gt_16_pixel_sse_ss_32x32: + cmp x9, #48 + bgt .vl_gt_48_pixel_sse_ss_32x32 + ptrue p0.b, vl32 + ld1b {z16.b}, p0/z, x0 + ld1b {z17.b}, p0/z, x0, #1, mul vl + ld1b {z20.b}, p0/z, x2 + ld1b {z21.b}, p0/z, x2, #1, mul vl + add x0, x0, x1, lsl #1 + add x2, x2, x3, lsl #1 + sub z1.h, z16.h, z20.h + sub z2.h, z17.h, z21.h + smullb z5.s, z1.h, z1.h + smullt z6.s, z1.h, z1.h + smlalb z5.s, z2.h, z2.h + smlalt z6.s, z2.h, z2.h +.rept 31 + ld1b {z16.b}, p0/z, x0 + ld1b {z17.b}, p0/z, x0, #1, mul vl + ld1b {z20.b}, p0/z, x2 + ld1b {z21.b}, p0/z, x2, #1, mul vl + add x0, x0, x1, lsl #1 + add x2, x2, x3, lsl #1 + sub z1.h, z16.h, z20.h + sub z2.h, z17.h, z21.h + smlalb z5.s, z1.h, z1.h + smlalt z6.s, z1.h, z1.h + smlalb z5.s, z2.h, z2.h + smlalt z6.s, z2.h, z2.h +.endr + uaddv d3, p0, z5.s + fmov w0, s3 + uaddv d4, p0, z6.s + fmov w1, s4 + add w0, w0, w1 + ret +.vl_gt_48_pixel_sse_ss_32x32: + ptrue p0.b, vl64 + ld1b {z16.b}, p0/z, x0 + ld1b {z20.b}, p0/z, x2 + add x0, x0, x1, lsl #1 + add x2, x2, x3, lsl #1 + sub z1.h, z16.h, z20.h + smullb z5.s, z1.h, z1.h + smullt z6.s, z1.h, z1.h +.rept 31 + ld1b {z16.b}, p0/z, x0 + ld1b {z20.b}, p0/z, x2 + add x0, x0, x1, lsl #1 + add x2, x2, x3, lsl #1 + sub z1.h, z16.h, z20.h + smlalb z5.s, z1.h, z1.h + smlalt z6.s, z1.h, z1.h +.endr + uaddv d3, p0, z5.s + fmov w0, s3 + uaddv d4, p0, z6.s + fmov w1, s4 + add w0, w0, w1 + ret +endfunc + +function PFX(pixel_sse_ss_64x64_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_sse_ss_64x64 + ptrue p0.b, vl16 + ld1b {z24.b}, p0/z, x0 + ld1b {z25.b}, p0/z, x0, #1, mul vl + ld1b {z26.b}, p0/z, x0, #2, mul vl + ld1b {z27.b}, p0/z, x0, #3, mul vl + ld1b {z28.b}, p0/z, x2 + ld1b {z29.b}, p0/z, x2, #1, mul vl + ld1b {z30.b}, p0/z, x2, #2, mul vl + ld1b {z31.b}, p0/z, x2, #3, mul vl + sub z0.h, z24.h, z28.h + sub z1.h, z25.h, z29.h + sub z2.h, z26.h, z30.h + sub z3.h, z27.h, z31.h + smullb z5.s, z0.h, z0.h + smullt z6.s, z0.h, z0.h + smlalb z5.s, z1.h, z1.h + smlalt z6.s, z1.h, z1.h + smlalb z5.s, z2.h, z2.h + smlalt z6.s, z2.h, z2.h + smlalb z5.s, z3.h, z3.h + smlalt z6.s, z3.h, z3.h + ld1b {z24.b}, p0/z, x0, #4, mul vl + ld1b {z25.b}, p0/z, x0, #5, mul vl + ld1b {z26.b}, p0/z, x0, #6, mul vl + ld1b {z27.b}, p0/z, x0, #7, mul vl + ld1b {z28.b}, p0/z, x2, #4, mul vl + ld1b {z29.b}, p0/z, x2, #5, mul vl + ld1b {z30.b}, p0/z, x2, #6, mul vl + ld1b {z31.b}, p0/z, x2, #7, mul vl + sub z0.h, z24.h, z28.h + sub z1.h, z25.h, z29.h + sub z2.h, z26.h, z30.h + sub z3.h, z27.h, z31.h + smlalb z5.s, z0.h, z0.h + smlalt z6.s, z0.h, z0.h + smlalb z5.s, z1.h, z1.h + smlalt z6.s, z1.h, z1.h + smlalb z5.s, z2.h, z2.h + smlalt z6.s, z2.h, z2.h + smlalb z5.s, z3.h, z3.h + smlalt z6.s, z3.h, z3.h + add x0, x0, x1, lsl #1 + add x2, x2, x3, lsl #1 +.rept 63 + ld1b {z24.b}, p0/z, x0 + ld1b {z25.b}, p0/z, x0, #1, mul vl + ld1b {z26.b}, p0/z, x0, #2, mul vl + ld1b {z27.b}, p0/z, x0, #3, mul vl + ld1b {z28.b}, p0/z, x2 + ld1b {z29.b}, p0/z, x2, #1, mul vl + ld1b {z30.b}, p0/z, x2, #2, mul vl + ld1b {z31.b}, p0/z, x2, #3, mul vl + sub z0.h, z24.h, z28.h + sub z1.h, z25.h, z29.h + sub z2.h, z26.h, z30.h + sub z3.h, z27.h, z31.h + smlalb z5.s, z0.h, z0.h + smlalt z6.s, z0.h, z0.h + smlalb z5.s, z1.h, z1.h + smlalt z6.s, z1.h, z1.h + smlalb z5.s, z2.h, z2.h + smlalt z6.s, z2.h, z2.h + smlalb z5.s, z3.h, z3.h + smlalt z6.s, z3.h, z3.h + ld1b {z24.b}, p0/z, x0, #4, mul vl + ld1b {z25.b}, p0/z, x0, #5, mul vl + ld1b {z26.b}, p0/z, x0, #6, mul vl + ld1b {z27.b}, p0/z, x0, #7, mul vl + ld1b {z28.b}, p0/z, x2, #4, mul vl + ld1b {z29.b}, p0/z, x2, #5, mul vl + ld1b {z30.b}, p0/z, x2, #6, mul vl + ld1b {z31.b}, p0/z, x2, #7, mul vl + sub z0.h, z24.h, z28.h + sub z1.h, z25.h, z29.h + sub z2.h, z26.h, z30.h + sub z3.h, z27.h, z31.h + smlalb z5.s, z0.h, z0.h + smlalt z6.s, z0.h, z0.h + smlalb z5.s, z1.h, z1.h + smlalt z6.s, z1.h, z1.h + smlalb z5.s, z2.h, z2.h + smlalt z6.s, z2.h, z2.h + smlalb z5.s, z3.h, z3.h + smlalt z6.s, z3.h, z3.h + add x0, x0, x1, lsl #1 + add x2, x2, x3, lsl #1 +.endr + uaddv d3, p0, z5.s + fmov w0, s3 + uaddv d4, p0, z6.s + fmov w1, s4 + add w0, w0, w1 + ret +.vl_gt_16_pixel_sse_ss_64x64: + cmp x9, #48 + bgt .vl_gt_48_pixel_sse_ss_64x64 + ptrue p0.b, vl32 + ld1b {z24.b}, p0/z, x0 + ld1b {z25.b}, p0/z, x0, #1, mul vl + ld1b {z28.b}, p0/z, x2 + ld1b {z29.b}, p0/z, x2, #1, mul vl + sub z0.h, z24.h, z28.h + sub z1.h, z25.h, z29.h + smullb z5.s, z0.h, z0.h + smullt z6.s, z0.h, z0.h + smlalb z5.s, z1.h, z1.h + smlalt z6.s, z1.h, z1.h + ld1b {z24.b}, p0/z, x0, #1, mul vl + ld1b {z25.b}, p0/z, x0, #2, mul vl + ld1b {z28.b}, p0/z, x2, #1, mul vl + ld1b {z29.b}, p0/z, x2, #2, mul vl + sub z0.h, z24.h, z28.h + sub z1.h, z25.h, z29.h + smlalb z5.s, z0.h, z0.h + smlalt z6.s, z0.h, z0.h + smlalb z5.s, z1.h, z1.h + smlalt z6.s, z1.h, z1.h + add x0, x0, x1, lsl #1 + add x2, x2, x3, lsl #1 +.rept 63 + ld1b {z24.b}, p0/z, x0 + ld1b {z25.b}, p0/z, x0, #1, mul vl + ld1b {z28.b}, p0/z, x2 + ld1b {z29.b}, p0/z, x2, #1, mul vl + sub z0.h, z24.h, z28.h + sub z1.h, z25.h, z29.h + smlalb z5.s, z0.h, z0.h + smlalt z6.s, z0.h, z0.h + smlalb z5.s, z1.h, z1.h + smlalt z6.s, z1.h, z1.h + ld1b {z24.b}, p0/z, x0, #1, mul vl + ld1b {z25.b}, p0/z, x0, #2, mul vl + ld1b {z28.b}, p0/z, x2, #1, mul vl + ld1b {z29.b}, p0/z, x2, #2, mul vl + sub z0.h, z24.h, z28.h + sub z1.h, z25.h, z29.h + smlalb z5.s, z0.h, z0.h + smlalt z6.s, z0.h, z0.h + smlalb z5.s, z1.h, z1.h + smlalt z6.s, z1.h, z1.h + add x0, x0, x1, lsl #1 + add x2, x2, x3, lsl #1 +.endr + uaddv d3, p0, z5.s + fmov w0, s3 + uaddv d4, p0, z6.s + fmov w1, s4 + add w0, w0, w1 + ret +.vl_gt_48_pixel_sse_ss_64x64: + cmp x9, #112 + bgt .vl_gt_112_pixel_sse_ss_64x64 + ptrue p0.b, vl64 + ld1b {z24.b}, p0/z, x0 + ld1b {z28.b}, p0/z, x2 + sub z0.h, z24.h, z28.h + smullb z5.s, z0.h, z0.h + smullt z6.s, z0.h, z0.h + ld1b {z24.b}, p0/z, x0, #1, mul vl + ld1b {z28.b}, p0/z, x2, #1, mul vl + sub z0.h, z24.h, z28.h + smlalb z5.s, z0.h, z0.h + smlalt z6.s, z0.h, z0.h + add x0, x0, x1, lsl #1 + add x2, x2, x3, lsl #1 +.rept 63 + ld1b {z24.b}, p0/z, x0 + ld1b {z28.b}, p0/z, x2 + sub z0.h, z24.h, z28.h + smlalb z5.s, z0.h, z0.h + smlalt z6.s, z0.h, z0.h + ld1b {z24.b}, p0/z, x0, #1, mul vl + ld1b {z28.b}, p0/z, x2, #1, mul vl + sub z0.h, z24.h, z28.h + smlalb z5.s, z0.h, z0.h + smlalt z6.s, z0.h, z0.h + add x0, x0, x1, lsl #1 + add x2, x2, x3, lsl #1 +.endr + uaddv d3, p0, z5.s + fmov w0, s3 + uaddv d4, p0, z6.s + fmov w1, s4 + add w0, w0, w1 + ret +.vl_gt_112_pixel_sse_ss_64x64: + ptrue p0.b, vl128 + ld1b {z24.b}, p0/z, x0 + ld1b {z28.b}, p0/z, x2 + sub z0.h, z24.h, z28.h + smullb z5.s, z0.h, z0.h + smullt z6.s, z0.h, z0.h + add x0, x0, x1, lsl #1 + add x2, x2, x3, lsl #1 +.rept 63 + ld1b {z24.b}, p0/z, x0 + ld1b {z28.b}, p0/z, x2 + sub z0.h, z24.h, z28.h + smlalb z5.s, z0.h, z0.h + smlalt z6.s, z0.h, z0.h + add x0, x0, x1, lsl #1 + add x2, x2, x3, lsl #1 +.endr + uaddv d3, p0, z5.s + fmov w0, s3 + uaddv d4, p0, z6.s + fmov w1, s4 + add w0, w0, w1 + ret +endfunc + +function PFX(pixel_ssd_s_4x4_sve2) + ptrue p0.b, vl8 + ld1b {z16.b}, p0/z, x0 + add x0, x0, x1, lsl #1 + smullb z0.s, z16.h, z16.h + smlalt z0.s, z16.h, z16.h +.rept 3 + ld1b {z16.b}, p0/z, x0 + add x0, x0, x1, lsl #1 + smlalb z0.s, z16.h, z16.h + smlalt z0.s, z16.h, z16.h +.endr + uaddv d3, p0, z0.s + fmov w0, s3 + ret +endfunc + +function PFX(pixel_ssd_s_8x8_sve2) + ptrue p0.b, vl16 + ld1b {z16.b}, p0/z, x0 + add x0, x0, x1, lsl #1 + smullb z0.s, z16.h, z16.h + smlalt z0.s, z16.h, z16.h +.rept 7 + ld1b {z16.b}, p0/z, x0 + add x0, x0, x1, lsl #1 + smlalb z0.s, z16.h, z16.h + smlalt z0.s, z16.h, z16.h +.endr + uaddv d3, p0, z0.s + fmov w0, s3 + ret +endfunc + +function PFX(pixel_ssd_s_16x16_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_ssd_s_16x16 + add x1, x1, x1 + mov w12, #4 + movi v0.16b, #0 + movi v1.16b, #0 +.loop_ssd_s_16_sve2: + sub w12, w12, #1 +.rept 2 + ld1 {v4.16b,v5.16b}, x0, x1 + ld1 {v6.16b,v7.16b}, x0, x1 + smlal v0.4s, v4.4h, v4.4h + smlal2 v1.4s, v4.8h, v4.8h + smlal v0.4s, v5.4h, v5.4h + smlal2 v1.4s, v5.8h, v5.8h + smlal v0.4s, v6.4h, v6.4h + smlal2 v1.4s, v6.8h, v6.8h + smlal v0.4s, v7.4h, v7.4h + smlal2 v1.4s, v7.8h, v7.8h +.endr + cbnz w12, .loop_ssd_s_16_sve2 + add v0.4s, v0.4s, v1.4s + ret_v0_w0 +.vl_gt_16_pixel_ssd_s_16x16: + ptrue p0.b, vl32 + ld1b {z16.b}, p0/z, x0 + add x0, x0, x1, lsl #1 + smullb z0.s, z16.h, z16.h + smlalt z0.s, z16.h, z16.h +.rept 15 + ld1b {z16.b}, p0/z, x0 + add x0, x0, x1, lsl #1 + smlalb z0.s, z16.h, z16.h + smlalt z0.s, z16.h, z16.h +.endr + uaddv d3, p0, z0.s + fmov w0, s3 + ret +endfunc + +function PFX(pixel_ssd_s_32x32_sve2) + rdvl x9, #1 + cmp x9, #16 + bgt .vl_gt_16_pixel_ssd_s_32x32 + add x1, x1, x1 + mov w12, #8 + movi v0.16b, #0 + movi v1.16b, #0 +.loop_ssd_s_32: + sub w12, w12, #1 +.rept 4 + ld1 {v4.16b-v7.16b}, x0, x1 + smlal v0.4s, v4.4h, v4.4h + smlal2 v1.4s, v4.8h, v4.8h + smlal v0.4s, v5.4h, v5.4h + smlal2 v1.4s, v5.8h, v5.8h + smlal v0.4s, v6.4h, v6.4h + smlal2 v1.4s, v6.8h, v6.8h + smlal v0.4s, v7.4h, v7.4h + smlal2 v1.4s, v7.8h, v7.8h +.endr + cbnz w12, .loop_ssd_s_32 + add v0.4s, v0.4s, v1.4s + ret_v0_w0 +.vl_gt_16_pixel_ssd_s_32x32: + cmp x9, #48 + bgt .vl_gt_48_pixel_ssd_s_32x32 + ptrue p0.b, vl32 + ld1b {z16.b}, p0/z, x0 + ld1b {z17.b}, p0/z, x0, #1, mul vl + add x0, x0, x1, lsl #1 + smullb z0.s, z16.h, z16.h + smlalt z0.s, z16.h, z16.h + smlalb z0.s, z17.h, z17.h + smlalt z0.s, z17.h, z17.h +.rept 31 + ld1b {z16.b}, p0/z, x0 + ld1b {z17.b}, p0/z, x0, #1, mul vl + add x0, x0, x1, lsl #1 + smlalb z0.s, z16.h, z16.h + smlalt z0.s, z16.h, z16.h + smlalb z0.s, z17.h, z17.h + smlalt z0.s, z17.h, z17.h +.endr + uaddv d3, p0, z0.s + fmov w0, s3 + ret +.vl_gt_48_pixel_ssd_s_32x32: + ptrue p0.b, vl64 + ld1b {z16.b}, p0/z, x0 + add x0, x0, x1, lsl #1 + smullb z0.s, z16.h, z16.h + smlalt z0.s, z16.h, z16.h +.rept 31 + ld1b {z16.b}, p0/z, x0 + add x0, x0, x1, lsl #1 + smlalb z0.s, z16.h, z16.h + smlalt z0.s, z16.h, z16.h +.endr + uaddv d3, p0, z0.s + fmov w0, s3 + ret +endfunc
View file
x265_3.6.tar.gz/source/common/aarch64/ssd-a.S
Added
@@ -0,0 +1,476 @@ +/***************************************************************************** + * Copyright (C) 2021 MulticoreWare, Inc + * + * Authors: Sebastian Pop <spop@amazon.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "asm.S" +#include "ssd-a-common.S" + +#ifdef __APPLE__ +.section __RODATA,__rodata +#else +.section .rodata +#endif + +.align 4 + +.text + +function PFX(pixel_sse_pp_4x4_neon) + ld1 {v16.s}0, x0, x1 + ld1 {v17.s}0, x2, x3 + ld1 {v18.s}0, x0, x1 + ld1 {v19.s}0, x2, x3 + ld1 {v20.s}0, x0, x1 + ld1 {v21.s}0, x2, x3 + ld1 {v22.s}0, x0, x1 + ld1 {v23.s}0, x2, x3 + + usubl v1.8h, v16.8b, v17.8b + usubl v2.8h, v18.8b, v19.8b + usubl v3.8h, v20.8b, v21.8b + usubl v4.8h, v22.8b, v23.8b + + smull v0.4s, v1.4h, v1.4h + smlal v0.4s, v2.4h, v2.4h + smlal v0.4s, v3.4h, v3.4h + smlal v0.4s, v4.4h, v4.4h + ret_v0_w0 +endfunc + +function PFX(pixel_sse_pp_4x8_neon) + ld1 {v16.s}0, x0, x1 + ld1 {v17.s}0, x2, x3 + usubl v1.8h, v16.8b, v17.8b + ld1 {v16.s}0, x0, x1 + ld1 {v17.s}0, x2, x3 + smull v0.4s, v1.4h, v1.4h +.rept 6 + usubl v1.8h, v16.8b, v17.8b + ld1 {v16.s}0, x0, x1 + smlal v0.4s, v1.4h, v1.4h + ld1 {v17.s}0, x2, x3 +.endr + usubl v1.8h, v16.8b, v17.8b + smlal v0.4s, v1.4h, v1.4h + ret_v0_w0 +endfunc + +function PFX(pixel_sse_pp_8x8_neon) + ld1 {v16.8b}, x0, x1 + ld1 {v17.8b}, x2, x3 + usubl v1.8h, v16.8b, v17.8b + ld1 {v16.8b}, x0, x1 + smull v0.4s, v1.4h, v1.4h + smlal2 v0.4s, v1.8h, v1.8h + ld1 {v17.8b}, x2, x3 + +.rept 6 + usubl v1.8h, v16.8b, v17.8b + ld1 {v16.8b}, x0, x1 + smlal v0.4s, v1.4h, v1.4h + smlal2 v0.4s, v1.8h, v1.8h + ld1 {v17.8b}, x2, x3 +.endr + usubl v1.8h, v16.8b, v17.8b + smlal v0.4s, v1.4h, v1.4h + smlal2 v0.4s, v1.8h, v1.8h + ret_v0_w0 +endfunc + +function PFX(pixel_sse_pp_8x16_neon) + ld1 {v16.8b}, x0, x1 + ld1 {v17.8b}, x2, x3 + usubl v1.8h, v16.8b, v17.8b + ld1 {v16.8b}, x0, x1 + smull v0.4s, v1.4h, v1.4h + smlal2 v0.4s, v1.8h, v1.8h + ld1 {v17.8b}, x2, x3 + +.rept 14 + usubl v1.8h, v16.8b, v17.8b + ld1 {v16.8b}, x0, x1 + smlal v0.4s, v1.4h, v1.4h + smlal2 v0.4s, v1.8h, v1.8h + ld1 {v17.8b}, x2, x3 +.endr + usubl v1.8h, v16.8b, v17.8b + smlal v0.4s, v1.4h, v1.4h + smlal2 v0.4s, v1.8h, v1.8h + ret_v0_w0 +endfunc + +.macro sse_pp_16xN h +function PFX(pixel_sse_pp_16x\h\()_neon) + ld1 {v16.16b}, x0, x1 + ld1 {v17.16b}, x2, x3 + usubl v1.8h, v16.8b, v17.8b + usubl2 v2.8h, v16.16b, v17.16b + ld1 {v16.16b}, x0, x1 + ld1 {v17.16b}, x2, x3 + smull v0.4s, v1.4h, v1.4h + smlal2 v0.4s, v1.8h, v1.8h + smlal v0.4s, v2.4h, v2.4h + smlal2 v0.4s, v2.8h, v2.8h +.rept \h - 2 + usubl v1.8h, v16.8b, v17.8b + usubl2 v2.8h, v16.16b, v17.16b + ld1 {v16.16b}, x0, x1 + smlal v0.4s, v1.4h, v1.4h + smlal2 v0.4s, v1.8h, v1.8h + ld1 {v17.16b}, x2, x3 + smlal v0.4s, v2.4h, v2.4h + smlal2 v0.4s, v2.8h, v2.8h +.endr + usubl v1.8h, v16.8b, v17.8b + usubl2 v2.8h, v16.16b, v17.16b + smlal v0.4s, v1.4h, v1.4h + smlal2 v0.4s, v1.8h, v1.8h + smlal v0.4s, v2.4h, v2.4h + smlal2 v0.4s, v2.8h, v2.8h + ret_v0_w0 +endfunc +.endm + +sse_pp_16xN 16 +sse_pp_16xN 32 + +function PFX(pixel_sse_pp_32x32_neon) + mov w12, #8 + movi v0.16b, #0 + movi v1.16b, #0 +.loop_sse_pp_32: + sub w12, w12, #1 +.rept 4 + ld1 {v16.16b,v17.16b}, x0, x1 + ld1 {v18.16b,v19.16b}, x2, x3 + usubl v2.8h, v16.8b, v18.8b + usubl2 v3.8h, v16.16b, v18.16b + usubl v4.8h, v17.8b, v19.8b + usubl2 v5.8h, v17.16b, v19.16b + smlal v0.4s, v2.4h, v2.4h + smlal2 v1.4s, v2.8h, v2.8h + smlal v0.4s, v3.4h, v3.4h + smlal2 v1.4s, v3.8h, v3.8h + smlal v0.4s, v4.4h, v4.4h + smlal2 v1.4s, v4.8h, v4.8h + smlal v0.4s, v5.4h, v5.4h + smlal2 v1.4s, v5.8h, v5.8h +.endr + cbnz w12, .loop_sse_pp_32 + add v0.4s, v0.4s, v1.4s + ret_v0_w0 +endfunc + +function PFX(pixel_sse_pp_32x64_neon) + mov w12, #16 + movi v0.16b, #0 + movi v1.16b, #0 +.loop_sse_pp_32x64: + sub w12, w12, #1 +.rept 4 + ld1 {v16.16b,v17.16b}, x0, x1 + ld1 {v18.16b,v19.16b}, x2, x3 + usubl v2.8h, v16.8b, v18.8b + usubl2 v3.8h, v16.16b, v18.16b + usubl v4.8h, v17.8b, v19.8b + usubl2 v5.8h, v17.16b, v19.16b + smlal v0.4s, v2.4h, v2.4h + smlal2 v1.4s, v2.8h, v2.8h + smlal v0.4s, v3.4h, v3.4h + smlal2 v1.4s, v3.8h, v3.8h + smlal v0.4s, v4.4h, v4.4h + smlal2 v1.4s, v4.8h, v4.8h + smlal v0.4s, v5.4h, v5.4h + smlal2 v1.4s, v5.8h, v5.8h +.endr + cbnz w12, .loop_sse_pp_32x64 + add v0.4s, v0.4s, v1.4s + ret_v0_w0 +endfunc + +function PFX(pixel_sse_pp_64x64_neon) + mov w12, #16 + movi v0.16b, #0 + movi v1.16b, #0 + +.loop_sse_pp_64: + sub w12, w12, #1 +.rept 4 + ld1 {v16.16b-v19.16b}, x0, x1 + ld1 {v20.16b-v23.16b}, x2, x3 + + usubl v2.8h, v16.8b, v20.8b + usubl2 v3.8h, v16.16b, v20.16b + usubl v4.8h, v17.8b, v21.8b + usubl2 v5.8h, v17.16b, v21.16b + smlal v0.4s, v2.4h, v2.4h + smlal2 v1.4s, v2.8h, v2.8h + smlal v0.4s, v3.4h, v3.4h + smlal2 v1.4s, v3.8h, v3.8h + smlal v0.4s, v4.4h, v4.4h + smlal2 v1.4s, v4.8h, v4.8h + smlal v0.4s, v5.4h, v5.4h + smlal2 v1.4s, v5.8h, v5.8h + + usubl v2.8h, v18.8b, v22.8b + usubl2 v3.8h, v18.16b, v22.16b + usubl v4.8h, v19.8b, v23.8b + usubl2 v5.8h, v19.16b, v23.16b + smlal v0.4s, v2.4h, v2.4h + smlal2 v1.4s, v2.8h, v2.8h + smlal v0.4s, v3.4h, v3.4h + smlal2 v1.4s, v3.8h, v3.8h + smlal v0.4s, v4.4h, v4.4h + smlal2 v1.4s, v4.8h, v4.8h + smlal v0.4s, v5.4h, v5.4h + smlal2 v1.4s, v5.8h, v5.8h +.endr + cbnz w12, .loop_sse_pp_64 + add v0.4s, v0.4s, v1.4s + ret_v0_w0 +endfunc + +function PFX(pixel_sse_ss_4x4_neon) + add x1, x1, x1 + add x3, x3, x3 + ld1 {v16.8b}, x0, x1 + ld1 {v17.8b}, x2, x3 + sub v2.4h, v16.4h, v17.4h + ld1 {v16.8b}, x0, x1 + ld1 {v17.8b}, x2, x3 + smull v0.4s, v2.4h, v2.4h + sub v2.4h, v16.4h, v17.4h + ld1 {v16.8b}, x0, x1 + ld1 {v17.8b}, x2, x3 + smlal v0.4s, v2.4h, v2.4h + sub v2.4h, v16.4h, v17.4h + ld1 {v16.8b}, x0, x1 + smlal v0.4s, v2.4h, v2.4h + ld1 {v17.8b}, x2, x3 + sub v2.4h, v16.4h, v17.4h + smlal v0.4s, v2.4h, v2.4h + ret_v0_w0 +endfunc + +function PFX(pixel_sse_ss_8x8_neon) + add x1, x1, x1 + add x3, x3, x3 + ld1 {v16.16b}, x0, x1 + ld1 {v17.16b}, x2, x3 + sub v2.8h, v16.8h, v17.8h + ld1 {v16.16b}, x0, x1 + ld1 {v17.16b}, x2, x3 + smull v0.4s, v2.4h, v2.4h + smull2 v1.4s, v2.8h, v2.8h + sub v2.8h, v16.8h, v17.8h +.rept 6 + ld1 {v16.16b}, x0, x1 + ld1 {v17.16b}, x2, x3 + smlal v0.4s, v2.4h, v2.4h + smlal2 v1.4s, v2.8h, v2.8h + sub v2.8h, v16.8h, v17.8h +.endr + smlal v0.4s, v2.4h, v2.4h + smlal2 v1.4s, v2.8h, v2.8h + add v0.4s, v0.4s, v1.4s + ret_v0_w0 +endfunc + +function PFX(pixel_sse_ss_16x16_neon) + add x1, x1, x1 + add x3, x3, x3 + mov w12, #4 + movi v0.16b, #0 + movi v1.16b, #0 +.loop_sse_ss_16: + sub w12, w12, #1 +.rept 4 + ld1 {v16.16b, v17.16b}, x0, x1 + ld1 {v18.16b, v19.16b}, x2, x3 + sub v2.8h, v16.8h, v18.8h + sub v3.8h, v17.8h, v19.8h + smlal v0.4s, v2.4h, v2.4h + smlal2 v1.4s, v2.8h, v2.8h + smlal v0.4s, v3.4h, v3.4h + smlal2 v1.4s, v3.8h, v3.8h +.endr + cbnz w12, .loop_sse_ss_16 + add v0.4s, v0.4s, v1.4s + ret_v0_w0 +endfunc + +function PFX(pixel_sse_ss_32x32_neon) + add x1, x1, x1 + add x3, x3, x3 + + mov w12, #8 + movi v0.16b, #0 + movi v1.16b, #0 +.loop_sse_ss_32: + sub w12, w12, #1 +.rept 4 + ld1 {v16.16b-v19.16b}, x0, x1 + ld1 {v20.16b-v23.16b}, x2, x3 + sub v2.8h, v16.8h, v20.8h + sub v3.8h, v17.8h, v21.8h + sub v4.8h, v18.8h, v22.8h + sub v5.8h, v19.8h, v23.8h + smlal v0.4s, v2.4h, v2.4h + smlal2 v1.4s, v2.8h, v2.8h + smlal v0.4s, v3.4h, v3.4h + smlal2 v1.4s, v3.8h, v3.8h + smlal v0.4s, v4.4h, v4.4h + smlal2 v1.4s, v4.8h, v4.8h + smlal v0.4s, v5.4h, v5.4h + smlal2 v1.4s, v5.8h, v5.8h +.endr + cbnz w12, .loop_sse_ss_32 + add v0.4s, v0.4s, v1.4s + ret_v0_w0 +endfunc + +function PFX(pixel_sse_ss_64x64_neon) + add x1, x1, x1 + add x3, x3, x3 + sub x1, x1, #64 + sub x3, x3, #64 + + mov w12, #32 + movi v0.16b, #0 + movi v1.16b, #0 +.loop_sse_ss_64: + sub w12, w12, #1 +.rept 2 + ld1 {v16.16b-v19.16b}, x0, #64 + ld1 {v20.16b-v23.16b}, x2, #64 + sub v2.8h, v16.8h, v20.8h + sub v3.8h, v17.8h, v21.8h + sub v4.8h, v18.8h, v22.8h + sub v5.8h, v19.8h, v23.8h + ld1 {v16.16b-v19.16b}, x0, x1 + ld1 {v20.16b-v23.16b}, x2, x3 + smlal v0.4s, v2.4h, v2.4h + smlal2 v1.4s, v2.8h, v2.8h + smlal v0.4s, v3.4h, v3.4h + smlal2 v1.4s, v3.8h, v3.8h + smlal v0.4s, v4.4h, v4.4h + smlal2 v1.4s, v4.8h, v4.8h + smlal v0.4s, v5.4h, v5.4h + smlal2 v1.4s, v5.8h, v5.8h + sub v2.8h, v16.8h, v20.8h + sub v3.8h, v17.8h, v21.8h + sub v4.8h, v18.8h, v22.8h + sub v5.8h, v19.8h, v23.8h + smlal v0.4s, v2.4h, v2.4h + smlal2 v1.4s, v2.8h, v2.8h + smlal v0.4s, v3.4h, v3.4h + smlal2 v1.4s, v3.8h, v3.8h + smlal v0.4s, v4.4h, v4.4h + smlal2 v1.4s, v4.8h, v4.8h + smlal v0.4s, v5.4h, v5.4h + smlal2 v1.4s, v5.8h, v5.8h +.endr + cbnz w12, .loop_sse_ss_64 + add v0.4s, v0.4s, v1.4s + ret_v0_w0 +endfunc + +function PFX(pixel_ssd_s_4x4_neon) + add x1, x1, x1 + ld1 {v4.8b}, x0, x1 + ld1 {v5.8b}, x0, x1 + ld1 {v6.8b}, x0, x1 + ld1 {v7.8b}, x0 + smull v0.4s, v4.4h, v4.4h + smull v1.4s, v5.4h, v5.4h + smlal v0.4s, v6.4h, v6.4h + smlal v1.4s, v7.4h, v7.4h + add v0.4s, v0.4s, v1.4s + ret_v0_w0 +endfunc + +function PFX(pixel_ssd_s_8x8_neon) + add x1, x1, x1 + ld1 {v4.16b}, x0, x1 + ld1 {v5.16b}, x0, x1 + smull v0.4s, v4.4h, v4.4h + smull2 v1.4s, v4.8h, v4.8h + smlal v0.4s, v5.4h, v5.4h + smlal2 v1.4s, v5.8h, v5.8h +.rept 3 + ld1 {v4.16b}, x0, x1 + ld1 {v5.16b}, x0, x1 + smlal v0.4s, v4.4h, v4.4h + smlal2 v1.4s, v4.8h, v4.8h + smlal v0.4s, v5.4h, v5.4h + smlal2 v1.4s, v5.8h, v5.8h +.endr + add v0.4s, v0.4s, v1.4s + ret_v0_w0 +endfunc + +function PFX(pixel_ssd_s_16x16_neon) + add x1, x1, x1 + mov w12, #4 + movi v0.16b, #0 + movi v1.16b, #0 +.loop_ssd_s_16: + sub w12, w12, #1 +.rept 2 + ld1 {v4.16b,v5.16b}, x0, x1 + ld1 {v6.16b,v7.16b}, x0, x1 + smlal v0.4s, v4.4h, v4.4h + smlal2 v1.4s, v4.8h, v4.8h + smlal v0.4s, v5.4h, v5.4h + smlal2 v1.4s, v5.8h, v5.8h + smlal v0.4s, v6.4h, v6.4h + smlal2 v1.4s, v6.8h, v6.8h + smlal v0.4s, v7.4h, v7.4h + smlal2 v1.4s, v7.8h, v7.8h +.endr + cbnz w12, .loop_ssd_s_16 + add v0.4s, v0.4s, v1.4s + ret_v0_w0 +endfunc + +function PFX(pixel_ssd_s_32x32_neon) + add x1, x1, x1 + mov w12, #8 + movi v0.16b, #0 + movi v1.16b, #0 +.loop_ssd_s_32: + sub w12, w12, #1 +.rept 4 + ld1 {v4.16b-v7.16b}, x0, x1 + smlal v0.4s, v4.4h, v4.4h + smlal2 v1.4s, v4.8h, v4.8h + smlal v0.4s, v5.4h, v5.4h + smlal2 v1.4s, v5.8h, v5.8h + smlal v0.4s, v6.4h, v6.4h + smlal2 v1.4s, v6.8h, v6.8h + smlal v0.4s, v7.4h, v7.4h + smlal2 v1.4s, v7.8h, v7.8h +.endr + cbnz w12, .loop_ssd_s_32 + add v0.4s, v0.4s, v1.4s + ret_v0_w0 +endfunc
View file
x265_3.5.tar.gz/source/common/common.h -> x265_3.6.tar.gz/source/common/common.h
Changed
@@ -130,7 +130,6 @@ typedef uint64_t pixel4; typedef int64_t ssum2_t; #define SHIFT_TO_BITPLANE 9 -#define HISTOGRAM_BINS 1024 #else typedef uint8_t pixel; typedef uint16_t sum_t; @@ -138,7 +137,6 @@ typedef uint32_t pixel4; typedef int32_t ssum2_t; // Signed sum #define SHIFT_TO_BITPLANE 7 -#define HISTOGRAM_BINS 256 #endif // if HIGH_BIT_DEPTH #if X265_DEPTH < 10 @@ -162,6 +160,8 @@ #define MIN_QPSCALE 0.21249999999999999 #define MAX_MAX_QPSCALE 615.46574234477100 +#define FRAME_BRIGHTNESS_THRESHOLD 50.0 // Min % of pixels in a frame, that are above BRIGHTNESS_THRESHOLD for it to be considered a bright frame +#define FRAME_EDGE_THRESHOLD 10.0 // Min % of edge pixels in a frame, for it to be considered to have high edge density template<typename T> @@ -340,6 +340,9 @@ #define FILLER_OVERHEAD (NAL_TYPE_OVERHEAD + START_CODE_OVERHEAD + 1) #define MAX_NUM_DYN_REFINE (NUM_CU_DEPTH * X265_REFINE_INTER_LEVELS) +#define X265_BYTE 8 + +#define MAX_MCSTF_TEMPORAL_WINDOW_LENGTH 8 namespace X265_NS { @@ -434,6 +437,14 @@ #define x265_unlink(fileName) unlink(fileName) #define x265_rename(oldName, newName) rename(oldName, newName) #endif +/* Close a file */ +#define x265_fclose(file) if (file != NULL) fclose(file); file=NULL; +#define x265_fread(val, size, readSize, fileOffset,errorMessage)\ + if (fread(val, size, readSize, fileOffset) != readSize)\ + {\ + x265_log(NULL, X265_LOG_ERROR, errorMessage); \ + return; \ + } int x265_exp2fix8(double x); double x265_ssim2dB(double ssim);
View file
x265_3.5.tar.gz/source/common/cpu.cpp -> x265_3.6.tar.gz/source/common/cpu.cpp
Changed
@@ -7,6 +7,8 @@ * Steve Borho <steve@borho.org> * Hongbin Liu <liuhongbin1@huawei.com> * Yimeng Su <yimeng.su@huawei.com> + * Josh Dekker <josh@itanimul.li> + * Jean-Baptiste Kempf <jb@videolan.org> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -105,6 +107,14 @@ { "NEON", X265_CPU_NEON }, { "FastNeonMRC", X265_CPU_FAST_NEON_MRC }, +#elif X265_ARCH_ARM64 + { "NEON", X265_CPU_NEON }, +#if defined(HAVE_SVE) + { "SVE", X265_CPU_SVE }, +#endif +#if defined(HAVE_SVE2) + { "SVE2", X265_CPU_SVE2 }, +#endif #elif X265_ARCH_POWER8 { "Altivec", X265_CPU_ALTIVEC }, @@ -369,12 +379,30 @@ flags |= PFX(cpu_fast_neon_mrc_test)() ? X265_CPU_FAST_NEON_MRC : 0; #endif // TODO: write dual issue test? currently it's A8 (dual issue) vs. A9 (fast mrc) -#elif X265_ARCH_ARM64 - flags |= X265_CPU_NEON; #endif // if HAVE_ARMV6 return flags; } +#elif X265_ARCH_ARM64 + +uint32_t cpu_detect(bool benableavx512) +{ + int flags = 0; + + #if defined(HAVE_SVE2) + flags |= X265_CPU_SVE2; + flags |= X265_CPU_SVE; + flags |= X265_CPU_NEON; + #elif defined(HAVE_SVE) + flags |= X265_CPU_SVE; + flags |= X265_CPU_NEON; + #elif HAVE_NEON + flags |= X265_CPU_NEON; + #endif + + return flags; +} + #elif X265_ARCH_POWER8 uint32_t cpu_detect(bool benableavx512)
View file
x265_3.5.tar.gz/source/common/frame.cpp -> x265_3.6.tar.gz/source/common/frame.cpp
Changed
@@ -64,12 +64,40 @@ m_edgeBitPlane = NULL; m_edgeBitPic = NULL; m_isInsideWindow = 0; + + // mcstf + m_isSubSampled = NULL; + m_mcstf = NULL; + m_refPicCnt0 = 0; + m_refPicCnt1 = 0; + m_nextMCSTF = NULL; + m_prevMCSTF = NULL; + + m_tempLayer = 0; + m_sameLayerRefPic = false; } bool Frame::create(x265_param *param, float* quantOffsets) { m_fencPic = new PicYuv; m_param = param; + + if (m_param->bEnableTemporalFilter) + { + m_mcstf = new TemporalFilter; + m_mcstf->init(param); + + m_fencPicSubsampled2 = new PicYuv; + m_fencPicSubsampled4 = new PicYuv; + + if (!m_fencPicSubsampled2->createScaledPicYUV(param, 2)) + return false; + if (!m_fencPicSubsampled4->createScaledPicYUV(param, 4)) + return false; + + CHECKED_MALLOC_ZERO(m_isSubSampled, int, 1); + } + CHECKED_MALLOC_ZERO(m_rcData, RcStats, 1); if (param->bCTUInfo) @@ -151,6 +179,22 @@ return false; } +bool Frame::createSubSample() +{ + + m_fencPicSubsampled2 = new PicYuv; + m_fencPicSubsampled4 = new PicYuv; + + if (!m_fencPicSubsampled2->createScaledPicYUV(m_param, 2)) + return false; + if (!m_fencPicSubsampled4->createScaledPicYUV(m_param, 4)) + return false; + CHECKED_MALLOC_ZERO(m_isSubSampled, int, 1); + return true; +fail: + return false; +} + bool Frame::allocEncodeData(x265_param *param, const SPS& sps) { m_encData = new FrameData; @@ -207,6 +251,26 @@ m_fencPic = NULL; } + if (m_param->bEnableTemporalFilter) + { + + if (m_fencPicSubsampled2) + { + m_fencPicSubsampled2->destroy(); + delete m_fencPicSubsampled2; + m_fencPicSubsampled2 = NULL; + } + + if (m_fencPicSubsampled4) + { + m_fencPicSubsampled4->destroy(); + delete m_fencPicSubsampled4; + m_fencPicSubsampled4 = NULL; + } + delete m_mcstf; + X265_FREE(m_isSubSampled); + } + if (m_reconPic) { m_reconPic->destroy(); @@ -267,7 +331,8 @@ X265_FREE(m_addOnPrevChange); m_addOnPrevChange = NULL; } - m_lowres.destroy(); + + m_lowres.destroy(m_param); X265_FREE(m_rcData); if (m_param->bDynamicRefine)
View file
x265_3.5.tar.gz/source/common/frame.h -> x265_3.6.tar.gz/source/common/frame.h
Changed
@@ -28,6 +28,7 @@ #include "common.h" #include "lowres.h" #include "threading.h" +#include "temporalfilter.h" namespace X265_NS { // private namespace @@ -70,6 +71,7 @@ double count4; double offset4; double bufferFillFinal; + int64_t currentSatd; }; class Frame @@ -83,8 +85,12 @@ /* Data associated with x265_picture */ PicYuv* m_fencPic; + PicYuv* m_fencPicSubsampled2; + PicYuv* m_fencPicSubsampled4; + int m_poc; int m_encodeOrder; + int m_gopOffset; int64_t m_pts; // user provided presentation time stamp int64_t m_reorderedPts; int64_t m_dts; @@ -132,6 +138,13 @@ bool m_classifyFrame; int m_fieldNum; + /*MCSTF*/ + TemporalFilter* m_mcstf; + int m_refPicCnt2; + Frame* m_nextMCSTF; // PicList doubly linked list pointers + Frame* m_prevMCSTF; + int* m_isSubSampled; + /* aq-mode 4 : Gaussian, edge and theta frames for edge information */ pixel* m_edgePic; pixel* m_gaussianPic; @@ -143,9 +156,15 @@ int m_isInsideWindow; + /*Frame's temporal layer info*/ + uint8_t m_tempLayer; + int8_t m_gopId; + bool m_sameLayerRefPic; + Frame(); bool create(x265_param *param, float* quantOffsets); + bool createSubSample(); bool allocEncodeData(x265_param *param, const SPS& sps); void reinit(const SPS& sps); void destroy();
View file
x265_3.5.tar.gz/source/common/framedata.cpp -> x265_3.6.tar.gz/source/common/framedata.cpp
Changed
@@ -62,7 +62,7 @@ } else return false; - CHECKED_MALLOC_ZERO(m_cuStat, RCStatCU, sps.numCUsInFrame); + CHECKED_MALLOC_ZERO(m_cuStat, RCStatCU, sps.numCUsInFrame + 1); CHECKED_MALLOC(m_rowStat, RCStatRow, sps.numCuInHeight); reinit(sps);
View file
x265_3.5.tar.gz/source/common/lowres.cpp -> x265_3.6.tar.gz/source/common/lowres.cpp
Changed
@@ -28,6 +28,28 @@ using namespace X265_NS; +/* + * Down Sample input picture + */ +static +void frame_lowres_core(const pixel* src0, pixel* dst0, + intptr_t src_stride, intptr_t dst_stride, int width, int height) +{ + for (int y = 0; y < height; y++) + { + const pixel* src1 = src0 + src_stride; + for (int x = 0; x < width; x++) + { + // slower than naive bilinear, but matches asm +#define FILTER(a, b, c, d) ((((a + b + 1) >> 1) + ((c + d + 1) >> 1) + 1) >> 1) + dst0x = FILTER(src02 * x, src12 * x, src02 * x + 1, src12 * x + 1); +#undef FILTER + } + src0 += src_stride * 2; + dst0 += dst_stride; + } +} + bool PicQPAdaptationLayer::create(uint32_t width, uint32_t height, uint32_t partWidth, uint32_t partHeight, uint32_t numAQPartInWidthExt, uint32_t numAQPartInHeightExt) { aqPartWidth = partWidth; @@ -73,7 +95,7 @@ size_t planesize = lumaStride * (lines + 2 * origPic->m_lumaMarginY); size_t padoffset = lumaStride * origPic->m_lumaMarginY + origPic->m_lumaMarginX; - if (!!param->rc.aqMode || !!param->rc.hevcAq || !!param->bAQMotion) + if (!!param->rc.aqMode || !!param->rc.hevcAq || !!param->bAQMotion || !!param->bEnableWeightedPred || !!param->bEnableWeightedBiPred) { CHECKED_MALLOC_ZERO(qpAqOffset, double, cuCountFullRes); CHECKED_MALLOC_ZERO(invQscaleFactor, int, cuCountFullRes); @@ -190,13 +212,45 @@ } } + if (param->bHistBasedSceneCut) + { + quarterSampleLowResWidth = widthFullRes / 4; + quarterSampleLowResHeight = heightFullRes / 4; + quarterSampleLowResOriginX = 16; + quarterSampleLowResOriginY = 16; + quarterSampleLowResStrideY = quarterSampleLowResWidth + 2 * quarterSampleLowResOriginY; + + size_t quarterSampleLowResPlanesize = quarterSampleLowResStrideY * (quarterSampleLowResHeight + 2 * quarterSampleLowResOriginX); + /* allocate quarter sampled lowres buffers */ + CHECKED_MALLOC_ZERO(quarterSampleLowResBuffer, pixel, quarterSampleLowResPlanesize); + + // Allocate memory for Histograms + picHistogram = X265_MALLOC(uint32_t***, NUMBER_OF_SEGMENTS_IN_WIDTH * sizeof(uint32_t***)); + picHistogram0 = X265_MALLOC(uint32_t**, NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT); + for (uint32_t wd = 1; wd < NUMBER_OF_SEGMENTS_IN_WIDTH; wd++) { + picHistogramwd = picHistogram0 + wd * NUMBER_OF_SEGMENTS_IN_HEIGHT; + } + + for (uint32_t regionInPictureWidthIndex = 0; regionInPictureWidthIndex < NUMBER_OF_SEGMENTS_IN_WIDTH; regionInPictureWidthIndex++) + { + for (uint32_t regionInPictureHeightIndex = 0; regionInPictureHeightIndex < NUMBER_OF_SEGMENTS_IN_HEIGHT; regionInPictureHeightIndex++) + { + picHistogramregionInPictureWidthIndexregionInPictureHeightIndex = X265_MALLOC(uint32_t*, NUMBER_OF_SEGMENTS_IN_WIDTH *sizeof(uint32_t*)); + picHistogramregionInPictureWidthIndexregionInPictureHeightIndex0 = X265_MALLOC(uint32_t, 3 * HISTOGRAM_NUMBER_OF_BINS * sizeof(uint32_t)); + for (uint32_t wd = 1; wd < 3; wd++) { + picHistogramregionInPictureWidthIndexregionInPictureHeightIndexwd = picHistogramregionInPictureWidthIndexregionInPictureHeightIndex0 + wd * HISTOGRAM_NUMBER_OF_BINS; + } + } + } + } + return true; fail: return false; } -void Lowres::destroy() +void Lowres::destroy(x265_param* param) { X265_FREE(buffer0); if(bEnableHME) @@ -234,7 +288,8 @@ X265_FREE(invQscaleFactor8x8); X265_FREE(edgeInclined); X265_FREE(qpAqMotionOffset); - X265_FREE(blockVariance); + if (param->bDynamicRefine || param->bEnableFades) + X265_FREE(blockVariance); if (maxAQDepth > 0) { for (uint32_t d = 0; d < 4; d++) @@ -254,6 +309,29 @@ delete pAQLayer; } + + // Histograms + if (param->bHistBasedSceneCut) + { + for (uint32_t segmentInFrameWidthIdx = 0; segmentInFrameWidthIdx < NUMBER_OF_SEGMENTS_IN_WIDTH; segmentInFrameWidthIdx++) + { + if (picHistogramsegmentInFrameWidthIdx) + { + for (uint32_t segmentInFrameHeightIdx = 0; segmentInFrameHeightIdx < NUMBER_OF_SEGMENTS_IN_HEIGHT; segmentInFrameHeightIdx++) + { + if (picHistogramsegmentInFrameWidthIdxsegmentInFrameHeightIdx) + X265_FREE(picHistogramsegmentInFrameWidthIdxsegmentInFrameHeightIdx0); + X265_FREE(picHistogramsegmentInFrameWidthIdxsegmentInFrameHeightIdx); + } + } + } + if (picHistogram) + X265_FREE(picHistogram0); + X265_FREE(picHistogram); + + X265_FREE(quarterSampleLowResBuffer); + + } } // (re) initialize lowres state void Lowres::init(PicYuv *origPic, int poc) @@ -266,10 +344,6 @@ indB = 0; memset(costEst, -1, sizeof(costEst)); memset(weightedCostDelta, 0, sizeof(weightedCostDelta)); - interPCostPercDiff = 0.0; - intraCostPercDiff = 0.0; - m_bIsMaxThres = false; - m_bIsHardScenecut = false; if (qpAqOffset && invQscaleFactor) memset(costEstAq, -1, sizeof(costEstAq)); @@ -314,4 +388,16 @@ } fpelPlane0 = lowresPlane0; + + if (origPic->m_param->bHistBasedSceneCut) + { + // Quarter Sampled Input Picture Formation + // TO DO: Replace with ASM function + frame_lowres_core( + lowresPlane0, + quarterSampleLowResBuffer + quarterSampleLowResOriginX + quarterSampleLowResOriginY * quarterSampleLowResStrideY, + lumaStride, + quarterSampleLowResStrideY, + widthFullRes / 4, heightFullRes / 4); + } }
View file
x265_3.5.tar.gz/source/common/lowres.h -> x265_3.6.tar.gz/source/common/lowres.h
Changed
@@ -32,6 +32,10 @@ namespace X265_NS { // private namespace +#define HISTOGRAM_NUMBER_OF_BINS 256 +#define NUMBER_OF_SEGMENTS_IN_WIDTH 4 +#define NUMBER_OF_SEGMENTS_IN_HEIGHT 4 + struct ReferencePlanes { ReferencePlanes() { memset(this, 0, sizeof(ReferencePlanes)); } @@ -171,6 +175,7 @@ int frameNum; // Presentation frame number int sliceType; // Slice type decided by lookahead + int sliceTypeReq; // Slice type required as per the QP file int width; // width of lowres frame in pixels int lines; // height of lowres frame in pixel lines int leadingBframes; // number of leading B frames for P or I @@ -214,13 +219,13 @@ double* qpAqOffset; // AQ QP offset values for each 16x16 CU double* qpCuTreeOffset; // cuTree QP offset values for each 16x16 CU double* qpAqMotionOffset; - int* invQscaleFactor; // qScale values for qp Aq Offsets + int* invQscaleFactor; // qScale values for qp Aq Offsets int* invQscaleFactor8x8; // temporary buffer for qg-size 8 uint32_t* blockVariance; uint64_t wp_ssd3; // This is different than SSDY, this is sum(pixel^2) - sum(pixel)^2 for entire frame uint64_t wp_sum3; double frameVariance; - int* edgeInclined; + int* edgeInclined; /* cutree intermediate data */ @@ -230,18 +235,30 @@ uint32_t heightFullRes; uint32_t m_maxCUSize; uint32_t m_qgSize; - + uint16_t* propagateCost; double weightedCostDeltaX265_BFRAME_MAX + 2; ReferencePlanes weightedRefX265_BFRAME_MAX + 2; + /* For hist-based scenecut */ - bool m_bIsMaxThres; - double interPCostPercDiff; - double intraCostPercDiff; - bool m_bIsHardScenecut; + int quarterSampleLowResWidth; // width of 1/4 lowres frame in pixels + int quarterSampleLowResHeight; // height of 1/4 lowres frame in pixels + int quarterSampleLowResStrideY; + int quarterSampleLowResOriginX; + int quarterSampleLowResOriginY; + pixel *quarterSampleLowResBuffer; + bool bHistScenecutAnalyzed; + + uint16_t picAvgVariance; + uint16_t picAvgVarianceCb; + uint16_t picAvgVarianceCr; + + uint32_t ****picHistogram; + uint64_t averageIntensityPerSegmentNUMBER_OF_SEGMENTS_IN_WIDTHNUMBER_OF_SEGMENTS_IN_HEIGHT3; + uint8_t averageIntensity3; bool create(x265_param* param, PicYuv *origPic, uint32_t qgSize); - void destroy(); + void destroy(x265_param* param); void init(PicYuv *origPic, int poc); }; }
View file
x265_3.5.tar.gz/source/common/mv.h -> x265_3.6.tar.gz/source/common/mv.h
Changed
@@ -105,6 +105,8 @@ { return x >= _min.x && x <= _max.x && y >= _min.y && y <= _max.y; } + + void set(int32_t _x, int32_t _y) { x = _x; y = _y; } }; }
View file
x265_3.5.tar.gz/source/common/param.cpp -> x265_3.6.tar.gz/source/common/param.cpp
Changed
@@ -145,6 +145,8 @@ param->bAnnexB = 1; param->bRepeatHeaders = 0; param->bEnableAccessUnitDelimiters = 0; + param->bEnableEndOfBitstream = 0; + param->bEnableEndOfSequence = 0; param->bEmitHRDSEI = 0; param->bEmitInfoSEI = 1; param->bEmitHDRSEI = 0; /*Deprecated*/ @@ -163,12 +165,12 @@ param->keyframeMax = 250; param->gopLookahead = 0; param->bOpenGOP = 1; + param->craNal = 0; param->bframes = 4; param->lookaheadDepth = 20; param->bFrameAdaptive = X265_B_ADAPT_TRELLIS; param->bBPyramid = 1; param->scenecutThreshold = 40; /* Magic number pulled in from x264 */ - param->edgeTransitionThreshold = 0.03; param->bHistBasedSceneCut = 0; param->lookaheadSlices = 8; param->lookaheadThreads = 0; @@ -179,12 +181,20 @@ param->bEnableHRDConcatFlag = 0; param->bEnableFades = 0; param->bEnableSceneCutAwareQp = 0; - param->fwdScenecutWindow = 500; - param->fwdRefQpDelta = 5; - param->fwdNonRefQpDelta = param->fwdRefQpDelta + (SLICE_TYPE_DELTA * param->fwdRefQpDelta); - param->bwdScenecutWindow = 100; - param->bwdRefQpDelta = -1; - param->bwdNonRefQpDelta = -1; + param->fwdMaxScenecutWindow = 1200; + param->bwdMaxScenecutWindow = 600; + for (int i = 0; i < 6; i++) + { + int deltas6 = { 5, 4, 3, 2, 1, 0 }; + + param->fwdScenecutWindowi = 200; + param->fwdRefQpDeltai = deltasi; + param->fwdNonRefQpDeltai = param->fwdRefQpDeltai + (SLICE_TYPE_DELTA * param->fwdRefQpDeltai); + + param->bwdScenecutWindowi = 100; + param->bwdRefQpDeltai = -1; + param->bwdNonRefQpDeltai = -1; + } /* Intra Coding Tools */ param->bEnableConstrainedIntra = 0; @@ -278,7 +288,10 @@ param->rc.rfConstantMin = 0; param->rc.bStatRead = 0; param->rc.bStatWrite = 0; + param->rc.dataShareMode = X265_SHARE_MODE_FILE; param->rc.statFileName = NULL; + param->rc.sharedMemName = NULL; + param->rc.bEncFocusedFramesOnly = 0; param->rc.complexityBlur = 20; param->rc.qblur = 0.5; param->rc.zoneCount = 0; @@ -321,6 +334,7 @@ param->maxLuma = PIXEL_MAX; param->log2MaxPocLsb = 8; param->maxSlices = 1; + param->videoSignalTypePreset = NULL; /*Conformance window*/ param->confWinRightOffset = 0; @@ -373,10 +387,17 @@ param->bEnableSvtHevc = 0; param->svtHevcParam = NULL; + /* MCSTF */ + param->bEnableTemporalFilter = 0; + param->temporalFilterStrength = 0.95; + #ifdef SVT_HEVC param->svtHevcParam = svtParam; svt_param_default(param); #endif + /* Film grain characteristics model filename */ + param->filmGrain = NULL; + param->bEnableSBRC = 0; } int x265_param_default_preset(x265_param* param, const char* preset, const char* tune) @@ -666,6 +687,46 @@ #define atof(str) x265_atof(str, bError) #define atobool(str) (x265_atobool(str, bError)) +int x265_scenecut_aware_qp_param_parse(x265_param* p, const char* name, const char* value) +{ + bool bError = false; + char nameBuf64; + if (!name) + return X265_PARAM_BAD_NAME; + // skip -- prefix if provided + if (name0 == '-' && name1 == '-') + name += 2; + // s/_/-/g + if (strlen(name) + 1 < sizeof(nameBuf) && strchr(name, '_')) + { + char *c; + strcpy(nameBuf, name); + while ((c = strchr(nameBuf, '_')) != 0) + *c = '-'; + name = nameBuf; + } + if (!value) + value = "true"; + else if (value0 == '=') + value++; +#define OPT(STR) else if (!strcmp(name, STR)) + if (0); + OPT("scenecut-aware-qp") p->bEnableSceneCutAwareQp = x265_atoi(value, bError); + OPT("masking-strength") bError = parseMaskingStrength(p, value); + else + return X265_PARAM_BAD_NAME; +#undef OPT + return bError ? X265_PARAM_BAD_VALUE : 0; +} + + +/* internal versions of string-to-int with additional error checking */ +#undef atoi +#undef atof +#define atoi(str) x265_atoi(str, bError) +#define atof(str) x265_atof(str, bError) +#define atobool(str) (x265_atobool(str, bError)) + int x265_zone_param_parse(x265_param* p, const char* name, const char* value) { bool bError = false; @@ -949,10 +1010,9 @@ { bError = false; p->scenecutThreshold = atoi(value); - p->bHistBasedSceneCut = 0; } } - OPT("temporal-layers") p->bEnableTemporalSubLayers = atobool(value); + OPT("temporal-layers") p->bEnableTemporalSubLayers = atoi(value); OPT("keyint") p->keyframeMax = atoi(value); OPT("min-keyint") p->keyframeMin = atoi(value); OPT("rc-lookahead") p->lookaheadDepth = atoi(value); @@ -1184,6 +1244,7 @@ int pass = x265_clip3(0, 3, atoi(value)); p->rc.bStatWrite = pass & 1; p->rc.bStatRead = pass & 2; + p->rc.dataShareMode = X265_SHARE_MODE_FILE; } OPT("stats") p->rc.statFileName = strdup(value); OPT("scaling-list") p->scalingLists = strdup(value); @@ -1216,21 +1277,7 @@ OPT("opt-ref-list-length-pps") p->bOptRefListLengthPPS = atobool(value); OPT("multi-pass-opt-rps") p->bMultiPassOptRPS = atobool(value); OPT("scenecut-bias") p->scenecutBias = atof(value); - OPT("hist-scenecut") - { - p->bHistBasedSceneCut = atobool(value); - if (bError) - { - bError = false; - p->bHistBasedSceneCut = 0; - } - if (p->bHistBasedSceneCut) - { - bError = false; - p->scenecutThreshold = 0; - } - } - OPT("hist-threshold") p->edgeTransitionThreshold = atof(value); + OPT("hist-scenecut") p->bHistBasedSceneCut = atobool(value); OPT("rskip-edge-threshold") p->edgeVarThreshold = atoi(value)/100.0f; OPT("lookahead-threads") p->lookaheadThreads = atoi(value); OPT("opt-cu-delta-qp") p->bOptCUDeltaQP = atobool(value); @@ -1238,6 +1285,7 @@ OPT("multi-pass-opt-distortion") p->analysisMultiPassDistortion = atobool(value); OPT("aq-motion") p->bAQMotion = atobool(value); OPT("dynamic-rd") p->dynamicRd = atof(value); + OPT("cra-nal") p->craNal = atobool(value); OPT("analysis-reuse-level") { p->analysisReuseLevel = atoi(value); @@ -1348,71 +1396,7 @@ } OPT("fades") p->bEnableFades = atobool(value); OPT("scenecut-aware-qp") p->bEnableSceneCutAwareQp = atoi(value); - OPT("masking-strength") - { - int window1; - double refQpDelta1, nonRefQpDelta1; - - if (p->bEnableSceneCutAwareQp == FORWARD) - { - if (3 == sscanf(value, "%d,%lf,%lf", &window1, &refQpDelta1, &nonRefQpDelta1)) - { - if (window1 > 0) - p->fwdScenecutWindow = window1; - if (refQpDelta1 > 0) - p->fwdRefQpDelta = refQpDelta1; - if (nonRefQpDelta1 > 0) - p->fwdNonRefQpDelta = nonRefQpDelta1; - } - else - { - x265_log(NULL, X265_LOG_ERROR, "Specify all the necessary offsets for masking-strength \n"); - bError = true; - } - } - else if (p->bEnableSceneCutAwareQp == BACKWARD) - { - if (3 == sscanf(value, "%d,%lf,%lf", &window1, &refQpDelta1, &nonRefQpDelta1)) - { - if (window1 > 0) - p->bwdScenecutWindow = window1; - if (refQpDelta1 > 0) - p->bwdRefQpDelta = refQpDelta1; - if (nonRefQpDelta1 > 0) - p->bwdNonRefQpDelta = nonRefQpDelta1; - } - else - { - x265_log(NULL, X265_LOG_ERROR, "Specify all the necessary offsets for masking-strength \n"); - bError = true; - } - } - else if (p->bEnableSceneCutAwareQp == BI_DIRECTIONAL) - { - int window2; - double refQpDelta2, nonRefQpDelta2; - if (6 == sscanf(value, "%d,%lf,%lf,%d,%lf,%lf", &window1, &refQpDelta1, &nonRefQpDelta1, &window2, &refQpDelta2, &nonRefQpDelta2)) - { - if (window1 > 0) - p->fwdScenecutWindow = window1; - if (refQpDelta1 > 0) - p->fwdRefQpDelta = refQpDelta1; - if (nonRefQpDelta1 > 0) - p->fwdNonRefQpDelta = nonRefQpDelta1; - if (window2 > 0) - p->bwdScenecutWindow = window2; - if (refQpDelta2 > 0) - p->bwdRefQpDelta = refQpDelta2; - if (nonRefQpDelta2 > 0) - p->bwdNonRefQpDelta = nonRefQpDelta2; - } - else - { - x265_log(NULL, X265_LOG_ERROR, "Specify all the necessary offsets for masking-strength \n"); - bError = true; - } - } - } + OPT("masking-strength") bError |= parseMaskingStrength(p, value); OPT("field") p->bField = atobool( value ); OPT("cll") p->bEmitCLL = atobool(value); OPT("frame-dup") p->bEnableFrameDuplication = atobool(value); @@ -1446,6 +1430,13 @@ OPT("vbv-live-multi-pass") p->bliveVBV2pass = atobool(value); OPT("min-vbv-fullness") p->minVbvFullness = atof(value); OPT("max-vbv-fullness") p->maxVbvFullness = atof(value); + OPT("video-signal-type-preset") p->videoSignalTypePreset = strdup(value); + OPT("eob") p->bEnableEndOfBitstream = atobool(value); + OPT("eos") p->bEnableEndOfSequence = atobool(value); + /* Film grain characterstics model filename */ + OPT("film-grain") p->filmGrain = (char* )value; + OPT("mcstf") p->bEnableTemporalFilter = atobool(value); + OPT("sbrc") p->bEnableSBRC = atobool(value); else return X265_PARAM_BAD_NAME; } @@ -1761,8 +1752,6 @@ "scenecutThreshold must be greater than 0"); CHECK(param->scenecutBias < 0 || 100 < param->scenecutBias, "scenecut-bias must be between 0 and 100"); - CHECK(param->edgeTransitionThreshold < 0.0 || 1.0 < param->edgeTransitionThreshold, - "hist-threshold must be between 0.0 and 1.0"); CHECK(param->radl < 0 || param->radl > param->bframes, "radl must be between 0 and bframes"); CHECK(param->rdPenalty < 0 || param->rdPenalty > 2, @@ -1824,15 +1813,15 @@ "Invalid refine-ctu-distortion value, must be either 0 or 1"); CHECK(param->maxAUSizeFactor < 0.5 || param->maxAUSizeFactor > 1.0, "Supported factor for controlling max AU size is from 0.5 to 1"); - CHECK((param->dolbyProfile != 0) && (param->dolbyProfile != 50) && (param->dolbyProfile != 81) && (param->dolbyProfile != 82), - "Unsupported Dolby Vision profile, only profile 5, profile 8.1 and profile 8.2 enabled"); + CHECK((param->dolbyProfile != 0) && (param->dolbyProfile != 50) && (param->dolbyProfile != 81) && (param->dolbyProfile != 82) && (param->dolbyProfile != 84), + "Unsupported Dolby Vision profile, only profile 5, profile 8.1, profile 8.2 and profile 8.4 enabled"); CHECK(param->dupThreshold < 1 || 99 < param->dupThreshold, "Invalid frame-duplication threshold. Value must be between 1 and 99."); if (param->dolbyProfile) { CHECK((param->rc.vbvMaxBitrate <= 0 || param->rc.vbvBufferSize <= 0), "Dolby Vision requires VBV settings to enable HRD.\n"); - CHECK((param->internalBitDepth != 10), "Dolby Vision profile - 5, profile - 8.1 and profile - 8.2 is Main10 only\n"); - CHECK((param->internalCsp != X265_CSP_I420), "Dolby Vision profile - 5, profile - 8.1 and profile - 8.2 requires YCbCr 4:2:0 color space\n"); + CHECK((param->internalBitDepth != 10), "Dolby Vision profile - 5, profile - 8.1, profile - 8.2 and profile - 8.4 are Main10 only\n"); + CHECK((param->internalCsp != X265_CSP_I420), "Dolby Vision profile - 5, profile - 8.1, profile - 8.2 and profile - 8.4 requires YCbCr 4:2:0 color space\n"); if (param->dolbyProfile == 81) CHECK(!(param->masteringDisplayColorVolume), "Dolby Vision profile - 8.1 requires Mastering display color volume information\n"); } @@ -1854,19 +1843,22 @@ { CHECK(param->bEnableSceneCutAwareQp < 0 || param->bEnableSceneCutAwareQp > 3, "Invalid masking direction. Value must be between 0 and 3(inclusive)"); - CHECK(param->fwdScenecutWindow < 0 || param->fwdScenecutWindow > 1000, - "Invalid forward scenecut Window duration. Value must be between 0 and 1000(inclusive)"); - CHECK(param->fwdRefQpDelta < 0 || param->fwdRefQpDelta > 10, - "Invalid fwdRefQpDelta value. Value must be between 0 and 10 (inclusive)"); - CHECK(param->fwdNonRefQpDelta < 0 || param->fwdNonRefQpDelta > 10, - "Invalid fwdNonRefQpDelta value. Value must be between 0 and 10 (inclusive)"); - - CHECK(param->bwdScenecutWindow < 0 || param->bwdScenecutWindow > 1000, - "Invalid backward scenecut Window duration. Value must be between 0 and 1000(inclusive)"); - CHECK(param->bwdRefQpDelta < -1 || param->bwdRefQpDelta > 10, - "Invalid bwdRefQpDelta value. Value must be between 0 and 10 (inclusive)"); - CHECK(param->bwdNonRefQpDelta < -1 || param->bwdNonRefQpDelta > 10, - "Invalid bwdNonRefQpDelta value. Value must be between 0 and 10 (inclusive)"); + for (int i = 0; i < 6; i++) + { + CHECK(param->fwdScenecutWindowi < 0 || param->fwdScenecutWindowi > 1000, + "Invalid forward scenecut Window duration. Value must be between 0 and 1000(inclusive)"); + CHECK(param->fwdRefQpDeltai < 0 || param->fwdRefQpDeltai > 20, + "Invalid fwdRefQpDelta value. Value must be between 0 and 20 (inclusive)"); + CHECK(param->fwdNonRefQpDeltai < 0 || param->fwdNonRefQpDeltai > 20, + "Invalid fwdNonRefQpDelta value. Value must be between 0 and 20 (inclusive)"); + + CHECK(param->bwdScenecutWindowi < 0 || param->bwdScenecutWindowi > 1000, + "Invalid backward scenecut Window duration. Value must be between 0 and 1000(inclusive)"); + CHECK(param->bwdRefQpDeltai < -1 || param->bwdRefQpDeltai > 20, + "Invalid bwdRefQpDelta value. Value must be between 0 and 20 (inclusive)"); + CHECK(param->bwdNonRefQpDeltai < -1 || param->bwdNonRefQpDeltai > 20, + "Invalid bwdNonRefQpDelta value. Value must be between 0 and 20 (inclusive)"); + } } } if (param->bEnableHME) @@ -1898,6 +1890,11 @@ param->bSingleSeiNal = 0; x265_log(param, X265_LOG_WARNING, "None of the SEI messages are enabled. Disabling Single SEI NAL\n"); } + if (param->bEnableTemporalFilter && (param->frameNumThreads > 1)) + { + param->bEnableTemporalFilter = 0; + x265_log(param, X265_LOG_WARNING, "MCSTF can be enabled with frame thread = 1 only. Disabling MCSTF\n"); + } CHECK(param->confWinRightOffset < 0, "Conformance Window Right Offset must be 0 or greater"); CHECK(param->confWinBottomOffset < 0, "Conformance Window Bottom Offset must be 0 or greater"); CHECK(param->decoderVbvMaxRate < 0, "Invalid Decoder Vbv Maxrate. Value can not be less than zero"); @@ -1910,6 +1907,7 @@ x265_log(param, X265_LOG_WARNING, "Live VBV enabled without VBV settings.Disabling live VBV in 2 pass\n"); } } + CHECK(param->rc.dataShareMode != X265_SHARE_MODE_FILE && param->rc.dataShareMode != X265_SHARE_MODE_SHAREDMEM, "Invalid data share mode. It must be one of the X265_DATA_SHARE_MODES enum values\n" ); return check_failed; } @@ -1970,8 +1968,8 @@ x265_log(param, X265_LOG_INFO, "Keyframe min / max / scenecut / bias : %d / %d / %d / %.2lf \n", param->keyframeMin, param->keyframeMax, param->scenecutThreshold, param->scenecutBias * 100); else if (param->bHistBasedSceneCut && param->keyframeMax != INT_MAX) - x265_log(param, X265_LOG_INFO, "Keyframe min / max / scenecut / edge threshold : %d / %d / %d / %.2lf\n", - param->keyframeMin, param->keyframeMax, param->bHistBasedSceneCut, param->edgeTransitionThreshold); + x265_log(param, X265_LOG_INFO, "Keyframe min / max / scenecut : %d / %d / %d\n", + param->keyframeMin, param->keyframeMax, param->bHistBasedSceneCut); else if (param->keyframeMax == INT_MAX) x265_log(param, X265_LOG_INFO, "Keyframe min / max / scenecut : disabled\n"); @@ -2089,6 +2087,8 @@ bufSize += strlen(p->numaPools); if (p->masteringDisplayColorVolume) bufSize += strlen(p->masteringDisplayColorVolume); + if (p->videoSignalTypePreset) + bufSize += strlen(p->videoSignalTypePreset); buf = s = X265_MALLOC(char, bufSize); if (!buf) @@ -2126,10 +2126,12 @@ BOOL(p->bRepeatHeaders, "repeat-headers"); BOOL(p->bAnnexB, "annexb"); BOOL(p->bEnableAccessUnitDelimiters, "aud"); + BOOL(p->bEnableEndOfBitstream, "eob"); + BOOL(p->bEnableEndOfSequence, "eos"); BOOL(p->bEmitHRDSEI, "hrd"); BOOL(p->bEmitInfoSEI, "info"); s += sprintf(s, " hash=%d", p->decodedPictureHashSEI); - BOOL(p->bEnableTemporalSubLayers, "temporal-layers"); + s += sprintf(s, " temporal-layers=%d", p->bEnableTemporalSubLayers); BOOL(p->bOpenGOP, "open-gop"); s += sprintf(s, " min-keyint=%d", p->keyframeMin); s += sprintf(s, " keyint=%d", p->keyframeMax); @@ -2141,7 +2143,7 @@ s += sprintf(s, " rc-lookahead=%d", p->lookaheadDepth); s += sprintf(s, " lookahead-slices=%d", p->lookaheadSlices); s += sprintf(s, " scenecut=%d", p->scenecutThreshold); - s += sprintf(s, " hist-scenecut=%d", p->bHistBasedSceneCut); + BOOL(p->bHistBasedSceneCut, "hist-scenecut"); s += sprintf(s, " radl=%d", p->radl); BOOL(p->bEnableHRDConcatFlag, "splice"); BOOL(p->bIntraRefresh, "intra-refresh"); @@ -2295,7 +2297,6 @@ BOOL(p->bOptRefListLengthPPS, "opt-ref-list-length-pps"); BOOL(p->bMultiPassOptRPS, "multi-pass-opt-rps"); s += sprintf(s, " scenecut-bias=%.2f", p->scenecutBias); - s += sprintf(s, " hist-threshold=%.2f", p->edgeTransitionThreshold); BOOL(p->bOptCUDeltaQP, "opt-cu-delta-qp"); BOOL(p->bAQMotion, "aq-motion"); BOOL(p->bEmitHDR10SEI, "hdr10"); @@ -2328,10 +2329,14 @@ s += sprintf(s, " qp-adaptation-range=%.2f", p->rc.qpAdaptationRange); s += sprintf(s, " scenecut-aware-qp=%d", p->bEnableSceneCutAwareQp); if (p->bEnableSceneCutAwareQp) - s += sprintf(s, " fwd-scenecut-window=%d fwd-ref-qp-delta=%f fwd-nonref-qp-delta=%f bwd-scenecut-window=%d bwd-ref-qp-delta=%f bwd-nonref-qp-delta=%f", p->fwdScenecutWindow, p->fwdRefQpDelta, p->fwdNonRefQpDelta, p->bwdScenecutWindow, p->bwdRefQpDelta, p->bwdNonRefQpDelta); + s += sprintf(s, " fwd-scenecut-window=%d fwd-ref-qp-delta=%f fwd-nonref-qp-delta=%f bwd-scenecut-window=%d bwd-ref-qp-delta=%f bwd-nonref-qp-delta=%f", p->fwdMaxScenecutWindow, p->fwdRefQpDelta0, p->fwdNonRefQpDelta0, p->bwdMaxScenecutWindow, p->bwdRefQpDelta0, p->bwdNonRefQpDelta0); s += sprintf(s, "conformance-window-offsets right=%d bottom=%d", p->confWinRightOffset, p->confWinBottomOffset); s += sprintf(s, " decoder-max-rate=%d", p->decoderVbvMaxRate); BOOL(p->bliveVBV2pass, "vbv-live-multi-pass"); + if (p->filmGrain) + s += sprintf(s, " film-grain=%s", p->filmGrain); // Film grain characteristics model filename + BOOL(p->bEnableTemporalFilter, "mcstf"); + BOOL(p->bEnableSBRC, "sbrc"); #undef BOOL return buf; } @@ -2406,6 +2411,151 @@ return false; } +bool parseMaskingStrength(x265_param* p, const char* value) +{ + bool bError = false; + int window16; + double refQpDelta16, nonRefQpDelta16; + if (p->bEnableSceneCutAwareQp == FORWARD) + { + if (3 == sscanf(value, "%d,%lf,%lf", &window10, &refQpDelta10, &nonRefQpDelta10)) + { + if (window10 > 0) + p->fwdMaxScenecutWindow = window10; + if (refQpDelta10 > 0) + p->fwdRefQpDelta0 = refQpDelta10; + if (nonRefQpDelta10 > 0) + p->fwdNonRefQpDelta0 = nonRefQpDelta10; + + p->fwdScenecutWindow0 = p->fwdMaxScenecutWindow / 6; + for (int i = 1; i < 6; i++) + { + p->fwdScenecutWindowi = p->fwdMaxScenecutWindow / 6; + p->fwdRefQpDeltai = p->fwdRefQpDeltai - 1 - (0.15 * p->fwdRefQpDeltai - 1); + p->fwdNonRefQpDeltai = p->fwdNonRefQpDeltai - 1 - (0.15 * p->fwdNonRefQpDeltai - 1); + } + } + else if (18 == sscanf(value, "%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf" + , &window10, &refQpDelta10, &nonRefQpDelta10, &window11, &refQpDelta11, &nonRefQpDelta11 + , &window12, &refQpDelta12, &nonRefQpDelta12, &window13, &refQpDelta13, &nonRefQpDelta13 + , &window14, &refQpDelta14, &nonRefQpDelta14, &window15, &refQpDelta15, &nonRefQpDelta15)) + { + p->fwdMaxScenecutWindow = 0; + for (int i = 0; i < 6; i++) + { + p->fwdScenecutWindowi = window1i; + p->fwdRefQpDeltai = refQpDelta1i; + p->fwdNonRefQpDeltai = nonRefQpDelta1i; + p->fwdMaxScenecutWindow += p->fwdScenecutWindowi; + } + } + else + { + x265_log(NULL, X265_LOG_ERROR, "Specify all the necessary offsets for masking-strength \n"); + bError = true; + } + } + else if (p->bEnableSceneCutAwareQp == BACKWARD) + { + if (3 == sscanf(value, "%d,%lf,%lf", &window10, &refQpDelta10, &nonRefQpDelta10)) + { + if (window10 > 0) + p->bwdMaxScenecutWindow = window10; + if (refQpDelta10 > 0) + p->bwdRefQpDelta0 = refQpDelta10; + if (nonRefQpDelta10 > 0) + p->bwdNonRefQpDelta0 = nonRefQpDelta10; + + p->bwdScenecutWindow0 = p->bwdMaxScenecutWindow / 6; + for (int i = 1; i < 6; i++) + { + p->bwdScenecutWindowi = p->bwdMaxScenecutWindow / 6; + p->bwdRefQpDeltai = p->bwdRefQpDeltai - 1 - (0.15 * p->bwdRefQpDeltai - 1); + p->bwdNonRefQpDeltai = p->bwdNonRefQpDeltai - 1 - (0.15 * p->bwdNonRefQpDeltai - 1); + } + } + else if (18 == sscanf(value, "%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf" + , &window10, &refQpDelta10, &nonRefQpDelta10, &window11, &refQpDelta11, &nonRefQpDelta11 + , &window12, &refQpDelta12, &nonRefQpDelta12, &window13, &refQpDelta13, &nonRefQpDelta13 + , &window14, &refQpDelta14, &nonRefQpDelta14, &window15, &refQpDelta15, &nonRefQpDelta15)) + { + p->bwdMaxScenecutWindow = 0; + for (int i = 0; i < 6; i++) + { + p->bwdScenecutWindowi = window1i; + p->bwdRefQpDeltai = refQpDelta1i; + p->bwdNonRefQpDeltai = nonRefQpDelta1i; + p->bwdMaxScenecutWindow += p->bwdScenecutWindowi; + } + } + else + { + x265_log(NULL, X265_LOG_ERROR, "Specify all the necessary offsets for masking-strength \n"); + bError = true; + } + } + else if (p->bEnableSceneCutAwareQp == BI_DIRECTIONAL) + { + int window26; + double refQpDelta26, nonRefQpDelta26; + if (6 == sscanf(value, "%d,%lf,%lf,%d,%lf,%lf", &window10, &refQpDelta10, &nonRefQpDelta10, &window20, &refQpDelta20, &nonRefQpDelta20)) + { + if (window10 > 0) + p->fwdMaxScenecutWindow = window10; + if (refQpDelta10 > 0) + p->fwdRefQpDelta0 = refQpDelta10; + if (nonRefQpDelta10 > 0) + p->fwdNonRefQpDelta0 = nonRefQpDelta10; + if (window20 > 0) + p->bwdMaxScenecutWindow = window20; + if (refQpDelta20 > 0) + p->bwdRefQpDelta0 = refQpDelta20; + if (nonRefQpDelta20 > 0) + p->bwdNonRefQpDelta0 = nonRefQpDelta20; + + p->fwdScenecutWindow0 = p->fwdMaxScenecutWindow / 6; + p->bwdScenecutWindow0 = p->bwdMaxScenecutWindow / 6; + for (int i = 1; i < 6; i++) + { + p->fwdScenecutWindowi = p->fwdMaxScenecutWindow / 6; + p->bwdScenecutWindowi = p->bwdMaxScenecutWindow / 6; + p->fwdRefQpDeltai = p->fwdRefQpDeltai - 1 - (0.15 * p->fwdRefQpDeltai - 1); + p->fwdNonRefQpDeltai = p->fwdNonRefQpDeltai - 1 - (0.15 * p->fwdNonRefQpDeltai - 1); + p->bwdRefQpDeltai = p->bwdRefQpDeltai - 1 - (0.15 * p->bwdRefQpDeltai - 1); + p->bwdNonRefQpDeltai = p->bwdNonRefQpDeltai - 1 - (0.15 * p->bwdNonRefQpDeltai - 1); + } + } + else if (36 == sscanf(value, "%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf,%d,%lf,%lf" + , &window10, &refQpDelta10, &nonRefQpDelta10, &window11, &refQpDelta11, &nonRefQpDelta11 + , &window12, &refQpDelta12, &nonRefQpDelta12, &window13, &refQpDelta13, &nonRefQpDelta13 + , &window14, &refQpDelta14, &nonRefQpDelta14, &window15, &refQpDelta15, &nonRefQpDelta15 + , &window20, &refQpDelta20, &nonRefQpDelta20, &window21, &refQpDelta21, &nonRefQpDelta21 + , &window22, &refQpDelta22, &nonRefQpDelta22, &window23, &refQpDelta23, &nonRefQpDelta23 + , &window24, &refQpDelta24, &nonRefQpDelta24, &window25, &refQpDelta25, &nonRefQpDelta25)) + { + p->fwdMaxScenecutWindow = 0; + p->bwdMaxScenecutWindow = 0; + for (int i = 0; i < 6; i++) + { + p->fwdScenecutWindowi = window1i; + p->fwdRefQpDeltai = refQpDelta1i; + p->fwdNonRefQpDeltai = nonRefQpDelta1i; + p->bwdScenecutWindowi = window2i; + p->bwdRefQpDeltai = refQpDelta2i; + p->bwdNonRefQpDeltai = nonRefQpDelta2i; + p->fwdMaxScenecutWindow += p->fwdScenecutWindowi; + p->bwdMaxScenecutWindow += p->bwdScenecutWindowi; + } + } + else + { + x265_log(NULL, X265_LOG_ERROR, "Specify all the necessary offsets for masking-strength \n"); + bError = true; + } + } + return bError; +} + void x265_copy_params(x265_param* dst, x265_param* src) { dst->cpuid = src->cpuid; @@ -2440,10 +2590,13 @@ dst->bRepeatHeaders = src->bRepeatHeaders; dst->bAnnexB = src->bAnnexB; dst->bEnableAccessUnitDelimiters = src->bEnableAccessUnitDelimiters; + dst->bEnableEndOfBitstream = src->bEnableEndOfBitstream; + dst->bEnableEndOfSequence = src->bEnableEndOfSequence; dst->bEmitInfoSEI = src->bEmitInfoSEI; dst->decodedPictureHashSEI = src->decodedPictureHashSEI; dst->bEnableTemporalSubLayers = src->bEnableTemporalSubLayers; dst->bOpenGOP = src->bOpenGOP; + dst->craNal = src->craNal; dst->keyframeMax = src->keyframeMax; dst->keyframeMin = src->keyframeMin; dst->bframes = src->bframes; @@ -2541,8 +2694,11 @@ dst->rc.rfConstantMin = src->rc.rfConstantMin; dst->rc.bStatWrite = src->rc.bStatWrite; dst->rc.bStatRead = src->rc.bStatRead; + dst->rc.dataShareMode = src->rc.dataShareMode; if (src->rc.statFileName) dst->rc.statFileName=strdup(src->rc.statFileName); else dst->rc.statFileName = NULL; + if (src->rc.sharedMemName) dst->rc.sharedMemName = strdup(src->rc.sharedMemName); + else dst->rc.sharedMemName = NULL; dst->rc.qblur = src->rc.qblur; dst->rc.complexityBlur = src->rc.complexityBlur; dst->rc.bEnableSlowFirstPass = src->rc.bEnableSlowFirstPass; @@ -2550,6 +2706,7 @@ dst->rc.zonefileCount = src->rc.zonefileCount; dst->reconfigWindowSize = src->reconfigWindowSize; dst->bResetZoneConfig = src->bResetZoneConfig; + dst->bNoResetZoneConfig = src->bNoResetZoneConfig; dst->decoderVbvMaxRate = src->decoderVbvMaxRate; if (src->rc.zonefileCount && src->rc.zones && src->bResetZoneConfig) @@ -2557,6 +2714,7 @@ for (int i = 0; i < src->rc.zonefileCount; i++) { dst->rc.zonesi.startFrame = src->rc.zonesi.startFrame; + dst->rc.zones0.keyframeMax = src->rc.zones0.keyframeMax; memcpy(dst->rc.zonesi.zoneParam, src->rc.zonesi.zoneParam, sizeof(x265_param)); } } @@ -2621,7 +2779,6 @@ dst->bOptRefListLengthPPS = src->bOptRefListLengthPPS; dst->bMultiPassOptRPS = src->bMultiPassOptRPS; dst->scenecutBias = src->scenecutBias; - dst->edgeTransitionThreshold = src->edgeTransitionThreshold; dst->gopLookahead = src->lookaheadDepth; dst->bOptCUDeltaQP = src->bOptCUDeltaQP; dst->analysisMultiPassDistortion = src->analysisMultiPassDistortion; @@ -2682,20 +2839,33 @@ dst->bEnableSvtHevc = src->bEnableSvtHevc; dst->bEnableFades = src->bEnableFades; dst->bEnableSceneCutAwareQp = src->bEnableSceneCutAwareQp; - dst->fwdScenecutWindow = src->fwdScenecutWindow; - dst->fwdRefQpDelta = src->fwdRefQpDelta; - dst->fwdNonRefQpDelta = src->fwdNonRefQpDelta; - dst->bwdScenecutWindow = src->bwdScenecutWindow; - dst->bwdRefQpDelta = src->bwdRefQpDelta; - dst->bwdNonRefQpDelta = src->bwdNonRefQpDelta; + dst->fwdMaxScenecutWindow = src->fwdMaxScenecutWindow; + dst->bwdMaxScenecutWindow = src->bwdMaxScenecutWindow; + for (int i = 0; i < 6; i++) + { + dst->fwdScenecutWindowi = src->fwdScenecutWindowi; + dst->fwdRefQpDeltai = src->fwdRefQpDeltai; + dst->fwdNonRefQpDeltai = src->fwdNonRefQpDeltai; + dst->bwdScenecutWindowi = src->bwdScenecutWindowi; + dst->bwdRefQpDeltai = src->bwdRefQpDeltai; + dst->bwdNonRefQpDeltai = src->bwdNonRefQpDeltai; + } dst->bField = src->bField; - + dst->bEnableTemporalFilter = src->bEnableTemporalFilter; + dst->temporalFilterStrength = src->temporalFilterStrength; dst->confWinRightOffset = src->confWinRightOffset; dst->confWinBottomOffset = src->confWinBottomOffset; dst->bliveVBV2pass = src->bliveVBV2pass; + + if (src->videoSignalTypePreset) dst->videoSignalTypePreset = strdup(src->videoSignalTypePreset); + else dst->videoSignalTypePreset = NULL; #ifdef SVT_HEVC memcpy(dst->svtHevcParam, src->svtHevcParam, sizeof(EB_H265_ENC_CONFIGURATION)); #endif + /* Film grain */ + if (src->filmGrain) + dst->filmGrain = src->filmGrain; + dst->bEnableSBRC = src->bEnableSBRC; } #ifdef SVT_HEVC
View file
x265_3.5.tar.gz/source/common/param.h -> x265_3.6.tar.gz/source/common/param.h
Changed
@@ -38,6 +38,7 @@ void getParamAspectRatio(x265_param *p, int& width, int& height); bool parseLambdaFile(x265_param *param); void x265_copy_params(x265_param* dst, x265_param* src); +bool parseMaskingStrength(x265_param* p, const char* value); /* this table is kept internal to avoid confusion, since log level indices start at -1 */ static const char * const logLevelNames = { "none", "error", "warning", "info", "debug", "full", 0 }; @@ -52,6 +53,7 @@ int x265_param_default_preset(x265_param *, const char *preset, const char *tune); int x265_param_apply_profile(x265_param *, const char *profile); int x265_param_parse(x265_param *p, const char *name, const char *value); +int x265_scenecut_aware_qp_param_parse(x265_param* p, const char* name, const char* value); int x265_zone_param_parse(x265_param* p, const char* name, const char* value); #define PARAM_NS X265_NS #endif
View file
x265_3.5.tar.gz/source/common/piclist.cpp -> x265_3.6.tar.gz/source/common/piclist.cpp
Changed
@@ -45,6 +45,25 @@ m_count++; } +void PicList::pushFrontMCSTF(Frame& curFrame) +{ + X265_CHECK(!curFrame.m_nextMCSTF && !curFrame.m_nextMCSTF, "piclist: picture already in OPB list\n"); // ensure frame is not in a list + curFrame.m_nextMCSTF = m_start; + curFrame.m_prevMCSTF = NULL; + + if (m_count) + { + m_start->m_prevMCSTF = &curFrame; + m_start = &curFrame; + } + else + { + m_start = m_end = &curFrame; + } + m_count++; + +} + void PicList::pushBack(Frame& curFrame) { X265_CHECK(!curFrame.m_next && !curFrame.m_prev, "piclist: picture already in list\n"); // ensure frame is not in a list @@ -63,6 +82,24 @@ m_count++; } +void PicList::pushBackMCSTF(Frame& curFrame) +{ + X265_CHECK(!curFrame.m_nextMCSTF && !curFrame.m_prevMCSTF, "piclist: picture already in OPB list\n"); // ensure frame is not in a list + curFrame.m_nextMCSTF = NULL; + curFrame.m_prevMCSTF = m_end; + + if (m_count) + { + m_end->m_nextMCSTF = &curFrame; + m_end = &curFrame; + } + else + { + m_start = m_end = &curFrame; + } + m_count++; +} + Frame *PicList::popFront() { if (m_start) @@ -94,6 +131,14 @@ return curFrame; } +Frame* PicList::getPOCMCSTF(int poc) +{ + Frame *curFrame = m_start; + while (curFrame && curFrame->m_poc != poc) + curFrame = curFrame->m_nextMCSTF; + return curFrame; +} + Frame *PicList::popBack() { if (m_end) @@ -117,6 +162,29 @@ return NULL; } +Frame *PicList::popBackMCSTF() +{ + if (m_end) + { + Frame* temp = m_end; + m_count--; + + if (m_count) + { + m_end = m_end->m_prevMCSTF; + m_end->m_nextMCSTF = NULL; + } + else + { + m_start = m_end = NULL; + } + temp->m_nextMCSTF = temp->m_prevMCSTF = NULL; + return temp; + } + else + return NULL; +} + Frame* PicList::getCurFrame(void) { Frame *curFrame = m_start; @@ -158,3 +226,36 @@ curFrame.m_next = curFrame.m_prev = NULL; } + +void PicList::removeMCSTF(Frame& curFrame) +{ +#if _DEBUG + Frame *tmp = m_start; + while (tmp && tmp != &curFrame) + { + tmp = tmp->m_nextMCSTF; + } + + X265_CHECK(tmp == &curFrame, "framelist: pic being removed was not in list\n"); // verify pic is in this list +#endif + + m_count--; + if (m_count) + { + if (m_start == &curFrame) + m_start = curFrame.m_nextMCSTF; + if (m_end == &curFrame) + m_end = curFrame.m_prevMCSTF; + + if (curFrame.m_nextMCSTF) + curFrame.m_nextMCSTF->m_prevMCSTF = curFrame.m_prevMCSTF; + if (curFrame.m_prevMCSTF) + curFrame.m_prevMCSTF->m_nextMCSTF = curFrame.m_nextMCSTF; + } + else + { + m_start = m_end = NULL; + } + + curFrame.m_nextMCSTF = curFrame.m_prevMCSTF = NULL; +}
View file
x265_3.5.tar.gz/source/common/piclist.h -> x265_3.6.tar.gz/source/common/piclist.h
Changed
@@ -49,24 +49,31 @@ /** Push picture to end of the list */ void pushBack(Frame& pic); + void pushBackMCSTF(Frame& pic); /** Push picture to beginning of the list */ void pushFront(Frame& pic); + void pushFrontMCSTF(Frame& pic); /** Pop picture from end of the list */ Frame* popBack(); + Frame* popBackMCSTF(); /** Pop picture from beginning of the list */ Frame* popFront(); /** Find frame with specified POC */ Frame* getPOC(int poc); + /* Find next MCSTF frame with specified POC */ + Frame* getPOCMCSTF(int poc); /** Get the current Frame from the list **/ Frame* getCurFrame(void); /** Remove picture from list */ void remove(Frame& pic); + /* Remove MCSTF picture from list */ + void removeMCSTF(Frame& pic); Frame* first() { return m_start; }
View file
x265_3.5.tar.gz/source/common/picyuv.cpp -> x265_3.6.tar.gz/source/common/picyuv.cpp
Changed
@@ -125,6 +125,58 @@ return false; } +/*Copy pixels from the picture buffer of a frame to picture buffer of another frame*/ +void PicYuv::copyFromFrame(PicYuv* source) +{ + uint32_t numCuInHeight = (m_picHeight + m_param->maxCUSize - 1) / m_param->maxCUSize; + + int maxHeight = numCuInHeight * m_param->maxCUSize; + memcpy(m_picBuf0, source->m_picBuf0, sizeof(pixel)* m_stride * (maxHeight + (m_lumaMarginY * 2))); + m_picOrg0 = m_picBuf0 + m_lumaMarginY * m_stride + m_lumaMarginX; + + if (m_picCsp != X265_CSP_I400) + { + memcpy(m_picBuf1, source->m_picBuf1, sizeof(pixel)* m_strideC * ((maxHeight >> m_vChromaShift) + (m_chromaMarginY * 2))); + memcpy(m_picBuf2, source->m_picBuf2, sizeof(pixel)* m_strideC * ((maxHeight >> m_vChromaShift) + (m_chromaMarginY * 2))); + + m_picOrg1 = m_picBuf1 + m_chromaMarginY * m_strideC + m_chromaMarginX; + m_picOrg2 = m_picBuf2 + m_chromaMarginY * m_strideC + m_chromaMarginX; + } + else + { + m_picBuf1 = m_picBuf2 = NULL; + m_picOrg1 = m_picOrg2 = NULL; + } +} + +bool PicYuv::createScaledPicYUV(x265_param* param, uint8_t scaleFactor) +{ + m_param = param; + m_picWidth = m_param->sourceWidth / scaleFactor; + m_picHeight = m_param->sourceHeight / scaleFactor; + + m_picCsp = m_param->internalCsp; + m_hChromaShift = CHROMA_H_SHIFT(m_picCsp); + m_vChromaShift = CHROMA_V_SHIFT(m_picCsp); + + uint32_t numCuInWidth = (m_picWidth + param->maxCUSize - 1) / param->maxCUSize; + uint32_t numCuInHeight = (m_picHeight + param->maxCUSize - 1) / param->maxCUSize; + + m_lumaMarginX = 128; // search margin for L0 and L1 ME in horizontal direction + m_lumaMarginY = 128; // search margin for L0 and L1 ME in vertical direction + m_stride = (numCuInWidth * param->maxCUSize) + (m_lumaMarginX << 1); + + int maxHeight = numCuInHeight * param->maxCUSize; + CHECKED_MALLOC_ZERO(m_picBuf0, pixel, m_stride * (maxHeight + (m_lumaMarginY * 2))); + m_picOrg0 = m_picBuf0 + m_lumaMarginY * m_stride + m_lumaMarginX; + m_picBuf1 = m_picBuf2 = NULL; + m_picOrg1 = m_picOrg2 = NULL; + return true; + +fail: + return false; +} + int PicYuv::getLumaBufLen(uint32_t picWidth, uint32_t picHeight, uint32_t picCsp) { m_picWidth = picWidth;
View file
x265_3.5.tar.gz/source/common/picyuv.h -> x265_3.6.tar.gz/source/common/picyuv.h
Changed
@@ -78,11 +78,13 @@ PicYuv(); bool create(x265_param* param, bool picAlloc = true, pixel *pixelbuf = NULL); + bool createScaledPicYUV(x265_param* param, uint8_t scaleFactor); bool createOffsets(const SPS& sps); void destroy(); int getLumaBufLen(uint32_t picWidth, uint32_t picHeight, uint32_t picCsp); void copyFromPicture(const x265_picture&, const x265_param& param, int padx, int pady); + void copyFromFrame(PicYuv* source); intptr_t getChromaAddrOffset(uint32_t ctuAddr, uint32_t absPartIdx) const { return m_cuOffsetCctuAddr + m_buOffsetCabsPartIdx; }
View file
x265_3.5.tar.gz/source/common/pixel.cpp -> x265_3.6.tar.gz/source/common/pixel.cpp
Changed
@@ -266,7 +266,7 @@ { int satd = 0; -#if ENABLE_ASSEMBLY && X265_ARCH_ARM64 +#if ENABLE_ASSEMBLY && X265_ARCH_ARM64 && !HIGH_BIT_DEPTH pixelcmp_t satd_4x4 = x265_pixel_satd_4x4_neon; #endif @@ -284,7 +284,7 @@ { int satd = 0; -#if ENABLE_ASSEMBLY && X265_ARCH_ARM64 +#if ENABLE_ASSEMBLY && X265_ARCH_ARM64 && !HIGH_BIT_DEPTH pixelcmp_t satd_8x4 = x265_pixel_satd_8x4_neon; #endif @@ -627,6 +627,23 @@ } } +static +void frame_subsample_luma(const pixel* src0, pixel* dst0, intptr_t src_stride, intptr_t dst_stride, int width, int height) +{ + for (int y = 0; y < height; y++, src0 += 2 * src_stride, dst0 += dst_stride) + { + const pixel *inRow = src0; + const pixel *inRowBelow = src0 + src_stride; + pixel *target = dst0; + for (int x = 0; x < width; x++) + { + targetx = (((inRow0 + inRowBelow0 + 1) >> 1) + ((inRow1 + inRowBelow1 + 1) >> 1) + 1) >> 1; + inRow += 2; + inRowBelow += 2; + } + } +} + /* structural similarity metric */ static void ssim_4x4x2_core(const pixel* pix1, intptr_t stride1, const pixel* pix2, intptr_t stride2, int sums24) { @@ -1355,5 +1372,7 @@ p.cuBLOCK_16x16.normFact = normFact_c; p.cuBLOCK_32x32.normFact = normFact_c; p.cuBLOCK_64x64.normFact = normFact_c; + /* SubSample Luma*/ + p.frameSubSampleLuma = frame_subsample_luma; } }
View file
x265_3.5.tar.gz/source/common/ppc/intrapred_altivec.cpp -> x265_3.6.tar.gz/source/common/ppc/intrapred_altivec.cpp
Changed
@@ -27,7 +27,7 @@ #include <assert.h> #include <math.h> #include <cmath> -#include <linux/types.h> +#include <sys/types.h> #include <stdlib.h> #include <stdio.h> #include <stdint.h>
View file
x265_3.5.tar.gz/source/common/primitives.h -> x265_3.6.tar.gz/source/common/primitives.h
Changed
@@ -232,6 +232,8 @@ typedef void(*psyRdoQuant_t2)(int16_t *m_resiDctCoeff, int16_t *m_fencDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, int64_t *psyScale, uint32_t blkPos); typedef void(*ssimDistortion_t)(const pixel *fenc, uint32_t fStride, const pixel *recon, intptr_t rstride, uint64_t *ssBlock, int shift, uint64_t *ac_k); typedef void(*normFactor_t)(const pixel *src, uint32_t blockSize, int shift, uint64_t *z_k); +/* SubSampling Luma */ +typedef void (*downscaleluma_t)(const pixel* src0, pixel* dstf, intptr_t src_stride, intptr_t dst_stride, int width, int height); /* Function pointers to optimized encoder primitives. Each pointer can reference * either an assembly routine, a SIMD intrinsic primitive, or a C function */ struct EncoderPrimitives @@ -353,6 +355,8 @@ downscale_t frameInitLowres; downscale_t frameInitLowerRes; + /* Sub Sample Luma */ + downscaleluma_t frameSubSampleLuma; cutree_propagate_cost propagateCost; cutree_fix8_unpack fix8Unpack; cutree_fix8_pack fix8Pack; @@ -488,7 +492,7 @@ #if ENABLE_ASSEMBLY && X265_ARCH_ARM64 extern "C" { -#include "aarch64/pixel-util.h" +#include "aarch64/fun-decls.h" } #endif
View file
x265_3.6.tar.gz/source/common/ringmem.cpp
Added
@@ -0,0 +1,357 @@ +/***************************************************************************** + * Copyright (C) 2013-2017 MulticoreWare, Inc + * + * Authors: liwei <liwei@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com + *****************************************************************************/ + +#include "ringmem.h" + +#ifndef _WIN32 +#include <sys/mman.h> +#endif ////< _WIN32 + +#ifdef _WIN32 +#define X265_SHARED_MEM_NAME "Local\\_x265_shr_mem_" +#define X265_SEMAPHORE_RINGMEM_WRITER_NAME "_x265_semW_" +#define X265_SEMAPHORE_RINGMEM_READER_NAME "_x265_semR_" +#else /* POSIX / pthreads */ +#define X265_SHARED_MEM_NAME "/tmp/_x265_shr_mem_" +#define X265_SEMAPHORE_RINGMEM_WRITER_NAME "/tmp/_x265_semW_" +#define X265_SEMAPHORE_RINGMEM_READER_NAME "/tmp/_x265_semR_" +#endif + +#define RINGMEM_ALLIGNMENT 64 + +namespace X265_NS { + RingMem::RingMem() + : m_initialized(false) + , m_protectRW(false) + , m_itemSize(0) + , m_itemCnt(0) + , m_dataPool(NULL) + , m_shrMem(NULL) +#ifdef _WIN32 + , m_handle(NULL) +#else //_WIN32 + , m_filepath(NULL) +#endif //_WIN32 + , m_writeSem(NULL) + , m_readSem(NULL) + { + } + + + RingMem::~RingMem() + { + } + + bool RingMem::skipRead(int32_t cnt) { + if (!m_initialized) + { + return false; + } + + if (m_protectRW) + { + for (int i = 0; i < cnt; i++) + { + m_readSem->take(); + } + } + + ATOMIC_ADD(&m_shrMem->m_read, cnt); + + if (m_protectRW) + { + m_writeSem->give(cnt); + } + + return true; + } + + bool RingMem::skipWrite(int32_t cnt) { + if (!m_initialized) + { + return false; + } + + if (m_protectRW) + { + for (int i = 0; i < cnt; i++) + { + m_writeSem->take(); + } + } + + ATOMIC_ADD(&m_shrMem->m_write, cnt); + + if (m_protectRW) + { + m_readSem->give(cnt); + } + + return true; + } + + ///< initialize + bool RingMem::init(int32_t itemSize, int32_t itemCnt, const char *name, bool protectRW) + { + ///< check parameters + if (itemSize <= 0 || itemCnt <= 0 || NULL == name) + { + ///< invalid parameters + return false; + } + + if (!m_initialized) + { + ///< formating names + char nameBufMAX_SHR_NAME_LEN = { 0 }; + + ///< shared memory name + snprintf(nameBuf, sizeof(nameBuf) - 1, "%s%s", X265_SHARED_MEM_NAME, name); + + ///< create or open shared memory + bool newCreated = false; + + ///< calculate the size of the shared memory + int32_t shrMemSize = (itemSize * itemCnt + sizeof(ShrMemCtrl) + RINGMEM_ALLIGNMENT - 1) & ~(RINGMEM_ALLIGNMENT - 1); + +#ifdef _WIN32 + HANDLE h = OpenFileMappingA(FILE_MAP_WRITE | FILE_MAP_READ, FALSE, nameBuf); + if (!h) + { + h = CreateFileMappingA(INVALID_HANDLE_VALUE, NULL, PAGE_READWRITE, 0, shrMemSize, nameBuf); + + if (!h) + { + return false; + } + + newCreated = true; + } + + void *pool = MapViewOfFile(h, FILE_MAP_ALL_ACCESS, 0, 0, 0); + + ///< should not close the handle here, otherwise the OpenFileMapping would fail + //CloseHandle(h); + m_handle = h; + + if (!pool) + { + return false; + } + +#else /* POSIX / pthreads */ + mode_t mode = S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH; + int flag = O_RDWR; + int shrfd = -1; + if ((shrfd = open(nameBuf, flag, mode)) < 0) + { + flag |= O_CREAT; + + shrfd = open(nameBuf, flag, mode); + if (shrfd < 0) + { + return false; + } + newCreated = true; + + lseek(shrfd, shrMemSize - 1, SEEK_SET); + + if (-1 == write(shrfd, "\0", 1)) + { + close(shrfd); + return false; + } + + if (lseek(shrfd, 0, SEEK_END) < shrMemSize) + { + close(shrfd); + return false; + } + } + + void *pool = mmap(0, + shrMemSize, + PROT_READ | PROT_WRITE, + MAP_SHARED, + shrfd, + 0); + + close(shrfd); + if (pool == MAP_FAILED) + { + return false; + } + + m_filepath = strdup(nameBuf); +#endif ///< _WIN32 + + if (newCreated) + { + memset(pool, 0, shrMemSize); + } + + m_shrMem = reinterpret_cast<ShrMemCtrl *>(pool); + m_dataPool = reinterpret_cast<uint8_t *>(pool) + sizeof(ShrMemCtrl); + m_itemSize = itemSize; + m_itemCnt = itemCnt; + m_initialized = true; + + if (protectRW) + { + m_protectRW = true; + m_writeSem = new NamedSemaphore(); + if (!m_writeSem) + { + release(); + return false; + } + + ///< shared memory name + snprintf(nameBuf, sizeof(nameBuf) - 1, "%s%s", X265_SEMAPHORE_RINGMEM_WRITER_NAME, name); + if (!m_writeSem->create(nameBuf, m_itemCnt, m_itemCnt)) + { + release(); + return false; + } + + m_readSem = new NamedSemaphore(); + if (!m_readSem) + { + release(); + return false; + } + + ///< shared memory name + snprintf(nameBuf, sizeof(nameBuf) - 1, "%s%s", X265_SEMAPHORE_RINGMEM_READER_NAME, name); + if (!m_readSem->create(nameBuf, 0, m_itemCnt)) + { + release(); + return false; + } + } + } + + return true; + } + ///< finalize + void RingMem::release() + { + if (m_initialized) + { + m_initialized = false; + + if (m_shrMem) + { +#ifdef _WIN32 + UnmapViewOfFile(m_shrMem); + CloseHandle(m_handle); + m_handle = NULL; +#else /* POSIX / pthreads */ + int32_t shrMemSize = (m_itemSize * m_itemCnt + sizeof(ShrMemCtrl) + RINGMEM_ALLIGNMENT - 1) & (~RINGMEM_ALLIGNMENT - 1); + munmap(m_shrMem, shrMemSize); + unlink(m_filepath); + free(m_filepath); + m_filepath = NULL; +#endif ///< _WIN32 + m_shrMem = NULL; + m_dataPool = NULL; + m_itemSize = 0; + m_itemCnt = 0; + } + + if (m_protectRW) + { + m_protectRW = false; + if (m_writeSem) + { + m_writeSem->release(); + + delete m_writeSem; + m_writeSem = NULL; + } + + if (m_readSem) + { + m_readSem->release(); + + delete m_readSem; + m_readSem = NULL; + } + } + + } + } + + ///< data read + bool RingMem::readNext(void* dst, fnRWSharedData callback) + { + if (!m_initialized || !callback || !dst) + { + return false; + } + + if (m_protectRW) + { + if (!m_readSem->take()) + { + return false; + } + } + + int32_t index = ATOMIC_ADD(&m_shrMem->m_read, 1) % m_itemCnt; + (*callback)(dst, reinterpret_cast<uint8_t *>(m_dataPool) + index * m_itemSize, m_itemSize); + + if (m_protectRW) + { + m_writeSem->give(1); + } + + return true; + } + ///< data write + bool RingMem::writeData(void *data, fnRWSharedData callback) + { + if (!m_initialized || !data || !callback) + { + return false; + } + + if (m_protectRW) + { + if (!m_writeSem->take()) + { + return false; + } + } + + int32_t index = ATOMIC_ADD(&m_shrMem->m_write, 1) % m_itemCnt; + (*callback)(reinterpret_cast<uint8_t *>(m_dataPool) + index * m_itemSize, data, m_itemSize); + + if (m_protectRW) + { + m_readSem->give(1); + } + + return true; + } +}
View file
x265_3.6.tar.gz/source/common/ringmem.h
Added
@@ -0,0 +1,90 @@ +/***************************************************************************** + * Copyright (C) 2013-2017 MulticoreWare, Inc + * + * Authors: liwei <liwei@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com + *****************************************************************************/ + +#ifndef X265_RINGMEM_H +#define X265_RINGMEM_H + +#include "common.h" +#include "threading.h" + +#if _MSC_VER +#define snprintf _snprintf +#define strdup _strdup +#endif + +namespace X265_NS { + +#define MAX_SHR_NAME_LEN 256 + + class RingMem { + public: + RingMem(); + ~RingMem(); + + bool skipRead(int32_t cnt); + + bool skipWrite(int32_t cnt); + + ///< initialize + ///< protectRW: if use the semaphore the protect the write and read operation. + bool init(int32_t itemSize, int32_t itemCnt, const char *name, bool protectRW = false); + ///< finalize + void release(); + + typedef void(*fnRWSharedData)(void *dst, void *src, int32_t size); + + ///< data read + bool readNext(void* dst, fnRWSharedData callback); + ///< data write + bool writeData(void *data, fnRWSharedData callback); + + private: + bool m_initialized; + bool m_protectRW; + + int32_t m_itemSize; + int32_t m_itemCnt; + ///< data pool + void *m_dataPool; + typedef struct { + ///< index to write + int32_t m_write; + ///< index to read + int32_t m_read; + + }ShrMemCtrl; + + ShrMemCtrl *m_shrMem; +#ifdef _WIN32 + void *m_handle; +#else // _WIN32 + char *m_filepath; +#endif // _WIN32 + + ///< Semaphores + NamedSemaphore *m_writeSem; + NamedSemaphore *m_readSem; + }; +}; + +#endif // ifndef X265_RINGMEM_H
View file
x265_3.5.tar.gz/source/common/slice.h -> x265_3.6.tar.gz/source/common/slice.h
Changed
@@ -156,9 +156,9 @@ HRDInfo hrdParameters; ProfileTierLevel ptl; uint32_t maxTempSubLayers; - uint32_t numReorderPics; - uint32_t maxDecPicBuffering; - uint32_t maxLatencyIncrease; + uint32_t numReorderPicsMAX_T_LAYERS; + uint32_t maxDecPicBufferingMAX_T_LAYERS; + uint32_t maxLatencyIncreaseMAX_T_LAYERS; }; struct Window @@ -235,9 +235,9 @@ uint32_t maxAMPDepth; uint32_t maxTempSubLayers; // max number of Temporal Sub layers - uint32_t maxDecPicBuffering; // these are dups of VPS values - uint32_t maxLatencyIncrease; - int numReorderPics; + uint32_t maxDecPicBufferingMAX_T_LAYERS; // these are dups of VPS values + uint32_t maxLatencyIncreaseMAX_T_LAYERS; + int numReorderPicsMAX_T_LAYERS; RPS spsrpsMAX_NUM_SHORT_TERM_RPS; int spsrpsNum; @@ -363,6 +363,7 @@ int m_iNumRPSInSPS; const x265_param *m_param; int m_fieldNum; + Frame* m_mcstfRefFrameList2MAX_MCSTF_TEMPORAL_WINDOW_LENGTH; Slice() {
View file
x265_3.6.tar.gz/source/common/temporalfilter.cpp
Added
@@ -0,0 +1,1017 @@ +/***************************************************************************** +* Copyright (C) 2013-2021 MulticoreWare, Inc +* + * Authors: Ashok Kumar Mishra <ashok@multicorewareinc.com> + * +* This program is free software; you can redistribute it and/or modify +* it under the terms of the GNU General Public License as published by +* the Free Software Foundation; either version 2 of the License, or +* (at your option) any later version. +* +* This program is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +* GNU General Public License for more details. +* +* You should have received a copy of the GNU General Public License +* along with this program; if not, write to the Free Software +* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. +* +* This program is also available under a commercial proprietary license. +* For more information, contact us at license @ x265.com. +*****************************************************************************/ +#include "common.h" +#include "temporalfilter.h" +#include "primitives.h" + +#include "frame.h" +#include "slice.h" +#include "framedata.h" +#include "analysis.h" + +using namespace X265_NS; + +void OrigPicBuffer::addPicture(Frame* inFrame) +{ + m_mcstfPicList.pushFrontMCSTF(*inFrame); +} + +void OrigPicBuffer::addEncPicture(Frame* inFrame) +{ + m_mcstfOrigPicFreeList.pushFrontMCSTF(*inFrame); +} + +void OrigPicBuffer::addEncPictureToPicList(Frame* inFrame) +{ + m_mcstfOrigPicList.pushFrontMCSTF(*inFrame); +} + +OrigPicBuffer::~OrigPicBuffer() +{ + while (!m_mcstfOrigPicList.empty()) + { + Frame* curFrame = m_mcstfOrigPicList.popBackMCSTF(); + curFrame->destroy(); + delete curFrame; + } + + while (!m_mcstfOrigPicFreeList.empty()) + { + Frame* curFrame = m_mcstfOrigPicFreeList.popBackMCSTF(); + curFrame->destroy(); + delete curFrame; + } +} + +void OrigPicBuffer::setOrigPicList(Frame* inFrame, int frameCnt) +{ + Slice* slice = inFrame->m_encData->m_slice; + uint8_t j = 0; + for (int iterPOC = (inFrame->m_poc - inFrame->m_mcstf->m_range); + iterPOC <= (inFrame->m_poc + inFrame->m_mcstf->m_range); iterPOC++) + { + if (iterPOC != inFrame->m_poc) + { + if (iterPOC < 0) + continue; + if (iterPOC >= frameCnt) + break; + + Frame *iterFrame = m_mcstfPicList.getPOCMCSTF(iterPOC); + X265_CHECK(iterFrame, "Reference frame not found in OPB"); + if (iterFrame != NULL) + { + slice->m_mcstfRefFrameList1j = iterFrame; + iterFrame->m_refPicCnt1--; + } + + iterFrame = m_mcstfOrigPicList.getPOCMCSTF(iterPOC); + if (iterFrame != NULL) + { + + slice->m_mcstfRefFrameList1j = iterFrame; + + iterFrame->m_refPicCnt1--; + Frame *cFrame = m_mcstfOrigPicList.getPOCMCSTF(inFrame->m_poc); + X265_CHECK(cFrame, "Reference frame not found in encoded OPB"); + cFrame->m_refPicCnt1--; + } + j++; + } + } +} + +void OrigPicBuffer::recycleOrigPicList() +{ + Frame *iterFrame = m_mcstfPicList.first(); + + while (iterFrame) + { + Frame *curFrame = iterFrame; + iterFrame = iterFrame->m_nextMCSTF; + if (!curFrame->m_refPicCnt1) + { + m_mcstfPicList.removeMCSTF(*curFrame); + iterFrame = m_mcstfPicList.first(); + } + } + + iterFrame = m_mcstfOrigPicList.first(); + + while (iterFrame) + { + Frame *curFrame = iterFrame; + iterFrame = iterFrame->m_nextMCSTF; + if (!curFrame->m_refPicCnt1) + { + m_mcstfOrigPicList.removeMCSTF(*curFrame); + *curFrame->m_isSubSampled = false; + m_mcstfOrigPicFreeList.pushFrontMCSTF(*curFrame); + iterFrame = m_mcstfOrigPicList.first(); + } + } +} + +void OrigPicBuffer::addPictureToFreelist(Frame* inFrame) +{ + m_mcstfOrigPicFreeList.pushBack(*inFrame); +} + +TemporalFilter::TemporalFilter() +{ + m_sourceWidth = 0; + m_sourceHeight = 0, + m_QP = 0; + m_sliceTypeConfig = 3; + m_numRef = 0; + m_useSADinME = 1; + + m_range = 2; + m_chromaFactor = 0.55; + m_sigmaMultiplier = 9.0; + m_sigmaZeroPoint = 10.0; + m_motionVectorFactor = 16; +} + +void TemporalFilter::init(const x265_param* param) +{ + m_param = param; + m_bitDepth = param->internalBitDepth; + m_sourceWidth = param->sourceWidth; + m_sourceHeight = param->sourceHeight; + m_internalCsp = param->internalCsp; + m_numComponents = (m_internalCsp != X265_CSP_I400) ? MAX_NUM_COMPONENT : 1; + + m_metld = new MotionEstimatorTLD; + + predPUYuv.create(FENC_STRIDE, X265_CSP_I400); +} + +int TemporalFilter::createRefPicInfo(TemporalFilterRefPicInfo* refFrame, x265_param* param) +{ + CHECKED_MALLOC_ZERO(refFrame->mvs, MV, sizeof(MV)* ((m_sourceWidth ) / 4) * ((m_sourceHeight ) / 4)); + refFrame->mvsStride = m_sourceWidth / 4; + CHECKED_MALLOC_ZERO(refFrame->mvs0, MV, sizeof(MV)* ((m_sourceWidth ) / 16) * ((m_sourceHeight ) / 16)); + refFrame->mvsStride0 = m_sourceWidth / 16; + CHECKED_MALLOC_ZERO(refFrame->mvs1, MV, sizeof(MV)* ((m_sourceWidth ) / 16) * ((m_sourceHeight ) / 16)); + refFrame->mvsStride1 = m_sourceWidth / 16; + CHECKED_MALLOC_ZERO(refFrame->mvs2, MV, sizeof(MV)* ((m_sourceWidth ) / 16)*((m_sourceHeight ) / 16)); + refFrame->mvsStride2 = m_sourceWidth / 16; + + CHECKED_MALLOC_ZERO(refFrame->noise, int, sizeof(int) * ((m_sourceWidth) / 4) * ((m_sourceHeight) / 4)); + CHECKED_MALLOC_ZERO(refFrame->error, int, sizeof(int) * ((m_sourceWidth) / 4) * ((m_sourceHeight) / 4)); + + refFrame->slicetype = X265_TYPE_AUTO; + + refFrame->compensatedPic = new PicYuv; + refFrame->compensatedPic->create(param, true); + + return 1; +fail: + return 0; +} + +int TemporalFilter::motionErrorLumaSAD( + PicYuv *orig, + PicYuv *buffer, + int x, + int y, + int dx, + int dy, + int bs, + int besterror) +{ + + pixel* origOrigin = orig->m_picOrg0; + intptr_t origStride = orig->m_stride; + pixel *buffOrigin = buffer->m_picOrg0; + intptr_t buffStride = buffer->m_stride; + int error = 0;// dx * 10 + dy * 10; + if (((dx | dy) & 0xF) == 0) + { + dx /= m_motionVectorFactor; + dy /= m_motionVectorFactor; + + const pixel* bufferRowStart = buffOrigin + (y + dy) * buffStride + (x + dx); +#if 0 + const pixel* origRowStart = origOrigin + y *origStride + x; + + for (int y1 = 0; y1 < bs; y1++) + { + for (int x1 = 0; x1 < bs; x1++) + { + int diff = origRowStartx1 - bufferRowStartx1; + error += abs(diff); + } + + origRowStart += origStride; + bufferRowStart += buffStride; + } +#else + int partEnum = partitionFromSizes(bs, bs); + /* copy PU block into cache */ + primitives.pupartEnum.copy_pp(predPUYuv.m_buf0, FENC_STRIDE, bufferRowStart, buffStride); + + error = m_metld->me.bufSAD(predPUYuv.m_buf0, FENC_STRIDE); +#endif + if (error > besterror) + { + return error; + } + } + else + { + const int *xFilter = s_interpolationFilterdx & 0xF; + const int *yFilter = s_interpolationFilterdy & 0xF; + int tempArray64 + 864; + + int iSum, iBase; + for (int y1 = 1; y1 < bs + 7; y1++) + { + const int yOffset = y + y1 + (dy >> 4) - 3; + const pixel *sourceRow = buffOrigin + (yOffset)*buffStride + 0; + for (int x1 = 0; x1 < bs; x1++) + { + iSum = 0; + iBase = x + x1 + (dx >> 4) - 3; + const pixel *rowStart = sourceRow + iBase; + + iSum += xFilter1 * rowStart1; + iSum += xFilter2 * rowStart2; + iSum += xFilter3 * rowStart3; + iSum += xFilter4 * rowStart4; + iSum += xFilter5 * rowStart5; + iSum += xFilter6 * rowStart6; + + tempArrayy1x1 = iSum; + } + } + + const pixel maxSampleValue = (1 << m_bitDepth) - 1; + for (int y1 = 0; y1 < bs; y1++) + { + const pixel *origRow = origOrigin + (y + y1)*origStride + 0; + for (int x1 = 0; x1 < bs; x1++) + { + iSum = 0; + iSum += yFilter1 * tempArrayy1 + 1x1; + iSum += yFilter2 * tempArrayy1 + 2x1; + iSum += yFilter3 * tempArrayy1 + 3x1; + iSum += yFilter4 * tempArrayy1 + 4x1; + iSum += yFilter5 * tempArrayy1 + 5x1; + iSum += yFilter6 * tempArrayy1 + 6x1; + + iSum = (iSum + (1 << 11)) >> 12; + iSum = iSum < 0 ? 0 : (iSum > maxSampleValue ? maxSampleValue : iSum); + + error += abs(iSum - origRowx + x1); + } + if (error > besterror) + { + return error; + } + } + } + return error; +} + +int TemporalFilter::motionErrorLumaSSD( + PicYuv *orig, + PicYuv *buffer, + int x, + int y, + int dx, + int dy, + int bs, + int besterror) +{ + + pixel* origOrigin = orig->m_picOrg0; + intptr_t origStride = orig->m_stride; + pixel *buffOrigin = buffer->m_picOrg0; + intptr_t buffStride = buffer->m_stride; + int error = 0;// dx * 10 + dy * 10; + if (((dx | dy) & 0xF) == 0) + { + dx /= m_motionVectorFactor; + dy /= m_motionVectorFactor; + + const pixel* bufferRowStart = buffOrigin + (y + dy) * buffStride + (x + dx); +#if 0 + const pixel* origRowStart = origOrigin + y * origStride + x; + + for (int y1 = 0; y1 < bs; y1++) + { + for (int x1 = 0; x1 < bs; x1++) + { + int diff = origRowStartx1 - bufferRowStartx1; + error += diff * diff; + } + + origRowStart += origStride; + bufferRowStart += buffStride; + } +#else + int partEnum = partitionFromSizes(bs, bs); + /* copy PU block into cache */ + primitives.pupartEnum.copy_pp(predPUYuv.m_buf0, FENC_STRIDE, bufferRowStart, buffStride); + + error = (int)primitives.cupartEnum.sse_pp(m_metld->me.fencPUYuv.m_buf0, FENC_STRIDE, predPUYuv.m_buf0, FENC_STRIDE); + +#endif + if (error > besterror) + { + return error; + } + } + else + { + const int *xFilter = s_interpolationFilterdx & 0xF; + const int *yFilter = s_interpolationFilterdy & 0xF; + int tempArray64 + 864; + + int iSum, iBase; + for (int y1 = 1; y1 < bs + 7; y1++) + { + const int yOffset = y + y1 + (dy >> 4) - 3; + const pixel *sourceRow = buffOrigin + (yOffset)*buffStride + 0; + for (int x1 = 0; x1 < bs; x1++) + { + iSum = 0; + iBase = x + x1 + (dx >> 4) - 3; + const pixel *rowStart = sourceRow + iBase; + + iSum += xFilter1 * rowStart1; + iSum += xFilter2 * rowStart2; + iSum += xFilter3 * rowStart3; + iSum += xFilter4 * rowStart4; + iSum += xFilter5 * rowStart5; + iSum += xFilter6 * rowStart6; + + tempArrayy1x1 = iSum; + } + } + + const pixel maxSampleValue = (1 << m_bitDepth) - 1; + for (int y1 = 0; y1 < bs; y1++) + { + const pixel *origRow = origOrigin + (y + y1)*origStride + 0; + for (int x1 = 0; x1 < bs; x1++) + { + iSum = 0; + iSum += yFilter1 * tempArrayy1 + 1x1; + iSum += yFilter2 * tempArrayy1 + 2x1; + iSum += yFilter3 * tempArrayy1 + 3x1; + iSum += yFilter4 * tempArrayy1 + 4x1; + iSum += yFilter5 * tempArrayy1 + 5x1; + iSum += yFilter6 * tempArrayy1 + 6x1; + + iSum = (iSum + (1 << 11)) >> 12; + iSum = iSum < 0 ? 0 : (iSum > maxSampleValue ? maxSampleValue : iSum); + + error += (iSum - origRowx + x1) * (iSum - origRowx + x1); + } + if (error > besterror) + { + return error; + } + } + } + return error; +} + +void TemporalFilter::applyMotion(MV *mvs, uint32_t mvsStride, PicYuv *input, PicYuv *output) +{ + static const int lumaBlockSize = 8; + int srcStride = 0; + int dstStride = 0; + int csx = 0, csy = 0; + for (int c = 0; c < m_numComponents; c++) + { + const pixel maxValue = (1 << X265_DEPTH) - 1; + + const pixel *pSrcImage = input->m_picOrgc; + pixel *pDstImage = output->m_picOrgc; + + if (!c) + { + srcStride = (int)input->m_stride; + dstStride = (int)output->m_stride; + } + else + { + srcStride = (int)input->m_strideC; + dstStride = (int)output->m_strideC; + csx = CHROMA_H_SHIFT(m_internalCsp); + csy = CHROMA_V_SHIFT(m_internalCsp); + } + const int blockSizeX = lumaBlockSize >> csx; + const int blockSizeY = lumaBlockSize >> csy; + const int height = input->m_picHeight >> csy; + const int width = input->m_picWidth >> csx; + + for (int y = 0, blockNumY = 0; y + blockSizeY <= height; y += blockSizeY, blockNumY++) + { + for (int x = 0, blockNumX = 0; x + blockSizeX <= width; x += blockSizeX, blockNumX++) + { + int mvIdx = blockNumY * mvsStride + blockNumX; + const MV &mv = mvsmvIdx; + const int dx = mv.x >> csx; + const int dy = mv.y >> csy; + const int xInt = mv.x >> (4 + csx); + const int yInt = mv.y >> (4 + csy); + + const int *xFilter = s_interpolationFilterdx & 0xf; + const int *yFilter = s_interpolationFilterdy & 0xf; // will add 6 bit. + const int numFilterTaps = 7; + const int centreTapOffset = 3; + + int tempArraylumaBlockSize + numFilterTapslumaBlockSize; + + for (int by = 1; by < blockSizeY + numFilterTaps; by++) + { + const int yOffset = y + by + yInt - centreTapOffset; + const pixel *sourceRow = pSrcImage + yOffset * srcStride; + for (int bx = 0; bx < blockSizeX; bx++) + { + int iBase = x + bx + xInt - centreTapOffset; + const pixel *rowStart = sourceRow + iBase; + + int iSum = 0; + iSum += xFilter1 * rowStart1; + iSum += xFilter2 * rowStart2; + iSum += xFilter3 * rowStart3; + iSum += xFilter4 * rowStart4; + iSum += xFilter5 * rowStart5; + iSum += xFilter6 * rowStart6; + + tempArraybybx = iSum; + } + } + + pixel *pDstRow = pDstImage + y * dstStride; + for (int by = 0; by < blockSizeY; by++, pDstRow += dstStride) + { + pixel *pDstPel = pDstRow + x; + for (int bx = 0; bx < blockSizeX; bx++, pDstPel++) + { + int iSum = 0; + + iSum += yFilter1 * tempArrayby + 1bx; + iSum += yFilter2 * tempArrayby + 2bx; + iSum += yFilter3 * tempArrayby + 3bx; + iSum += yFilter4 * tempArrayby + 4bx; + iSum += yFilter5 * tempArrayby + 5bx; + iSum += yFilter6 * tempArrayby + 6bx; + + iSum = (iSum + (1 << 11)) >> 12; + iSum = iSum < 0 ? 0 : (iSum > maxValue ? maxValue : iSum); + *pDstPel = (pixel)iSum; + } + } + } + } + } +} + +void TemporalFilter::bilateralFilter(Frame* frame, + TemporalFilterRefPicInfo* m_mcstfRefList, + double overallStrength) +{ + + const int numRefs = frame->m_mcstf->m_numRef; + + for (int i = 0; i < numRefs; i++) + { + TemporalFilterRefPicInfo *ref = &m_mcstfRefListi; + applyMotion(m_mcstfRefListi.mvs, m_mcstfRefListi.mvsStride, m_mcstfRefListi.picBuffer, ref->compensatedPic); + } + + int refStrengthRow = 2; + if (numRefs == m_range * 2) + { + refStrengthRow = 0; + } + else if (numRefs == m_range) + { + refStrengthRow = 1; + } + + const double lumaSigmaSq = (m_QP - m_sigmaZeroPoint) * (m_QP - m_sigmaZeroPoint) * m_sigmaMultiplier; + const double chromaSigmaSq = 30 * 30; + + PicYuv* orgPic = frame->m_fencPic; + + for (int c = 0; c < m_numComponents; c++) + { + int height, width; + pixel *srcPelRow = NULL; + intptr_t srcStride, correctedPicsStride = 0; + + if (!c) + { + height = orgPic->m_picHeight; + width = orgPic->m_picWidth; + srcPelRow = orgPic->m_picOrgc; + srcStride = orgPic->m_stride; + } + else + { + int csx = CHROMA_H_SHIFT(m_internalCsp); + int csy = CHROMA_V_SHIFT(m_internalCsp); + + height = orgPic->m_picHeight >> csy; + width = orgPic->m_picWidth >> csx; + srcPelRow = orgPic->m_picOrgc; + srcStride = (int)orgPic->m_strideC; + } + + const double sigmaSq = (!c) ? lumaSigmaSq : chromaSigmaSq; + const double weightScaling = overallStrength * ( (!c) ? 0.4 : m_chromaFactor); + + const double maxSampleValue = (1 << m_bitDepth) - 1; + const double bitDepthDiffWeighting = 1024.0 / (maxSampleValue + 1); + + const int blkSize = (!c) ? 8 : 4; + + for (int y = 0; y < height; y++, srcPelRow += srcStride) + { + pixel *srcPel = srcPelRow; + + for (int x = 0; x < width; x++, srcPel++) + { + const int orgVal = (int)*srcPel; + double temporalWeightSum = 1.0; + double newVal = (double)orgVal; + + if ((y % blkSize == 0) && (x % blkSize == 0)) + { + for (int i = 0; i < numRefs; i++) + { + TemporalFilterRefPicInfo *refPicInfo = &m_mcstfRefListi; + + if (!c) + correctedPicsStride = refPicInfo->compensatedPic->m_stride; + else + correctedPicsStride = refPicInfo->compensatedPic->m_strideC; + + double variance = 0, diffsum = 0; + for (int y1 = 0; y1 < blkSize - 1; y1++) + { + for (int x1 = 0; x1 < blkSize - 1; x1++) + { + int pix = *(srcPel + x1); + int pixR = *(srcPel + x1 + 1); + int pixD = *(srcPel + x1 + srcStride); + + int ref = *(refPicInfo->compensatedPic->m_picOrgc + ((y + y1) * correctedPicsStride + x + x1)); + int refR = *(refPicInfo->compensatedPic->m_picOrgc + ((y + y1) * correctedPicsStride + x + x1 + 1)); + int refD = *(refPicInfo->compensatedPic->m_picOrgc + ((y + y1 + 1) * correctedPicsStride + x + x1)); + + int diff = pix - ref; + int diffR = pixR - refR; + int diffD = pixD - refD; + + variance += diff * diff; + diffsum += (diffR - diff) * (diffR - diff); + diffsum += (diffD - diff) * (diffD - diff); + } + } + + refPicInfo->noise(y / blkSize) * refPicInfo->mvsStride + (x / blkSize) = (int)round((300 * variance + 50) / (10 * diffsum + 50)); + } + } + + double minError = 9999999; + for (int i = 0; i < numRefs; i++) + { + TemporalFilterRefPicInfo *refPicInfo = &m_mcstfRefListi; + minError = X265_MIN(minError, (double)refPicInfo->error(y / blkSize) * refPicInfo->mvsStride + (x / blkSize)); + } + + for (int i = 0; i < numRefs; i++) + { + TemporalFilterRefPicInfo *refPicInfo = &m_mcstfRefListi; + + const int error = refPicInfo->error(y / blkSize) * refPicInfo->mvsStride + (x / blkSize); + const int noise = refPicInfo->noise(y / blkSize) * refPicInfo->mvsStride + (x / blkSize); + + const pixel *pCorrectedPelPtr = refPicInfo->compensatedPic->m_picOrgc + (y * correctedPicsStride + x); + const int refVal = (int)*pCorrectedPelPtr; + double diff = (double)(refVal - orgVal); + diff *= bitDepthDiffWeighting; + double diffSq = diff * diff; + + const int index = X265_MIN(3, std::abs(refPicInfo->origOffset) - 1); + double ww = 1, sw = 1; + ww *= (noise < 25) ? 1 : 1.2; + sw *= (noise < 25) ? 1.3 : 0.8; + ww *= (error < 50) ? 1.2 : ((error > 100) ? 0.8 : 1); + sw *= (error < 50) ? 1.3 : 1; + ww *= ((minError + 1) / (error + 1)); + const double weight = weightScaling * s_refStrengthsrefStrengthRowindex * ww * exp(-diffSq / (2 * sw * sigmaSq)); + + newVal += weight * refVal; + temporalWeightSum += weight; + } + newVal /= temporalWeightSum; + double sampleVal = round(newVal); + sampleVal = (sampleVal < 0 ? 0 : (sampleVal > maxSampleValue ? maxSampleValue : sampleVal)); + *srcPel = (pixel)sampleVal; + } + } + } +} + +void TemporalFilter::motionEstimationLuma(MV *mvs, uint32_t mvStride, PicYuv *orig, PicYuv *buffer, int blockSize, + MV *previous, uint32_t prevMvStride, int factor) +{ + + int range = 5; + + + const int stepSize = blockSize; + + const int origWidth = orig->m_picWidth; + const int origHeight = orig->m_picHeight; + + int error; + + for (int blockY = 0; blockY + blockSize <= origHeight; blockY += stepSize) + { + for (int blockX = 0; blockX + blockSize <= origWidth; blockX += stepSize) + { + const intptr_t pelOffset = blockY * orig->m_stride + blockX; + m_metld->me.setSourcePU(orig->m_picOrg0, orig->m_stride, pelOffset, blockSize, blockSize, X265_HEX_SEARCH, 1); + + + MV best(0, 0); + int leastError = INT_MAX; + + if (previous == NULL) + { + range = 8; + } + else + { + + for (int py = -1; py <= 1; py++) + { + int testy = blockY / (2 * blockSize) + py; + + for (int px = -1; px <= 1; px++) + { + + int testx = blockX / (2 * blockSize) + px; + if ((testx >= 0) && (testx < origWidth / (2 * blockSize)) && (testy >= 0) && (testy < origHeight / (2 * blockSize))) + { + int mvIdx = testy * prevMvStride + testx; + MV old = previousmvIdx; + + if (m_useSADinME) + error = motionErrorLumaSAD(orig, buffer, blockX, blockY, old.x * factor, old.y * factor, blockSize, leastError); + else + error = motionErrorLumaSSD(orig, buffer, blockX, blockY, old.x * factor, old.y * factor, blockSize, leastError); + + if (error < leastError) + { + best.set(old.x * factor, old.y * factor); + leastError = error; + } + } + } + } + + if (m_useSADinME) + error = motionErrorLumaSAD(orig, buffer, blockX, blockY, 0, 0, blockSize, leastError); + else + error = motionErrorLumaSSD(orig, buffer, blockX, blockY, 0, 0, blockSize, leastError); + + if (error < leastError) + { + best.set(0, 0); + leastError = error; + } + + } + + MV prevBest = best; + for (int y2 = prevBest.y / m_motionVectorFactor - range; y2 <= prevBest.y / m_motionVectorFactor + range; y2++) + { + for (int x2 = prevBest.x / m_motionVectorFactor - range; x2 <= prevBest.x / m_motionVectorFactor + range; x2++) + { + if (m_useSADinME) + error = motionErrorLumaSAD(orig, buffer, blockX, blockY, x2 * m_motionVectorFactor, y2 * m_motionVectorFactor, blockSize, leastError); + else + error = motionErrorLumaSSD(orig, buffer, blockX, blockY, x2 * m_motionVectorFactor, y2 * m_motionVectorFactor, blockSize, leastError); + if (error < leastError) + { + best.set(x2 * m_motionVectorFactor, y2 * m_motionVectorFactor); + leastError = error; + } + } + } + + if (blockY > 0) + { + int idx = ((blockY - stepSize) / stepSize) * mvStride + (blockX / stepSize); + MV aboveMV = mvsidx; + + if (m_useSADinME) + error = motionErrorLumaSAD(orig, buffer, blockX, blockY, aboveMV.x, aboveMV.y, blockSize, leastError); + else + error = motionErrorLumaSSD(orig, buffer, blockX, blockY, aboveMV.x, aboveMV.y, blockSize, leastError); + + if (error < leastError) + { + best.set(aboveMV.x, aboveMV.y); + leastError = error; + } + } + + if (blockX > 0) + { + int idx = ((blockY / stepSize) * mvStride + (blockX - stepSize) / stepSize); + MV leftMV = mvsidx; + + if (m_useSADinME) + error = motionErrorLumaSAD(orig, buffer, blockX, blockY, leftMV.x, leftMV.y, blockSize, leastError); + else + error = motionErrorLumaSSD(orig, buffer, blockX, blockY, leftMV.x, leftMV.y, blockSize, leastError); + + if (error < leastError) + { + best.set(leftMV.x, leftMV.y); + leastError = error; + } + } + + // calculate average + double avg = 0.0; + for (int x1 = 0; x1 < blockSize; x1++) + { + for (int y1 = 0; y1 < blockSize; y1++) + { + avg = avg + *(orig->m_picOrg0 + (blockX + x1 + orig->m_stride * (blockY + y1))); + } + } + avg = avg / (blockSize * blockSize); + + // calculate variance + double variance = 0; + for (int x1 = 0; x1 < blockSize; x1++) + { + for (int y1 = 0; y1 < blockSize; y1++) + { + int pix = *(orig->m_picOrg0 + (blockX + x1 + orig->m_stride * (blockY + y1))); + variance = variance + (pix - avg) * (pix - avg); + } + } + + leastError = (int)(20 * ((leastError + 5.0) / (variance + 5.0)) + (leastError / (blockSize * blockSize)) / 50); + + int mvIdx = (blockY / stepSize) * mvStride + (blockX / stepSize); + mvsmvIdx = best; + } + } +} + + +void TemporalFilter::motionEstimationLumaDoubleRes(MV *mvs, uint32_t mvStride, PicYuv *orig, PicYuv *buffer, int blockSize, + MV *previous, uint32_t prevMvStride, int factor, int* minError) +{ + + int range = 0; + + + const int stepSize = blockSize; + + const int origWidth = orig->m_picWidth; + const int origHeight = orig->m_picHeight; + + int error; + + for (int blockY = 0; blockY + blockSize <= origHeight; blockY += stepSize) + { + for (int blockX = 0; blockX + blockSize <= origWidth; blockX += stepSize) + { + + const intptr_t pelOffset = blockY * orig->m_stride + blockX; + m_metld->me.setSourcePU(orig->m_picOrg0, orig->m_stride, pelOffset, blockSize, blockSize, X265_HEX_SEARCH, 1); + + MV best(0, 0); + int leastError = INT_MAX; + + if (previous == NULL) + { + range = 8; + } + else + { + + for (int py = -1; py <= 1; py++) + { + int testy = blockY / (2 * blockSize) + py; + + for (int px = -1; px <= 1; px++) + { + + int testx = blockX / (2 * blockSize) + px; + if ((testx >= 0) && (testx < origWidth / (2 * blockSize)) && (testy >= 0) && (testy < origHeight / (2 * blockSize))) + { + int mvIdx = testy * prevMvStride + testx; + MV old = previousmvIdx; + + if (m_useSADinME) + error = motionErrorLumaSAD(orig, buffer, blockX, blockY, old.x * factor, old.y * factor, blockSize, leastError); + else + error = motionErrorLumaSSD(orig, buffer, blockX, blockY, old.x * factor, old.y * factor, blockSize, leastError); + + if (error < leastError) + { + best.set(old.x * factor, old.y * factor); + leastError = error; + } + } + } + } + + if (m_useSADinME) + error = motionErrorLumaSAD(orig, buffer, blockX, blockY, 0, 0, blockSize, leastError); + else + error = motionErrorLumaSSD(orig, buffer, blockX, blockY, 0, 0, blockSize, leastError); + + if (error < leastError) + { + best.set(0, 0); + leastError = error; + } + + } + + MV prevBest = best; + for (int y2 = prevBest.y / m_motionVectorFactor - range; y2 <= prevBest.y / m_motionVectorFactor + range; y2++) + { + for (int x2 = prevBest.x / m_motionVectorFactor - range; x2 <= prevBest.x / m_motionVectorFactor + range; x2++) + { + if (m_useSADinME) + error = motionErrorLumaSAD(orig, buffer, blockX, blockY, x2 * m_motionVectorFactor, y2 * m_motionVectorFactor, blockSize, leastError); + else + error = motionErrorLumaSSD(orig, buffer, blockX, blockY, x2 * m_motionVectorFactor, y2 * m_motionVectorFactor, blockSize, leastError); + + if (error < leastError) + { + best.set(x2 * m_motionVectorFactor, y2 * m_motionVectorFactor); + leastError = error; + } + } + } + + prevBest = best; + int doubleRange = 3 * 4; + for (int y2 = prevBest.y - doubleRange; y2 <= prevBest.y + doubleRange; y2 += 4) + { + for (int x2 = prevBest.x - doubleRange; x2 <= prevBest.x + doubleRange; x2 += 4) + { + if (m_useSADinME) + error = motionErrorLumaSAD(orig, buffer, blockX, blockY, x2, y2, blockSize, leastError); + else + error = motionErrorLumaSSD(orig, buffer, blockX, blockY, x2, y2, blockSize, leastError); + + if (error < leastError) + { + best.set(x2, y2); + leastError = error; + } + } + } + + prevBest = best; + doubleRange = 3; + for (int y2 = prevBest.y - doubleRange; y2 <= prevBest.y + doubleRange; y2++) + { + for (int x2 = prevBest.x - doubleRange; x2 <= prevBest.x + doubleRange; x2++) + { + if (m_useSADinME) + error = motionErrorLumaSAD(orig, buffer, blockX, blockY, x2, y2, blockSize, leastError); + else + error = motionErrorLumaSSD(orig, buffer, blockX, blockY, x2, y2, blockSize, leastError); + + if (error < leastError) + { + best.set(x2, y2); + leastError = error; + } + } + } + + + if (blockY > 0) + { + int idx = ((blockY - stepSize) / stepSize) * mvStride + (blockX / stepSize); + MV aboveMV = mvsidx; + + if (m_useSADinME) + error = motionErrorLumaSAD(orig, buffer, blockX, blockY, aboveMV.x, aboveMV.y, blockSize, leastError); + else + error = motionErrorLumaSSD(orig, buffer, blockX, blockY, aboveMV.x, aboveMV.y, blockSize, leastError); + + if (error < leastError) + { + best.set(aboveMV.x, aboveMV.y); + leastError = error; + } + } + + if (blockX > 0) + { + int idx = ((blockY / stepSize) * mvStride + (blockX - stepSize) / stepSize); + MV leftMV = mvsidx; + + if (m_useSADinME) + error = motionErrorLumaSAD(orig, buffer, blockX, blockY, leftMV.x, leftMV.y, blockSize, leastError); + else + error = motionErrorLumaSSD(orig, buffer, blockX, blockY, leftMV.x, leftMV.y, blockSize, leastError); + + if (error < leastError) + { + best.set(leftMV.x, leftMV.y); + leastError = error; + } + } + + // calculate average + double avg = 0.0; + for (int x1 = 0; x1 < blockSize; x1++) + { + for (int y1 = 0; y1 < blockSize; y1++) + { + avg = avg + *(orig->m_picOrg0 + (blockX + x1 + orig->m_stride * (blockY + y1))); + } + } + avg = avg / (blockSize * blockSize); + + // calculate variance + double variance = 0; + for (int x1 = 0; x1 < blockSize; x1++) + { + for (int y1 = 0; y1 < blockSize; y1++) + { + int pix = *(orig->m_picOrg0 + (blockX + x1 + orig->m_stride * (blockY + y1))); + variance = variance + (pix - avg) * (pix - avg); + } + } + + leastError = (int)(20 * ((leastError + 5.0) / (variance + 5.0)) + (leastError / (blockSize * blockSize)) / 50); + + int mvIdx = (blockY / stepSize) * mvStride + (blockX / stepSize); + mvsmvIdx = best; + minErrormvIdx = leastError; + } + } +} + +void TemporalFilter::destroyRefPicInfo(TemporalFilterRefPicInfo* curFrame) +{ + if (curFrame) + { + if (curFrame->compensatedPic) + { + curFrame->compensatedPic->destroy(); + delete curFrame->compensatedPic; + } + + if (curFrame->mvs) + X265_FREE(curFrame->mvs); + if (curFrame->mvs0) + X265_FREE(curFrame->mvs0); + if (curFrame->mvs1) + X265_FREE(curFrame->mvs1); + if (curFrame->mvs2) + X265_FREE(curFrame->mvs2); + if (curFrame->noise) + X265_FREE(curFrame->noise); + if (curFrame->error) + X265_FREE(curFrame->error); + } +}
View file
x265_3.6.tar.gz/source/common/temporalfilter.h
Added
@@ -0,0 +1,185 @@ +/***************************************************************************** +* Copyright (C) 2013-2021 MulticoreWare, Inc +* + * Authors: Ashok Kumar Mishra <ashok@multicorewareinc.com> + * +* This program is free software; you can redistribute it and/or modify +* it under the terms of the GNU General Public License as published by +* the Free Software Foundation; either version 2 of the License, or +* (at your option) any later version. +* +* This program is distributed in the hope that it will be useful, +* but WITHOUT ANY WARRANTY; without even the implied warranty of +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +* GNU General Public License for more details. +* +* You should have received a copy of the GNU General Public License +* along with this program; if not, write to the Free Software +* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. +* +* This program is also available under a commercial proprietary license. +* For more information, contact us at license @ x265.com. +*****************************************************************************/ + +#ifndef X265_TEMPORAL_FILTER_H +#define X265_TEMPORAL_FILTER_H + +#include "x265.h" +#include "picyuv.h" +#include "mv.h" +#include "piclist.h" +#include "yuv.h" +#include "motion.h" + +const int s_interpolationFilter168 = +{ + { 0, 0, 0, 64, 0, 0, 0, 0 }, //0 + { 0, 1, -3, 64, 4, -2, 0, 0 }, //1 -->--> + { 0, 1, -6, 62, 9, -3, 1, 0 }, //2 --> + { 0, 2, -8, 60, 14, -5, 1, 0 }, //3 -->--> + { 0, 2, -9, 57, 19, -7, 2, 0 }, //4 + { 0, 3, -10, 53, 24, -8, 2, 0 }, //5 -->--> + { 0, 3, -11, 50, 29, -9, 2, 0 }, //6 --> + { 0, 3, -11, 44, 35, -10, 3, 0 }, //7 -->--> + { 0, 1, -7, 38, 38, -7, 1, 0 }, //8 + { 0, 3, -10, 35, 44, -11, 3, 0 }, //9 -->--> + { 0, 2, -9, 29, 50, -11, 3, 0 }, //10--> + { 0, 2, -8, 24, 53, -10, 3, 0 }, //11-->--> + { 0, 2, -7, 19, 57, -9, 2, 0 }, //12 + { 0, 1, -5, 14, 60, -8, 2, 0 }, //13-->--> + { 0, 1, -3, 9, 62, -6, 1, 0 }, //14--> + { 0, 0, -2, 4, 64, -3, 1, 0 } //15-->--> +}; + +const double s_refStrengths34 = +{ // abs(POC offset) + // 1, 2 3 4 + {0.85, 0.57, 0.41, 0.33}, // m_range * 2 + {1.13, 0.97, 0.81, 0.57}, // m_range + {0.30, 0.30, 0.30, 0.30} // otherwise +}; + +namespace X265_NS { + class OrigPicBuffer + { + public: + PicList m_mcstfPicList; + PicList m_mcstfOrigPicFreeList; + PicList m_mcstfOrigPicList; + + ~OrigPicBuffer(); + void addPicture(Frame*); + void addEncPicture(Frame*); + void setOrigPicList(Frame*, int); + void recycleOrigPicList(); + void addPictureToFreelist(Frame*); + void addEncPictureToPicList(Frame*); + }; + + struct MotionEstimatorTLD + { + MotionEstimate me; + + MotionEstimatorTLD() + { + me.init(X265_CSP_I400); + me.setQP(X265_LOOKAHEAD_QP); + } + + ~MotionEstimatorTLD() {} + }; + + struct TemporalFilterRefPicInfo + { + PicYuv* picBuffer; + PicYuv* picBufferSubSampled2; + PicYuv* picBufferSubSampled4; + MV* mvs; + MV* mvs0; + MV* mvs1; + MV* mvs2; + uint32_t mvsStride; + uint32_t mvsStride0; + uint32_t mvsStride1; + uint32_t mvsStride2; + int* error; + int* noise; + + int16_t origOffset; + bool isFilteredFrame; + PicYuv* compensatedPic; + + int* isSubsampled; + + int slicetype; + }; + + class TemporalFilter + { + public: + TemporalFilter(); + ~TemporalFilter() {} + + void init(const x265_param* param); + + //private: + // Private static member variables + const x265_param *m_param; + int32_t m_bitDepth; + int m_range; + uint8_t m_numRef; + double m_chromaFactor; + double m_sigmaMultiplier; + double m_sigmaZeroPoint; + int m_motionVectorFactor; + int m_padding; + + // Private member variables + + int m_sourceWidth; + int m_sourceHeight; + int m_QP; + + int m_internalCsp; + int m_numComponents; + uint8_t m_sliceTypeConfig; + + MotionEstimatorTLD* m_metld; + Yuv predPUYuv; + int m_useSADinME; + + int createRefPicInfo(TemporalFilterRefPicInfo* refFrame, x265_param* param); + + void bilateralFilter(Frame* frame, TemporalFilterRefPicInfo* mctfRefList, double overallStrength); + + void motionEstimationLuma(MV *mvs, uint32_t mvStride, PicYuv *orig, PicYuv *buffer, int bs, + MV *previous = 0, uint32_t prevmvStride = 0, int factor = 1); + + void motionEstimationLumaDoubleRes(MV *mvs, uint32_t mvStride, PicYuv *orig, PicYuv *buffer, int blockSize, + MV *previous, uint32_t prevMvStride, int factor, int* minError); + + int motionErrorLumaSSD(PicYuv *orig, + PicYuv *buffer, + int x, + int y, + int dx, + int dy, + int bs, + int besterror = 8 * 8 * 1024 * 1024); + + int motionErrorLumaSAD(PicYuv *orig, + PicYuv *buffer, + int x, + int y, + int dx, + int dy, + int bs, + int besterror = 8 * 8 * 1024 * 1024); + + void destroyRefPicInfo(TemporalFilterRefPicInfo* curFrame); + + void applyMotion(MV *mvs, uint32_t mvsStride, PicYuv *input, PicYuv *output); + + }; +} +#endif
View file
x265_3.5.tar.gz/source/common/threading.h -> x265_3.6.tar.gz/source/common/threading.h
Changed
@@ -3,6 +3,7 @@ * * Authors: Steve Borho <steve@borho.org> * Min Chen <chenm003@163.com> + liwei <liwei@multicorewareinc.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -253,6 +254,47 @@ int m_val; }; +class NamedSemaphore +{ +public: + NamedSemaphore() : m_sem(NULL) + { + } + + ~NamedSemaphore() + { + } + + bool create(const char* name, const int initcnt, const int maxcnt) + { + if(!m_sem) + { + m_sem = CreateSemaphoreA(NULL, initcnt, maxcnt, name); + } + return m_sem != NULL; + } + + bool give(const int32_t cnt) + { + return ReleaseSemaphore(m_sem, (LONG)cnt, NULL) != FALSE; + } + + bool take(const uint32_t time_out = INFINITE) + { + int32_t rt = WaitForSingleObject(m_sem, time_out); + return rt != WAIT_TIMEOUT && rt != WAIT_FAILED; + } + + void release() + { + CloseHandle(m_sem); + m_sem = NULL; + } + +private: + HANDLE m_sem; +}; + #else /* POSIX / pthreads */ typedef pthread_t ThreadHandle; @@ -459,6 +501,282 @@ int m_val; }; +#define TIMEOUT_INFINITE 0xFFFFFFFF + +class NamedSemaphore +{ +public: + NamedSemaphore() + : m_sem(NULL) +#ifndef __APPLE__ + , m_name(NULL) +#endif //__APPLE__ + { + } + + ~NamedSemaphore() + { + } + + bool create(const char* name, const int initcnt, const int maxcnt) + { + bool ret = false; + + if (initcnt >= maxcnt) + { + return false; + } + +#ifdef __APPLE__ + do + { + int32_t pshared = name != NULL ? PTHREAD_PROCESS_SHARED : PTHREAD_PROCESS_PRIVATE; + + m_sem = (mac_sem_t *)malloc(sizeof(mac_sem_t)); + if (!m_sem) + { + break; + } + + if (pthread_mutexattr_init(&m_sem->mutexAttr)) + { + break; + } + + if (pthread_mutexattr_setpshared(&m_sem->mutexAttr, pshared)) + { + break; + } + + if (pthread_condattr_init(&m_sem->condAttr)) + { + break; + } + + if (pthread_condattr_setpshared(&m_sem->condAttr, pshared)) + { + break; + } + + if (pthread_mutex_init(&m_sem->mutex, &m_sem->mutexAttr)) + { + break; + } + + if (pthread_cond_init(&m_sem->cond, &m_sem->condAttr)) + { + break; + } + + m_sem->curCnt = initcnt; + m_sem->maxCnt = maxcnt; + + ret = true; + } while (0); + + if (!ret) + { + release(); + } + +#else //__APPLE__ + m_sem = sem_open(name, O_CREAT | O_EXCL, 0666, initcnt); + if (m_sem != SEM_FAILED) + { + m_name = strdup(name); + ret = true; + } + else + { + if (EEXIST == errno) + { + m_sem = sem_open(name, 0); + if (m_sem != SEM_FAILED) + { + m_name = strdup(name); + ret = true; + } + } + } +#endif //__APPLE__ + + return ret; + } + + bool give(const int32_t cnt) + { + if (!m_sem) + { + return false; + } + +#ifdef __APPLE__ + if (pthread_mutex_lock(&m_sem->mutex)) + { + return false; + } + + int oldCnt = m_sem->curCnt; + m_sem->curCnt += cnt; + if (m_sem->curCnt > m_sem->maxCnt) + { + m_sem->curCnt = m_sem->maxCnt; + } + + bool ret = true; + if (!oldCnt) + { + ret = 0 == pthread_cond_broadcast(&m_sem->cond); + } + + if (pthread_mutex_unlock(&m_sem->mutex)) + { + return false; + } + + return ret; +#else //__APPLE__ + int ret = 0; + int32_t curCnt = cnt; + while (curCnt-- && !ret) { + ret = sem_post(m_sem); + } + + return 0 == ret; +#endif //_APPLE__ + } + + bool take(const uint32_t time_out = TIMEOUT_INFINITE) + { + if (!m_sem) + { + return false; + } + +#ifdef __APPLE__ + + if (pthread_mutex_lock(&m_sem->mutex)) + { + return false; + } + + bool ret = true; + if (TIMEOUT_INFINITE == time_out) + { + if (!m_sem->curCnt) + { + if (pthread_cond_wait(&m_sem->cond, &m_sem->mutex)) + { + ret = false; + } + } + + if (m_sem->curCnt && ret) + { + m_sem->curCnt--; + } + } + else + { + if (0 == time_out) + { + if (m_sem->curCnt) + { + m_sem->curCnt--; + } + else + { + ret = false; + } + } + else + { + if (!m_sem->curCnt) + { + struct timespec ts; + ts.tv_sec = time_out / 1000L; + ts.tv_nsec = (time_out * 1000000L) - ts.tv_sec * 1000 * 1000 * 1000; + + if (pthread_cond_timedwait(&m_sem->cond, &m_sem->mutex, &ts)) + { + ret = false; + } + } + + if (m_sem->curCnt && ret) + { + m_sem->curCnt--; + } + } + } + + if (pthread_mutex_unlock(&m_sem->mutex)) + { + return false; + } + + return ret; +#else //__APPLE__ + if (TIMEOUT_INFINITE == time_out) + { + return 0 == sem_wait(m_sem); + } + else + { + if (0 == time_out) + { + return 0 == sem_trywait(m_sem); + } + else + { + struct timespec ts; + ts.tv_sec = time_out / 1000L; + ts.tv_nsec = (time_out * 1000000L) - ts.tv_sec * 1000 * 1000 * 1000; + return 0 == sem_timedwait(m_sem, &ts); + } + } +#endif //_APPLE__ + } + + void release() + { + if (m_sem) + { +#ifdef __APPLE__ + pthread_condattr_destroy(&m_sem->condAttr); + pthread_mutexattr_destroy(&m_sem->mutexAttr); + pthread_mutex_destroy(&m_sem->mutex); + pthread_cond_destroy(&m_sem->cond); + free(m_sem); + m_sem = NULL; +#else //__APPLE__ + sem_close(m_sem); + sem_unlink(m_name); + m_sem = NULL; + free(m_name); + m_name = NULL; +#endif //__APPLE__ + } + } + +private: +#ifdef __APPLE__ + typedef struct + { + pthread_mutex_t mutex; + pthread_cond_t cond; + pthread_mutexattr_t mutexAttr; + pthread_condattr_t condAttr; + uint32_t curCnt; + uint32_t maxCnt; + }mac_sem_t; + mac_sem_t *m_sem; +#else // __APPLE__ + sem_t *m_sem; + char *m_name; +#endif // __APPLE_ +}; + #endif // ifdef _WIN32 class ScopedLock
View file
x265_3.5.tar.gz/source/common/threadpool.cpp -> x265_3.6.tar.gz/source/common/threadpool.cpp
Changed
@@ -301,7 +301,7 @@ /* limit threads based on param->numaPools * For windows because threads can't be allocated to live across sockets * changing the default behavior to be per-socket pools -- FIXME */ -#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 || HAVE_LIBNUMA if (!p->numaPools || (strcmp(p->numaPools, "NULL") == 0 || strcmp(p->numaPools, "*") == 0 || strcmp(p->numaPools, "") == 0)) { char poolString50 = "";
View file
x265_3.5.tar.gz/source/common/version.cpp -> x265_3.6.tar.gz/source/common/version.cpp
Changed
@@ -71,7 +71,7 @@ #define ONOS "Unk-OS" #endif -#if X86_64 +#if defined(_LP64) || defined(_WIN64) #define BITS "64 bit" #else #define BITS "32 bit"
View file
x265_3.5.tar.gz/source/common/x86/asm-primitives.cpp -> x265_3.6.tar.gz/source/common/x86/asm-primitives.cpp
Changed
@@ -1091,6 +1091,7 @@ p.frameInitLowres = PFX(frame_init_lowres_core_sse2); p.frameInitLowerRes = PFX(frame_init_lowres_core_sse2); + p.frameSubSampleLuma = PFX(frame_subsample_luma_sse2); // TODO: the planecopy_sp is really planecopy_SC now, must be fix it //p.planecopy_sp = PFX(downShift_16_sse2); p.planecopy_sp_shl = PFX(upShift_16_sse2); @@ -1121,6 +1122,7 @@ { ASSIGN2(p.scale1D_128to64, scale1D_128to64_ssse3); p.scale2D_64to32 = PFX(scale2D_64to32_ssse3); + p.frameSubSampleLuma = PFX(frame_subsample_luma_ssse3); // p.puLUMA_4x4.satd = p.cuBLOCK_4x4.sa8d = PFX(pixel_satd_4x4_ssse3); this one is broken ALL_LUMA_PU(satd, pixel_satd, ssse3); @@ -1462,6 +1464,7 @@ p.puLUMA_64x48.copy_pp = (copy_pp_t)PFX(blockcopy_ss_64x48_avx); p.puLUMA_64x64.copy_pp = (copy_pp_t)PFX(blockcopy_ss_64x64_avx); p.propagateCost = PFX(mbtree_propagate_cost_avx); + p.frameSubSampleLuma = PFX(frame_subsample_luma_avx); } if (cpuMask & X265_CPU_XOP) { @@ -1473,6 +1476,7 @@ LUMA_VAR(xop); p.frameInitLowres = PFX(frame_init_lowres_core_xop); p.frameInitLowerRes = PFX(frame_init_lowres_core_xop); + p.frameSubSampleLuma = PFX(frame_subsample_luma_xop); } if (cpuMask & X265_CPU_AVX2) { @@ -2301,6 +2305,9 @@ p.frameInitLowres = PFX(frame_init_lowres_core_avx2); p.frameInitLowerRes = PFX(frame_init_lowres_core_avx2); + + p.frameSubSampleLuma = PFX(frame_subsample_luma_avx2); + p.propagateCost = PFX(mbtree_propagate_cost_avx2); p.fix8Unpack = PFX(cutree_fix8_unpack_avx2); p.fix8Pack = PFX(cutree_fix8_pack_avx2); @@ -3300,6 +3307,7 @@ //p.frameInitLowres = PFX(frame_init_lowres_core_mmx2); p.frameInitLowres = PFX(frame_init_lowres_core_sse2); p.frameInitLowerRes = PFX(frame_init_lowres_core_sse2); + p.frameSubSampleLuma = PFX(frame_subsample_luma_sse2); ALL_LUMA_TU(blockfill_sNONALIGNED, blockfill_s, sse2); ALL_LUMA_TU(blockfill_sALIGNED, blockfill_s, sse2); @@ -3424,6 +3432,8 @@ ASSIGN2(p.scale1D_128to64, scale1D_128to64_ssse3); p.scale2D_64to32 = PFX(scale2D_64to32_ssse3); + p.frameSubSampleLuma = PFX(frame_subsample_luma_ssse3); + ASSIGN2(p.puLUMA_8x4.convert_p2s, filterPixelToShort_8x4_ssse3); ASSIGN2(p.puLUMA_8x8.convert_p2s, filterPixelToShort_8x8_ssse3); ASSIGN2(p.puLUMA_8x16.convert_p2s, filterPixelToShort_8x16_ssse3); @@ -3691,6 +3701,7 @@ p.frameInitLowres = PFX(frame_init_lowres_core_avx); p.frameInitLowerRes = PFX(frame_init_lowres_core_avx); p.propagateCost = PFX(mbtree_propagate_cost_avx); + p.frameSubSampleLuma = PFX(frame_subsample_luma_avx); } if (cpuMask & X265_CPU_XOP) { @@ -3702,6 +3713,7 @@ p.cuBLOCK_16x16.sse_pp = PFX(pixel_ssd_16x16_xop); p.frameInitLowres = PFX(frame_init_lowres_core_xop); p.frameInitLowerRes = PFX(frame_init_lowres_core_xop); + p.frameSubSampleLuma = PFX(frame_subsample_luma_xop); } #if X86_64 @@ -4684,6 +4696,8 @@ p.saoCuStatsE2 = PFX(saoCuStatsE2_avx2); p.saoCuStatsE3 = PFX(saoCuStatsE3_avx2); + p.frameSubSampleLuma = PFX(frame_subsample_luma_avx2); + if (cpuMask & X265_CPU_BMI2) { p.scanPosLast = PFX(scanPosLast_avx2_bmi2);
View file
x265_3.5.tar.gz/source/common/x86/const-a.asm -> x265_3.6.tar.gz/source/common/x86/const-a.asm
Changed
@@ -100,7 +100,7 @@ const pw_2000, times 16 dw 0x2000 const pw_8000, times 8 dw 0x8000 const pw_3fff, times 16 dw 0x3fff -const pw_32_0, times 4 dw 32, +const pw_32_0, times 4 dw 32 times 4 dw 0 const pw_pixel_max, times 16 dw ((1 << BIT_DEPTH)-1)
View file
x265_3.5.tar.gz/source/common/x86/h-ipfilter8.asm -> x265_3.6.tar.gz/source/common/x86/h-ipfilter8.asm
Changed
@@ -125,6 +125,9 @@ ALIGN 32 interp4_hps_shuf: times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 8, 9, 10, 11, 9, 10, 11, 12 +ALIGN 32 +const interp_4tap_8x8_horiz_shuf, dd 0, 4, 1, 5, 2, 6, 3, 7 + SECTION .text cextern pw_1 @@ -1459,8 +1462,6 @@ RET -ALIGN 32 -const interp_4tap_8x8_horiz_shuf, dd 0, 4, 1, 5, 2, 6, 3, 7 %macro FILTER_H4_w6 3 movu %1, srcq - 1
View file
x265_3.5.tar.gz/source/common/x86/mc-a2.asm -> x265_3.6.tar.gz/source/common/x86/mc-a2.asm
Changed
@@ -992,6 +992,262 @@ FRAME_INIT_LOWRES %endif +%macro SUBSAMPLEFILT8x4 7 + mova %3, r0+%7 + mova %4, r0+r2+%7 + pavgb %3, %4 + pavgb %4, r0+r2*2+%7 + PALIGNR %1, %3, 1, m6 + PALIGNR %2, %4, 1, m6 +%if cpuflag(xop) + pavgb %1, %3 + pavgb %2, %4 +%else + pavgb %1, %3 + pavgb %2, %4 + psrlw %5, %1, 8 + psrlw %6, %2, 8 + pand %1, m7 + pand %2, m7 +%endif +%endmacro + +%macro SUBSAMPLEFILT32x4U 1 + movu m1, r0+r2 + pavgb m0, m1, r0 + movu m3, r0+r2+1 + pavgb m2, m3, r0+1 + pavgb m1, r0+r2*2 + pavgb m3, r0+r2*2+1 + pavgb m0, m2 + pavgb m1, m3 + + movu m3, r0+r2+mmsize + pavgb m2, m3, r0+mmsize + movu m5, r0+r2+1+mmsize + pavgb m4, m5, r0+1+mmsize + pavgb m2, m4 + + pshufb m0, m7 + pshufb m2, m7 + punpcklqdq m0, m0, m2 + vpermq m0, m0, q3120 + movu %1, m0 +%endmacro + +%macro SUBSAMPLEFILT16x2 3 + mova m3, r0+%3+mmsize + mova m2, r0+%3 + pavgb m3, r0+%3+r2+mmsize + pavgb m2, r0+%3+r2 + PALIGNR %1, m3, 1, m6 + pavgb %1, m3 + PALIGNR m3, m2, 1, m6 + pavgb m3, m2 +%if cpuflag(xop) + vpperm m3, m3, %1, m6 +%else + pand m3, m7 + pand %1, m7 + packuswb m3, %1 +%endif + mova %2, m3 + mova %1, m2 +%endmacro + +%macro SUBSAMPLEFILT8x2U 2 + mova m2, r0+%2 + pavgb m2, r0+%2+r2 + mova m0, r0+%2+1 + pavgb m0, r0+%2+r2+1 + pavgb m1, m3 + pavgb m0, m2 + pand m1, m7 + pand m0, m7 + packuswb m0, m1 + mova %1, m0 +%endmacro + +%macro SUBSAMPLEFILT8xU 2 + mova m3, r0+%2+8 + mova m2, r0+%2 + pavgw m3, r0+%2+r2+8 + pavgw m2, r0+%2+r2 + movu m1, r0+%2+10 + movu m0, r0+%2+2 + pavgw m1, r0+%2+r2+10 + pavgw m0, r0+%2+r2+2 + pavgw m1, m3 + pavgw m0, m2 + psrld m3, m1, 16 + pand m1, m7 + pand m0, m7 + packssdw m0, m1 + movu %1, m0 +%endmacro + +%macro SUBSAMPLEFILT8xA 3 + movu m3, r0+%3+mmsize + movu m2, r0+%3 + pavgw m3, r0+%3+r2+mmsize + pavgw m2, r0+%3+r2 + PALIGNR %1, m3, 2, m6 + pavgw %1, m3 + PALIGNR m3, m2, 2, m6 + pavgw m3, m2 +%if cpuflag(xop) + vpperm m3, m3, %1, m6 +%else + pand m3, m7 + pand %1, m7 + packssdw m3, %1 +%endif +%if cpuflag(avx2) + vpermq m3, m3, q3120 +%endif + movu %2, m3 + movu %1, m2 +%endmacro + +;----------------------------------------------------------------------------- +; void frame_subsample_luma( uint8_t *src0, uint8_t *dst0, +; intptr_t src_stride, intptr_t dst_stride, int width, int height ) +;----------------------------------------------------------------------------- + +%macro FRAME_SUBSAMPLE_LUMA 0 +cglobal frame_subsample_luma, 6,7,(12-4*(BIT_DEPTH/9)) ; 8 for HIGH_BIT_DEPTH, 12 otherwise +%if HIGH_BIT_DEPTH + shl dword r3m, 1 + FIX_STRIDES r2 + shl dword r4m, 1 +%endif +%if mmsize >= 16 + add dword r4m, mmsize-1 + and dword r4m, ~(mmsize-1) +%endif + ; src += 2*(height-1)*stride + 2*width + mov r6d, r5m + dec r6d + imul r6d, r2d + add r6d, r4m + lea r0, r0+r6*2 + ; dst += (height-1)*stride + width + mov r6d, r5m + dec r6d + imul r6d, r3m + add r6d, r4m + add r1, r6 + ; gap = stride - width + mov r6d, r3m + sub r6d, r4m + PUSH r6 + %define dst_gap rsp+gprsize + mov r6d, r2d + sub r6d, r4m + shl r6d, 1 + PUSH r6 + %define src_gap rsp +%if HIGH_BIT_DEPTH +%if cpuflag(xop) + mova m6, deinterleave_shuf32a + mova m7, deinterleave_shuf32b +%else + pcmpeqw m7, m7 + psrld m7, 16 +%endif +.vloop: + mov r6d, r4m +%ifnidn cpuname, mmx2 + movu m0, r0 + movu m1, r0+r2 + pavgw m0, m1 + pavgw m1, r0+r2*2 +%endif +.hloop: + sub r0, mmsize*2 + sub r1, mmsize +%ifidn cpuname, mmx2 + SUBSAMPLEFILT8xU r1, 0 +%else + SUBSAMPLEFILT8xA m0, r1, 0 +%endif + sub r6d, mmsize + jg .hloop +%else ; !HIGH_BIT_DEPTH +%if cpuflag(avx2) + mova m7, deinterleave_shuf +%elif cpuflag(xop) + mova m6, deinterleave_shuf32a + mova m7, deinterleave_shuf32b +%else + pcmpeqb m7, m7 + psrlw m7, 8 +%endif +.vloop: + mov r6d, r4m +%ifnidn cpuname, mmx2 +%if mmsize <= 16 + mova m0, r0 + mova m1, r0+r2 + pavgb m0, m1 + pavgb m1, r0+r2*2 +%endif +%endif +.hloop: + sub r0, mmsize*2 + sub r1, mmsize +%if mmsize==32 + SUBSAMPLEFILT32x4U r1 +%elifdef m8 + SUBSAMPLEFILT8x4 m0, m1, m2, m3, m10, m11, mmsize + mova m8, m0 + mova m9, m1 + SUBSAMPLEFILT8x4 m2, m3, m0, m1, m4, m5, 0 +%if cpuflag(xop) + vpperm m4, m2, m8, m7 + vpperm m2, m2, m8, m6 +%else + packuswb m2, m8 +%endif + mova r1, m2 +%elifidn cpuname, mmx2 + SUBSAMPLEFILT8x2U r1, 0 +%else + SUBSAMPLEFILT16x2 m0, r1, 0 +%endif + sub r6d, mmsize + jg .hloop +%endif ; HIGH_BIT_DEPTH +.skip: + mov r3, dst_gap + sub r0, src_gap + sub r1, r3 + dec dword r5m + jg .vloop + ADD rsp, 2*gprsize + emms + RET +%endmacro ; FRAME_SUBSAMPLE_LUMA + +INIT_MMX mmx2 +FRAME_SUBSAMPLE_LUMA +%if ARCH_X86_64 == 0 +INIT_MMX cache32, mmx2 +FRAME_SUBSAMPLE_LUMA +%endif +INIT_XMM sse2 +FRAME_SUBSAMPLE_LUMA +INIT_XMM ssse3 +FRAME_SUBSAMPLE_LUMA +INIT_XMM avx +FRAME_SUBSAMPLE_LUMA +INIT_XMM xop +FRAME_SUBSAMPLE_LUMA +%if ARCH_X86_64 == 1 +INIT_YMM avx2 +FRAME_SUBSAMPLE_LUMA +%endif + ;----------------------------------------------------------------------------- ; void mbtree_propagate_cost( int *dst, uint16_t *propagate_in, int32_t *intra_costs, ; uint16_t *inter_costs, int32_t *inv_qscales, double *fps_factor, int len )
View file
x265_3.5.tar.gz/source/common/x86/mc.h -> x265_3.6.tar.gz/source/common/x86/mc.h
Changed
@@ -36,6 +36,17 @@ #undef LOWRES +#define SUBSAMPLELUMA(cpu) \ + void PFX(frame_subsample_luma_ ## cpu)(const pixel* src0, pixel* dst0, intptr_t src_stride, intptr_t dst_stride, int width, int height); +SUBSAMPLELUMA(mmx2) +SUBSAMPLELUMA(sse2) +SUBSAMPLELUMA(ssse3) +SUBSAMPLELUMA(avx) +SUBSAMPLELUMA(avx2) +SUBSAMPLELUMA(xop) + +#undef SUBSAMPLELUMA + #define PROPAGATE_COST(cpu) \ void PFX(mbtree_propagate_cost_ ## cpu)(int* dst, const uint16_t* propagateIn, const int32_t* intraCosts, \ const uint16_t* interCosts, const int32_t* invQscales, const double* fpsFactor, int len);
View file
x265_3.5.tar.gz/source/common/x86/x86inc.asm -> x265_3.6.tar.gz/source/common/x86/x86inc.asm
Changed
@@ -401,16 +401,6 @@ %endif %endmacro -%macro DEFINE_ARGS_INTERNAL 3+ - %ifnum %2 - DEFINE_ARGS %3 - %elif %1 == 4 - DEFINE_ARGS %2 - %elif %1 > 4 - DEFINE_ARGS %2, %3 - %endif -%endmacro - %if WIN64 ; Windows x64 ;================================================= DECLARE_REG 0, rcx @@ -429,7 +419,7 @@ DECLARE_REG 13, R12, 112 DECLARE_REG 14, R13, 120 -%macro PROLOGUE 2-5+ 0 ; #args, #regs, #xmm_regs, stack_size, arg_names... +%macro PROLOGUE 2-5+ 0, 0 ; #args, #regs, #xmm_regs, stack_size, arg_names... %assign num_args %1 %assign regs_used %2 ASSERT regs_used >= num_args @@ -441,7 +431,15 @@ WIN64_SPILL_XMM %3 %endif LOAD_IF_USED 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 - DEFINE_ARGS_INTERNAL %0, %4, %5 + %if %0 > 4 + %ifnum %4 + DEFINE_ARGS %5 + %else + DEFINE_ARGS %4, %5 + %endif + %elifnnum %4 + DEFINE_ARGS %4 + %endif %endmacro %macro WIN64_PUSH_XMM 0 @@ -537,7 +535,7 @@ DECLARE_REG 13, R12, 64 DECLARE_REG 14, R13, 72 -%macro PROLOGUE 2-5+ 0; #args, #regs, #xmm_regs, stack_size, arg_names... +%macro PROLOGUE 2-5+ 0, 0 ; #args, #regs, #xmm_regs, stack_size, arg_names... %assign num_args %1 %assign regs_used %2 %assign xmm_regs_used %3 @@ -547,7 +545,15 @@ PUSH_IF_USED 9, 10, 11, 12, 13, 14 ALLOC_STACK %4 LOAD_IF_USED 6, 7, 8, 9, 10, 11, 12, 13, 14 - DEFINE_ARGS_INTERNAL %0, %4, %5 + %if %0 > 4 + %ifnum %4 + DEFINE_ARGS %5 + %else + DEFINE_ARGS %4, %5 + %endif + %elifnnum %4 + DEFINE_ARGS %4 + %endif %endmacro %define has_epilogue regs_used > 9 || stack_size > 0 || vzeroupper_required @@ -588,7 +594,7 @@ DECLARE_ARG 7, 8, 9, 10, 11, 12, 13, 14 -%macro PROLOGUE 2-5+ ; #args, #regs, #xmm_regs, stack_size, arg_names... +%macro PROLOGUE 2-5+ 0, 0 ; #args, #regs, #xmm_regs, stack_size, arg_names... %assign num_args %1 %assign regs_used %2 ASSERT regs_used >= num_args @@ -603,7 +609,15 @@ PUSH_IF_USED 3, 4, 5, 6 ALLOC_STACK %4 LOAD_IF_USED 0, 1, 2, 3, 4, 5, 6 - DEFINE_ARGS_INTERNAL %0, %4, %5 + %if %0 > 4 + %ifnum %4 + DEFINE_ARGS %5 + %else + DEFINE_ARGS %4, %5 + %endif + %elifnnum %4 + DEFINE_ARGS %4 + %endif %endmacro %define has_epilogue regs_used > 3 || stack_size > 0 || vzeroupper_required
View file
x265_3.5.tar.gz/source/common/x86/x86util.asm -> x265_3.6.tar.gz/source/common/x86/x86util.asm
Changed
@@ -578,8 +578,10 @@ %elif %1==2 %if mmsize==8 SBUTTERFLY dq, %3, %4, %5 - %else + %elif %0==6 TRANS q, ORDER, %3, %4, %5, %6 + %else + TRANS q, ORDER, %3, %4, %5 %endif %elif %1==4 SBUTTERFLY qdq, %3, %4, %5
View file
x265_3.5.tar.gz/source/encoder/analysis.cpp -> x265_3.6.tar.gz/source/encoder/analysis.cpp
Changed
@@ -3645,7 +3645,7 @@ qp += distortionData->offsetctu.m_cuAddr; } - if (m_param->analysisLoadReuseLevel == 10 && m_param->rc.cuTree) + if (m_param->analysisLoadReuseLevel >= 2 && m_param->rc.cuTree) { int cuIdx = (ctu.m_cuAddr * ctu.m_numPartitions) + cuGeom.absPartIdx; if (ctu.m_slice->m_sliceType == I_SLICE)
View file
x265_3.5.tar.gz/source/encoder/api.cpp -> x265_3.6.tar.gz/source/encoder/api.cpp
Changed
@@ -208,7 +208,6 @@ memcpy(zoneParam, param, sizeof(x265_param)); for (int i = 0; i < param->rc.zonefileCount; i++) { - param->rc.zonesi.startFrame = -1; encoder->configureZone(zoneParam, param->rc.zonesi.zoneParam); } @@ -608,6 +607,14 @@ if (numEncoded < 0) encoder->m_aborted = true; + if ((!encoder->m_numDelayedPic && !numEncoded) && (encoder->m_param->bEnableEndOfSequence || encoder->m_param->bEnableEndOfBitstream)) + { + Bitstream bs; + encoder->getEndNalUnits(encoder->m_nalList, bs); + *pp_nal = &encoder->m_nalList.m_nal0; + if (pi_nal) *pi_nal = encoder->m_nalList.m_numNal; + } + return numEncoded; } @@ -1042,6 +1049,7 @@ &PARAM_NS::x265_param_free, &PARAM_NS::x265_param_default, &PARAM_NS::x265_param_parse, + &PARAM_NS::x265_scenecut_aware_qp_param_parse, &PARAM_NS::x265_param_apply_profile, &PARAM_NS::x265_param_default_preset, &x265_picture_alloc, @@ -1288,6 +1296,8 @@ if (param->csvLogLevel) { fprintf(csvfp, "Encode Order, Type, POC, QP, Bits, Scenecut, "); + if (!!param->bEnableTemporalSubLayers) + fprintf(csvfp, "Temporal Sub Layer ID, "); if (param->csvLogLevel >= 2) fprintf(csvfp, "I/P cost ratio, "); if (param->rc.rateControlMode == X265_RC_CRF) @@ -1401,6 +1411,8 @@ const x265_frame_stats* frameStats = &pic->frameData; fprintf(param->csvfpt, "%d, %c-SLICE, %4d, %2.2lf, %10d, %d,", frameStats->encoderOrder, frameStats->sliceType, frameStats->poc, frameStats->qp, (int)frameStats->bits, frameStats->bScenecut); + if (!!param->bEnableTemporalSubLayers) + fprintf(param->csvfpt, "%d,", frameStats->tLayer); if (param->csvLogLevel >= 2) fprintf(param->csvfpt, "%.2f,", frameStats->ipCostRatio); if (param->rc.rateControlMode == X265_RC_CRF)
View file
x265_3.5.tar.gz/source/encoder/dpb.cpp -> x265_3.6.tar.gz/source/encoder/dpb.cpp
Changed
@@ -70,10 +70,18 @@ { Frame *curFrame = iterFrame; iterFrame = iterFrame->m_next; - if (!curFrame->m_encData->m_bHasReferences && !curFrame->m_countRefEncoders) + bool isMCSTFReferenced = false; + + if (curFrame->m_param->bEnableTemporalFilter) + isMCSTFReferenced =!!(curFrame->m_refPicCnt1); + + if (!curFrame->m_encData->m_bHasReferences && !curFrame->m_countRefEncoders && !isMCSTFReferenced) { curFrame->m_bChromaExtended = false; + if (curFrame->m_param->bEnableTemporalFilter) + *curFrame->m_isSubSampled = false; + // Reset column counter X265_CHECK(curFrame->m_reconRowFlag != NULL, "curFrame->m_reconRowFlag check failure"); X265_CHECK(curFrame->m_reconColCount != NULL, "curFrame->m_reconColCount check failure"); @@ -142,12 +150,13 @@ { newFrame->m_encData->m_bHasReferences = false; + newFrame->m_tempLayer = (newFrame->m_param->bEnableTemporalSubLayers && !m_bTemporalSublayer) ? 1 : newFrame->m_tempLayer; // Adjust NAL type for unreferenced B frames (change from _R "referenced" // to _N "non-referenced" NAL unit type) switch (slice->m_nalUnitType) { case NAL_UNIT_CODED_SLICE_TRAIL_R: - slice->m_nalUnitType = m_bTemporalSublayer ? NAL_UNIT_CODED_SLICE_TSA_N : NAL_UNIT_CODED_SLICE_TRAIL_N; + slice->m_nalUnitType = newFrame->m_param->bEnableTemporalSubLayers ? NAL_UNIT_CODED_SLICE_TSA_N : NAL_UNIT_CODED_SLICE_TRAIL_N; break; case NAL_UNIT_CODED_SLICE_RADL_R: slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_RADL_N; @@ -168,13 +177,94 @@ m_picList.pushFront(*newFrame); + if (m_bTemporalSublayer && getTemporalLayerNonReferenceFlag()) + { + switch (slice->m_nalUnitType) + { + case NAL_UNIT_CODED_SLICE_TRAIL_R: + slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_TRAIL_N; + break; + case NAL_UNIT_CODED_SLICE_RADL_R: + slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_RADL_N; + break; + case NAL_UNIT_CODED_SLICE_RASL_R: + slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_RASL_N; + break; + default: + break; + } + } // Do decoding refresh marking if any decodingRefreshMarking(pocCurr, slice->m_nalUnitType); - computeRPS(pocCurr, slice->isIRAP(), &slice->m_rps, slice->m_sps->maxDecPicBuffering); - + computeRPS(pocCurr, newFrame->m_tempLayer, slice->isIRAP(), &slice->m_rps, slice->m_sps->maxDecPicBufferingnewFrame->m_tempLayer); + bool isTSAPic = ((slice->m_nalUnitType == 2) || (slice->m_nalUnitType == 3)) ? true : false; // Mark pictures in m_piclist as unreferenced if they are not included in RPS - applyReferencePictureSet(&slice->m_rps, pocCurr); + applyReferencePictureSet(&slice->m_rps, pocCurr, newFrame->m_tempLayer, isTSAPic); + + + if (m_bTemporalSublayer && newFrame->m_tempLayer > 0 + && !(slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_RADL_N // Check if not a leading picture + || slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_RADL_R + || slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_RASL_N + || slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_RASL_R) + ) + { + if (isTemporalLayerSwitchingPoint(pocCurr, newFrame->m_tempLayer) || (slice->m_sps->maxTempSubLayers == 1)) + { + if (getTemporalLayerNonReferenceFlag()) + { + slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_TSA_N; + } + else + { + slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_TSA_R; + } + } + else if (isStepwiseTemporalLayerSwitchingPoint(&slice->m_rps, pocCurr, newFrame->m_tempLayer)) + { + bool isSTSA = true; + int id = newFrame->m_gopOffset % x265_gop_ra_lengthnewFrame->m_gopId; + for (int ii = id; (ii < x265_gop_ra_lengthnewFrame->m_gopId && isSTSA == true); ii++) + { + int tempIdRef = x265_gop_ranewFrame->m_gopIdii.layer; + if (tempIdRef == newFrame->m_tempLayer) + { + for (int jj = 0; jj < slice->m_rps.numberOfPositivePictures + slice->m_rps.numberOfNegativePictures; jj++) + { + if (slice->m_rps.bUsedjj) + { + int refPoc = x265_gop_ranewFrame->m_gopIdii.poc_offset + slice->m_rps.deltaPOCjj; + int kk = 0; + for (kk = 0; kk < x265_gop_ra_lengthnewFrame->m_gopId; kk++) + { + if (x265_gop_ranewFrame->m_gopIdkk.poc_offset == refPoc) + { + break; + } + } + if (x265_gop_ranewFrame->m_gopIdkk.layer >= newFrame->m_tempLayer) + { + isSTSA = false; + break; + } + } + } + } + } + if (isSTSA == true) + { + if (getTemporalLayerNonReferenceFlag()) + { + slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_STSA_N; + } + else + { + slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_STSA_R; + } + } + } + } if (slice->m_sliceType != I_SLICE) slice->m_numRefIdx0 = x265_clip3(1, newFrame->m_param->maxNumReferences, slice->m_rps.numberOfNegativePictures); @@ -218,7 +308,7 @@ } } -void DPB::computeRPS(int curPoc, bool isRAP, RPS * rps, unsigned int maxDecPicBuffer) +void DPB::computeRPS(int curPoc, int tempId, bool isRAP, RPS * rps, unsigned int maxDecPicBuffer) { unsigned int poci = 0, numNeg = 0, numPos = 0; @@ -228,7 +318,7 @@ { if ((iterPic->m_poc != curPoc) && iterPic->m_encData->m_bHasReferences) { - if ((m_lastIDR >= curPoc) || (m_lastIDR <= iterPic->m_poc)) + if ((!m_bTemporalSublayer || (iterPic->m_tempLayer <= tempId)) && ((m_lastIDR >= curPoc) || (m_lastIDR <= iterPic->m_poc))) { rps->pocpoci = iterPic->m_poc; rps->deltaPOCpoci = rps->pocpoci - curPoc; @@ -247,6 +337,18 @@ rps->sortDeltaPOC(); } +bool DPB::getTemporalLayerNonReferenceFlag() +{ + Frame* curFrame = m_picList.first(); + if (curFrame->m_encData->m_bHasReferences) + { + curFrame->m_sameLayerRefPic = true; + return false; + } + else + return true; +} + /* Marking reference pictures when an IDR/CRA is encountered. */ void DPB::decodingRefreshMarking(int pocCurr, NalUnitType nalUnitType) { @@ -296,7 +398,7 @@ } /** Function for applying picture marking based on the Reference Picture Set */ -void DPB::applyReferencePictureSet(RPS *rps, int curPoc) +void DPB::applyReferencePictureSet(RPS *rps, int curPoc, int tempId, bool isTSAPicture) { // loop through all pictures in the reference picture buffer Frame* iterFrame = m_picList.first(); @@ -317,9 +419,68 @@ } if (!referenced) iterFrame->m_encData->m_bHasReferences = false; + + if (m_bTemporalSublayer) + { + //check that pictures of higher temporal layers are not used + assert(referenced == 0 || iterFrame->m_encData->m_bHasReferences == false || iterFrame->m_tempLayer <= tempId); + + //check that pictures of higher or equal temporal layer are not in the RPS if the current picture is a TSA picture + if (isTSAPicture) + { + assert(referenced == 0 || iterFrame->m_tempLayer < tempId); + } + //check that pictures marked as temporal layer non-reference pictures are not used for reference + if (iterFrame->m_tempLayer == tempId) + { + assert(referenced == 0 || iterFrame->m_sameLayerRefPic == true); + } + } + } + iterFrame = iterFrame->m_next; + } +} + +bool DPB::isTemporalLayerSwitchingPoint(int curPoc, int tempId) +{ + // loop through all pictures in the reference picture buffer + Frame* iterFrame = m_picList.first(); + while (iterFrame) + { + if (iterFrame->m_poc != curPoc && iterFrame->m_encData->m_bHasReferences) + { + if (iterFrame->m_tempLayer >= tempId) + { + return false; + } + } + iterFrame = iterFrame->m_next; + } + return true; +} + +bool DPB::isStepwiseTemporalLayerSwitchingPoint(RPS *rps, int curPoc, int tempId) +{ + // loop through all pictures in the reference picture buffer + Frame* iterFrame = m_picList.first(); + while (iterFrame) + { + if (iterFrame->m_poc != curPoc && iterFrame->m_encData->m_bHasReferences) + { + for (int i = 0; i < rps->numberOfPositivePictures + rps->numberOfNegativePictures; i++) + { + if ((iterFrame->m_poc == curPoc + rps->deltaPOCi) && rps->bUsedi) + { + if (iterFrame->m_tempLayer >= tempId) + { + return false; + } + } + } } iterFrame = iterFrame->m_next; } + return true; } /* deciding the nal_unit_type */ @@ -328,7 +489,7 @@ if (!curPOC) return NAL_UNIT_CODED_SLICE_IDR_N_LP; if (bIsKeyFrame) - return m_bOpenGOP ? NAL_UNIT_CODED_SLICE_CRA : m_bhasLeadingPicture ? NAL_UNIT_CODED_SLICE_IDR_W_RADL : NAL_UNIT_CODED_SLICE_IDR_N_LP; + return (m_bOpenGOP || m_craNal) ? NAL_UNIT_CODED_SLICE_CRA : m_bhasLeadingPicture ? NAL_UNIT_CODED_SLICE_IDR_W_RADL : NAL_UNIT_CODED_SLICE_IDR_N_LP; if (m_pocCRA && curPOC < m_pocCRA) // All leading pictures are being marked as TFD pictures here since // current encoder uses all reference pictures while encoding leading
View file
x265_3.5.tar.gz/source/encoder/dpb.h -> x265_3.6.tar.gz/source/encoder/dpb.h
Changed
@@ -40,6 +40,7 @@ int m_lastIDR; int m_pocCRA; int m_bOpenGOP; + int m_craNal; int m_bhasLeadingPicture; bool m_bRefreshPending; bool m_bTemporalSublayer; @@ -66,7 +67,8 @@ m_bRefreshPending = false; m_frameDataFreeList = NULL; m_bOpenGOP = param->bOpenGOP; - m_bTemporalSublayer = !!param->bEnableTemporalSubLayers; + m_craNal = param->craNal; + m_bTemporalSublayer = (param->bEnableTemporalSubLayers > 2); } ~DPB(); @@ -77,10 +79,13 @@ protected: - void computeRPS(int curPoc, bool isRAP, RPS * rps, unsigned int maxDecPicBuffer); + void computeRPS(int curPoc,int tempId, bool isRAP, RPS * rps, unsigned int maxDecPicBuffer); - void applyReferencePictureSet(RPS *rps, int curPoc); + void applyReferencePictureSet(RPS *rps, int curPoc, int tempId, bool isTSAPicture); + bool getTemporalLayerNonReferenceFlag(); void decodingRefreshMarking(int pocCurr, NalUnitType nalUnitType); + bool isTemporalLayerSwitchingPoint(int curPoc, int tempId); + bool isStepwiseTemporalLayerSwitchingPoint(RPS *rps, int curPoc, int tempId); NalUnitType getNalUnitType(int curPoc, bool bIsKeyFrame); };
View file
x265_3.5.tar.gz/source/encoder/encoder.cpp -> x265_3.6.tar.gz/source/encoder/encoder.cpp
Changed
@@ -72,7 +72,40 @@ { { 1, 1, 1, 1, 1, 5, 1, 2, 2, 2, 50 }, { 1, 1, 1, 1, 1, 5, 0, 16, 9, 9, 81 }, - { 1, 1, 1, 1, 1, 5, 0, 1, 1, 1, 82 } + { 1, 1, 1, 1, 1, 5, 0, 1, 1, 1, 82 }, + { 1, 1, 1, 1, 1, 5, 0, 18, 9, 9, 84 } +}; + +typedef struct +{ + int bEnableVideoSignalTypePresentFlag; + int bEnableColorDescriptionPresentFlag; + int bEnableChromaLocInfoPresentFlag; + int colorPrimaries; + int transferCharacteristics; + int matrixCoeffs; + int bEnableVideoFullRangeFlag; + int chromaSampleLocTypeTopField; + int chromaSampleLocTypeBottomField; + const char* systemId; +}VideoSignalTypePresets; + +VideoSignalTypePresets vstPresets = +{ + {1, 1, 1, 6, 6, 6, 0, 0, 0, "BT601_525"}, + {1, 1, 1, 5, 6, 5, 0, 0, 0, "BT601_626"}, + {1, 1, 1, 1, 1, 1, 0, 0, 0, "BT709_YCC"}, + {1, 1, 0, 1, 1, 0, 0, 0, 0, "BT709_RGB"}, + {1, 1, 1, 9, 14, 1, 0, 2, 2, "BT2020_YCC_NCL"}, + {1, 1, 0, 9, 16, 9, 0, 0, 0, "BT2020_RGB"}, + {1, 1, 1, 9, 16, 9, 0, 2, 2, "BT2100_PQ_YCC"}, + {1, 1, 1, 9, 16, 14, 0, 2, 2, "BT2100_PQ_ICTCP"}, + {1, 1, 0, 9, 16, 0, 0, 0, 0, "BT2100_PQ_RGB"}, + {1, 1, 1, 9, 18, 9, 0, 2, 2, "BT2100_HLG_YCC"}, + {1, 1, 0, 9, 18, 0, 0, 0, 0, "BT2100_HLG_RGB"}, + {1, 1, 0, 1, 1, 0, 1, 0, 0, "FR709_RGB"}, + {1, 1, 0, 9, 14, 0, 1, 0, 0, "FR2020_RGB"}, + {1, 1, 1, 12, 1, 6, 1, 1, 1, "FRP3D65_YCC"} }; } @@ -109,6 +142,7 @@ m_threadPool = NULL; m_analysisFileIn = NULL; m_analysisFileOut = NULL; + m_filmGrainIn = NULL; m_naluFile = NULL; m_offsetEmergency = NULL; m_iFrameNum = 0; @@ -134,12 +168,8 @@ m_prevTonemapPayload.payload = NULL; m_startPoint = 0; m_saveCTUSize = 0; - m_edgePic = NULL; - m_edgeHistThreshold = 0; - m_chromaHistThreshold = 0.0; - m_scaledEdgeThreshold = 0.0; - m_scaledChromaThreshold = 0.0; m_zoneIndex = 0; + m_origPicBuffer = 0; } inline char *strcatFilename(const char *input, const char *suffix) @@ -216,34 +246,6 @@ } } - if (m_param->bHistBasedSceneCut) - { - m_planeSizes0 = (m_param->sourceWidth >> x265_cli_cspsp->internalCsp.width0) * (m_param->sourceHeight >> x265_cli_cspsm_param->internalCsp.height0); - uint32_t pixelbytes = m_param->internalBitDepth > 8 ? 2 : 1; - m_edgePic = X265_MALLOC(pixel, m_planeSizes0 * pixelbytes); - m_edgeHistThreshold = m_param->edgeTransitionThreshold; - m_chromaHistThreshold = x265_min(m_edgeHistThreshold * 10.0, MAX_SCENECUT_THRESHOLD); - m_scaledEdgeThreshold = x265_min(m_edgeHistThreshold * SCENECUT_STRENGTH_FACTOR, MAX_SCENECUT_THRESHOLD); - m_scaledChromaThreshold = x265_min(m_chromaHistThreshold * SCENECUT_STRENGTH_FACTOR, MAX_SCENECUT_THRESHOLD); - if (m_param->sourceBitDepth != m_param->internalBitDepth) - { - int size = m_param->sourceWidth * m_param->sourceHeight; - int hshift = CHROMA_H_SHIFT(m_param->internalCsp); - int vshift = CHROMA_V_SHIFT(m_param->internalCsp); - int widthC = m_param->sourceWidth >> hshift; - int heightC = m_param->sourceHeight >> vshift; - - m_inputPic0 = X265_MALLOC(pixel, size); - if (m_param->internalCsp != X265_CSP_I400) - { - for (int j = 1; j < 3; j++) - { - m_inputPicj = X265_MALLOC(pixel, widthC * heightC); - } - } - } - } - // Do not allow WPP if only one row or fewer than 3 columns, it is pointless and unstable if (rows == 1 || cols < 3) { @@ -357,6 +359,10 @@ lookAheadThreadPooli.start(); m_lookahead->m_numPools = pools; m_dpb = new DPB(m_param); + + if (m_param->bEnableTemporalFilter) + m_origPicBuffer = new OrigPicBuffer(); + m_rateControl = new RateControl(*m_param, this); if (!m_param->bResetZoneConfig) { @@ -518,6 +524,15 @@ } } } + if (m_param->filmGrain) + { + m_filmGrainIn = x265_fopen(m_param->filmGrain, "rb"); + if (!m_filmGrainIn) + { + x265_log_file(NULL, X265_LOG_ERROR, "Failed to open film grain characteristics binary file %s\n", m_param->filmGrain); + } + } + m_bZeroLatency = !m_param->bframes && !m_param->lookaheadDepth && m_param->frameNumThreads == 1 && m_param->maxSlices == 1; m_aborted |= parseLambdaFile(m_param); @@ -879,26 +894,6 @@ } } - if (m_param->bHistBasedSceneCut) - { - if (m_edgePic != NULL) - { - X265_FREE_ZERO(m_edgePic); - } - - if (m_param->sourceBitDepth != m_param->internalBitDepth) - { - X265_FREE_ZERO(m_inputPic0); - if (m_param->internalCsp != X265_CSP_I400) - { - for (int i = 1; i < 3; i++) - { - X265_FREE_ZERO(m_inputPici); - } - } - } - } - for (int i = 0; i < m_param->frameNumThreads; i++) { if (m_frameEncoderi) @@ -924,6 +919,10 @@ delete zoneReadCount; delete zoneWriteCount; } + + if (m_param->bEnableTemporalFilter) + delete m_origPicBuffer; + if (m_rateControl) { m_rateControl->destroy(); @@ -963,6 +962,8 @@ } if (m_naluFile) fclose(m_naluFile); + if (m_filmGrainIn) + x265_fclose(m_filmGrainIn); #ifdef SVT_HEVC X265_FREE(m_svtAppData); @@ -974,6 +975,7 @@ /* release string arguments that were strdup'd */ free((char*)m_param->rc.lambdaFileName); free((char*)m_param->rc.statFileName); + free((char*)m_param->rc.sharedMemName); free((char*)m_param->analysisReuseFileName); free((char*)m_param->scalingLists); free((char*)m_param->csvfn); @@ -982,6 +984,7 @@ free((char*)m_param->toneMapFile); free((char*)m_param->analysisSave); free((char*)m_param->analysisLoad); + free((char*)m_param->videoSignalTypePreset); PARAM_NS::x265_param_free(m_param); } } @@ -1358,215 +1361,90 @@ dest->planes2 = (char*)dest->planes1 + src->stride1 * (src->height >> x265_cli_cspssrc->colorSpace.height1); } -bool Encoder::computeHistograms(x265_picture *pic) +bool Encoder::isFilterThisframe(uint8_t sliceTypeConfig, int curSliceType) { - pixel *src = NULL, *planeV = NULL, *planeU = NULL; - uint32_t widthC, heightC; - int hshift, vshift; - - hshift = CHROMA_H_SHIFT(pic->colorSpace); - vshift = CHROMA_V_SHIFT(pic->colorSpace); - widthC = pic->width >> hshift; - heightC = pic->height >> vshift; - - if (pic->bitDepth == X265_DEPTH) + uint8_t newSliceType = 0; + switch (curSliceType) { - src = (pixel*)pic->planes0; - if (m_param->internalCsp != X265_CSP_I400) - { - planeU = (pixel*)pic->planes1; - planeV = (pixel*)pic->planes2; - } - } - else if (pic->bitDepth == 8 && X265_DEPTH > 8) - { - int shift = (X265_DEPTH - 8); - uint8_t *yChar, *uChar, *vChar; - - yChar = (uint8_t*)pic->planes0; - primitives.planecopy_cp(yChar, pic->stride0 / sizeof(*yChar), m_inputPic0, pic->stride0 / sizeof(*yChar), pic->width, pic->height, shift); - src = m_inputPic0; - if (m_param->internalCsp != X265_CSP_I400) - { - uChar = (uint8_t*)pic->planes1; - vChar = (uint8_t*)pic->planes2; - primitives.planecopy_cp(uChar, pic->stride1 / sizeof(*uChar), m_inputPic1, pic->stride1 / sizeof(*uChar), widthC, heightC, shift); - primitives.planecopy_cp(vChar, pic->stride2 / sizeof(*vChar), m_inputPic2, pic->stride2 / sizeof(*vChar), widthC, heightC, shift); - planeU = m_inputPic1; - planeV = m_inputPic2; - } - } - else - { - uint16_t *yShort, *uShort, *vShort; - /* mask off bits that are supposed to be zero */ - uint16_t mask = (1 << X265_DEPTH) - 1; - int shift = abs(pic->bitDepth - X265_DEPTH); - - yShort = (uint16_t*)pic->planes0; - uShort = (uint16_t*)pic->planes1; - vShort = (uint16_t*)pic->planes2; - - if (pic->bitDepth > X265_DEPTH) - { - /* shift right and mask pixels to final size */ - primitives.planecopy_sp(yShort, pic->stride0 / sizeof(*yShort), m_inputPic0, pic->stride0 / sizeof(*yShort), pic->width, pic->height, shift, mask); - if (m_param->internalCsp != X265_CSP_I400) - { - primitives.planecopy_sp(uShort, pic->stride1 / sizeof(*uShort), m_inputPic1, pic->stride1 / sizeof(*uShort), widthC, heightC, shift, mask); - primitives.planecopy_sp(vShort, pic->stride2 / sizeof(*vShort), m_inputPic2, pic->stride2 / sizeof(*vShort), widthC, heightC, shift, mask); - } - } - else /* Case for (pic.bitDepth < X265_DEPTH) */ - { - /* shift left and mask pixels to final size */ - primitives.planecopy_sp_shl(yShort, pic->stride0 / sizeof(*yShort), m_inputPic0, pic->stride0 / sizeof(*yShort), pic->width, pic->height, shift, mask); - if (m_param->internalCsp != X265_CSP_I400) - { - primitives.planecopy_sp_shl(uShort, pic->stride1 / sizeof(*uShort), m_inputPic1, pic->stride1 / sizeof(*uShort), widthC, heightC, shift, mask); - primitives.planecopy_sp_shl(vShort, pic->stride2 / sizeof(*vShort), m_inputPic2, pic->stride2 / sizeof(*vShort), widthC, heightC, shift, mask); - } - } - - src = m_inputPic0; - planeU = m_inputPic1; - planeV = m_inputPic2; - } - - size_t bufSize = sizeof(pixel) * m_planeSizes0; - memset(m_edgePic, 0, bufSize); - - if (!computeEdge(m_edgePic, src, NULL, pic->width, pic->height, pic->width, false, 1)) - { - x265_log(m_param, X265_LOG_ERROR, "Failed to compute edge!"); - return false; - } - - pixel pixelVal; - int32_t *edgeHist = m_curEdgeHist; - memset(edgeHist, 0, EDGE_BINS * sizeof(int32_t)); - for (uint32_t i = 0; i < m_planeSizes0; i++) - { - if (m_edgePici) - edgeHist1++; - else - edgeHist0++; - } - - /* Y Histogram Calculation */ - int32_t *yHist = m_curYUVHist0; - memset(yHist, 0, HISTOGRAM_BINS * sizeof(int32_t)); - for (uint32_t i = 0; i < m_planeSizes0; i++) - { - pixelVal = srci; - yHistpixelVal++; + case 1: newSliceType |= 1 << 0; + break; + case 2: newSliceType |= 1 << 0; + break; + case 3: newSliceType |= 1 << 1; + break; + case 4: newSliceType |= 1 << 2; + break; + case 5: newSliceType |= 1 << 3; + break; + default: return 0; } + return ((sliceTypeConfig & newSliceType) != 0); +} - if (pic->colorSpace != X265_CSP_I400) - { - /* U Histogram Calculation */ - int32_t *uHist = m_curYUVHist1; - memset(uHist, 0, sizeof(m_curYUVHist1)); - for (uint32_t i = 0; i < m_planeSizes1; i++) - { - pixelVal = planeUi; - uHistpixelVal++; - } +inline int enqueueRefFrame(FrameEncoder* curframeEncoder, Frame* iterFrame, Frame* curFrame, bool isPreFiltered, int16_t i) +{ + TemporalFilterRefPicInfo* dest = &curframeEncoder->m_mcstfRefListcurFrame->m_mcstf->m_numRef; + dest->picBuffer = iterFrame->m_fencPic; + dest->picBufferSubSampled2 = iterFrame->m_fencPicSubsampled2; + dest->picBufferSubSampled4 = iterFrame->m_fencPicSubsampled4; + dest->isFilteredFrame = isPreFiltered; + dest->isSubsampled = iterFrame->m_isSubSampled; + dest->origOffset = i; + curFrame->m_mcstf->m_numRef++; - /* V Histogram Calculation */ - pixelVal = 0; - int32_t *vHist = m_curYUVHist2; - memset(vHist, 0, sizeof(m_curYUVHist2)); - for (uint32_t i = 0; i < m_planeSizes2; i++) - { - pixelVal = planeVi; - vHistpixelVal++; - } - } - return true; + return 1; } -void Encoder::computeHistogramSAD(double *normalizedMaxUVSad, double *normalizedEdgeSad, int curPoc) +bool Encoder::generateMcstfRef(Frame* frameEnc, FrameEncoder* currEncoder) { + frameEnc->m_mcstf->m_numRef = 0; - if (curPoc == 0) - { /* first frame is scenecut by default no sad computation for the same. */ - *normalizedMaxUVSad = 0.0; - *normalizedEdgeSad = 0.0; - } - else + for (int iterPOC = (frameEnc->m_poc - frameEnc->m_mcstf->m_range); + iterPOC <= (frameEnc->m_poc + frameEnc->m_mcstf->m_range); iterPOC++) { - /* compute sum of absolute differences of histogram bins of chroma and luma edge response between the current and prev pictures. */ - int32_t edgeHistSad = 0; - int32_t uHistSad = 0; - int32_t vHistSad = 0; - double normalizedUSad = 0.0; - double normalizedVSad = 0.0; - - for (int j = 0; j < HISTOGRAM_BINS; j++) + bool isFound = false; + if (iterPOC != frameEnc->m_poc) { - if (j < 2) + //search for the reference frame in the Original Picture Buffer + if (!isFound) { - edgeHistSad += abs(m_curEdgeHistj - m_prevEdgeHistj); - } - uHistSad += abs(m_curYUVHist1j - m_prevYUVHist1j); - vHistSad += abs(m_curYUVHist2j - m_prevYUVHist2j); - } - *normalizedEdgeSad = normalizeRange(edgeHistSad, 0, 2 * m_planeSizes0, 0.0, 1.0); - normalizedUSad = normalizeRange(uHistSad, 0, 2 * m_planeSizes1, 0.0, 1.0); - normalizedVSad = normalizeRange(vHistSad, 0, 2 * m_planeSizes2, 0.0, 1.0); - *normalizedMaxUVSad = x265_max(normalizedUSad, normalizedVSad); - } - - /* store histograms of previous frame for reference */ - memcpy(m_prevEdgeHist, m_curEdgeHist, sizeof(m_curEdgeHist)); - memcpy(m_prevYUVHist, m_curYUVHist, sizeof(m_curYUVHist)); -} + for (int j = 0; j < (2 * frameEnc->m_mcstf->m_range); j++) + { + if (iterPOC < 0) + continue; + if (iterPOC >= m_pocLast) + { -double Encoder::normalizeRange(int32_t value, int32_t minValue, int32_t maxValue, double rangeStart, double rangeEnd) -{ - return (double)(value - minValue) * (rangeEnd - rangeStart) / (maxValue - minValue) + rangeStart; -} + TemporalFilter* mcstf = frameEnc->m_mcstf; + while (mcstf->m_numRef) + { + memset(currEncoder->m_mcstfRefListmcstf->m_numRef.mvs0, 0, sizeof(MV) * ((mcstf->m_sourceWidth / 16) * (mcstf->m_sourceHeight / 16))); + memset(currEncoder->m_mcstfRefListmcstf->m_numRef.mvs1, 0, sizeof(MV) * ((mcstf->m_sourceWidth / 16) * (mcstf->m_sourceHeight / 16))); + memset(currEncoder->m_mcstfRefListmcstf->m_numRef.mvs2, 0, sizeof(MV) * ((mcstf->m_sourceWidth / 16) * (mcstf->m_sourceHeight / 16))); + memset(currEncoder->m_mcstfRefListmcstf->m_numRef.mvs, 0, sizeof(MV) * ((mcstf->m_sourceWidth / 4) * (mcstf->m_sourceHeight / 4))); + memset(currEncoder->m_mcstfRefListmcstf->m_numRef.noise, 0, sizeof(int) * ((mcstf->m_sourceWidth / 4) * (mcstf->m_sourceHeight / 4))); + memset(currEncoder->m_mcstfRefListmcstf->m_numRef.error, 0, sizeof(int) * ((mcstf->m_sourceWidth / 4) * (mcstf->m_sourceHeight / 4))); -void Encoder::findSceneCuts(x265_picture *pic, bool& bDup, double maxUVSad, double edgeSad, bool& isMaxThres, bool& isHardSC) -{ - double minEdgeT = m_edgeHistThreshold * MIN_EDGE_FACTOR; - double minChromaT = minEdgeT * SCENECUT_CHROMA_FACTOR; - double maxEdgeT = m_edgeHistThreshold * MAX_EDGE_FACTOR; - double maxChromaT = maxEdgeT * SCENECUT_CHROMA_FACTOR; - pic->frameData.bScenecut = false; + mcstf->m_numRef--; + } - if (pic->poc == 0) - { - /* for first frame */ - pic->frameData.bScenecut = false; - bDup = false; - } - else - { - if (edgeSad == 0.0 && maxUVSad == 0.0) - { - bDup = true; - } - else if (edgeSad < minEdgeT && maxUVSad < minChromaT) - { - pic->frameData.bScenecut = false; - } - else if (edgeSad > maxEdgeT && maxUVSad > maxChromaT) - { - pic->frameData.bScenecut = true; - isMaxThres = true; - isHardSC = true; - } - else if (edgeSad > m_scaledEdgeThreshold || maxUVSad >= m_scaledChromaThreshold - || (edgeSad > m_edgeHistThreshold && maxUVSad >= m_chromaHistThreshold)) - { - pic->frameData.bScenecut = true; - bDup = false; - if (edgeSad > m_scaledEdgeThreshold || maxUVSad >= m_scaledChromaThreshold) - isHardSC = true; + break; + } + Frame* iterFrame = frameEnc->m_encData->m_slice->m_mcstfRefFrameList1j; + if (iterFrame->m_poc == iterPOC) + { + if (!enqueueRefFrame(currEncoder, iterFrame, frameEnc, false, (int16_t)(iterPOC - frameEnc->m_poc))) + { + return false; + }; + break; + } + } + } } } + + return true; } /** @@ -1595,40 +1473,24 @@ const x265_picture* inputPic = NULL; static int written = 0, read = 0; bool dontRead = false; - bool bdropFrame = false; bool dropflag = false; - bool isMaxThres = false; - bool isHardSC = false; if (m_exportedPic) { if (!m_param->bUseAnalysisFile && m_param->analysisSave) x265_free_analysis_data(m_param, &m_exportedPic->m_analysisData); + ATOMIC_DEC(&m_exportedPic->m_countRefEncoders); + m_exportedPic = NULL; m_dpb->recycleUnreferenced(); + + if (m_param->bEnableTemporalFilter) + m_origPicBuffer->recycleOrigPicList(); } + if ((pic_in && (!m_param->chunkEnd || (m_encodedFrameNum < m_param->chunkEnd))) || (m_param->bEnableFrameDuplication && !pic_in && (read < written))) { - if (m_param->bHistBasedSceneCut && pic_in) - { - x265_picture *pic = (x265_picture *) pic_in; - - if (pic->poc == 0) - { - /* for entire encode compute the chroma plane sizes only once */ - for (int i = 1; i < x265_cli_cspsm_param->internalCsp.planes; i++) - m_planeSizesi = (pic->width >> x265_cli_cspsm_param->internalCsp.widthi) * (pic->height >> x265_cli_cspsm_param->internalCsp.heighti); - } - - if (computeHistograms(pic)) - { - double maxUVSad = 0.0, edgeSad = 0.0; - computeHistogramSAD(&maxUVSad, &edgeSad, pic_in->poc); - findSceneCuts(pic, bdropFrame, maxUVSad, edgeSad, isMaxThres, isHardSC); - } - } - if ((m_param->bEnableFrameDuplication && !pic_in && (read < written))) dontRead = true; else @@ -1672,20 +1534,7 @@ written++; } - if (m_param->bEnableFrameDuplication && m_param->bHistBasedSceneCut) - { - if (!bdropFrame && m_dupBuffer1->dupPic->frameData.bScenecut == false) - { - psnrWeight = ComputePSNR(m_dupBuffer0->dupPic, m_dupBuffer1->dupPic, m_param); - if (psnrWeight >= m_param->dupThreshold) - dropflag = true; - } - else - { - dropflag = true; - } - } - else if (m_param->bEnableFrameDuplication) + if (m_param->bEnableFrameDuplication) { psnrWeight = ComputePSNR(m_dupBuffer0->dupPic, m_dupBuffer1->dupPic, m_param); if (psnrWeight >= m_param->dupThreshold) @@ -1768,12 +1617,6 @@ } } } - if (m_param->recursionSkipMode == EDGE_BASED_RSKIP && m_param->bHistBasedSceneCut) - { - pixel* src = m_edgePic; - primitives.planecopy_pp_shr(src, inFrame->m_fencPic->m_picWidth, inFrame->m_edgeBitPic, inFrame->m_fencPic->m_stride, - inFrame->m_fencPic->m_picWidth, inFrame->m_fencPic->m_picHeight, 0); - } } else { @@ -1794,6 +1637,8 @@ inFrame->m_lowres.satdCost = (int64_t)-1; inFrame->m_lowresInit = false; inFrame->m_isInsideWindow = 0; + inFrame->m_tempLayer = 0; + inFrame->m_sameLayerRefPic = 0; } /* Copy input picture into a Frame and PicYuv, send to lookahead */ @@ -1802,13 +1647,6 @@ inFrame->m_poc = ++m_pocLast; inFrame->m_userData = inputPic->userData; inFrame->m_pts = inputPic->pts; - if (m_param->bHistBasedSceneCut) - { - inFrame->m_lowres.bScenecut = (inputPic->frameData.bScenecut == 1) ? true : false; - inFrame->m_lowres.m_bIsMaxThres = isMaxThres; - if (m_param->radl && m_param->keyframeMax != m_param->keyframeMin) - inFrame->m_lowres.m_bIsHardScenecut = isHardSC; - } if ((m_param->bEnableSceneCutAwareQp & BACKWARD) && m_param->rc.bStatRead) { @@ -1816,7 +1654,7 @@ rcEntry = &(m_rateControl->m_rce2PassinFrame->m_poc); if(rcEntry->scenecut) { - int backwardWindow = X265_MIN(int((m_param->bwdScenecutWindow / 1000.0) * (m_param->fpsNum / m_param->fpsDenom)), p->lookaheadDepth); + int backwardWindow = X265_MIN(int((m_param->bwdMaxScenecutWindow / 1000.0) * (m_param->fpsNum / m_param->fpsDenom)), p->lookaheadDepth); for (int i = 1; i <= backwardWindow; i++) { int frameNum = inFrame->m_poc - i; @@ -1826,16 +1664,7 @@ } } } - if (m_param->bHistBasedSceneCut && m_param->analysisSave) - { - memcpy(inFrame->m_analysisData.edgeHist, m_curEdgeHist, EDGE_BINS * sizeof(int32_t)); - memcpy(inFrame->m_analysisData.yuvHist0, m_curYUVHist0, HISTOGRAM_BINS *sizeof(int32_t)); - if (inputPic->colorSpace != X265_CSP_I400) - { - memcpy(inFrame->m_analysisData.yuvHist1, m_curYUVHist1, HISTOGRAM_BINS * sizeof(int32_t)); - memcpy(inFrame->m_analysisData.yuvHist2, m_curYUVHist2, HISTOGRAM_BINS * sizeof(int32_t)); - } - } + inFrame->m_forceqp = inputPic->forceqp; inFrame->m_param = (m_reconfigure || m_reconfigureRc) ? m_latestParam : m_param; inFrame->m_picStruct = inputPic->picStruct; @@ -1881,7 +1710,8 @@ } /* Use the frame types from the first pass, if available */ - int sliceType = (m_param->rc.bStatRead) ? m_rateControl->rateControlSliceType(inFrame->m_poc) : inputPic->sliceType; + int sliceType = (m_param->rc.bStatRead) ? m_rateControl->rateControlSliceType(inFrame->m_poc) : X265_TYPE_AUTO; + inFrame->m_lowres.sliceTypeReq = inputPic->sliceType; /* In analysisSave mode, x265_analysis_data is allocated in inputPic and inFrame points to this */ /* Load analysis data before lookahead->addPicture, since sliceType has been decided */ @@ -1977,6 +1807,59 @@ if (m_reconfigureRc) inFrame->m_reconfigureRc = true; + if (m_param->bEnableTemporalFilter) + { + if (!m_pocLast) + { + /*One shot allocation of frames in OriginalPictureBuffer*/ + int numFramesinOPB = X265_MAX(m_param->bframes, (inFrame->m_mcstf->m_range << 1)) + 1; + for (int i = 0; i < numFramesinOPB; i++) + { + Frame* dupFrame = new Frame; + if (!(dupFrame->create(m_param, pic_in->quantOffsets))) + { + m_aborted = true; + x265_log(m_param, X265_LOG_ERROR, "Memory allocation failure, aborting encode\n"); + fflush(stderr); + dupFrame->destroy(); + delete dupFrame; + return -1; + } + else + { + if (m_sps.cuOffsetY) + { + dupFrame->m_fencPic->m_cuOffsetC = m_sps.cuOffsetC; + dupFrame->m_fencPic->m_buOffsetC = m_sps.buOffsetC; + dupFrame->m_fencPic->m_cuOffsetY = m_sps.cuOffsetY; + dupFrame->m_fencPic->m_buOffsetY = m_sps.buOffsetY; + if (m_param->internalCsp != X265_CSP_I400) + { + dupFrame->m_fencPic->m_cuOffsetC = m_sps.cuOffsetC; + dupFrame->m_fencPic->m_buOffsetC = m_sps.buOffsetC; + } + m_origPicBuffer->addEncPicture(dupFrame); + } + } + } + } + + inFrame->m_refPicCnt1 = 2 * inFrame->m_mcstf->m_range + 1; + if (inFrame->m_poc < inFrame->m_mcstf->m_range) + inFrame->m_refPicCnt1 -= (uint8_t)(inFrame->m_mcstf->m_range - inFrame->m_poc); + if (m_param->totalFrames && (inFrame->m_poc >= (m_param->totalFrames - inFrame->m_mcstf->m_range))) + inFrame->m_refPicCnt1 -= (uint8_t)(inFrame->m_poc + inFrame->m_mcstf->m_range - m_param->totalFrames + 1); + + //Extend full-res original picture border + PicYuv *orig = inFrame->m_fencPic; + extendPicBorder(orig->m_picOrg0, orig->m_stride, orig->m_picWidth, orig->m_picHeight, orig->m_lumaMarginX, orig->m_lumaMarginY); + extendPicBorder(orig->m_picOrg1, orig->m_strideC, orig->m_picWidth >> orig->m_hChromaShift, orig->m_picHeight >> orig->m_vChromaShift, orig->m_chromaMarginX, orig->m_chromaMarginY); + extendPicBorder(orig->m_picOrg2, orig->m_strideC, orig->m_picWidth >> orig->m_hChromaShift, orig->m_picHeight >> orig->m_vChromaShift, orig->m_chromaMarginX, orig->m_chromaMarginY); + + //TODO: Add subsampling here if required + m_origPicBuffer->addPicture(inFrame); + } + m_lookahead->addPicture(*inFrame, sliceType); m_numDelayedPic++; } @@ -2019,6 +1902,7 @@ pic_out->bitDepth = X265_DEPTH; pic_out->userData = outFrame->m_userData; pic_out->colorSpace = m_param->internalCsp; + pic_out->frameData.tLayer = outFrame->m_tempLayer; frameData = &(pic_out->frameData); pic_out->pts = outFrame->m_pts; @@ -2041,16 +1925,6 @@ pic_out->analysisData.poc = pic_out->poc; pic_out->analysisData.sliceType = pic_out->sliceType; pic_out->analysisData.bScenecut = outFrame->m_lowres.bScenecut; - if (m_param->bHistBasedSceneCut) - { - memcpy(pic_out->analysisData.edgeHist, outFrame->m_analysisData.edgeHist, EDGE_BINS * sizeof(int32_t)); - memcpy(pic_out->analysisData.yuvHist0, outFrame->m_analysisData.yuvHist0, HISTOGRAM_BINS * sizeof(int32_t)); - if (pic_out->colorSpace != X265_CSP_I400) - { - memcpy(pic_out->analysisData.yuvHist1, outFrame->m_analysisData.yuvHist1, HISTOGRAM_BINS * sizeof(int32_t)); - memcpy(pic_out->analysisData.yuvHist2, outFrame->m_analysisData.yuvHist2, HISTOGRAM_BINS * sizeof(int32_t)); - } - } pic_out->analysisData.satdCost = outFrame->m_lowres.satdCost; pic_out->analysisData.numCUsInFrame = outFrame->m_analysisData.numCUsInFrame; pic_out->analysisData.numPartitions = outFrame->m_analysisData.numPartitions; @@ -2198,7 +2072,7 @@ if (m_rateControl->writeRateControlFrameStats(outFrame, &curEncoder->m_rce)) m_aborted = true; if (pic_out) - { + { /* m_rcData is allocated for every frame */ pic_out->rcData = outFrame->m_rcData; outFrame->m_rcData->qpaRc = outFrame->m_encData->m_avgQpRc; @@ -2216,6 +2090,18 @@ outFrame->m_rcData->iCuCount = outFrame->m_encData->m_frameStats.percent8x8Intra * m_rateControl->m_ncu; outFrame->m_rcData->pCuCount = outFrame->m_encData->m_frameStats.percent8x8Inter * m_rateControl->m_ncu; outFrame->m_rcData->skipCuCount = outFrame->m_encData->m_frameStats.percent8x8Skip * m_rateControl->m_ncu; + outFrame->m_rcData->currentSatd = curEncoder->m_rce.coeffBits; + } + + if (m_param->bEnableTemporalFilter) + { + Frame *curFrame = m_origPicBuffer->m_mcstfPicList.getPOCMCSTF(outFrame->m_poc); + X265_CHECK(curFrame, "Outframe not found in DPB's mcstfPicList"); + curFrame->m_refPicCnt0--; + curFrame->m_refPicCnt1--; + curFrame = m_origPicBuffer->m_mcstfOrigPicList.getPOCMCSTF(outFrame->m_poc); + X265_CHECK(curFrame, "Outframe not found in OPB's mcstfOrigPicList"); + curFrame->m_refPicCnt1--; } /* Allow this frame to be recycled if no frame encoders are using it for reference */ @@ -2223,6 +2109,8 @@ { ATOMIC_DEC(&outFrame->m_countRefEncoders); m_dpb->recycleUnreferenced(); + if (m_param->bEnableTemporalFilter) + m_origPicBuffer->recycleOrigPicList(); } else m_exportedPic = outFrame; @@ -2253,7 +2141,7 @@ m_rateControl->m_lastScenecut = frameEnc->m_poc; else { - int maxWindowSize = int((m_param->fwdScenecutWindow / 1000.0) * (m_param->fpsNum / m_param->fpsDenom) + 0.5); + int maxWindowSize = int((m_param->fwdMaxScenecutWindow / 1000.0) * (m_param->fpsNum / m_param->fpsDenom) + 0.5); if (frameEnc->m_poc > (m_rateControl->m_lastScenecut + maxWindowSize)) m_rateControl->m_lastScenecut = frameEnc->m_poc; } @@ -2422,8 +2310,36 @@ analysis->numPartitions = m_param->num4x4Partitions; x265_alloc_analysis_data(m_param, analysis); } + if (m_param->bEnableTemporalSubLayers > 2) + { + //Re-assign temporalid if the current frame is at the end of encode or when I slice is encountered + if ((frameEnc->m_poc == (m_param->totalFrames - 1)) || (frameEnc->m_lowres.sliceType == X265_TYPE_I) || (frameEnc->m_lowres.sliceType == X265_TYPE_IDR)) + { + frameEnc->m_tempLayer = (int8_t)0; + } + } /* determine references, setup RPS, etc */ m_dpb->prepareEncode(frameEnc); + + if (m_param->bEnableTemporalFilter) + { + X265_CHECK(!m_origPicBuffer->m_mcstfOrigPicFreeList.empty(), "Frames not available in Encoded OPB"); + + Frame *dupFrame = m_origPicBuffer->m_mcstfOrigPicFreeList.popBackMCSTF(); + dupFrame->m_fencPic->copyFromFrame(frameEnc->m_fencPic); + dupFrame->m_poc = frameEnc->m_poc; + dupFrame->m_encodeOrder = frameEnc->m_encodeOrder; + dupFrame->m_refPicCnt1 = 2 * dupFrame->m_mcstf->m_range + 1; + + if (dupFrame->m_poc < dupFrame->m_mcstf->m_range) + dupFrame->m_refPicCnt1 -= (uint8_t)(dupFrame->m_mcstf->m_range - dupFrame->m_poc); + if (m_param->totalFrames && (dupFrame->m_poc >= (m_param->totalFrames - dupFrame->m_mcstf->m_range))) + dupFrame->m_refPicCnt1 -= (uint8_t)(dupFrame->m_poc + dupFrame->m_mcstf->m_range - m_param->totalFrames + 1); + + m_origPicBuffer->addEncPictureToPicList(dupFrame); + m_origPicBuffer->setOrigPicList(frameEnc, m_pocLast); + } + if (!!m_param->selectiveSAO) { Slice* slice = frameEnc->m_encData->m_slice; @@ -2449,9 +2365,72 @@ if (m_param->rc.rateControlMode != X265_RC_CQP) m_lookahead->getEstimatedPictureCost(frameEnc); + if (m_param->bIntraRefresh) calcRefreshInterval(frameEnc); + // Generate MCSTF References and perform HME + if (m_param->bEnableTemporalFilter && isFilterThisframe(frameEnc->m_mcstf->m_sliceTypeConfig, frameEnc->m_lowres.sliceType)) + { + + if (!generateMcstfRef(frameEnc, curEncoder)) + { + m_aborted = true; + x265_log(m_param, X265_LOG_ERROR, "Failed to initialize MCSTFReferencePicInfo at POC %d\n", frameEnc->m_poc); + fflush(stderr); + return -1; + } + + + if (!*frameEnc->m_isSubSampled) + { + primitives.frameSubSampleLuma((const pixel *)frameEnc->m_fencPic->m_picOrg0,frameEnc->m_fencPicSubsampled2->m_picOrg0, frameEnc->m_fencPic->m_stride, frameEnc->m_fencPicSubsampled2->m_stride, frameEnc->m_fencPicSubsampled2->m_picWidth, frameEnc->m_fencPicSubsampled2->m_picHeight); + extendPicBorder(frameEnc->m_fencPicSubsampled2->m_picOrg0, frameEnc->m_fencPicSubsampled2->m_stride, frameEnc->m_fencPicSubsampled2->m_picWidth, frameEnc->m_fencPicSubsampled2->m_picHeight, frameEnc->m_fencPicSubsampled2->m_lumaMarginX, frameEnc->m_fencPicSubsampled2->m_lumaMarginY); + primitives.frameSubSampleLuma((const pixel *)frameEnc->m_fencPicSubsampled2->m_picOrg0,frameEnc->m_fencPicSubsampled4->m_picOrg0, frameEnc->m_fencPicSubsampled2->m_stride, frameEnc->m_fencPicSubsampled4->m_stride, frameEnc->m_fencPicSubsampled4->m_picWidth, frameEnc->m_fencPicSubsampled4->m_picHeight); + extendPicBorder(frameEnc->m_fencPicSubsampled4->m_picOrg0, frameEnc->m_fencPicSubsampled4->m_stride, frameEnc->m_fencPicSubsampled4->m_picWidth, frameEnc->m_fencPicSubsampled4->m_picHeight, frameEnc->m_fencPicSubsampled4->m_lumaMarginX, frameEnc->m_fencPicSubsampled4->m_lumaMarginY); + *frameEnc->m_isSubSampled = true; + } + + for (uint8_t i = 1; i <= frameEnc->m_mcstf->m_numRef; i++) + { + TemporalFilterRefPicInfo *ref = &curEncoder->m_mcstfRefListi - 1; + if (!*ref->isSubsampled) + { + primitives.frameSubSampleLuma((const pixel *)ref->picBuffer->m_picOrg0, ref->picBufferSubSampled2->m_picOrg0, ref->picBuffer->m_stride, ref->picBufferSubSampled2->m_stride, ref->picBufferSubSampled2->m_picWidth, ref->picBufferSubSampled2->m_picHeight); + extendPicBorder(ref->picBufferSubSampled2->m_picOrg0, ref->picBufferSubSampled2->m_stride, ref->picBufferSubSampled2->m_picWidth, ref->picBufferSubSampled2->m_picHeight, ref->picBufferSubSampled2->m_lumaMarginX, ref->picBufferSubSampled2->m_lumaMarginY); + primitives.frameSubSampleLuma((const pixel *)ref->picBufferSubSampled2->m_picOrg0,ref->picBufferSubSampled4->m_picOrg0, ref->picBufferSubSampled2->m_stride, ref->picBufferSubSampled4->m_stride, ref->picBufferSubSampled4->m_picWidth, ref->picBufferSubSampled4->m_picHeight); + extendPicBorder(ref->picBufferSubSampled4->m_picOrg0, ref->picBufferSubSampled4->m_stride, ref->picBufferSubSampled4->m_picWidth, ref->picBufferSubSampled4->m_picHeight, ref->picBufferSubSampled4->m_lumaMarginX, ref->picBufferSubSampled4->m_lumaMarginY); + *ref->isSubsampled = true; + } + } + + for (uint8_t i = 1; i <= frameEnc->m_mcstf->m_numRef; i++) + { + TemporalFilterRefPicInfo *ref = &curEncoder->m_mcstfRefListi - 1; + + curEncoder->m_frameEncTF->motionEstimationLuma(ref->mvs0, ref->mvsStride0, frameEnc->m_fencPicSubsampled4, ref->picBufferSubSampled4, 16); + curEncoder->m_frameEncTF->motionEstimationLuma(ref->mvs1, ref->mvsStride1, frameEnc->m_fencPicSubsampled2, ref->picBufferSubSampled2, 16, ref->mvs0, ref->mvsStride0, 2); + curEncoder->m_frameEncTF->motionEstimationLuma(ref->mvs2, ref->mvsStride2, frameEnc->m_fencPic, ref->picBuffer, 16, ref->mvs1, ref->mvsStride1, 2); + curEncoder->m_frameEncTF->motionEstimationLumaDoubleRes(ref->mvs, ref->mvsStride, frameEnc->m_fencPic, ref->picBuffer, 8, ref->mvs2, ref->mvsStride2, 1, ref->error); + } + + for (int i = 0; i < frameEnc->m_mcstf->m_numRef; i++) + { + TemporalFilterRefPicInfo *ref = &curEncoder->m_mcstfRefListi; + ref->slicetype = m_lookahead->findSliceType(frameEnc->m_poc + ref->origOffset); + Frame* dpbframePtr = m_dpb->m_picList.getPOC(frameEnc->m_poc + ref->origOffset); + if (dpbframePtr != NULL) + { + if (dpbframePtr->m_encData->m_slice->m_sliceType == B_SLICE) + ref->slicetype = X265_TYPE_B; + else if (dpbframePtr->m_encData->m_slice->m_sliceType == P_SLICE) + ref->slicetype = X265_TYPE_P; + else + ref->slicetype = X265_TYPE_I; + } + } + } + /* Allow FrameEncoder::compressFrame() to start in the frame encoder thread */ if (!curEncoder->startCompressFrame(frameEnc)) m_aborted = true; @@ -2523,7 +2502,11 @@ encParam->dynamicRd = param->dynamicRd; encParam->bEnableTransformSkip = param->bEnableTransformSkip; encParam->bEnableAMP = param->bEnableAMP; - + if (param->confWinBottomOffset == 0 && param->confWinRightOffset == 0) + { + encParam->confWinBottomOffset = param->confWinBottomOffset; + encParam->confWinRightOffset = param->confWinRightOffset; + } /* Resignal changes in params in Parameter Sets */ m_sps.maxAMPDepth = (m_sps.bUseAMP = param->bEnableAMP && param->bEnableAMP) ? param->maxCUDepth : 0; m_pps.bTransformSkipEnabled = param->bEnableTransformSkip ? 1 : 0; @@ -2729,18 +2712,7 @@ (float)100.0 * m_numLumaWPBiFrames / m_analyzeB.m_numPics, (float)100.0 * m_numChromaWPBiFrames / m_analyzeB.m_numPics); } - int pWithB = 0; - for (int i = 0; i <= m_param->bframes; i++) - pWithB += m_lookahead->m_histogrami; - if (pWithB) - { - int p = 0; - for (int i = 0; i <= m_param->bframes; i++) - p += sprintf(buffer + p, "%.1f%% ", 100. * m_lookahead->m_histogrami / pWithB); - - x265_log(m_param, X265_LOG_INFO, "consecutive B-frames: %s\n", buffer); - } if (m_param->bLossless) { float frameSize = (float)(m_param->sourceWidth - m_sps.conformanceWindow.rightOffset) * @@ -3341,6 +3313,19 @@ } } +void Encoder::getEndNalUnits(NALList& list, Bitstream& bs) +{ + NALList nalList; + bs.resetBits(); + + if (m_param->bEnableEndOfSequence) + nalList.serialize(NAL_UNIT_EOS, bs); + if (m_param->bEnableEndOfBitstream) + nalList.serialize(NAL_UNIT_EOB, bs); + + list.takeContents(nalList); +} + void Encoder::initVPS(VPS *vps) { /* Note that much of the VPS is initialized by determineLevel() */ @@ -3375,10 +3360,14 @@ sps->bUseAMP = m_param->bEnableAMP; sps->maxAMPDepth = m_param->bEnableAMP ? m_param->maxCUDepth : 0; - sps->maxTempSubLayers = m_param->bEnableTemporalSubLayers ? 2 : 1; - sps->maxDecPicBuffering = m_vps.maxDecPicBuffering; - sps->numReorderPics = m_vps.numReorderPics; - sps->maxLatencyIncrease = m_vps.maxLatencyIncrease = m_param->bframes; + sps->maxTempSubLayers = m_vps.maxTempSubLayers;// Getting the value from the user + + for(uint8_t i = 0; i < sps->maxTempSubLayers; i++) + { + sps->maxDecPicBufferingi = m_vps.maxDecPicBufferingi; + sps->numReorderPicsi = m_vps.numReorderPicsi; + sps->maxLatencyIncreasei = m_vps.maxLatencyIncreasei = m_param->bframes; + } sps->bUseStrongIntraSmoothing = m_param->bEnableStrongIntraSmoothing; sps->bTemporalMVPEnabled = m_param->bEnableTemporalMvp; @@ -3518,6 +3507,11 @@ p->rc.aqMode = X265_AQ_NONE; p->rc.hevcAq = 0; } + if (p->rc.aqMode == 0 && p->rc.cuTree) + { + p->rc.aqMode = X265_AQ_VARIANCE; + p->rc.aqStrength = 0; + } p->radl = zone->radl; } memcpy(zone, p, sizeof(x265_param)); @@ -3548,6 +3542,65 @@ p->crQpOffset = 3; } +void Encoder::configureVideoSignalTypePreset(x265_param* p) +{ + char systemId20 = {}; + char colorVolume20 = {}; + sscanf(p->videoSignalTypePreset, "%^::%s", systemId, colorVolume); + uint32_t sysId = 0; + while (strcmp(vstPresetssysId.systemId, systemId)) + { + if (sysId + 1 == sizeof(vstPresets) / sizeof(vstPresets0)) + { + x265_log(NULL, X265_LOG_ERROR, "Incorrect system-id, aborting\n"); + m_aborted = true; + break; + } + sysId++; + } + + p->vui.bEnableVideoSignalTypePresentFlag = vstPresetssysId.bEnableVideoSignalTypePresentFlag; + p->vui.bEnableColorDescriptionPresentFlag = vstPresetssysId.bEnableColorDescriptionPresentFlag; + p->vui.bEnableChromaLocInfoPresentFlag = vstPresetssysId.bEnableChromaLocInfoPresentFlag; + p->vui.colorPrimaries = vstPresetssysId.colorPrimaries; + p->vui.transferCharacteristics = vstPresetssysId.transferCharacteristics; + p->vui.matrixCoeffs = vstPresetssysId.matrixCoeffs; + p->vui.bEnableVideoFullRangeFlag = vstPresetssysId.bEnableVideoFullRangeFlag; + p->vui.chromaSampleLocTypeTopField = vstPresetssysId.chromaSampleLocTypeTopField; + p->vui.chromaSampleLocTypeBottomField = vstPresetssysId.chromaSampleLocTypeBottomField; + + if (colorVolume0 != '\0') + { + if (!strcmp(systemId, "BT2100_PQ_YCC") || !strcmp(systemId, "BT2100_PQ_ICTCP") || !strcmp(systemId, "BT2100_PQ_RGB")) + { + p->bEmitHDR10SEI = 1; + if (!strcmp(colorVolume, "P3D65x1000n0005")) + { + p->masteringDisplayColorVolume = strdup("G(13250,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(10000000,5)"); + } + else if (!strcmp(colorVolume, "P3D65x4000n005")) + { + p->masteringDisplayColorVolume = strdup("G(13250,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(40000000,50)"); + } + else if (!strcmp(colorVolume, "BT2100x108n0005")) + { + p->masteringDisplayColorVolume = strdup("G(8500,39850)B(6550,2300)R(34000,146000)WP(15635,16450)L(10000000,1)"); + } + else + { + x265_log(NULL, X265_LOG_ERROR, "Incorrect color-volume, aborting\n"); + m_aborted = true; + } + } + else + { + x265_log(NULL, X265_LOG_ERROR, "Color-volume is not supported with the given system-id, aborting\n"); + m_aborted = true; + } + } + +} + void Encoder::configure(x265_param *p) { this->m_param = p; @@ -3610,6 +3663,12 @@ if (!p->rdoqLevel) p->psyRdoq = 0; + if (p->craNal && p->keyframeMax > 1) + { + x265_log_file(NULL, X265_LOG_ERROR, " --cra-nal works only with keyint 1, but given keyint = %s\n", p->keyframeMax); + m_aborted = true; + } + /* Disable features which are not supported by the current RD level */ if (p->rdLevel < 3) { @@ -3848,12 +3907,37 @@ p->limitReferences = 0; } - if (p->bEnableTemporalSubLayers && !p->bframes) + if ((p->bEnableTemporalSubLayers > 2) && !p->bframes) { x265_log(p, X265_LOG_WARNING, "B frames not enabled, temporal sublayer disabled\n"); p->bEnableTemporalSubLayers = 0; } + if (!!p->bEnableTemporalSubLayers && p->bEnableTemporalSubLayers < 2) + { + p->bEnableTemporalSubLayers = 0; + x265_log(p, X265_LOG_WARNING, "No support for temporal sublayers less than 2; Disabling temporal layers\n"); + } + + if (p->bEnableTemporalSubLayers > 5) + { + p->bEnableTemporalSubLayers = 5; + x265_log(p, X265_LOG_WARNING, "No support for temporal sublayers more than 5; Reducing the temporal sublayers to 5\n"); + } + + // Assign number of B frames for temporal layers + if (p->bEnableTemporalSubLayers > 2) + p->bframes = x265_temporal_layer_bframesp->bEnableTemporalSubLayers - 1; + + if (p->bEnableTemporalSubLayers > 2) + { + if (p->bFrameAdaptive) + { + x265_log(p, X265_LOG_WARNING, "Disabling adaptive B-frame placement to support temporal sub-layers\n"); + p->bFrameAdaptive = 0; + } + } + m_bframeDelay = p->bframes ? (p->bBPyramid ? 2 : 1) : 0; p->bFrameBias = X265_MIN(X265_MAX(-90, p->bFrameBias), 100); @@ -3907,6 +3991,16 @@ p->rc.bStatRead = 0; } + if ((p->rc.bStatWrite || p->rc.bStatRead) && p->rc.dataShareMode != X265_SHARE_MODE_FILE && p->rc.dataShareMode != X265_SHARE_MODE_SHAREDMEM) + { + p->rc.dataShareMode = X265_SHARE_MODE_FILE; + } + + if (!p->rc.bStatRead || p->rc.rateControlMode != X265_RC_CRF) + { + p->rc.bEncFocusedFramesOnly = 0; + } + /* some options make no sense if others are disabled */ p->bSaoNonDeblocked &= p->bEnableSAO; p->bEnableTSkipFast &= p->bEnableTransformSkip; @@ -4243,6 +4337,9 @@ } } + if (p->videoSignalTypePreset) // Default disabled. + configureVideoSignalTypePreset(p); + if (m_param->toneMapFile || p->bHDR10Opt || p->bEmitHDR10SEI) { if (!p->bRepeatHeaders) @@ -4313,12 +4410,26 @@ m_param->searchRange = m_param->hmeRange2; } - if (p->bHistBasedSceneCut && !p->edgeTransitionThreshold) - { - p->edgeTransitionThreshold = 0.03; - x265_log(p, X265_LOG_WARNING, "using default threshold %.2lf for scene cut detection\n", p->edgeTransitionThreshold); - } + if (p->bEnableSBRC && (p->rc.rateControlMode != X265_RC_CRF || (p->rc.vbvBufferSize == 0 || p->rc.vbvMaxBitrate == 0))) + { + x265_log(p, X265_LOG_WARNING, "SBRC can be enabled only with CRF+VBV mode. Disabling SBRC\n"); + p->bEnableSBRC = 0; + } + if (p->bEnableSBRC) + { + p->rc.ipFactor = p->rc.ipFactor * X265_IPRATIO_STRENGTH; + if (p->bOpenGOP) + { + x265_log(p, X265_LOG_WARNING, "Segment based RateControl requires closed gop structure. Enabling closed GOP.\n"); + p->bOpenGOP = 0; + } + if (p->keyframeMax != p->keyframeMin) + { + x265_log(p, X265_LOG_WARNING, "Segment based RateControl requires fixed gop length. Force set min-keyint equal to keyint.\n"); + p->keyframeMin = p->keyframeMax; + } + } } void Encoder::readAnalysisFile(x265_analysis_data* analysis, int curPoc, const x265_picture* picIn, int paramBytes) @@ -4379,16 +4490,6 @@ analysis->frameRecordSize = frameRecordSize; X265_FREAD(&analysis->sliceType, sizeof(int), 1, m_analysisFileIn, &(picData->sliceType)); X265_FREAD(&analysis->bScenecut, sizeof(int), 1, m_analysisFileIn, &(picData->bScenecut)); - if (m_param->bHistBasedSceneCut) - { - X265_FREAD(&analysis->edgeHist, sizeof(int32_t), EDGE_BINS, m_analysisFileIn, &m_curEdgeHist); - X265_FREAD(&analysis->yuvHist0, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileIn, &m_curYUVHist0); - if (m_param->internalCsp != X265_CSP_I400) - { - X265_FREAD(&analysis->yuvHist1, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileIn, &m_curYUVHist1); - X265_FREAD(&analysis->yuvHist2, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileIn, &m_curYUVHist2); - } - } X265_FREAD(&analysis->satdCost, sizeof(int64_t), 1, m_analysisFileIn, &(picData->satdCost)); X265_FREAD(&numCUsLoad, sizeof(int), 1, m_analysisFileIn, &(picData->numCUsInFrame)); X265_FREAD(&analysis->numPartitions, sizeof(int), 1, m_analysisFileIn, &(picData->numPartitions)); @@ -4711,16 +4812,6 @@ analysis->frameRecordSize = frameRecordSize; X265_FREAD(&analysis->sliceType, sizeof(int), 1, m_analysisFileIn, &(picData->sliceType)); X265_FREAD(&analysis->bScenecut, sizeof(int), 1, m_analysisFileIn, &(picData->bScenecut)); - if (m_param->bHistBasedSceneCut) - { - X265_FREAD(&analysis->edgeHist, sizeof(int32_t), EDGE_BINS, m_analysisFileIn, &m_curEdgeHist); - X265_FREAD(&analysis->yuvHist0, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileIn, &m_curYUVHist0); - if (m_param->internalCsp != X265_CSP_I400) - { - X265_FREAD(&analysis->yuvHist1, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileIn, &m_curYUVHist1); - X265_FREAD(&analysis->yuvHist2, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileIn, &m_curYUVHist2); - } - } X265_FREAD(&analysis->satdCost, sizeof(int64_t), 1, m_analysisFileIn, &(picData->satdCost)); X265_FREAD(&analysis->numCUsInFrame, sizeof(int), 1, m_analysisFileIn, &(picData->numCUsInFrame)); X265_FREAD(&analysis->numPartitions, sizeof(int), 1, m_analysisFileIn, &(picData->numPartitions)); @@ -4810,8 +4901,14 @@ if (analysis->sliceType == X265_TYPE_IDR || analysis->sliceType == X265_TYPE_I) { - if (m_param->analysisLoadReuseLevel < 2) - return; + if (m_param->analysisLoadReuseLevel < 2) + { + /* Restore to the current encode's numPartitions and numCUsInFrame */ + analysis->numPartitions = m_param->num4x4Partitions; + analysis->numCUsInFrame = cuLoc.heightInCU * cuLoc.widthInCU; + analysis->numCuInHeight = cuLoc.heightInCU; + return; + } uint8_t *tempBuf = NULL, *depthBuf = NULL, *modeBuf = NULL, *partSizes = NULL; int8_t *cuQPBuf = NULL; @@ -4879,8 +4976,14 @@ uint32_t numDir = analysis->sliceType == X265_TYPE_P ? 1 : 2; uint32_t numPlanes = m_param->internalCsp == X265_CSP_I400 ? 1 : 3; X265_FREAD((WeightParam*)analysis->wt, sizeof(WeightParam), numPlanes * numDir, m_analysisFileIn, (picIn->analysisData.wt)); - if (m_param->analysisLoadReuseLevel < 2) - return; + if (m_param->analysisLoadReuseLevel < 2) + { + /* Restore to the current encode's numPartitions and numCUsInFrame */ + analysis->numPartitions = m_param->num4x4Partitions; + analysis->numCUsInFrame = cuLoc.heightInCU * cuLoc.widthInCU; + analysis->numCuInHeight = cuLoc.heightInCU; + return; + } uint8_t *tempBuf = NULL, *depthBuf = NULL, *modeBuf = NULL, *partSize = NULL, *mergeFlag = NULL; uint8_t *interDir = NULL, *chromaDir = NULL, *mvpIdx2; @@ -5167,7 +5270,7 @@ int bcutree; X265_FREAD(&bcutree, sizeof(int), 1, m_analysisFileIn, &(saveParam->cuTree)); - if (loadLevel == 10 && m_param->rc.cuTree && (!bcutree || saveLevel < 2)) + if (loadLevel >= 2 && m_param->rc.cuTree && (!bcutree || saveLevel < 2)) { x265_log(NULL, X265_LOG_ERROR, "Error reading cu-tree info. Disabling cutree offsets. \n"); m_param->rc.cuTree = 0; @@ -5337,6 +5440,7 @@ distortionData->highDistortionCtuCount++; } } + void Encoder::readAnalysisFile(x265_analysis_data* analysis, int curPoc, int sliceType) { @@ -5486,17 +5590,6 @@ /* calculate frameRecordSize */ analysis->frameRecordSize = sizeof(analysis->frameRecordSize) + sizeof(depthBytes) + sizeof(analysis->poc) + sizeof(analysis->sliceType) + sizeof(analysis->numCUsInFrame) + sizeof(analysis->numPartitions) + sizeof(analysis->bScenecut) + sizeof(analysis->satdCost); - if (m_param->bHistBasedSceneCut) - { - analysis->frameRecordSize += sizeof(analysis->edgeHist); - analysis->frameRecordSize += sizeof(int32_t) * HISTOGRAM_BINS; - if (m_param->internalCsp != X265_CSP_I400) - { - analysis->frameRecordSize += sizeof(int32_t) * HISTOGRAM_BINS; - analysis->frameRecordSize += sizeof(int32_t) * HISTOGRAM_BINS; - } - } - if (analysis->sliceType > X265_TYPE_I) { numDir = (analysis->sliceType == X265_TYPE_P) ? 1 : 2; @@ -5641,17 +5734,6 @@ X265_FWRITE(&analysis->poc, sizeof(int), 1, m_analysisFileOut); X265_FWRITE(&analysis->sliceType, sizeof(int), 1, m_analysisFileOut); X265_FWRITE(&analysis->bScenecut, sizeof(int), 1, m_analysisFileOut); - if (m_param->bHistBasedSceneCut) - { - X265_FWRITE(&analysis->edgeHist, sizeof(int32_t), EDGE_BINS, m_analysisFileOut); - X265_FWRITE(&analysis->yuvHist0, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileOut); - if (m_param->internalCsp != X265_CSP_I400) - { - X265_FWRITE(&analysis->yuvHist1, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileOut); - X265_FWRITE(&analysis->yuvHist2, sizeof(int32_t), HISTOGRAM_BINS, m_analysisFileOut); - } - } - X265_FWRITE(&analysis->satdCost, sizeof(int64_t), 1, m_analysisFileOut); X265_FWRITE(&analysis->numCUsInFrame, sizeof(int), 1, m_analysisFileOut); X265_FWRITE(&analysis->numPartitions, sizeof(int), 1, m_analysisFileOut);
View file
x265_3.5.tar.gz/source/encoder/encoder.h -> x265_3.6.tar.gz/source/encoder/encoder.h
Changed
@@ -32,6 +32,7 @@ #include "nal.h" #include "framedata.h" #include "svt.h" +#include "temporalfilter.h" #ifdef ENABLE_HDR10_PLUS #include "dynamicHDR10/hdr10plus.h" #endif @@ -256,19 +257,6 @@ int m_bToneMap; // Enables tone-mapping int m_enableNal; - /* For histogram based scene-cut detection */ - pixel* m_edgePic; - pixel* m_inputPic3; - int32_t m_curYUVHist3HISTOGRAM_BINS; - int32_t m_prevYUVHist3HISTOGRAM_BINS; - int32_t m_curEdgeHist2; - int32_t m_prevEdgeHist2; - uint32_t m_planeSizes3; - double m_edgeHistThreshold; - double m_chromaHistThreshold; - double m_scaledEdgeThreshold; - double m_scaledChromaThreshold; - #ifdef ENABLE_HDR10_PLUS const hdr10plus_api *m_hdr10plus_api; uint8_t **m_cim; @@ -295,6 +283,9 @@ ThreadSafeInteger* zoneReadCount; ThreadSafeInteger* zoneWriteCount; + /* Film grain model file */ + FILE* m_filmGrainIn; + OrigPicBuffer* m_origPicBuffer; Encoder(); ~Encoder() @@ -327,6 +318,8 @@ void getStreamHeaders(NALList& list, Entropy& sbacCoder, Bitstream& bs); + void getEndNalUnits(NALList& list, Bitstream& bs); + void fetchStats(x265_stats* stats, size_t statsSizeBytes); void printSummary(); @@ -373,11 +366,6 @@ void copyPicture(x265_picture *dest, const x265_picture *src); - bool computeHistograms(x265_picture *pic); - void computeHistogramSAD(double *maxUVNormalizedSAD, double *edgeNormalizedSAD, int curPoc); - double normalizeRange(int32_t value, int32_t minValue, int32_t maxValue, double rangeStart, double rangeEnd); - void findSceneCuts(x265_picture *pic, bool& bDup, double m_maxUVSADVal, double m_edgeSADVal, bool& isMaxThres, bool& isHardSC); - void initRefIdx(); void analyseRefIdx(int *numRefIdx); void updateRefIdx(); @@ -387,6 +375,11 @@ void configureDolbyVisionParams(x265_param* p); + void configureVideoSignalTypePreset(x265_param* p); + + bool isFilterThisframe(uint8_t sliceTypeConfig, int curSliceType); + bool generateMcstfRef(Frame* frameEnc, FrameEncoder* currEncoder); + protected: void initVPS(VPS *vps);
View file
x265_3.5.tar.gz/source/encoder/entropy.cpp -> x265_3.6.tar.gz/source/encoder/entropy.cpp
Changed
@@ -245,9 +245,9 @@ for (uint32_t i = 0; i < vps.maxTempSubLayers; i++) { - WRITE_UVLC(vps.maxDecPicBuffering - 1, "vps_max_dec_pic_buffering_minus1i"); - WRITE_UVLC(vps.numReorderPics, "vps_num_reorder_picsi"); - WRITE_UVLC(vps.maxLatencyIncrease + 1, "vps_max_latency_increase_plus1i"); + WRITE_UVLC(vps.maxDecPicBufferingi - 1, "vps_max_dec_pic_buffering_minus1i"); + WRITE_UVLC(vps.numReorderPicsi, "vps_num_reorder_picsi"); + WRITE_UVLC(vps.maxLatencyIncreasei + 1, "vps_max_latency_increase_plus1i"); } WRITE_CODE(0, 6, "vps_max_nuh_reserved_zero_layer_id"); @@ -291,9 +291,9 @@ for (uint32_t i = 0; i < sps.maxTempSubLayers; i++) { - WRITE_UVLC(sps.maxDecPicBuffering - 1, "sps_max_dec_pic_buffering_minus1i"); - WRITE_UVLC(sps.numReorderPics, "sps_num_reorder_picsi"); - WRITE_UVLC(sps.maxLatencyIncrease + 1, "sps_max_latency_increase_plus1i"); + WRITE_UVLC(sps.maxDecPicBufferingi - 1, "sps_max_dec_pic_buffering_minus1i"); + WRITE_UVLC(sps.numReorderPicsi, "sps_num_reorder_picsi"); + WRITE_UVLC(sps.maxLatencyIncreasei + 1, "sps_max_latency_increase_plus1i"); } WRITE_UVLC(sps.log2MinCodingBlockSize - 3, "log2_min_coding_block_size_minus3"); @@ -418,8 +418,11 @@ if (maxTempSubLayers > 1) { - WRITE_FLAG(0, "sub_layer_profile_present_flagi"); - WRITE_FLAG(0, "sub_layer_level_present_flagi"); + for(int i = 0; i < maxTempSubLayers - 1; i++) + { + WRITE_FLAG(0, "sub_layer_profile_present_flagi"); + WRITE_FLAG(0, "sub_layer_level_present_flagi"); + } for (int i = maxTempSubLayers - 1; i < 8 ; i++) WRITE_CODE(0, 2, "reserved_zero_2bits"); }
View file
x265_3.5.tar.gz/source/encoder/frameencoder.cpp -> x265_3.6.tar.gz/source/encoder/frameencoder.cpp
Changed
@@ -34,6 +34,7 @@ #include "common.h" #include "slicetype.h" #include "nal.h" +#include "temporalfilter.h" namespace X265_NS { void weightAnalyse(Slice& slice, Frame& frame, x265_param& param); @@ -101,6 +102,16 @@ delete m_rce.picTimingSEI; delete m_rce.hrdTiming; } + + if (m_param->bEnableTemporalFilter) + { + delete m_frameEncTF->m_metld; + + for (int i = 0; i < (m_frameEncTF->m_range << 1); i++) + m_frameEncTF->destroyRefPicInfo(&m_mcstfRefListi); + + delete m_frameEncTF; + } } bool FrameEncoder::init(Encoder *top, int numRows, int numCols) @@ -195,6 +206,16 @@ m_sliceAddrBits = (uint16_t)(tmp + 1); } + if (m_param->bEnableTemporalFilter) + { + m_frameEncTF = new TemporalFilter(); + if (m_frameEncTF) + m_frameEncTF->init(m_param); + + for (int i = 0; i < (m_frameEncTF->m_range << 1); i++) + ok &= !!m_frameEncTF->createRefPicInfo(&m_mcstfRefListi, m_param); + } + return ok; } @@ -450,7 +471,7 @@ m_ssimCnt = 0; memset(&(m_frame->m_encData->m_frameStats), 0, sizeof(m_frame->m_encData->m_frameStats)); - if (!m_param->bHistBasedSceneCut && m_param->rc.aqMode != X265_AQ_EDGE && m_param->recursionSkipMode == EDGE_BASED_RSKIP) + if (m_param->rc.aqMode != X265_AQ_EDGE && m_param->recursionSkipMode == EDGE_BASED_RSKIP) { int height = m_frame->m_fencPic->m_picHeight; int width = m_frame->m_fencPic->m_picWidth; @@ -467,6 +488,12 @@ * unit) */ Slice* slice = m_frame->m_encData->m_slice; + if (m_param->bEnableEndOfSequence && m_frame->m_lowres.sliceType == X265_TYPE_IDR && m_frame->m_poc) + { + m_bs.resetBits(); + m_nalList.serialize(NAL_UNIT_EOS, m_bs); + } + if (m_param->bEnableAccessUnitDelimiters && (m_frame->m_poc || m_param->bRepeatHeaders)) { m_bs.resetBits(); @@ -573,6 +600,12 @@ int qp = m_top->m_rateControl->rateControlStart(m_frame, &m_rce, m_top); m_rce.newQp = qp; + if (m_param->bEnableTemporalFilter) + { + m_frameEncTF->m_QP = qp; + m_frameEncTF->bilateralFilter(m_frame, m_mcstfRefList, m_param->temporalFilterStrength); + } + if (m_nr) { if (qp > QP_MAX_SPEC && m_frame->m_param->rc.vbvBufferSize) @@ -744,7 +777,7 @@ // wait after removal of the access unit with the most recent // buffering period SEI message sei->m_auCpbRemovalDelay = X265_MIN(X265_MAX(1, m_rce.encodeOrder - prevBPSEI), (1 << hrd->cpbRemovalDelayLength)); - sei->m_picDpbOutputDelay = slice->m_sps->numReorderPics + poc - m_rce.encodeOrder; + sei->m_picDpbOutputDelay = slice->m_sps->numReorderPicsm_frame->m_tempLayer + poc - m_rce.encodeOrder; } sei->writeSEImessages(m_bs, *slice->m_sps, NAL_UNIT_PREFIX_SEI, m_nalList, m_param->bSingleSeiNal); @@ -756,7 +789,14 @@ m_seiAlternativeTC.m_preferredTransferCharacteristics = m_param->preferredTransferCharacteristics; m_seiAlternativeTC.writeSEImessages(m_bs, *slice->m_sps, NAL_UNIT_PREFIX_SEI, m_nalList, m_param->bSingleSeiNal); } - + /* Write Film grain characteristics if present */ + if (this->m_top->m_filmGrainIn) + { + FilmGrainCharacteristics m_filmGrain; + /* Read the Film grain model file */ + readModel(&m_filmGrain, this->m_top->m_filmGrainIn); + m_filmGrain.writeSEImessages(m_bs, *slice->m_sps, NAL_UNIT_PREFIX_SEI, m_nalList, m_param->bSingleSeiNal); + } /* Write user SEI */ for (int i = 0; i < m_frame->m_userSEI.numPayloads; i++) { @@ -933,6 +973,23 @@ if (m_param->bDynamicRefine && m_top->m_startPoint <= m_frame->m_encodeOrder) //Avoid collecting data that will not be used by future frames. collectDynDataFrame(); + if (m_param->bEnableTemporalFilter && m_top->isFilterThisframe(m_frame->m_mcstf->m_sliceTypeConfig, m_frame->m_lowres.sliceType)) + { + //Reset the MCSTF context in Frame Encoder and Frame + for (int i = 0; i < (m_frameEncTF->m_range << 1); i++) + { + memset(m_mcstfRefListi.mvs0, 0, sizeof(MV) * ((m_param->sourceWidth / 16) * (m_param->sourceHeight / 16))); + memset(m_mcstfRefListi.mvs1, 0, sizeof(MV) * ((m_param->sourceWidth / 16) * (m_param->sourceHeight / 16))); + memset(m_mcstfRefListi.mvs2, 0, sizeof(MV) * ((m_param->sourceWidth / 16) * (m_param->sourceHeight / 16))); + memset(m_mcstfRefListi.mvs, 0, sizeof(MV) * ((m_param->sourceWidth / 4) * (m_param->sourceHeight / 4))); + memset(m_mcstfRefListi.noise, 0, sizeof(int) * ((m_param->sourceWidth / 4) * (m_param->sourceHeight / 4))); + memset(m_mcstfRefListi.error, 0, sizeof(int) * ((m_param->sourceWidth / 4) * (m_param->sourceHeight / 4))); + + m_frame->m_mcstf->m_numRef = 0; + } + } + + if (m_param->rc.bStatWrite) { int totalI = 0, totalP = 0, totalSkip = 0; @@ -1041,7 +1098,7 @@ m_bs.writeByteAlignment(); - m_nalList.serialize(slice->m_nalUnitType, m_bs); + m_nalList.serialize(slice->m_nalUnitType, m_bs, (!!m_param->bEnableTemporalSubLayers ? m_frame->m_tempLayer + 1 : (1 + (slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_TSA_N)))); } } else @@ -1062,7 +1119,7 @@ m_entropyCoder.codeSliceHeaderWPPEntryPoints(m_substreamSizes, (slice->m_sps->numCuInHeight - 1), maxStreamSize); m_bs.writeByteAlignment(); - m_nalList.serialize(slice->m_nalUnitType, m_bs); + m_nalList.serialize(slice->m_nalUnitType, m_bs, (!!m_param->bEnableTemporalSubLayers ? m_frame->m_tempLayer + 1 : (1 + (slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_TSA_N)))); } if (m_param->decodedPictureHashSEI) @@ -2127,6 +2184,54 @@ m_nr->nrOffsetDenoisecat0 = 0; } } + +void FrameEncoder::readModel(FilmGrainCharacteristics* m_filmGrain, FILE* filmgrain) +{ + char const* errorMessage = "Error reading FilmGrain characteristics\n"; + FilmGrain m_fg; + x265_fread((char* )&m_fg, sizeof(bool) * 3 + sizeof(uint8_t), 1, filmgrain, errorMessage); + m_filmGrain->m_filmGrainCharacteristicsCancelFlag = m_fg.m_filmGrainCharacteristicsCancelFlag; + m_filmGrain->m_filmGrainCharacteristicsPersistenceFlag = m_fg.m_filmGrainCharacteristicsPersistenceFlag; + m_filmGrain->m_filmGrainModelId = m_fg.m_filmGrainModelId; + m_filmGrain->m_separateColourDescriptionPresentFlag = m_fg.m_separateColourDescriptionPresentFlag; + if (m_filmGrain->m_separateColourDescriptionPresentFlag) + { + ColourDescription m_clr; + x265_fread((char* )&m_clr, sizeof(bool) + sizeof(uint8_t) * 5, 1, filmgrain, errorMessage); + m_filmGrain->m_filmGrainBitDepthLumaMinus8 = m_clr.m_filmGrainBitDepthLumaMinus8; + m_filmGrain->m_filmGrainBitDepthChromaMinus8 = m_clr.m_filmGrainBitDepthChromaMinus8; + m_filmGrain->m_filmGrainFullRangeFlag = m_clr.m_filmGrainFullRangeFlag; + m_filmGrain->m_filmGrainColourPrimaries = m_clr.m_filmGrainColourPrimaries; + m_filmGrain->m_filmGrainTransferCharacteristics = m_clr.m_filmGrainTransferCharacteristics; + m_filmGrain->m_filmGrainMatrixCoeffs = m_clr.m_filmGrainMatrixCoeffs; + } + FGPresent m_present; + x265_fread((char* )&m_present, sizeof(bool) * 3 + sizeof(uint8_t) * 2, 1, filmgrain, errorMessage); + m_filmGrain->m_blendingModeId = m_present.m_blendingModeId; + m_filmGrain->m_log2ScaleFactor = m_present.m_log2ScaleFactor; + m_filmGrain->m_compModel0.bPresentFlag = m_present.m_presentFlag0; + m_filmGrain->m_compModel1.bPresentFlag = m_present.m_presentFlag1; + m_filmGrain->m_compModel2.bPresentFlag = m_present.m_presentFlag2; + for (int i = 0; i < MAX_NUM_COMPONENT; i++) + { + if (m_filmGrain->m_compModeli.bPresentFlag) + { + x265_fread((char* )(&m_filmGrain->m_compModeli.m_filmGrainNumIntensityIntervalMinus1), sizeof(uint8_t), 1, filmgrain, errorMessage); + x265_fread((char* )(&m_filmGrain->m_compModeli.numModelValues), sizeof(uint8_t), 1, filmgrain, errorMessage); + m_filmGrain->m_compModeli.intensityValues = (FilmGrainCharacteristics::CompModelIntensityValues* ) malloc(sizeof(FilmGrainCharacteristics::CompModelIntensityValues) * (m_filmGrain->m_compModeli.m_filmGrainNumIntensityIntervalMinus1+1)) ; + for (int j = 0; j <= m_filmGrain->m_compModeli.m_filmGrainNumIntensityIntervalMinus1; j++) + { + x265_fread((char* )(&m_filmGrain->m_compModeli.intensityValuesj.intensityIntervalLowerBound), sizeof(uint8_t), 1, filmgrain, errorMessage); + x265_fread((char* )(&m_filmGrain->m_compModeli.intensityValuesj.intensityIntervalUpperBound), sizeof(uint8_t), 1, filmgrain, errorMessage); + m_filmGrain->m_compModeli.intensityValuesj.compModelValue = (int* ) malloc(sizeof(int) * (m_filmGrain->m_compModeli.numModelValues)); + for (int k = 0; k < m_filmGrain->m_compModeli.numModelValues; k++) + { + x265_fread((char* )(&m_filmGrain->m_compModeli.intensityValuesj.compModelValuek), sizeof(int), 1, filmgrain, errorMessage); + } + } + } + } +} #if ENABLE_LIBVMAF void FrameEncoder::vmafFrameLevelScore() {
View file
x265_3.5.tar.gz/source/encoder/frameencoder.h -> x265_3.6.tar.gz/source/encoder/frameencoder.h
Changed
@@ -40,6 +40,7 @@ #include "ratecontrol.h" #include "reference.h" #include "nal.h" +#include "temporalfilter.h" namespace X265_NS { // private x265 namespace @@ -113,6 +114,34 @@ } }; +/*Film grain characteristics*/ +struct FilmGrain +{ + bool m_filmGrainCharacteristicsCancelFlag; + bool m_filmGrainCharacteristicsPersistenceFlag; + bool m_separateColourDescriptionPresentFlag; + uint8_t m_filmGrainModelId; + uint8_t m_blendingModeId; + uint8_t m_log2ScaleFactor; +}; + +struct ColourDescription +{ + bool m_filmGrainFullRangeFlag; + uint8_t m_filmGrainBitDepthLumaMinus8; + uint8_t m_filmGrainBitDepthChromaMinus8; + uint8_t m_filmGrainColourPrimaries; + uint8_t m_filmGrainTransferCharacteristics; + uint8_t m_filmGrainMatrixCoeffs; +}; + +struct FGPresent +{ + uint8_t m_blendingModeId; + uint8_t m_log2ScaleFactor; + bool m_presentFlag3; +}; + // Manages the wave-front processing of a single encoding frame class FrameEncoder : public WaveFront, public Thread { @@ -205,6 +234,10 @@ FrameFilter m_frameFilter; NALList m_nalList; + // initialization for mcstf + TemporalFilter* m_frameEncTF; + TemporalFilterRefPicInfo m_mcstfRefListMAX_MCSTF_TEMPORAL_WINDOW_LENGTH; + class WeightAnalysis : public BondedTaskGroup { public: @@ -250,6 +283,7 @@ void collectDynDataFrame(); void computeAvgTrainingData(); void collectDynDataRow(CUData& ctu, FrameStats* rowStats); + void readModel(FilmGrainCharacteristics* m_filmGrain, FILE* filmgrain); }; }
View file
x265_3.5.tar.gz/source/encoder/level.cpp -> x265_3.6.tar.gz/source/encoder/level.cpp
Changed
@@ -72,7 +72,7 @@ * for intra-only profiles (vps.ptl.intraConstraintFlag) */ vps.ptl.lowerBitRateConstraintFlag = true; - vps.maxTempSubLayers = param.bEnableTemporalSubLayers ? 2 : 1; + vps.maxTempSubLayers = !!param.bEnableTemporalSubLayers ? param.bEnableTemporalSubLayers : 1; if (param.internalCsp == X265_CSP_I420 && param.internalBitDepth <= 10) { @@ -167,7 +167,7 @@ /* The value of sps_max_dec_pic_buffering_minus1 HighestTid + 1 shall be less than * or equal to MaxDpbSize */ - if (vps.maxDecPicBuffering > maxDpbSize) + if (vps.maxDecPicBufferingvps.maxTempSubLayers - 1 > maxDpbSize) continue; /* For level 5 and higher levels, the value of CtbSizeY shall be equal to 32 or 64 */ @@ -182,8 +182,8 @@ } /* The value of NumPocTotalCurr shall be less than or equal to 8 */ - int numPocTotalCurr = param.maxNumReferences + vps.numReorderPics; - if (numPocTotalCurr > 8) + int numPocTotalCurr = param.maxNumReferences + vps.numReorderPicsvps.maxTempSubLayers - 1; + if (numPocTotalCurr > 10) { x265_log(¶m, X265_LOG_WARNING, "level %s detected, but NumPocTotalCurr (total references) is non-compliant\n", levelsi.name); vps.ptl.profileIdc = Profile::NONE; @@ -289,9 +289,40 @@ * circumstances it will be quite noisy */ bool enforceLevel(x265_param& param, VPS& vps) { - vps.numReorderPics = (param.bBPyramid && param.bframes > 1) ? 2 : !!param.bframes; - vps.maxDecPicBuffering = X265_MIN(MAX_NUM_REF, X265_MAX(vps.numReorderPics + 2, (uint32_t)param.maxNumReferences) + 1); + vps.maxTempSubLayers = !!param.bEnableTemporalSubLayers ? param.bEnableTemporalSubLayers : 1; + for (uint32_t i = 0; i < vps.maxTempSubLayers; i++) + { + vps.numReorderPicsi = (i == 0) ? ((param.bBPyramid && param.bframes > 1) ? 2 : !!param.bframes) : i; + vps.maxDecPicBufferingi = X265_MIN(MAX_NUM_REF, X265_MAX(vps.numReorderPicsi + 2, (uint32_t)param.maxNumReferences) + 1); + } + if (!!param.bEnableTemporalSubLayers) + { + for (int i = 0; i < MAX_T_LAYERS - 1; i++) + { + // a lower layer can not have higher value of numReorderPics than a higher layer + if (vps.numReorderPicsi + 1 < vps.numReorderPicsi) + { + vps.numReorderPicsi + 1 = vps.numReorderPicsi; + } + // the value of numReorderPicsi shall be in the range of 0 to maxDecPicBufferingi - 1, inclusive + if (vps.numReorderPicsi > vps.maxDecPicBufferingi - 1) + { + vps.maxDecPicBufferingi = vps.numReorderPicsi + 1; + } + // a lower layer can not have higher value of maxDecPicBuffering than a higher layer + if (vps.maxDecPicBufferingi + 1 < vps.maxDecPicBufferingi) + { + vps.maxDecPicBufferingi + 1 = vps.maxDecPicBufferingi; + } + } + + // the value of numReorderPicsi shall be in the range of 0 to maxDecPicBuffering i - 1, inclusive + if (vps.numReorderPicsMAX_T_LAYERS - 1 > vps.maxDecPicBufferingMAX_T_LAYERS - 1 - 1) + { + vps.maxDecPicBufferingMAX_T_LAYERS - 1 = vps.numReorderPicsMAX_T_LAYERS - 1 + 1; + } + } /* no level specified by user, just auto-detect from the configuration */ if (param.levelIdc <= 0) return true; @@ -391,10 +422,10 @@ } int savedRefCount = param.maxNumReferences; - while (vps.maxDecPicBuffering > maxDpbSize && param.maxNumReferences > 1) + while (vps.maxDecPicBufferingvps.maxTempSubLayers - 1 > maxDpbSize && param.maxNumReferences > 1) { param.maxNumReferences--; - vps.maxDecPicBuffering = X265_MIN(MAX_NUM_REF, X265_MAX(vps.numReorderPics + 1, (uint32_t)param.maxNumReferences) + 1); + vps.maxDecPicBufferingvps.maxTempSubLayers - 1 = X265_MIN(MAX_NUM_REF, X265_MAX(vps.numReorderPicsvps.maxTempSubLayers - 1 + 1, (uint32_t)param.maxNumReferences) + 1); } if (param.maxNumReferences != savedRefCount) x265_log(¶m, X265_LOG_WARNING, "Lowering max references to %d to meet level requirement\n", param.maxNumReferences);
View file
x265_3.5.tar.gz/source/encoder/motion.cpp -> x265_3.6.tar.gz/source/encoder/motion.cpp
Changed
@@ -190,6 +190,31 @@ X265_CHECK(!bChromaSATD, "chroma distortion measurements impossible in this code path\n"); } +/* Called by lookahead, luma only, no use of PicYuv */ +void MotionEstimate::setSourcePU(pixel *fencY, intptr_t stride, intptr_t offset, int pwidth, int pheight, const int method, const int refine) +{ + partEnum = partitionFromSizes(pwidth, pheight); + X265_CHECK(LUMA_4x4 != partEnum, "4x4 inter partition detected!\n"); + sad = primitives.pupartEnum.sad; + ads = primitives.pupartEnum.ads; + satd = primitives.pupartEnum.satd; + sad_x3 = primitives.pupartEnum.sad_x3; + sad_x4 = primitives.pupartEnum.sad_x4; + + + blockwidth = pwidth; + blockOffset = offset; + absPartIdx = ctuAddr = -1; + + /* Search params */ + searchMethod = method; + subpelRefine = refine; + + /* copy PU block into cache */ + primitives.pupartEnum.copy_pp(fencPUYuv.m_buf0, FENC_STRIDE, fencY + offset, stride); + X265_CHECK(!bChromaSATD, "chroma distortion measurements impossible in this code path\n"); +} + /* Called by Search::predInterSearch() or --pme equivalent, chroma residual might be considered */ void MotionEstimate::setSourcePU(const Yuv& srcFencYuv, int _ctuAddr, int cuPartIdx, int puPartIdx, int pwidth, int pheight, const int method, const int refine, bool bChroma) {
View file
x265_3.5.tar.gz/source/encoder/motion.h -> x265_3.6.tar.gz/source/encoder/motion.h
Changed
@@ -77,7 +77,7 @@ void init(int csp); /* Methods called at slice setup */ - + void setSourcePU(pixel *fencY, intptr_t stride, intptr_t offset, int pwidth, int pheight, const int searchMethod, const int subpelRefine); void setSourcePU(pixel *fencY, intptr_t stride, intptr_t offset, int pwidth, int pheight, const int searchMethod, const int searchL0, const int searchL1, const int subpelRefine); void setSourcePU(const Yuv& srcFencYuv, int ctuAddr, int cuPartIdx, int puPartIdx, int pwidth, int pheight, const int searchMethod, const int subpelRefine, bool bChroma);
View file
x265_3.5.tar.gz/source/encoder/nal.cpp -> x265_3.6.tar.gz/source/encoder/nal.cpp
Changed
@@ -57,7 +57,7 @@ other.m_buffer = X265_MALLOC(uint8_t, m_allocSize); } -void NALList::serialize(NalUnitType nalUnitType, const Bitstream& bs) +void NALList::serialize(NalUnitType nalUnitType, const Bitstream& bs, uint8_t temporalID) { static const char startCodePrefix = { 0, 0, 0, 1 }; @@ -114,7 +114,7 @@ * nuh_reserved_zero_6bits 6-bits * nuh_temporal_id_plus1 3-bits */ outbytes++ = (uint8_t)nalUnitType << 1; - outbytes++ = 1 + (nalUnitType == NAL_UNIT_CODED_SLICE_TSA_N); + outbytes++ = temporalID; /* 7.4.1 ... * Within the NAL unit, the following three-byte sequences shall not occur at
View file
x265_3.5.tar.gz/source/encoder/nal.h -> x265_3.6.tar.gz/source/encoder/nal.h
Changed
@@ -56,7 +56,7 @@ void takeContents(NALList& other); - void serialize(NalUnitType nalUnitType, const Bitstream& bs); + void serialize(NalUnitType nalUnitType, const Bitstream& bs, uint8_t temporalID = 1); uint32_t serializeSubstreams(uint32_t* streamSizeBytes, uint32_t streamCount, const Bitstream* streams); };
View file
x265_3.5.tar.gz/source/encoder/ratecontrol.cpp -> x265_3.6.tar.gz/source/encoder/ratecontrol.cpp
Changed
@@ -41,6 +41,10 @@ #define BR_SHIFT 6 #define CPB_SHIFT 4 +#define SHARED_DATA_ALIGNMENT 4 ///< 4btye, 32bit +#define CUTREE_SHARED_MEM_NAME "cutree" +#define GOP_CNT_CU_TREE 3 + using namespace X265_NS; /* Amortize the partial cost of I frames over the next N frames */ @@ -104,6 +108,37 @@ return output; } +typedef struct CUTreeSharedDataItem +{ + uint8_t *type; + uint16_t *stats; +}CUTreeSharedDataItem; + +void static ReadSharedCUTreeData(void *dst, void *src, int32_t size) +{ + CUTreeSharedDataItem *statsDst = reinterpret_cast<CUTreeSharedDataItem *>(dst); + uint8_t *typeSrc = reinterpret_cast<uint8_t *>(src); + *statsDst->type = *typeSrc; + + ///< for memory alignment, the type will take 32bit in the shared memory + int32_t offset = (sizeof(*statsDst->type) + SHARED_DATA_ALIGNMENT - 1) & ~(SHARED_DATA_ALIGNMENT - 1); + uint16_t *statsSrc = reinterpret_cast<uint16_t *>(typeSrc + offset); + memcpy(statsDst->stats, statsSrc, size - offset); +} + +void static WriteSharedCUTreeData(void *dst, void *src, int32_t size) +{ + CUTreeSharedDataItem *statsSrc = reinterpret_cast<CUTreeSharedDataItem *>(src); + uint8_t *typeDst = reinterpret_cast<uint8_t *>(dst); + *typeDst = *statsSrc->type; + + ///< for memory alignment, the type will take 32bit in the shared memory + int32_t offset = (sizeof(*statsSrc->type) + SHARED_DATA_ALIGNMENT - 1) & ~(SHARED_DATA_ALIGNMENT - 1); + uint16_t *statsDst = reinterpret_cast<uint16_t *>(typeDst + offset); + memcpy(statsDst, statsSrc->stats, size - offset); +} + + inline double qScale2bits(RateControlEntry *rce, double qScale) { if (qScale < 0.1) @@ -209,6 +244,7 @@ m_lastAbrResetPoc = -1; m_statFileOut = NULL; m_cutreeStatFileOut = m_cutreeStatFileIn = NULL; + m_cutreeShrMem = NULL; m_rce2Pass = NULL; m_encOrder = NULL; m_lastBsliceSatdCost = 0; @@ -224,6 +260,8 @@ m_initVbv = false; m_singleFrameVbv = 0; m_rateTolerance = 1.0; + m_encodedSegmentBits = 0; + m_segDur = 0; if (m_param->rc.vbvBufferSize) { @@ -320,47 +358,86 @@ m_cuTreeStats.qpBufferi = NULL; } -bool RateControl::init(const SPS& sps) +bool RateControl::initCUTreeSharedMem() { - if (m_isVbv && !m_initVbv) - { - /* We don't support changing the ABR bitrate right now, - * so if the stream starts as CBR, keep it CBR. */ - if (m_param->rc.vbvBufferSize < (int)(m_param->rc.vbvMaxBitrate / m_fps)) + if (!m_cutreeShrMem) { + m_cutreeShrMem = new RingMem(); + if (!m_cutreeShrMem) { - m_param->rc.vbvBufferSize = (int)(m_param->rc.vbvMaxBitrate / m_fps); - x265_log(m_param, X265_LOG_WARNING, "VBV buffer size cannot be smaller than one frame, using %d kbit\n", - m_param->rc.vbvBufferSize); + return false; } - int vbvBufferSize = m_param->rc.vbvBufferSize * 1000; - int vbvMaxBitrate = m_param->rc.vbvMaxBitrate * 1000; - if (m_param->bEmitHRDSEI && !m_param->decoderVbvMaxRate) + ///< now cutree data form at most 3 gops would be stored in the shared memory at the same time + int32_t itemSize = (sizeof(uint8_t) + SHARED_DATA_ALIGNMENT - 1) & ~(SHARED_DATA_ALIGNMENT - 1); + if (m_param->rc.qgSize == 8) { - const HRDInfo* hrd = &sps.vuiParameters.hrdParameters; - vbvBufferSize = hrd->cpbSizeValue << (hrd->cpbSizeScale + CPB_SHIFT); - vbvMaxBitrate = hrd->bitRateValue << (hrd->bitRateScale + BR_SHIFT); + itemSize += sizeof(uint16_t) * m_ncu * 4; } - m_bufferRate = vbvMaxBitrate / m_fps; - m_vbvMaxRate = vbvMaxBitrate; - m_bufferSize = vbvBufferSize; - m_singleFrameVbv = m_bufferRate * 1.1 > m_bufferSize; + else + { + itemSize += sizeof(uint16_t) * m_ncu; + } + + int32_t itemCnt = X265_MIN(m_param->keyframeMax, (int)(m_fps + 0.5)); + itemCnt *= GOP_CNT_CU_TREE; - if (m_param->rc.vbvBufferInit > 1.) - m_param->rc.vbvBufferInit = x265_clip3(0.0, 1.0, m_param->rc.vbvBufferInit / m_param->rc.vbvBufferSize); - if (m_param->vbvBufferEnd > 1.) - m_param->vbvBufferEnd = x265_clip3(0.0, 1.0, m_param->vbvBufferEnd / m_param->rc.vbvBufferSize); - if (m_param->vbvEndFrameAdjust > 1.) - m_param->vbvEndFrameAdjust = x265_clip3(0.0, 1.0, m_param->vbvEndFrameAdjust); - m_param->rc.vbvBufferInit = x265_clip3(0.0, 1.0, X265_MAX(m_param->rc.vbvBufferInit, m_bufferRate / m_bufferSize)); - m_bufferFillFinal = m_bufferSize * m_param->rc.vbvBufferInit; - m_bufferFillActual = m_bufferFillFinal; - m_bufferExcess = 0; - m_minBufferFill = m_param->minVbvFullness / 100; - m_maxBufferFill = 1 - (m_param->maxVbvFullness / 100); - m_initVbv = true; + char shrnameMAX_SHR_NAME_LEN = { 0 }; + strcpy(shrname, m_param->rc.sharedMemName); + strcat(shrname, CUTREE_SHARED_MEM_NAME); + + if (!m_cutreeShrMem->init(itemSize, itemCnt, shrname)) + { + return false; + } } + return true; +} + +void RateControl::initVBV(const SPS& sps) +{ + /* We don't support changing the ABR bitrate right now, + * so if the stream starts as CBR, keep it CBR. */ + if (m_param->rc.vbvBufferSize < (int)(m_param->rc.vbvMaxBitrate / m_fps)) + { + m_param->rc.vbvBufferSize = (int)(m_param->rc.vbvMaxBitrate / m_fps); + x265_log(m_param, X265_LOG_WARNING, "VBV buffer size cannot be smaller than one frame, using %d kbit\n", + m_param->rc.vbvBufferSize); + } + int vbvBufferSize = m_param->rc.vbvBufferSize * 1000; + int vbvMaxBitrate = m_param->rc.vbvMaxBitrate * 1000; + + if (m_param->bEmitHRDSEI && !m_param->decoderVbvMaxRate) + { + const HRDInfo* hrd = &sps.vuiParameters.hrdParameters; + vbvBufferSize = hrd->cpbSizeValue << (hrd->cpbSizeScale + CPB_SHIFT); + vbvMaxBitrate = hrd->bitRateValue << (hrd->bitRateScale + BR_SHIFT); + } + m_bufferRate = vbvMaxBitrate / m_fps; + m_vbvMaxRate = vbvMaxBitrate; + m_bufferSize = vbvBufferSize; + m_singleFrameVbv = m_bufferRate * 1.1 > m_bufferSize; + + if (m_param->rc.vbvBufferInit > 1.) + m_param->rc.vbvBufferInit = x265_clip3(0.0, 1.0, m_param->rc.vbvBufferInit / m_param->rc.vbvBufferSize); + if (m_param->vbvBufferEnd > 1.) + m_param->vbvBufferEnd = x265_clip3(0.0, 1.0, m_param->vbvBufferEnd / m_param->rc.vbvBufferSize); + if (m_param->vbvEndFrameAdjust > 1.) + m_param->vbvEndFrameAdjust = x265_clip3(0.0, 1.0, m_param->vbvEndFrameAdjust); + m_param->rc.vbvBufferInit = x265_clip3(0.0, 1.0, X265_MAX(m_param->rc.vbvBufferInit, m_bufferRate / m_bufferSize)); + m_bufferFillFinal = m_bufferSize * m_param->rc.vbvBufferInit; + m_bufferFillActual = m_bufferFillFinal; + m_bufferExcess = 0; + m_minBufferFill = m_param->minVbvFullness / 100; + m_maxBufferFill = 1 - (m_param->maxVbvFullness / 100); + m_initVbv = true; +} + +bool RateControl::init(const SPS& sps) +{ + if (m_isVbv && !m_initVbv) + initVBV(sps); + if (!m_param->bResetZoneConfig && (m_relativeComplexity == NULL)) { m_relativeComplexity = X265_MALLOC(double, m_param->reconfigWindowSize); @@ -373,7 +450,9 @@ m_totalBits = 0; m_encodedBits = 0; + m_encodedSegmentBits = 0; m_framesDone = 0; + m_segDur = 0; m_residualCost = 0; m_partialResidualCost = 0; m_amortizeFraction = 0.85; @@ -421,244 +500,257 @@ /* Load stat file and init 2pass algo */ if (m_param->rc.bStatRead) { - m_expectedBitsSum = 0; - char *p, *statsIn, *statsBuf; - /* read 1st pass stats */ - statsIn = statsBuf = x265_slurp_file(fileName); - if (!statsBuf) - return false; - if (m_param->rc.cuTree) + if (X265_SHARE_MODE_FILE == m_param->rc.dataShareMode) { - char *tmpFile = strcatFilename(fileName, ".cutree"); - if (!tmpFile) + m_expectedBitsSum = 0; + char *p, *statsIn, *statsBuf; + /* read 1st pass stats */ + statsIn = statsBuf = x265_slurp_file(fileName); + if (!statsBuf) return false; - m_cutreeStatFileIn = x265_fopen(tmpFile, "rb"); - X265_FREE(tmpFile); - if (!m_cutreeStatFileIn) + if (m_param->rc.cuTree) { - x265_log_file(m_param, X265_LOG_ERROR, "can't open stats file %s.cutree\n", fileName); - return false; + char *tmpFile = strcatFilename(fileName, ".cutree"); + if (!tmpFile) + return false; + m_cutreeStatFileIn = x265_fopen(tmpFile, "rb"); + X265_FREE(tmpFile); + if (!m_cutreeStatFileIn) + { + x265_log_file(m_param, X265_LOG_ERROR, "can't open stats file %s.cutree\n", fileName); + return false; + } } - } - /* check whether 1st pass options were compatible with current options */ - if (strncmp(statsBuf, "#options:", 9)) - { - x265_log(m_param, X265_LOG_ERROR,"options list in stats file not valid\n"); - return false; - } - { - int i, j, m; - uint32_t k , l; - bool bErr = false; - char *opts = statsBuf; - statsIn = strchr(statsBuf, '\n'); - if (!statsIn) - { - x265_log(m_param, X265_LOG_ERROR, "Malformed stats file\n"); - return false; - } - *statsIn = '\0'; - statsIn++; - if ((p = strstr(opts, " input-res=")) == 0 || sscanf(p, " input-res=%dx%d", &i, &j) != 2) - { - x265_log(m_param, X265_LOG_ERROR, "Resolution specified in stats file not valid\n"); - return false; - } - if ((p = strstr(opts, " fps=")) == 0 || sscanf(p, " fps=%u/%u", &k, &l) != 2) - { - x265_log(m_param, X265_LOG_ERROR, "fps specified in stats file not valid\n"); - return false; - } - if (((p = strstr(opts, " vbv-maxrate=")) == 0 || sscanf(p, " vbv-maxrate=%d", &m) != 1) && m_param->rc.rateControlMode == X265_RC_CRF) - { - x265_log(m_param, X265_LOG_ERROR, "Constant rate-factor is incompatible with 2pass without vbv-maxrate in the previous pass\n"); - return false; - } - if (k != m_param->fpsNum || l != m_param->fpsDenom) + /* check whether 1st pass options were compatible with current options */ + if (strncmp(statsBuf, "#options:", 9)) { - x265_log(m_param, X265_LOG_ERROR, "fps mismatch with 1st pass (%u/%u vs %u/%u)\n", - m_param->fpsNum, m_param->fpsDenom, k, l); + x265_log(m_param, X265_LOG_ERROR, "options list in stats file not valid\n"); return false; } - if (m_param->analysisMultiPassRefine) { - p = strstr(opts, "ref="); - sscanf(p, "ref=%d", &i); - if (i > m_param->maxNumReferences) + int i, j, m; + uint32_t k, l; + bool bErr = false; + char *opts = statsBuf; + statsIn = strchr(statsBuf, '\n'); + if (!statsIn) { - x265_log(m_param, X265_LOG_ERROR, "maxNumReferences cannot be less than 1st pass (%d vs %d)\n", - i, m_param->maxNumReferences); + x265_log(m_param, X265_LOG_ERROR, "Malformed stats file\n"); return false; } - } - if (m_param->analysisMultiPassRefine || m_param->analysisMultiPassDistortion) - { - p = strstr(opts, "ctu="); - sscanf(p, "ctu=%u", &k); - if (k != m_param->maxCUSize) + *statsIn = '\0'; + statsIn++; + if ((p = strstr(opts, " input-res=")) == 0 || sscanf(p, " input-res=%dx%d", &i, &j) != 2) { - x265_log(m_param, X265_LOG_ERROR, "maxCUSize mismatch with 1st pass (%u vs %u)\n", - k, m_param->maxCUSize); + x265_log(m_param, X265_LOG_ERROR, "Resolution specified in stats file not valid\n"); return false; } + if ((p = strstr(opts, " fps=")) == 0 || sscanf(p, " fps=%u/%u", &k, &l) != 2) + { + x265_log(m_param, X265_LOG_ERROR, "fps specified in stats file not valid\n"); + return false; + } + if (((p = strstr(opts, " vbv-maxrate=")) == 0 || sscanf(p, " vbv-maxrate=%d", &m) != 1) && m_param->rc.rateControlMode == X265_RC_CRF) + { + x265_log(m_param, X265_LOG_ERROR, "Constant rate-factor is incompatible with 2pass without vbv-maxrate in the previous pass\n"); + return false; + } + if (k != m_param->fpsNum || l != m_param->fpsDenom) + { + x265_log(m_param, X265_LOG_ERROR, "fps mismatch with 1st pass (%u/%u vs %u/%u)\n", + m_param->fpsNum, m_param->fpsDenom, k, l); + return false; + } + if (m_param->analysisMultiPassRefine) + { + p = strstr(opts, "ref="); + sscanf(p, "ref=%d", &i); + if (i > m_param->maxNumReferences) + { + x265_log(m_param, X265_LOG_ERROR, "maxNumReferences cannot be less than 1st pass (%d vs %d)\n", + i, m_param->maxNumReferences); + return false; + } + } + if (m_param->analysisMultiPassRefine || m_param->analysisMultiPassDistortion) + { + p = strstr(opts, "ctu="); + sscanf(p, "ctu=%u", &k); + if (k != m_param->maxCUSize) + { + x265_log(m_param, X265_LOG_ERROR, "maxCUSize mismatch with 1st pass (%u vs %u)\n", + k, m_param->maxCUSize); + return false; + } + } + CMP_OPT_FIRST_PASS("bitdepth", m_param->internalBitDepth); + CMP_OPT_FIRST_PASS("weightp", m_param->bEnableWeightedPred); + CMP_OPT_FIRST_PASS("bframes", m_param->bframes); + CMP_OPT_FIRST_PASS("b-pyramid", m_param->bBPyramid); + CMP_OPT_FIRST_PASS("open-gop", m_param->bOpenGOP); + CMP_OPT_FIRST_PASS(" keyint", m_param->keyframeMax); + CMP_OPT_FIRST_PASS("scenecut", m_param->scenecutThreshold); + CMP_OPT_FIRST_PASS("intra-refresh", m_param->bIntraRefresh); + CMP_OPT_FIRST_PASS("frame-dup", m_param->bEnableFrameDuplication); + if (m_param->bMultiPassOptRPS) + { + CMP_OPT_FIRST_PASS("multi-pass-opt-rps", m_param->bMultiPassOptRPS); + CMP_OPT_FIRST_PASS("repeat-headers", m_param->bRepeatHeaders); + CMP_OPT_FIRST_PASS("min-keyint", m_param->keyframeMin); + } + + if ((p = strstr(opts, "b-adapt=")) != 0 && sscanf(p, "b-adapt=%d", &i) && i >= X265_B_ADAPT_NONE && i <= X265_B_ADAPT_TRELLIS) + { + m_param->bFrameAdaptive = i; + } + else if (m_param->bframes) + { + x265_log(m_param, X265_LOG_ERROR, "b-adapt method specified in stats file not valid\n"); + return false; + } + + if ((p = strstr(opts, "rc-lookahead=")) != 0 && sscanf(p, "rc-lookahead=%d", &i)) + m_param->lookaheadDepth = i; } - CMP_OPT_FIRST_PASS("bitdepth", m_param->internalBitDepth); - CMP_OPT_FIRST_PASS("weightp", m_param->bEnableWeightedPred); - CMP_OPT_FIRST_PASS("bframes", m_param->bframes); - CMP_OPT_FIRST_PASS("b-pyramid", m_param->bBPyramid); - CMP_OPT_FIRST_PASS("open-gop", m_param->bOpenGOP); - CMP_OPT_FIRST_PASS(" keyint", m_param->keyframeMax); - CMP_OPT_FIRST_PASS("scenecut", m_param->scenecutThreshold); - CMP_OPT_FIRST_PASS("intra-refresh", m_param->bIntraRefresh); - CMP_OPT_FIRST_PASS("frame-dup", m_param->bEnableFrameDuplication); - if (m_param->bMultiPassOptRPS) + /* find number of pics */ + p = statsIn; + int numEntries; + for (numEntries = -1; p; numEntries++) + p = strchr(p + 1, ';'); + if (!numEntries) { - CMP_OPT_FIRST_PASS("multi-pass-opt-rps", m_param->bMultiPassOptRPS); - CMP_OPT_FIRST_PASS("repeat-headers", m_param->bRepeatHeaders); - CMP_OPT_FIRST_PASS("min-keyint", m_param->keyframeMin); + x265_log(m_param, X265_LOG_ERROR, "empty stats file\n"); + return false; } + m_numEntries = numEntries; - if ((p = strstr(opts, "b-adapt=")) != 0 && sscanf(p, "b-adapt=%d", &i) && i >= X265_B_ADAPT_NONE && i <= X265_B_ADAPT_TRELLIS) + if (m_param->totalFrames < m_numEntries && m_param->totalFrames > 0) { - m_param->bFrameAdaptive = i; + x265_log(m_param, X265_LOG_WARNING, "2nd pass has fewer frames than 1st pass (%d vs %d)\n", + m_param->totalFrames, m_numEntries); } - else if (m_param->bframes) + if (m_param->totalFrames > m_numEntries && !m_param->bEnableFrameDuplication) { - x265_log(m_param, X265_LOG_ERROR, "b-adapt method specified in stats file not valid\n"); + x265_log(m_param, X265_LOG_ERROR, "2nd pass has more frames than 1st pass (%d vs %d)\n", + m_param->totalFrames, m_numEntries); return false; } - if ((p = strstr(opts, "rc-lookahead=")) != 0 && sscanf(p, "rc-lookahead=%d", &i)) - m_param->lookaheadDepth = i; - } - /* find number of pics */ - p = statsIn; - int numEntries; - for (numEntries = -1; p; numEntries++) - p = strchr(p + 1, ';'); - if (!numEntries) - { - x265_log(m_param, X265_LOG_ERROR, "empty stats file\n"); - return false; - } - m_numEntries = numEntries; - - if (m_param->totalFrames < m_numEntries && m_param->totalFrames > 0) - { - x265_log(m_param, X265_LOG_WARNING, "2nd pass has fewer frames than 1st pass (%d vs %d)\n", - m_param->totalFrames, m_numEntries); - } - if (m_param->totalFrames > m_numEntries && !m_param->bEnableFrameDuplication) - { - x265_log(m_param, X265_LOG_ERROR, "2nd pass has more frames than 1st pass (%d vs %d)\n", - m_param->totalFrames, m_numEntries); - return false; - } - - m_rce2Pass = X265_MALLOC(RateControlEntry, m_numEntries); - if (!m_rce2Pass) - { - x265_log(m_param, X265_LOG_ERROR, "Rce Entries for 2 pass cannot be allocated\n"); - return false; - } - m_encOrder = X265_MALLOC(int, m_numEntries); - if (!m_encOrder) - { - x265_log(m_param, X265_LOG_ERROR, "Encode order for 2 pass cannot be allocated\n"); - return false; - } - /* init all to skipped p frames */ - for (int i = 0; i < m_numEntries; i++) - { - RateControlEntry *rce = &m_rce2Passi; - rce->sliceType = P_SLICE; - rce->qScale = rce->newQScale = x265_qp2qScale(20); - rce->miscBits = m_ncu + 10; - rce->newQp = 0; - } - /* read stats */ - p = statsIn; - double totalQpAq = 0; - for (int i = 0; i < m_numEntries; i++) - { - RateControlEntry *rce, *rcePocOrder; - int frameNumber; - int encodeOrder; - char picType; - int e; - char *next; - double qpRc, qpAq, qNoVbv, qRceq; - next = strstr(p, ";"); - if (next) - *next++ = 0; - e = sscanf(p, " in:%d out:%d", &frameNumber, &encodeOrder); - if (frameNumber < 0 || frameNumber >= m_numEntries) + m_rce2Pass = X265_MALLOC(RateControlEntry, m_numEntries); + if (!m_rce2Pass) { - x265_log(m_param, X265_LOG_ERROR, "bad frame number (%d) at stats line %d\n", frameNumber, i); + x265_log(m_param, X265_LOG_ERROR, "Rce Entries for 2 pass cannot be allocated\n"); return false; } - rce = &m_rce2PassencodeOrder; - rcePocOrder = &m_rce2PassframeNumber; - m_encOrderframeNumber = encodeOrder; - if (!m_param->bMultiPassOptRPS) - { - int scenecut = 0; - e += sscanf(p, " in:%*d out:%*d type:%c q:%lf q-aq:%lf q-noVbv:%lf q-Rceq:%lf tex:%d mv:%d misc:%d icu:%lf pcu:%lf scu:%lf sc:%d", - &picType, &qpRc, &qpAq, &qNoVbv, &qRceq, &rce->coeffBits, - &rce->mvBits, &rce->miscBits, &rce->iCuCount, &rce->pCuCount, - &rce->skipCuCount, &scenecut); - rcePocOrder->scenecut = scenecut != 0; + m_encOrder = X265_MALLOC(int, m_numEntries); + if (!m_encOrder) + { + x265_log(m_param, X265_LOG_ERROR, "Encode order for 2 pass cannot be allocated\n"); + return false; } - else + /* init all to skipped p frames */ + for (int i = 0; i < m_numEntries; i++) { - char deltaPOC128; - char bUsed40; - memset(deltaPOC, 0, sizeof(deltaPOC)); - memset(bUsed, 0, sizeof(bUsed)); - e += sscanf(p, " in:%*d out:%*d type:%c q:%lf q-aq:%lf q-noVbv:%lf q-Rceq:%lf tex:%d mv:%d misc:%d icu:%lf pcu:%lf scu:%lf nump:%d numnegp:%d numposp:%d deltapoc:%s bused:%s", - &picType, &qpRc, &qpAq, &qNoVbv, &qRceq, &rce->coeffBits, - &rce->mvBits, &rce->miscBits, &rce->iCuCount, &rce->pCuCount, - &rce->skipCuCount, &rce->rpsData.numberOfPictures, &rce->rpsData.numberOfNegativePictures, &rce->rpsData.numberOfPositivePictures, deltaPOC, bUsed); - splitdeltaPOC(deltaPOC, rce); - splitbUsed(bUsed, rce); - rce->rpsIdx = -1; - } - rce->keptAsRef = true; - rce->isIdr = false; - if (picType == 'b' || picType == 'p') - rce->keptAsRef = false; - if (picType == 'I') - rce->isIdr = true; - if (picType == 'I' || picType == 'i') - rce->sliceType = I_SLICE; - else if (picType == 'P' || picType == 'p') + RateControlEntry *rce = &m_rce2Passi; rce->sliceType = P_SLICE; - else if (picType == 'B' || picType == 'b') - rce->sliceType = B_SLICE; - else - e = -1; - if (e < 10) + rce->qScale = rce->newQScale = x265_qp2qScale(20); + rce->miscBits = m_ncu + 10; + rce->newQp = 0; + } + /* read stats */ + p = statsIn; + double totalQpAq = 0; + for (int i = 0; i < m_numEntries; i++) + { + RateControlEntry *rce, *rcePocOrder; + int frameNumber; + int encodeOrder; + char picType; + int e; + char *next; + double qpRc, qpAq, qNoVbv, qRceq; + next = strstr(p, ";"); + if (next) + *next++ = 0; + e = sscanf(p, " in:%d out:%d", &frameNumber, &encodeOrder); + if (frameNumber < 0 || frameNumber >= m_numEntries) + { + x265_log(m_param, X265_LOG_ERROR, "bad frame number (%d) at stats line %d\n", frameNumber, i); + return false; + } + rce = &m_rce2PassencodeOrder; + rcePocOrder = &m_rce2PassframeNumber; + m_encOrderframeNumber = encodeOrder; + if (!m_param->bMultiPassOptRPS) + { + int scenecut = 0; + e += sscanf(p, " in:%*d out:%*d type:%c q:%lf q-aq:%lf q-noVbv:%lf q-Rceq:%lf tex:%d mv:%d misc:%d icu:%lf pcu:%lf scu:%lf sc:%d", + &picType, &qpRc, &qpAq, &qNoVbv, &qRceq, &rce->coeffBits, + &rce->mvBits, &rce->miscBits, &rce->iCuCount, &rce->pCuCount, + &rce->skipCuCount, &scenecut); + rcePocOrder->scenecut = scenecut != 0; + } + else + { + char deltaPOC128; + char bUsed40; + memset(deltaPOC, 0, sizeof(deltaPOC)); + memset(bUsed, 0, sizeof(bUsed)); + e += sscanf(p, " in:%*d out:%*d type:%c q:%lf q-aq:%lf q-noVbv:%lf q-Rceq:%lf tex:%d mv:%d misc:%d icu:%lf pcu:%lf scu:%lf nump:%d numnegp:%d numposp:%d deltapoc:%s bused:%s", + &picType, &qpRc, &qpAq, &qNoVbv, &qRceq, &rce->coeffBits, + &rce->mvBits, &rce->miscBits, &rce->iCuCount, &rce->pCuCount, + &rce->skipCuCount, &rce->rpsData.numberOfPictures, &rce->rpsData.numberOfNegativePictures, &rce->rpsData.numberOfPositivePictures, deltaPOC, bUsed); + splitdeltaPOC(deltaPOC, rce); + splitbUsed(bUsed, rce); + rce->rpsIdx = -1; + } + rce->keptAsRef = true; + rce->isIdr = false; + if (picType == 'b' || picType == 'p') + rce->keptAsRef = false; + if (picType == 'I') + rce->isIdr = true; + if (picType == 'I' || picType == 'i') + rce->sliceType = I_SLICE; + else if (picType == 'P' || picType == 'p') + rce->sliceType = P_SLICE; + else if (picType == 'B' || picType == 'b') + rce->sliceType = B_SLICE; + else + e = -1; + if (e < 10) + { + x265_log(m_param, X265_LOG_ERROR, "statistics are damaged at line %d, parser out=%d\n", i, e); + return false; + } + rce->qScale = rce->newQScale = x265_qp2qScale(qpRc); + totalQpAq += qpAq; + rce->qpNoVbv = qNoVbv; + rce->qpaRc = qpRc; + rce->qpAq = qpAq; + rce->qRceq = qRceq; + p = next; + } + X265_FREE(statsBuf); + if (m_param->rc.rateControlMode != X265_RC_CQP) + { + m_start = 0; + m_isQpModified = true; + if (!initPass2()) + return false; + } /* else we're using constant quant, so no need to run the bitrate allocation */ + } + else // X265_SHARE_MODE_SHAREDMEM == m_param->rc.dataShareMode + { + if (m_param->rc.cuTree) { - x265_log(m_param, X265_LOG_ERROR, "statistics are damaged at line %d, parser out=%d\n", i, e); - return false; + if (!initCUTreeSharedMem()) + { + return false; + } } - rce->qScale = rce->newQScale = x265_qp2qScale(qpRc); - totalQpAq += qpAq; - rce->qpNoVbv = qNoVbv; - rce->qpaRc = qpRc; - rce->qpAq = qpAq; - rce->qRceq = qRceq; - p = next; - } - X265_FREE(statsBuf); - if (m_param->rc.rateControlMode != X265_RC_CQP) - { - m_start = 0; - m_isQpModified = true; - if (!initPass2()) - return false; - } /* else we're using constant quant, so no need to run the bitrate allocation */ + } } /* Open output file */ /* If input and output files are the same, output to a temp file @@ -682,19 +774,29 @@ X265_FREE(p); if (m_param->rc.cuTree && !m_param->rc.bStatRead) { - statFileTmpname = strcatFilename(fileName, ".cutree.temp"); - if (!statFileTmpname) - return false; - m_cutreeStatFileOut = x265_fopen(statFileTmpname, "wb"); - X265_FREE(statFileTmpname); - if (!m_cutreeStatFileOut) + if (X265_SHARE_MODE_FILE == m_param->rc.dataShareMode) { - x265_log_file(m_param, X265_LOG_ERROR, "can't open mbtree stats file %s.cutree.temp\n", fileName); - return false; + statFileTmpname = strcatFilename(fileName, ".cutree.temp"); + if (!statFileTmpname) + return false; + m_cutreeStatFileOut = x265_fopen(statFileTmpname, "wb"); + X265_FREE(statFileTmpname); + if (!m_cutreeStatFileOut) + { + x265_log_file(m_param, X265_LOG_ERROR, "can't open mbtree stats file %s.cutree.temp\n", fileName); + return false; + } + } + else // X265_SHARE_MODE_SHAREDMEM == m_param->rc.dataShareMode + { + if (!initCUTreeSharedMem()) + { + return false; + } } } } - if (m_param->rc.cuTree) + if (m_param->rc.cuTree && !m_cuTreeStats.qpBuffer0) { if (m_param->rc.qgSize == 8) { @@ -714,6 +816,10 @@ return true; } +void RateControl::skipCUTreeSharedMemRead(int32_t cnt) +{ + m_cutreeShrMem->skipRead(cnt); +} void RateControl::reconfigureRC() { if (m_isVbv) @@ -806,7 +912,7 @@ TimingInfo *time = &sps.vuiParameters.timingInfo; int maxCpbOutputDelay = (int)(X265_MIN(m_param->keyframeMax * MAX_DURATION * time->timeScale / time->numUnitsInTick, INT_MAX)); - int maxDpbOutputDelay = (int)(sps.maxDecPicBuffering * MAX_DURATION * time->timeScale / time->numUnitsInTick); + int maxDpbOutputDelay = (int)(sps.maxDecPicBufferingsps.maxTempSubLayers - 1 * MAX_DURATION * time->timeScale / time->numUnitsInTick); int maxDelay = (int)(90000.0 * cpbSizeUnscale / bitRateUnscale + 0.5); hrd->initialCpbRemovalDelayLength = 2 + x265_clip3(4, 22, 32 - calcLength(maxDelay)); @@ -1000,125 +1106,103 @@ { uint64_t allConstBits = 0, allCodedBits = 0; uint64_t allAvailableBits = uint64_t(m_param->rc.bitrate * 1000. * m_numEntries * m_frameDuration); - int startIndex, framesCount, endIndex; + int startIndex, endIndex; int fps = X265_MIN(m_param->keyframeMax, (int)(m_fps + 0.5)); - startIndex = endIndex = framesCount = 0; - int diffQp = 0; + int distance = fps << 1; + distance = distance > m_param->keyframeMax ? (m_param->keyframeMax << 1) : m_param->keyframeMax; + startIndex = endIndex = 0; double targetBits = 0; double expectedBits = 0; - for (startIndex = m_start, endIndex = m_start; endIndex < m_numEntries; endIndex++) + double targetBits2 = 0; + double expectedBits2 = 0; + double cpxSum = 0; + double cpxSum2 = 0; + + if (m_param->rc.rateControlMode == X265_RC_ABR) { - allConstBits += m_rce2PassendIndex.miscBits; - allCodedBits += m_rce2PassendIndex.coeffBits + m_rce2PassendIndex.mvBits; - if (m_param->rc.rateControlMode == X265_RC_CRF) + for (endIndex = m_start; endIndex < m_numEntries; endIndex++) { - framesCount = endIndex - startIndex + 1; - diffQp += int (m_rce2PassendIndex.qpaRc - m_rce2PassendIndex.qpNoVbv); - if (framesCount > fps) - diffQp -= int (m_rce2PassendIndex - fps.qpaRc - m_rce2PassendIndex - fps.qpNoVbv); - if (framesCount >= fps) - { - if (diffQp >= 1) - { - if (!m_isQpModified && endIndex > fps) - { - double factor = 2; - double step = 0; - if (endIndex + fps >= m_numEntries) - { - m_start = endIndex - (endIndex % fps); - return true; - } - for (int start = endIndex + 1; start <= endIndex + fps && start < m_numEntries; start++) - { - RateControlEntry *rce = &m_rce2Passstart; - targetBits += qScale2bits(rce, x265_qp2qScale(rce->qpNoVbv)); - expectedBits += qScale2bits(rce, rce->qScale); - } - if (expectedBits < 0.95 * targetBits) - { - m_isQpModified = true; - m_isGopReEncoded = true; - while (endIndex + fps < m_numEntries) - { - step = pow(2, factor / 6.0); - expectedBits = 0; - for (int start = endIndex + 1; start <= endIndex + fps; start++) - { - RateControlEntry *rce = &m_rce2Passstart; - rce->newQScale = rce->qScale / step; - X265_CHECK(rce->newQScale >= 0, "new Qscale is negative\n"); - expectedBits += qScale2bits(rce, rce->newQScale); - rce->newQp = x265_qScale2qp(rce->newQScale); - } - if (expectedBits >= targetBits && step > 1) - factor *= 0.90; - else - break; - } - - if (m_isVbv && endIndex + fps < m_numEntries) - if (!vbv2Pass((uint64_t)targetBits, endIndex + fps, endIndex + 1)) - return false; - - targetBits = 0; - expectedBits = 0; - - for (int start = endIndex - fps + 1; start <= endIndex; start++) - { - RateControlEntry *rce = &m_rce2Passstart; - targetBits += qScale2bits(rce, x265_qp2qScale(rce->qpNoVbv)); - } - while (1) - { - step = pow(2, factor / 6.0); - expectedBits = 0; - for (int start = endIndex - fps + 1; start <= endIndex; start++) - { - RateControlEntry *rce = &m_rce2Passstart; - rce->newQScale = rce->qScale * step; - X265_CHECK(rce->newQScale >= 0, "new Qscale is negative\n"); - expectedBits += qScale2bits(rce, rce->newQScale); - rce->newQp = x265_qScale2qp(rce->newQScale); - } - if (expectedBits > targetBits && step > 1) - factor *= 1.1; - else - break; - } - if (m_isVbv) - if (!vbv2Pass((uint64_t)targetBits, endIndex, endIndex - fps + 1)) - return false; - diffQp = 0; - m_reencode = endIndex - fps + 1; - endIndex = endIndex + fps; - startIndex = endIndex + 1; - m_start = startIndex; - targetBits = expectedBits = 0; - } - else - targetBits = expectedBits = 0; - } - } - else - m_isQpModified = false; - } + allConstBits += m_rce2PassendIndex.miscBits; + allCodedBits += m_rce2PassendIndex.coeffBits + m_rce2PassendIndex.mvBits; } - } - if (m_param->rc.rateControlMode == X265_RC_ABR) - { if (allAvailableBits < allConstBits) { x265_log(m_param, X265_LOG_ERROR, "requested bitrate is too low. estimated minimum is %d kbps\n", - (int)(allConstBits * m_fps / framesCount * 1000.)); + (int)(allConstBits * m_fps / (m_numEntries - m_start) * 1000.)); return false; } if (!analyseABR2Pass(allAvailableBits)) return false; + + return true; + } + + if (m_isQpModified) + { + return true; + } + + if (m_start + (fps << 1) > m_numEntries) + { + return true; + } + + for (startIndex = m_start, endIndex = m_numEntries - 1; startIndex < endIndex; startIndex++, endIndex--) + { + cpxSum += m_rce2PassstartIndex.qScale / m_rce2PassstartIndex.coeffBits; + cpxSum2 += m_rce2PassendIndex.qScale / m_rce2PassendIndex.coeffBits; + + RateControlEntry *rce = &m_rce2PassstartIndex; + targetBits += qScale2bits(rce, x265_qp2qScale(rce->qpNoVbv)); + expectedBits += qScale2bits(rce, rce->qScale); + + rce = &m_rce2PassendIndex; + targetBits2 += qScale2bits(rce, x265_qp2qScale(rce->qpNoVbv)); + expectedBits2 += qScale2bits(rce, rce->qScale); } - m_start = X265_MAX(m_start, endIndex - fps); + if (expectedBits < 0.95 * targetBits || expectedBits2 < 0.95 * targetBits2) + { + if (cpxSum / cpxSum2 < 0.95 || cpxSum2 / cpxSum < 0.95) + { + m_isQpModified = true; + m_isGopReEncoded = true; + + m_shortTermCplxSum = 0; + m_shortTermCplxCount = 0; + m_framesDone = m_start; + + for (startIndex = m_start; startIndex < m_numEntries; startIndex++) + { + m_shortTermCplxSum *= 0.5; + m_shortTermCplxCount *= 0.5; + m_shortTermCplxSum += m_rce2PassstartIndex.currentSatd / (CLIP_DURATION(m_frameDuration) / BASE_FRAME_DURATION); + m_shortTermCplxCount++; + } + + m_bufferFill = m_rce2Passm_start - 1.bufferFill; + m_bufferFillFinal = m_rce2Passm_start - 1.bufferFillFinal; + m_bufferFillActual = m_rce2Passm_start - 1.bufferFillActual; + + m_reencode = m_start; + m_start = m_numEntries; + } + else + { + + m_isQpModified = false; + m_isGopReEncoded = false; + } + } + else + { + + m_isQpModified = false; + m_isGopReEncoded = false; + } + + m_start = X265_MAX(m_start, m_numEntries - distance + m_param->keyframeMax); return true; } @@ -1271,6 +1355,16 @@ m_predType = getPredictorType(curFrame->m_lowres.sliceType, m_sliceType); rce->poc = m_curSlice->m_poc; + if (m_param->bEnableSBRC) + { + if (rce->poc == 0 || (m_framesDone % m_param->keyframeMax == 0)) + { + //Reset SBRC buffer + m_encodedSegmentBits = 0; + m_segDur = 0; + } + } + if (!m_param->bResetZoneConfig && (rce->encodeOrder % m_param->reconfigWindowSize == 0)) { int index = m_zoneBufferIdx % m_param->rc.zonefileCount; @@ -1304,7 +1398,8 @@ { m_param = m_param->rc.zonesi.zoneParam; reconfigureRC(); - init(*m_curSlice->m_sps); + if (!m_param->bNoResetZoneConfig) + init(*m_curSlice->m_sps); } } } @@ -1391,15 +1486,57 @@ rce->frameSizeMaximum *= m_param->maxAUSizeFactor; } } + + ///< regenerate the qp if (!m_isAbr && m_2pass && m_param->rc.rateControlMode == X265_RC_CRF) { - rce->qpPrev = x265_qScale2qp(rce->qScale); - rce->qScale = rce->newQScale; - rce->qpaRc = curEncData.m_avgQpRc = curEncData.m_avgQpAq = x265_qScale2qp(rce->newQScale); - m_qp = int(rce->qpaRc + 0.5); - rce->frameSizePlanned = qScale2bits(rce, rce->qScale); - m_framesDone++; - return m_qp; + if (!m_param->rc.bEncFocusedFramesOnly) + { + rce->qpPrev = x265_qScale2qp(rce->qScale); + if (m_param->bEnableSceneCutAwareQp) + { + double lqmin = m_lminm_sliceType; + double lqmax = m_lmaxm_sliceType; + if (m_param->bEnableSceneCutAwareQp & FORWARD) + rce->newQScale = forwardMasking(curFrame, rce->newQScale); + if (m_param->bEnableSceneCutAwareQp & BACKWARD) + rce->newQScale = backwardMasking(curFrame, rce->newQScale); + rce->newQScale = x265_clip3(lqmin, lqmax, rce->newQScale); + } + rce->qScale = rce->newQScale; + rce->qpaRc = curEncData.m_avgQpRc = curEncData.m_avgQpAq = x265_qScale2qp(rce->newQScale); + m_qp = int(rce->qpaRc + 0.5); + rce->frameSizePlanned = qScale2bits(rce, rce->qScale); + m_framesDone++; + return m_qp; + } + else + { + int index = m_encOrderrce->poc; + index++; + double totalDuration = m_frameDuration; + for (int j = 0; totalDuration < 1.0 && index < m_numEntries; j++) + { + switch (m_rce2Passindex.sliceType) + { + case B_SLICE: + curFrame->m_lowres.plannedTypej = m_rce2Passindex.keptAsRef ? X265_TYPE_BREF : X265_TYPE_B; + break; + case P_SLICE: + curFrame->m_lowres.plannedTypej = X265_TYPE_P; + break; + case I_SLICE: + curFrame->m_lowres.plannedTypej = m_param->bOpenGOP ? X265_TYPE_I : X265_TYPE_IDR; + break; + default: + break; + } + + curFrame->m_lowres.plannedSatdj = m_rce2Passindex.currentSatd; + totalDuration += m_frameDuration; + index++; + } + } } if (m_isAbr || m_2pass) // ABR,CRF @@ -1655,10 +1792,25 @@ { m_cuTreeStats.qpBufPos++; - if (!fread(&type, 1, 1, m_cutreeStatFileIn)) - goto fail; - if (fread(m_cuTreeStats.qpBufferm_cuTreeStats.qpBufPos, sizeof(uint16_t), ncu, m_cutreeStatFileIn) != (size_t)ncu) - goto fail; + if (X265_SHARE_MODE_FILE == m_param->rc.dataShareMode) + { + if (!fread(&type, 1, 1, m_cutreeStatFileIn)) + goto fail; + if (fread(m_cuTreeStats.qpBufferm_cuTreeStats.qpBufPos, sizeof(uint16_t), ncu, m_cutreeStatFileIn) != (size_t)ncu) + goto fail; + } + else // X265_SHARE_MODE_SHAREDMEM == m_param->rc.dataShareMode + { + if (!m_cutreeShrMem) + { + goto fail; + } + + CUTreeSharedDataItem shrItem; + shrItem.type = &type; + shrItem.stats = m_cuTreeStats.qpBufferm_cuTreeStats.qpBufPos; + m_cutreeShrMem->readNext(&shrItem, ReadSharedCUTreeData); + } if (type != sliceTypeActual && m_cuTreeStats.qpBufPos == 1) { @@ -1785,7 +1937,7 @@ m_sliderPos++; } - if (m_sliceType == B_SLICE) + if((!m_param->bEnableSBRC && m_sliceType == B_SLICE) || (m_param->bEnableSBRC && !IS_REFERENCED(curFrame))) { /* B-frames don't have independent rate control, but rather get the * average QP of the two adjacent P-frames + an offset */ @@ -1836,8 +1988,16 @@ double minScenecutQscale =x265_qp2qScale(ABR_SCENECUT_INIT_QP_MIN); m_lastQScaleForP_SLICE = X265_MAX(minScenecutQscale, m_lastQScaleForP_SLICE); } + double qScale = x265_qp2qScale(q); rce->qpNoVbv = q; + + if (m_param->bEnableSBRC) + { + qScale = tuneQscaleForSBRC(curFrame, qScale); + rce->qpNoVbv = x265_qScale2qp(qScale); + } + double lmin = 0, lmax = 0; if (m_isGrainEnabled && m_isFirstMiniGop) { @@ -1890,7 +2050,7 @@ qScale = x265_clip3(lqmin, lqmax, qScale); } - if (!m_2pass || m_param->bliveVBV2pass) + if (!m_2pass || m_param->bliveVBV2pass || (m_2pass && m_param->rc.rateControlMode == X265_RC_CRF && m_param->rc.bEncFocusedFramesOnly)) { /* clip qp to permissible range after vbv-lookahead estimation to avoid possible * mispredictions by initial frame size predictors */ @@ -1927,7 +2087,7 @@ else { double abrBuffer = 2 * m_rateTolerance * m_bitrate; - if (m_2pass) + if (m_2pass && (m_param->rc.rateControlMode != X265_RC_CRF || !m_param->rc.bEncFocusedFramesOnly)) { double lmin = m_lminm_sliceType; double lmax = m_lmaxm_sliceType; @@ -2057,6 +2217,19 @@ if (m_param->rc.rateControlMode == X265_RC_CRF) { + if (m_param->bEnableSBRC) + { + double rfConstant = m_param->rc.rfConstant; + if (m_currentSatd < rce->movingAvgSum) + rfConstant += 2; + double ipOffset = (curFrame->m_lowres.bScenecut ? m_ipOffset : m_ipOffset / 2.0); + rfConstant = (rce->sliceType == I_SLICE ? rfConstant - ipOffset : + (rce->sliceType == B_SLICE ? rfConstant + m_pbOffset : rfConstant)); + double mbtree_offset = m_param->rc.cuTree ? (1.0 - m_param->rc.qCompress) * 13.5 : 0; + double qComp = (m_param->rc.cuTree && !m_param->rc.hevcAq) ? 0.99 : m_param->rc.qCompress; + m_rateFactorConstant = pow(m_currentSatd, 1.0 - qComp) / + x265_qp2qScale(rfConstant + mbtree_offset); + } q = getQScale(rce, m_rateFactorConstant); x265_zone* zone = getZone(); if (zone) @@ -2082,7 +2255,7 @@ } double tunedQScale = tuneAbrQScaleFromFeedback(initialQScale); overflow = tunedQScale / initialQScale; - q = !m_partialResidualFrames? tunedQScale : initialQScale; + q = !m_partialResidualFrames ? tunedQScale : initialQScale; bool isEncodeEnd = (m_param->totalFrames && m_framesDone > 0.75 * m_param->totalFrames) ? 1 : 0; bool isEncodeBeg = m_framesDone < (int)(m_fps + 0.5); @@ -2138,6 +2311,9 @@ q = X265_MAX(minScenecutQscale, q); m_lastQScaleForP_SLICE = X265_MAX(minScenecutQscale, m_lastQScaleForP_SLICE); } + if (m_param->bEnableSBRC) + q = tuneQscaleForSBRC(curFrame, q); + rce->qpNoVbv = x265_qScale2qp(q); if (m_sliceType == P_SLICE) { @@ -2319,6 +2495,43 @@ return (p->coeff * var + p->offset) / (q * p->count); } +double RateControl::tuneQscaleForSBRC(Frame* curFrame, double q) +{ + int depth = 0; + int framesDoneInSeg = m_framesDone % m_param->keyframeMax; + if (framesDoneInSeg + m_param->lookaheadDepth <= m_param->keyframeMax) + depth = m_param->lookaheadDepth; + else + depth = m_param->keyframeMax - framesDoneInSeg; + for (int iterations = 0; iterations < 1000; iterations++) + { + double totalDuration = m_segDur; + double frameBitsTotal = m_encodedSegmentBits + predictSize(&m_predm_predType, q, (double)m_currentSatd); + for (int i = 0; i < depth; i++) + { + int type = curFrame->m_lowres.plannedTypei; + if (type == X265_TYPE_AUTO) + break; + int64_t satd = curFrame->m_lowres.plannedSatdi >> (X265_DEPTH - 8); + type = IS_X265_TYPE_I(curFrame->m_lowres.plannedTypei) ? I_SLICE : IS_X265_TYPE_B(curFrame->m_lowres.plannedTypei) ? B_SLICE : P_SLICE; + int predType = getPredictorType(curFrame->m_lowres.plannedTypei, type); + double curBits = predictSize(&m_predpredType, q, (double)satd); + frameBitsTotal += curBits; + totalDuration += m_frameDuration; + } + //Check for segment buffer overflow and adjust QP accordingly + double segDur = m_param->keyframeMax / m_fps; + double allowedSize = m_vbvMaxRate * segDur; + double remDur = segDur - totalDuration; + double remainingBits = frameBitsTotal / totalDuration * remDur; + if (frameBitsTotal + remainingBits > 0.9 * allowedSize) + q = q * 1.01; + else + break; + } + return q; +} + double RateControl::clipQscale(Frame* curFrame, RateControlEntry* rce, double q) { // B-frames are not directly subject to VBV, @@ -2395,7 +2608,7 @@ { finalDur = x265_clip3(0.4, 1.0, totalDuration); } - targetFill = X265_MIN(m_bufferFill + totalDuration * m_vbvMaxRate * 0.5, m_bufferSize * (1 - m_minBufferFill * finalDur)); + targetFill = X265_MIN(m_bufferFill + totalDuration * m_vbvMaxRate * 0.5, m_bufferSize * (m_minBufferFill * finalDur)); if (bufferFillCur < targetFill) { q *= 1.01; @@ -2828,7 +3041,7 @@ if (m_param->rc.aqMode || m_isVbv || m_param->bAQMotion || bEnableDistOffset) { - if (m_isVbv && !(m_2pass && m_param->rc.rateControlMode == X265_RC_CRF)) + if (m_isVbv && !(m_2pass && m_param->rc.rateControlMode == X265_RC_CRF && !m_param->rc.bEncFocusedFramesOnly)) { double avgQpRc = 0; /* determine avg QP decided by VBV rate control */ @@ -2862,8 +3075,9 @@ if (m_param->rc.rateControlMode == X265_RC_CRF) { double crfVal, qpRef = curEncData.m_avgQpRc; + bool is2passCrfChange = false; - if (m_2pass) + if (m_2pass && !m_param->rc.bEncFocusedFramesOnly) { if (fabs(curEncData.m_avgQpRc - rce->qpPrev) > 0.1) { @@ -2921,6 +3135,8 @@ m_wantedBitsWindow += m_frameDuration * m_bitrate; m_totalBits += bits - rce->rowTotalBits; m_encodedBits += actualBits; + m_encodedSegmentBits += actualBits; + m_segDur += m_frameDuration; int pos = m_sliderPos - m_param->frameNumThreads; if (pos >= 0) m_encodedBitsWindowpos % s_slidingWindowFrames = actualBits; @@ -3048,10 +3264,26 @@ { uint8_t sliceType = (uint8_t)rce->sliceType; primitives.fix8Pack(m_cuTreeStats.qpBuffer0, curFrame->m_lowres.qpCuTreeOffset, ncu); - if (fwrite(&sliceType, 1, 1, m_cutreeStatFileOut) < 1) - goto writeFailure; - if (fwrite(m_cuTreeStats.qpBuffer0, sizeof(uint16_t), ncu, m_cutreeStatFileOut) < (size_t)ncu) - goto writeFailure; + + if (X265_SHARE_MODE_FILE == m_param->rc.dataShareMode) + { + if (fwrite(&sliceType, 1, 1, m_cutreeStatFileOut) < 1) + goto writeFailure; + if (fwrite(m_cuTreeStats.qpBuffer0, sizeof(uint16_t), ncu, m_cutreeStatFileOut) < (size_t)ncu) + goto writeFailure; + } + else // X265_SHARE_MODE_SHAREDMEM == m_param->rc.dataShareMode + { + if (!m_cutreeShrMem) + { + goto writeFailure; + } + + CUTreeSharedDataItem shrItem; + shrItem.type = &sliceType; + shrItem.stats = m_cuTreeStats.qpBuffer0; + m_cutreeShrMem->writeData(&shrItem, WriteSharedCUTreeData); + } } return 0; @@ -3127,6 +3359,13 @@ if (m_cutreeStatFileIn) fclose(m_cutreeStatFileIn); + if (m_cutreeShrMem) + { + m_cutreeShrMem->release(); + delete m_cutreeShrMem; + m_cutreeShrMem = NULL; + } + X265_FREE(m_rce2Pass); X265_FREE(m_encOrder); for (int i = 0; i < 2; i++) @@ -3186,13 +3425,20 @@ double RateControl::forwardMasking(Frame* curFrame, double q) { double qp = x265_qScale2qp(q); - uint32_t maxWindowSize = uint32_t((m_param->fwdScenecutWindow / 1000.0) * (m_param->fpsNum / m_param->fpsDenom) + 0.5); - uint32_t windowSize = maxWindowSize / 3; + uint32_t maxWindowSize = uint32_t((m_param->fwdMaxScenecutWindow / 1000.0) * (m_param->fpsNum / m_param->fpsDenom) + 0.5); + uint32_t windowSize6, prevWindow = 0; int lastScenecut = m_top->m_rateControl->m_lastScenecut; - int lastIFrame = m_top->m_rateControl->m_lastScenecutAwareIFrame; - double fwdRefQpDelta = double(m_param->fwdRefQpDelta); - double fwdNonRefQpDelta = double(m_param->fwdNonRefQpDelta); - double sliceTypeDelta = SLICE_TYPE_DELTA * fwdRefQpDelta; + + double fwdRefQpDelta6, fwdNonRefQpDelta6, sliceTypeDelta6; + for (int i = 0; i < 6; i++) + { + windowSizei = prevWindow + (uint32_t((m_param->fwdScenecutWindowi / 1000.0) * (m_param->fpsNum / m_param->fpsDenom) + 0.5)); + fwdRefQpDeltai = double(m_param->fwdRefQpDeltai); + fwdNonRefQpDeltai = double(m_param->fwdNonRefQpDeltai); + sliceTypeDeltai = SLICE_TYPE_DELTA * fwdRefQpDeltai; + prevWindow = windowSizei; + } + //Check whether the current frame is within the forward window if (curFrame->m_poc > lastScenecut && curFrame->m_poc <= (lastScenecut + int(maxWindowSize))) @@ -3205,45 +3451,51 @@ } else if (curFrame->m_lowres.sliceType == X265_TYPE_P) { - if (!(lastIFrame > lastScenecut && lastIFrame <= (lastScenecut + int(maxWindowSize)) - && curFrame->m_poc >= lastIFrame)) - { - //Add offsets corresponding to the window in which the P-frame occurs - if (curFrame->m_poc <= (lastScenecut + int(windowSize))) - qp += WINDOW1_DELTA * (fwdRefQpDelta - sliceTypeDelta); - else if (((curFrame->m_poc) > (lastScenecut + int(windowSize))) && ((curFrame->m_poc) <= (lastScenecut + 2 * int(windowSize)))) - qp += WINDOW2_DELTA * (fwdRefQpDelta - sliceTypeDelta); - else if (curFrame->m_poc > lastScenecut + 2 * int(windowSize)) - qp += WINDOW3_DELTA * (fwdRefQpDelta - sliceTypeDelta); - } + //Add offsets corresponding to the window in which the P-frame occurs + if (curFrame->m_poc <= (lastScenecut + int(windowSize0))) + qp += fwdRefQpDelta0 - sliceTypeDelta0; + else if (((curFrame->m_poc) > (lastScenecut + int(windowSize0))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize1)))) + qp += fwdRefQpDelta1 - sliceTypeDelta1; + else if (((curFrame->m_poc) > (lastScenecut + int(windowSize1))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize2)))) + qp += fwdRefQpDelta2 - sliceTypeDelta2; + else if (((curFrame->m_poc) > (lastScenecut + int(windowSize2))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize3)))) + qp += fwdRefQpDelta3 - sliceTypeDelta3; + else if (((curFrame->m_poc) > (lastScenecut + int(windowSize3))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize4)))) + qp += fwdRefQpDelta4 - sliceTypeDelta4; + else if (curFrame->m_poc > lastScenecut + int(windowSize4)) + qp += fwdRefQpDelta5 - sliceTypeDelta5; } else if (curFrame->m_lowres.sliceType == X265_TYPE_BREF) { - if (!(lastIFrame > lastScenecut && lastIFrame <= (lastScenecut + int(maxWindowSize)) - && curFrame->m_poc >= lastIFrame)) - { - //Add offsets corresponding to the window in which the B-frame occurs - if (curFrame->m_poc <= (lastScenecut + int(windowSize))) - qp += WINDOW1_DELTA * fwdRefQpDelta; - else if (((curFrame->m_poc) > (lastScenecut + int(windowSize))) && ((curFrame->m_poc) <= (lastScenecut + 2 * int(windowSize)))) - qp += WINDOW2_DELTA * fwdRefQpDelta; - else if (curFrame->m_poc > lastScenecut + 2 * int(windowSize)) - qp += WINDOW3_DELTA * fwdRefQpDelta; - } + //Add offsets corresponding to the window in which the B-frame occurs + if (curFrame->m_poc <= (lastScenecut + int(windowSize0))) + qp += fwdRefQpDelta0; + else if (((curFrame->m_poc) > (lastScenecut + int(windowSize0))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize1)))) + qp += fwdRefQpDelta1; + else if (((curFrame->m_poc) > (lastScenecut + int(windowSize1))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize2)))) + qp += fwdRefQpDelta2; + else if (((curFrame->m_poc) > (lastScenecut + int(windowSize2))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize3)))) + qp += fwdRefQpDelta3; + else if (((curFrame->m_poc) > (lastScenecut + int(windowSize3))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize4)))) + qp += fwdRefQpDelta4; + else if (curFrame->m_poc > lastScenecut + int(windowSize4)) + qp += fwdRefQpDelta5; } else if (curFrame->m_lowres.sliceType == X265_TYPE_B) { - if (!(lastIFrame > lastScenecut && lastIFrame <= (lastScenecut + int(maxWindowSize)) - && curFrame->m_poc >= lastIFrame)) - { - //Add offsets corresponding to the window in which the b-frame occurs - if (curFrame->m_poc <= (lastScenecut + int(windowSize))) - qp += WINDOW1_DELTA * fwdNonRefQpDelta; - else if (((curFrame->m_poc) > (lastScenecut + int(windowSize))) && ((curFrame->m_poc) <= (lastScenecut + 2 * int(windowSize)))) - qp += WINDOW2_DELTA * fwdNonRefQpDelta; - else if (curFrame->m_poc > lastScenecut + 2 * int(windowSize)) - qp += WINDOW3_DELTA * fwdNonRefQpDelta; - } + //Add offsets corresponding to the window in which the b-frame occurs + if (curFrame->m_poc <= (lastScenecut + int(windowSize0))) + qp += fwdNonRefQpDelta0; + else if (((curFrame->m_poc) > (lastScenecut + int(windowSize0))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize1)))) + qp += fwdNonRefQpDelta1; + else if (((curFrame->m_poc) > (lastScenecut + int(windowSize1))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize2)))) + qp += fwdNonRefQpDelta2; + else if (((curFrame->m_poc) > (lastScenecut + int(windowSize2))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize3)))) + qp += fwdNonRefQpDelta3; + else if (((curFrame->m_poc) > (lastScenecut + int(windowSize3))) && ((curFrame->m_poc) <= (lastScenecut + int(windowSize4)))) + qp += fwdNonRefQpDelta4; + else if (curFrame->m_poc > lastScenecut + int(windowSize4)) + qp += fwdNonRefQpDelta5; } } @@ -3252,24 +3504,75 @@ double RateControl::backwardMasking(Frame* curFrame, double q) { double qp = x265_qScale2qp(q); - double fwdRefQpDelta = double(m_param->fwdRefQpDelta); - double bwdRefQpDelta = double(m_param->bwdRefQpDelta); - double bwdNonRefQpDelta = double(m_param->bwdNonRefQpDelta); + uint32_t windowSize6, prevWindow = 0; + int lastScenecut = m_top->m_rateControl->m_lastScenecut; - if (curFrame->m_isInsideWindow == BACKWARD_WINDOW) + double bwdRefQpDelta6, bwdNonRefQpDelta6, sliceTypeDelta6; + for (int i = 0; i < 6; i++) { - if (bwdRefQpDelta < 0) - bwdRefQpDelta = WINDOW3_DELTA * fwdRefQpDelta; - double sliceTypeDelta = SLICE_TYPE_DELTA * bwdRefQpDelta; - if (bwdNonRefQpDelta < 0) - bwdNonRefQpDelta = bwdRefQpDelta + sliceTypeDelta; + windowSizei = prevWindow + (uint32_t((m_param->bwdScenecutWindowi / 1000.0) * (m_param->fpsNum / m_param->fpsDenom) + 0.5)); + prevWindow = windowSizei; + bwdRefQpDeltai = double(m_param->bwdRefQpDeltai); + bwdNonRefQpDeltai = double(m_param->bwdNonRefQpDeltai); + + if (bwdRefQpDeltai < 0) + bwdRefQpDeltai = BWD_WINDOW_DELTA * m_param->fwdRefQpDeltai; + sliceTypeDeltai = SLICE_TYPE_DELTA * bwdRefQpDeltai; + + if (bwdNonRefQpDeltai < 0) + bwdNonRefQpDeltai = bwdRefQpDeltai + sliceTypeDeltai; + } + if (curFrame->m_isInsideWindow == BACKWARD_WINDOW) + { if (curFrame->m_lowres.sliceType == X265_TYPE_P) - qp += bwdRefQpDelta - sliceTypeDelta; + { + //Add offsets corresponding to the window in which the P-frame occurs + if (curFrame->m_poc >= (lastScenecut - int(windowSize0))) + qp += bwdRefQpDelta0 - sliceTypeDelta0; + else if (((curFrame->m_poc) < (lastScenecut - int(windowSize0))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize1)))) + qp += bwdRefQpDelta1 - sliceTypeDelta1; + else if (((curFrame->m_poc) < (lastScenecut - int(windowSize1))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize2)))) + qp += bwdRefQpDelta2 - sliceTypeDelta2; + else if (((curFrame->m_poc) < (lastScenecut - int(windowSize2))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize3)))) + qp += bwdRefQpDelta3 - sliceTypeDelta3; + else if (((curFrame->m_poc) < (lastScenecut - int(windowSize3))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize4)))) + qp += bwdRefQpDelta4 - sliceTypeDelta4; + else if (curFrame->m_poc < lastScenecut - int(windowSize4)) + qp += bwdRefQpDelta5 - sliceTypeDelta5; + } else if (curFrame->m_lowres.sliceType == X265_TYPE_BREF) - qp += bwdRefQpDelta; + { + //Add offsets corresponding to the window in which the B-frame occurs + if (curFrame->m_poc >= (lastScenecut - int(windowSize0))) + qp += bwdRefQpDelta0; + else if (((curFrame->m_poc) < (lastScenecut - int(windowSize0))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize1)))) + qp += bwdRefQpDelta1; + else if (((curFrame->m_poc) < (lastScenecut - int(windowSize1))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize2)))) + qp += bwdRefQpDelta2; + else if (((curFrame->m_poc) < (lastScenecut - int(windowSize2))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize3)))) + qp += bwdRefQpDelta3; + else if (((curFrame->m_poc) < (lastScenecut - int(windowSize3))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize4)))) + qp += bwdRefQpDelta4; + else if (curFrame->m_poc < lastScenecut - int(windowSize4)) + qp += bwdRefQpDelta5; + } else if (curFrame->m_lowres.sliceType == X265_TYPE_B) - qp += bwdNonRefQpDelta; + { + //Add offsets corresponding to the window in which the b-frame occurs + if (curFrame->m_poc >= (lastScenecut - int(windowSize0))) + qp += bwdNonRefQpDelta0; + else if (((curFrame->m_poc) < (lastScenecut - int(windowSize0))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize1)))) + qp += bwdNonRefQpDelta1; + else if (((curFrame->m_poc) < (lastScenecut - int(windowSize1))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize2)))) + qp += bwdNonRefQpDelta2; + else if (((curFrame->m_poc) < (lastScenecut - int(windowSize2))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize3)))) + qp += bwdNonRefQpDelta3; + else if (((curFrame->m_poc) < (lastScenecut - int(windowSize3))) && ((curFrame->m_poc) >= (lastScenecut - int(windowSize4)))) + qp += bwdNonRefQpDelta4; + else if (curFrame->m_poc < lastScenecut - int(windowSize4)) + qp += bwdNonRefQpDelta5; + } } return x265_qp2qScale(qp);
View file
x265_3.5.tar.gz/source/encoder/ratecontrol.h -> x265_3.6.tar.gz/source/encoder/ratecontrol.h
Changed
@@ -28,6 +28,7 @@ #include "common.h" #include "sei.h" +#include "ringmem.h" namespace X265_NS { // encoder namespace @@ -46,11 +47,6 @@ #define MIN_AMORTIZE_FRACTION 0.2 #define CLIP_DURATION(f) x265_clip3(MIN_FRAME_DURATION, MAX_FRAME_DURATION, f) -/*Scenecut Aware QP*/ -#define WINDOW1_DELTA 1.0 /* The offset for the frames coming in the window-1*/ -#define WINDOW2_DELTA 0.7 /* The offset for the frames coming in the window-2*/ -#define WINDOW3_DELTA 0.4 /* The offset for the frames coming in the window-3*/ - struct Predictor { double coeffMin; @@ -73,6 +69,7 @@ Predictor rowPreds32; Predictor* rowPred2; + int64_t currentSatd; int64_t lastSatd; /* Contains the picture cost of the previous frame, required for resetAbr and VBV */ int64_t leadingNoBSatd; int64_t rowTotalBits; /* update cplxrsum and totalbits at the end of 2 rows */ @@ -87,6 +84,8 @@ double rowCplxrSum; double qpNoVbv; double bufferFill; + double bufferFillFinal; + double bufferFillActual; double targetFill; bool vbvEndAdj; double frameDuration; @@ -192,6 +191,8 @@ double m_qCompress; int64_t m_totalBits; /* total bits used for already encoded frames (after ammortization) */ int64_t m_encodedBits; /* bits used for encoded frames (without ammortization) */ + int64_t m_encodedSegmentBits; /* bits used for encoded frames in a segment*/ + double m_segDur; double m_fps; int64_t m_satdCostWindow50; int64_t m_encodedBitsWindow50; @@ -237,6 +238,8 @@ FILE* m_statFileOut; FILE* m_cutreeStatFileOut; FILE* m_cutreeStatFileIn; + ///< store the cutree data in memory instead of file + RingMem *m_cutreeShrMem; double m_lastAccumPNorm; double m_expectedBitsSum; /* sum of qscale2bits after rceq, ratefactor, and overflow, only includes finished frames */ int64_t m_predictedBits; @@ -254,6 +257,7 @@ RateControl(x265_param& p, Encoder *enc); bool init(const SPS& sps); void initHRD(SPS& sps); + void initVBV(const SPS& sps); void reconfigureRC(); void setFinalFrameCount(int count); @@ -271,6 +275,9 @@ int writeRateControlFrameStats(Frame* curFrame, RateControlEntry* rce); bool initPass2(); + bool initCUTreeSharedMem(); + void skipCUTreeSharedMemRead(int32_t cnt); + double forwardMasking(Frame* curFrame, double q); double backwardMasking(Frame* curFrame, double q); @@ -291,6 +298,7 @@ double rateEstimateQscale(Frame* pic, RateControlEntry *rce); // main logic for calculating QP based on ABR double tuneAbrQScaleFromFeedback(double qScale); double tuneQScaleForZone(RateControlEntry *rce, double qScale); // Tune qScale to adhere to zone budget + double tuneQscaleForSBRC(Frame* curFrame, double q); // Tune qScale to adhere to segment budget void accumPQpUpdate(); int getPredictorType(int lowresSliceType, int sliceType); @@ -311,6 +319,7 @@ double tuneQScaleForGrain(double rcOverflow); void splitdeltaPOC(char deltapoc, RateControlEntry *rce); void splitbUsed(char deltapoc, RateControlEntry *rce); + void checkAndResetCRF(RateControlEntry* rce); }; } #endif // ifndef X265_RATECONTROL_H
View file
x265_3.5.tar.gz/source/encoder/sei.cpp -> x265_3.6.tar.gz/source/encoder/sei.cpp
Changed
@@ -68,7 +68,7 @@ { if (nalUnitType != NAL_UNIT_UNSPECIFIED) bs.writeByteAlignment(); - list.serialize(nalUnitType, bs); + list.serialize(nalUnitType, bs, (1 + (nalUnitType == NAL_UNIT_CODED_SLICE_TSA_N))); } }
View file
x265_3.5.tar.gz/source/encoder/sei.h -> x265_3.6.tar.gz/source/encoder/sei.h
Changed
@@ -73,6 +73,101 @@ } }; +/* Film grain characteristics */ +class FilmGrainCharacteristics : public SEI +{ + public: + + FilmGrainCharacteristics() + { + m_payloadType = FILM_GRAIN_CHARACTERISTICS; + m_payloadSize = 0; + } + + struct CompModelIntensityValues + { + uint8_t intensityIntervalLowerBound; + uint8_t intensityIntervalUpperBound; + int* compModelValue; + }; + + struct CompModel + { + bool bPresentFlag; + uint8_t numModelValues; + uint8_t m_filmGrainNumIntensityIntervalMinus1; + CompModelIntensityValues* intensityValues; + }; + + CompModel m_compModelMAX_NUM_COMPONENT; + bool m_filmGrainCharacteristicsPersistenceFlag; + bool m_filmGrainCharacteristicsCancelFlag; + bool m_separateColourDescriptionPresentFlag; + bool m_filmGrainFullRangeFlag; + uint8_t m_filmGrainModelId; + uint8_t m_blendingModeId; + uint8_t m_log2ScaleFactor; + uint8_t m_filmGrainBitDepthLumaMinus8; + uint8_t m_filmGrainBitDepthChromaMinus8; + uint8_t m_filmGrainColourPrimaries; + uint8_t m_filmGrainTransferCharacteristics; + uint8_t m_filmGrainMatrixCoeffs; + + void writeSEI(const SPS&) + { + WRITE_FLAG(m_filmGrainCharacteristicsCancelFlag, "film_grain_characteristics_cancel_flag"); + + if (!m_filmGrainCharacteristicsCancelFlag) + { + WRITE_CODE(m_filmGrainModelId, 2, "film_grain_model_id"); + WRITE_FLAG(m_separateColourDescriptionPresentFlag, "separate_colour_description_present_flag"); + if (m_separateColourDescriptionPresentFlag) + { + WRITE_CODE(m_filmGrainBitDepthLumaMinus8, 3, "film_grain_bit_depth_luma_minus8"); + WRITE_CODE(m_filmGrainBitDepthChromaMinus8, 3, "film_grain_bit_depth_chroma_minus8"); + WRITE_FLAG(m_filmGrainFullRangeFlag, "film_grain_full_range_flag"); + WRITE_CODE(m_filmGrainColourPrimaries, X265_BYTE, "film_grain_colour_primaries"); + WRITE_CODE(m_filmGrainTransferCharacteristics, X265_BYTE, "film_grain_transfer_characteristics"); + WRITE_CODE(m_filmGrainMatrixCoeffs, X265_BYTE, "film_grain_matrix_coeffs"); + } + WRITE_CODE(m_blendingModeId, 2, "blending_mode_id"); + WRITE_CODE(m_log2ScaleFactor, 4, "log2_scale_factor"); + for (uint8_t c = 0; c < 3; c++) + { + WRITE_FLAG(m_compModelc.bPresentFlag && m_compModelc.m_filmGrainNumIntensityIntervalMinus1 + 1 > 0 && m_compModelc.numModelValues > 0, "comp_model_present_flagc"); + } + for (uint8_t c = 0; c < 3; c++) + { + if (m_compModelc.bPresentFlag && m_compModelc.m_filmGrainNumIntensityIntervalMinus1 + 1 > 0 && m_compModelc.numModelValues > 0) + { + assert(m_compModelc.m_filmGrainNumIntensityIntervalMinus1 + 1 <= 256); + assert(m_compModelc.numModelValues <= X265_BYTE); + WRITE_CODE(m_compModelc.m_filmGrainNumIntensityIntervalMinus1 , X265_BYTE, "num_intensity_intervals_minus1c"); + WRITE_CODE(m_compModelc.numModelValues - 1, 3, "num_model_values_minus1c"); + for (uint8_t interval = 0; interval < m_compModelc.m_filmGrainNumIntensityIntervalMinus1 + 1; interval++) + { + WRITE_CODE(m_compModelc.intensityValuesinterval.intensityIntervalLowerBound, X265_BYTE, "intensity_interval_lower_boundci"); + WRITE_CODE(m_compModelc.intensityValuesinterval.intensityIntervalUpperBound, X265_BYTE, "intensity_interval_upper_boundci"); + for (uint8_t j = 0; j < m_compModelc.numModelValues; j++) + { + WRITE_SVLC(m_compModelc.intensityValuesinterval.compModelValuej,"comp_model_valueci"); + } + } + } + } + WRITE_FLAG(m_filmGrainCharacteristicsPersistenceFlag, "film_grain_characteristics_persistence_flag"); + } + if (m_bitIf->getNumberOfWrittenBits() % X265_BYTE != 0) + { + WRITE_FLAG(1, "payload_bit_equal_to_one"); + while (m_bitIf->getNumberOfWrittenBits() % X265_BYTE != 0) + { + WRITE_FLAG(0, "payload_bit_equal_to_zero"); + } + } + } +}; + static const uint32_t ISO_IEC_11578_LEN = 16; class SEIuserDataUnregistered : public SEI
View file
x265_3.5.tar.gz/source/encoder/slicetype.cpp -> x265_3.6.tar.gz/source/encoder/slicetype.cpp
Changed
@@ -87,6 +87,14 @@ namespace X265_NS { +uint32_t acEnergyVarHist(uint64_t sum_ssd, int shift) +{ + uint32_t sum = (uint32_t)sum_ssd; + uint32_t ssd = (uint32_t)(sum_ssd >> 32); + + return ssd - ((uint64_t)sum * sum >> shift); +} + bool computeEdge(pixel* edgePic, pixel* refPic, pixel* edgeTheta, intptr_t stride, int height, int width, bool bcalcTheta, pixel whitePixel) { intptr_t rowOne = 0, rowTwo = 0, rowThree = 0, colOne = 0, colTwo = 0, colThree = 0; @@ -184,7 +192,7 @@ { for (int colNum = 0; colNum < width; colNum++) { - if ((rowNum >= 2) && (colNum >= 2) && (rowNum != height - 2) && (colNum != width - 2)) //Ignoring the border pixels of the picture + if ((rowNum >= 2) && (colNum >= 2) && (rowNum < height - 2) && (colNum < width - 2)) //Ignoring the border pixels of the picture { /* 5x5 Gaussian filter 2 4 5 4 2 @@ -519,7 +527,7 @@ if (param->rc.aqMode == X265_AQ_EDGE) edgeFilter(curFrame, param); - if (param->rc.aqMode == X265_AQ_EDGE && !param->bHistBasedSceneCut && param->recursionSkipMode == EDGE_BASED_RSKIP) + if (param->rc.aqMode == X265_AQ_EDGE && param->recursionSkipMode == EDGE_BASED_RSKIP) { pixel* src = curFrame->m_edgePic + curFrame->m_fencPic->m_lumaMarginY * curFrame->m_fencPic->m_stride + curFrame->m_fencPic->m_lumaMarginX; primitives.planecopy_pp_shr(src, curFrame->m_fencPic->m_stride, curFrame->m_edgeBitPic, @@ -1050,7 +1058,48 @@ m_countPreLookahead = 0; #endif - memset(m_histogram, 0, sizeof(m_histogram)); + m_accHistDiffRunningAvgCb = X265_MALLOC(uint32_t*, NUMBER_OF_SEGMENTS_IN_WIDTH * sizeof(uint32_t*)); + m_accHistDiffRunningAvgCb0 = X265_MALLOC(uint32_t, NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT); + memset(m_accHistDiffRunningAvgCb0, 0, sizeof(uint32_t) * NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT); + for (uint32_t w = 1; w < NUMBER_OF_SEGMENTS_IN_WIDTH; w++) { + m_accHistDiffRunningAvgCbw = m_accHistDiffRunningAvgCb0 + w * NUMBER_OF_SEGMENTS_IN_HEIGHT; + } + + m_accHistDiffRunningAvgCr = X265_MALLOC(uint32_t*, NUMBER_OF_SEGMENTS_IN_WIDTH * sizeof(uint32_t*)); + m_accHistDiffRunningAvgCr0 = X265_MALLOC(uint32_t, NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT); + memset(m_accHistDiffRunningAvgCr0, 0, sizeof(uint32_t) * NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT); + for (uint32_t w = 1; w < NUMBER_OF_SEGMENTS_IN_WIDTH; w++) { + m_accHistDiffRunningAvgCrw = m_accHistDiffRunningAvgCr0 + w * NUMBER_OF_SEGMENTS_IN_HEIGHT; + } + + m_accHistDiffRunningAvg = X265_MALLOC(uint32_t*, NUMBER_OF_SEGMENTS_IN_WIDTH * sizeof(uint32_t*)); + m_accHistDiffRunningAvg0 = X265_MALLOC(uint32_t, NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT); + memset(m_accHistDiffRunningAvg0, 0, sizeof(uint32_t) * NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT); + for (uint32_t w = 1; w < NUMBER_OF_SEGMENTS_IN_WIDTH; w++) { + m_accHistDiffRunningAvgw = m_accHistDiffRunningAvg0 + w * NUMBER_OF_SEGMENTS_IN_HEIGHT; + } + + m_resetRunningAvg = true; + + m_segmentCountThreshold = (uint32_t)(((float)((NUMBER_OF_SEGMENTS_IN_WIDTH * NUMBER_OF_SEGMENTS_IN_HEIGHT) * 50) / 100) + 0.5); + + if (m_param->bEnableTemporalSubLayers > 2) + { + switch (m_param->bEnableTemporalSubLayers) + { + case 3: + m_gopId = 0; + break; + case 4: + m_gopId = 1; + break; + case 5: + m_gopId = 2; + break; + default: + break; + } + } } #if DETAILED_CU_STATS @@ -1098,6 +1147,7 @@ m_pooli.stopWorkers(); } } + void Lookahead::destroy() { // these two queues will be empty unless the encode was aborted @@ -1309,32 +1359,32 @@ default: return; } - if (!m_param->analysisLoad || !m_param->bDisableLookahead) + if (!curFrame->m_param->analysisLoad || !curFrame->m_param->bDisableLookahead) { X265_CHECK(curFrame->m_lowres.costEstb - p0p1 - b > 0, "Slice cost not estimated\n") - if (m_param->rc.cuTree && !m_param->rc.bStatRead) + if (curFrame->m_param->rc.cuTree && !curFrame->m_param->rc.bStatRead) /* update row satds based on cutree offsets */ curFrame->m_lowres.satdCost = frameCostRecalculate(frames, p0, p1, b); - else if (!m_param->analysisLoad || m_param->scaleFactor || m_param->bAnalysisType == HEVC_INFO) + else if (!curFrame->m_param->analysisLoad || curFrame->m_param->scaleFactor || curFrame->m_param->bAnalysisType == HEVC_INFO) { - if (m_param->rc.aqMode) + if (curFrame->m_param->rc.aqMode) curFrame->m_lowres.satdCost = curFrame->m_lowres.costEstAqb - p0p1 - b; else curFrame->m_lowres.satdCost = curFrame->m_lowres.costEstb - p0p1 - b; } - if (m_param->rc.vbvBufferSize && m_param->rc.vbvMaxBitrate) + if (curFrame->m_param->rc.vbvBufferSize && curFrame->m_param->rc.vbvMaxBitrate) { /* aggregate lowres row satds to CTU resolution */ curFrame->m_lowres.lowresCostForRc = curFrame->m_lowres.lowresCostsb - p0p1 - b; uint32_t lowresRow = 0, lowresCol = 0, lowresCuIdx = 0, sum = 0, intraSum = 0; - uint32_t scale = m_param->maxCUSize / (2 * X265_LOWRES_CU_SIZE); - uint32_t numCuInHeight = (m_param->sourceHeight + m_param->maxCUSize - 1) / m_param->maxCUSize; + uint32_t scale = curFrame->m_param->maxCUSize / (2 * X265_LOWRES_CU_SIZE); + uint32_t numCuInHeight = (curFrame->m_param->sourceHeight + curFrame->m_param->maxCUSize - 1) / curFrame->m_param->maxCUSize; uint32_t widthInLowresCu = (uint32_t)m_8x8Width, heightInLowresCu = (uint32_t)m_8x8Height; double *qp_offset = 0; /* Factor in qpoffsets based on Aq/Cutree in CU costs */ - if (m_param->rc.aqMode || m_param->bAQMotion) - qp_offset = (framesb->sliceType == X265_TYPE_B || !m_param->rc.cuTree) ? framesb->qpAqOffset : framesb->qpCuTreeOffset; + if (curFrame->m_param->rc.aqMode || curFrame->m_param->bAQMotion) + qp_offset = (framesb->sliceType == X265_TYPE_B || !curFrame->m_param->rc.cuTree) ? framesb->qpAqOffset : framesb->qpCuTreeOffset; for (uint32_t row = 0; row < numCuInHeight; row++) { @@ -1350,7 +1400,7 @@ if (qp_offset) { double qpOffset; - if (m_param->rc.qgSize == 8) + if (curFrame->m_param->rc.qgSize == 8) qpOffset = (qp_offsetlowresCol * 2 + lowresRow * widthInLowresCu * 4 + qp_offsetlowresCol * 2 + lowresRow * widthInLowresCu * 4 + 1 + qp_offsetlowresCol * 2 + lowresRow * widthInLowresCu * 4 + curFrame->m_lowres.maxBlocksInRowFullRes + @@ -1361,7 +1411,7 @@ int32_t intraCuCost = curFrame->m_lowres.intraCostlowresCuIdx; curFrame->m_lowres.intraCostlowresCuIdx = (intraCuCost * x265_exp2fix8(qpOffset) + 128) >> 8; } - if (m_param->bIntraRefresh && slice->m_sliceType == X265_TYPE_P) + if (curFrame->m_param->bIntraRefresh && slice->m_sliceType == X265_TYPE_P) for (uint32_t x = curFrame->m_encData->m_pir.pirStartCol; x <= curFrame->m_encData->m_pir.pirEndCol; x++) diff += curFrame->m_lowres.intraCostlowresCuIdx - lowresCuCost; curFrame->m_lowres.lowresCostForRclowresCuIdx = lowresCuCost; @@ -1377,6 +1427,291 @@ } } +uint32_t LookaheadTLD::calcVariance(pixel* inpSrc, intptr_t stride, intptr_t blockOffset, uint32_t plane) +{ + pixel* src = inpSrc + blockOffset; + + uint32_t var; + if (!plane) + var = acEnergyVarHist(primitives.cuBLOCK_8x8.var(src, stride), 6); + else + var = acEnergyVarHist(primitives.cuBLOCK_4x4.var(src, stride), 4); + + x265_emms(); + return var; +} + +/* +** Compute Block and Picture Variance, Block Mean for all blocks in the picture +*/ +void LookaheadTLD::computePictureStatistics(Frame *curFrame) +{ + int maxCol = curFrame->m_fencPic->m_picWidth; + int maxRow = curFrame->m_fencPic->m_picHeight; + intptr_t inpStride = curFrame->m_fencPic->m_stride; + + // Variance + uint64_t picTotVariance = 0; + uint32_t variance; + + uint64_t blockXY = 0; + pixel* src = curFrame->m_fencPic->m_picOrg0; + + for (int blockY = 0; blockY < maxRow; blockY += 8) + { + uint64_t rowVariance = 0; + for (int blockX = 0; blockX < maxCol; blockX += 8) + { + intptr_t blockOffsetLuma = blockX + (blockY * inpStride); + + variance = calcVariance( + src, + inpStride, + blockOffsetLuma, 0); + + rowVariance += variance; + blockXY++; + } + picTotVariance += (uint16_t)(rowVariance / maxCol); + } + + curFrame->m_lowres.picAvgVariance = (uint16_t)(picTotVariance / maxRow); + + // Collect chroma variance + int hShift = curFrame->m_fencPic->m_hChromaShift; + int vShift = curFrame->m_fencPic->m_vChromaShift; + + int maxColChroma = curFrame->m_fencPic->m_picWidth >> hShift; + int maxRowChroma = curFrame->m_fencPic->m_picHeight >> vShift; + intptr_t cStride = curFrame->m_fencPic->m_strideC; + + pixel* srcCb = curFrame->m_fencPic->m_picOrg1; + + picTotVariance = 0; + for (int blockY = 0; blockY < maxRowChroma; blockY += 4) + { + uint64_t rowVariance = 0; + for (int blockX = 0; blockX < maxColChroma; blockX += 4) + { + intptr_t blockOffsetChroma = blockX + blockY * cStride; + + variance = calcVariance( + srcCb, + cStride, + blockOffsetChroma, 1); + + rowVariance += variance; + blockXY++; + } + picTotVariance += (uint16_t)(rowVariance / maxColChroma); + } + + curFrame->m_lowres.picAvgVarianceCb = (uint16_t)(picTotVariance / maxRowChroma); + + + pixel* srcCr = curFrame->m_fencPic->m_picOrg2; + + picTotVariance = 0; + for (int blockY = 0; blockY < maxRowChroma; blockY += 4) + { + uint64_t rowVariance = 0; + for (int blockX = 0; blockX < maxColChroma; blockX += 4) + { + intptr_t blockOffsetChroma = blockX + blockY * cStride; + + variance = calcVariance( + srcCr, + cStride, + blockOffsetChroma, 2); + + rowVariance += variance; + blockXY++; + } + picTotVariance += (uint16_t)(rowVariance / maxColChroma); + } + + curFrame->m_lowres.picAvgVarianceCr = (uint16_t)(picTotVariance / maxRowChroma); +} + +/* +* Compute histogram of n-bins for the input +*/ +void LookaheadTLD::calculateHistogram( + pixel *inputSrc, + uint32_t inputWidth, + uint32_t inputHeight, + intptr_t stride, + uint8_t dsFactor, + uint32_t *histogram, + uint64_t *sum) + +{ + *sum = 0; + + for (uint32_t verticalIdx = 0; verticalIdx < inputHeight; verticalIdx += dsFactor) + { + for (uint32_t horizontalIdx = 0; horizontalIdx < inputWidth; horizontalIdx += dsFactor) + { + ++(histograminputSrchorizontalIdx); + *sum += inputSrchorizontalIdx; + } + inputSrc += (stride << (dsFactor >> 1)); + } + + return; +} + +/* +* Compute histogram bins and chroma pixel intensity * +*/ +void LookaheadTLD::computeIntensityHistogramBinsChroma( + Frame *curFrame, + uint64_t *sumAverageIntensityCb, + uint64_t *sumAverageIntensityCr) +{ + uint64_t sum; + uint8_t dsFactor = 4; + + uint32_t segmentWidth = curFrame->m_lowres.widthFullRes / NUMBER_OF_SEGMENTS_IN_WIDTH; + uint32_t segmentHeight = curFrame->m_lowres.heightFullRes / NUMBER_OF_SEGMENTS_IN_HEIGHT; + + for (uint32_t segmentInFrameWidthIndex = 0; segmentInFrameWidthIndex < NUMBER_OF_SEGMENTS_IN_WIDTH; segmentInFrameWidthIndex++) + { + for (uint32_t segmentInFrameHeightIndex = 0; segmentInFrameHeightIndex < NUMBER_OF_SEGMENTS_IN_HEIGHT; segmentInFrameHeightIndex++) + { + // Initialize bins to 1 + for (uint32_t cuIndex = 0; cuIndex < 256; cuIndex++) { + curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex1cuIndex = 1; + curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex2cuIndex = 1; + } + + uint32_t segmentWidthOffset = (segmentInFrameWidthIndex == NUMBER_OF_SEGMENTS_IN_WIDTH - 1) ? + curFrame->m_lowres.widthFullRes - (NUMBER_OF_SEGMENTS_IN_WIDTH * segmentWidth) : 0; + + uint32_t segmentHeightOffset = (segmentInFrameHeightIndex == NUMBER_OF_SEGMENTS_IN_HEIGHT - 1) ? + curFrame->m_lowres.heightFullRes - (NUMBER_OF_SEGMENTS_IN_HEIGHT * segmentHeight) : 0; + + + // U Histogram + calculateHistogram( + curFrame->m_fencPic->m_picOrg1 + ((segmentInFrameWidthIndex * segmentWidth) >> 1) + (((segmentInFrameHeightIndex * segmentHeight) >> 1) * curFrame->m_fencPic->m_strideC), + (segmentWidth + segmentWidthOffset) >> 1, + (segmentHeight + segmentHeightOffset) >> 1, + curFrame->m_fencPic->m_strideC, + dsFactor, + curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex1, + &sum); + + sum = (sum << dsFactor); + *sumAverageIntensityCb += sum; + curFrame->m_lowres.averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex1 = + (uint8_t)((sum + (((segmentWidth + segmentWidthOffset) * (segmentHeight + segmentHeightOffset)) >> 3)) / (((segmentWidth + segmentWidthOffset) * (segmentHeight + segmentHeightOffset)) >> 2)); + + for (uint16_t histogramBin = 0; histogramBin < HISTOGRAM_NUMBER_OF_BINS; histogramBin++) { + curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex1histogramBin = + curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex1histogramBin << dsFactor; + } + + // V Histogram + calculateHistogram( + curFrame->m_fencPic->m_picOrg2 + ((segmentInFrameWidthIndex * segmentWidth) >> 1) + (((segmentInFrameHeightIndex * segmentHeight) >> 1) * curFrame->m_fencPic->m_strideC), + (segmentWidth + segmentWidthOffset) >> 1, + (segmentHeight + segmentHeightOffset) >> 1, + curFrame->m_fencPic->m_strideC, + dsFactor, + curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex2, + &sum); + + sum = (sum << dsFactor); + *sumAverageIntensityCr += sum; + curFrame->m_lowres.averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex2 = + (uint8_t)((sum + (((segmentWidth + segmentWidthOffset) * (segmentHeight + segmentHeightOffset)) >> 3)) / (((segmentWidth + segmentHeightOffset) * (segmentHeight + segmentHeightOffset)) >> 2)); + + for (uint16_t histogramBin = 0; histogramBin < HISTOGRAM_NUMBER_OF_BINS; histogramBin++) { + curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex2histogramBin = + curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex2histogramBin << dsFactor; + } + } + } + return; + +} + +/* +* Compute histogram bins and luma pixel intensity * +*/ +void LookaheadTLD::computeIntensityHistogramBinsLuma( + Frame *curFrame, + uint64_t *sumAvgIntensityTotalSegmentsLuma) +{ + uint64_t sum; + + uint32_t segmentWidth = curFrame->m_lowres.quarterSampleLowResWidth / NUMBER_OF_SEGMENTS_IN_WIDTH; + uint32_t segmentHeight = curFrame->m_lowres.quarterSampleLowResHeight / NUMBER_OF_SEGMENTS_IN_HEIGHT; + + for (uint32_t segmentInFrameWidthIndex = 0; segmentInFrameWidthIndex < NUMBER_OF_SEGMENTS_IN_WIDTH; segmentInFrameWidthIndex++) + { + for (uint32_t segmentInFrameHeightIndex = 0; segmentInFrameHeightIndex < NUMBER_OF_SEGMENTS_IN_HEIGHT; segmentInFrameHeightIndex++) + { + // Initialize bins to 1 + for (uint32_t cuIndex = 0; cuIndex < 256; cuIndex++) { + curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex0cuIndex = 1; + } + + uint32_t segmentWidthOffset = (segmentInFrameWidthIndex == NUMBER_OF_SEGMENTS_IN_WIDTH - 1) ? + curFrame->m_lowres.quarterSampleLowResWidth - (NUMBER_OF_SEGMENTS_IN_WIDTH * segmentWidth) : 0; + + uint32_t segmentHeightOffset = (segmentInFrameHeightIndex == NUMBER_OF_SEGMENTS_IN_HEIGHT - 1) ? + curFrame->m_lowres.quarterSampleLowResHeight - (NUMBER_OF_SEGMENTS_IN_HEIGHT * segmentHeight) : 0; + + // Y Histogram + calculateHistogram( + curFrame->m_lowres.quarterSampleLowResBuffer + (curFrame->m_lowres.quarterSampleLowResOriginX + segmentInFrameWidthIndex * segmentWidth) + ((curFrame->m_lowres.quarterSampleLowResOriginY + segmentInFrameHeightIndex * segmentHeight) * curFrame->m_lowres.quarterSampleLowResStrideY), + segmentWidth + segmentWidthOffset, + segmentHeight + segmentHeightOffset, + curFrame->m_lowres.quarterSampleLowResStrideY, + 1, + curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex0, + &sum); + + curFrame->m_lowres.averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex0 = (uint8_t)((sum + (((segmentWidth + segmentWidthOffset)*(segmentWidth + segmentHeightOffset)) >> 1)) / ((segmentWidth + segmentWidthOffset)*(segmentHeight + segmentHeightOffset))); + (*sumAvgIntensityTotalSegmentsLuma) += (sum << 4); + for (uint32_t histogramBin = 0; histogramBin < HISTOGRAM_NUMBER_OF_BINS; histogramBin++) + { + curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex0histogramBin = + curFrame->m_lowres.picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex0histogramBin << 4; + } + } + } +} + +void LookaheadTLD::collectPictureStatistics(Frame *curFrame) +{ + + uint64_t sumAverageIntensityCb = 0; + uint64_t sumAverageIntensityCr = 0; + uint64_t sumAverageIntensity = 0; + + // Histogram bins for Luma + computeIntensityHistogramBinsLuma( + curFrame, + &sumAverageIntensity); + + // Histogram bins for Chroma + computeIntensityHistogramBinsChroma( + curFrame, + &sumAverageIntensityCb, + &sumAverageIntensityCr); + + curFrame->m_lowres.averageIntensity0 = (uint8_t)((sumAverageIntensity + ((curFrame->m_lowres.widthFullRes * curFrame->m_lowres.heightFullRes) >> 1)) / (curFrame->m_lowres.widthFullRes * curFrame->m_lowres.heightFullRes)); + curFrame->m_lowres.averageIntensity1 = (uint8_t)((sumAverageIntensityCb + ((curFrame->m_lowres.widthFullRes * curFrame->m_lowres.heightFullRes) >> 3)) / ((curFrame->m_lowres.widthFullRes * curFrame->m_lowres.heightFullRes) >> 2)); + curFrame->m_lowres.averageIntensity2 = (uint8_t)((sumAverageIntensityCr + ((curFrame->m_lowres.widthFullRes * curFrame->m_lowres.heightFullRes) >> 3)) / ((curFrame->m_lowres.widthFullRes * curFrame->m_lowres.heightFullRes) >> 2)); + + computePictureStatistics(curFrame); + + curFrame->m_lowres.bHistScenecutAnalyzed = false; +} + void PreLookaheadGroup::processTasks(int workerThreadID) { if (workerThreadID < 0) @@ -1393,6 +1728,10 @@ preFrame->m_lowres.init(preFrame->m_fencPic, preFrame->m_poc); if (m_lookahead.m_bAdaptiveQuant) tld.calcAdaptiveQuantFrame(preFrame, m_lookahead.m_param); + + if (m_lookahead.m_param->bHistBasedSceneCut) + tld.collectPictureStatistics(preFrame); + tld.lowresIntraEstimate(preFrame->m_lowres, m_lookahead.m_param->rc.qgSize); preFrame->m_lowresInit = true; @@ -1401,6 +1740,53 @@ m_lock.release(); } + +void Lookahead::placeBref(Frame** frames, int start, int end, int num, int *brefs) +{ + int avg = (start + end) / 2; + if (m_param->bEnableTemporalSubLayers < 2) + { + (*framesavg).m_lowres.sliceType = X265_TYPE_BREF; + (*brefs)++; + return; + } + else + { + if (num <= 2) + return; + else + { + (*framesavg).m_lowres.sliceType = X265_TYPE_BREF; + (*brefs)++; + placeBref(frames, start, avg, avg - start, brefs); + placeBref(frames, avg + 1, end, end - avg, brefs); + return; + } + } +} + + +void Lookahead::compCostBref(Lowres **frames, int start, int end, int num) +{ + CostEstimateGroup estGroup(*this, frames); + int avg = (start + end) / 2; + if (num <= 2) + { + for (int i = start; i < end; i++) + { + estGroup.singleCost(start, end + 1, i + 1); + } + return; + } + else + { + estGroup.singleCost(start, end + 1, avg + 1); + compCostBref(frames, start, avg, avg - start); + compCostBref(frames, avg + 1, end, end - avg); + return; + } +} + /* called by API thread or worker thread with inputQueueLock acquired */ void Lookahead::slicetypeDecide() { @@ -1416,6 +1802,18 @@ ScopedLock lock(m_inputLock); Frame *curFrame = m_inputQueue.first(); + if (m_param->bResetZoneConfig) + { + for (int i = 0; i < m_param->rc.zonefileCount; i++) + { + if (m_param->rc.zonesi.startFrame == curFrame->m_poc) + m_param = m_param->rc.zonesi.zoneParam; + int nextZoneStart = m_param->rc.zonesi.startFrame; + nextZoneStart += nextZoneStart ? m_param->rc.zonesi.zoneParam->radl : 0; + if (nextZoneStart < curFrame->m_poc + maxSearch && curFrame->m_poc < nextZoneStart) + maxSearch = nextZoneStart - curFrame->m_poc; + } + } int j; for (j = 0; j < m_param->bframes + 2; j++) { @@ -1502,7 +1900,7 @@ m_param->rc.cuTree || m_param->scenecutThreshold || m_param->bHistBasedSceneCut || (m_param->lookaheadDepth && m_param->rc.vbvBufferSize))) { - if(!m_param->rc.bStatRead) + if (!m_param->rc.bStatRead) slicetypeAnalyse(frames, false); bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0; if ((m_param->analysisLoad && m_param->scaleFactor && bIsVbv) || m_param->bliveVBV2pass) @@ -1526,6 +1924,8 @@ { Lowres& frm = listbframes->m_lowres; + if (frm.sliceTypeReq != X265_TYPE_AUTO && frm.sliceTypeReq != frm.sliceType) + frm.sliceType = frm.sliceTypeReq; if (frm.sliceType == X265_TYPE_BREF && !m_param->bBPyramid && brefs == m_param->bBPyramid) { frm.sliceType = X265_TYPE_B; @@ -1583,12 +1983,9 @@ } if (frm.sliceType == X265_TYPE_IDR && frm.bScenecut && isClosedGopRadl) { - if (!m_param->bHistBasedSceneCut || (m_param->bHistBasedSceneCut && frm.m_bIsHardScenecut)) - { - for (int i = bframes; i < bframes + m_param->radl; i++) - listi->m_lowres.sliceType = X265_TYPE_B; - list(bframes + m_param->radl)->m_lowres.sliceType = X265_TYPE_IDR; - } + for (int i = bframes; i < bframes + m_param->radl; i++) + listi->m_lowres.sliceType = X265_TYPE_B; + list(bframes + m_param->radl)->m_lowres.sliceType = X265_TYPE_IDR; } if (frm.sliceType == X265_TYPE_IDR) { @@ -1649,138 +2046,454 @@ break; } } - if (bframes) - listbframes - 1->m_lowres.bLastMiniGopBFrame = true; - listbframes->m_lowres.leadingBframes = bframes; - m_lastNonB = &listbframes->m_lowres; - m_histogrambframes++; - - /* insert a bref into the sequence */ - if (m_param->bBPyramid && bframes > 1 && !brefs) - { - listbframes / 2->m_lowres.sliceType = X265_TYPE_BREF; - brefs++; - } - /* calculate the frame costs ahead of time for estimateFrameCost while we still have lowres */ - if (m_param->rc.rateControlMode != X265_RC_CQP) - { - int p0, p1, b; - /* For zero latency tuning, calculate frame cost to be used later in RC */ - if (!maxSearch) + + if (m_param->bEnableTemporalSubLayers > 2) + { + //Split the partial mini GOP into sub mini GOPs when temporal sub layers are enabled + if (bframes < m_param->bframes) { - for (int i = 0; i <= bframes; i++) - framesi + 1 = &listi->m_lowres; - } + int leftOver = bframes + 1; + int8_t gopId = m_gopId - 1; + int gopLen = x265_gop_ra_lengthgopId; + int listReset = 0; - /* estimate new non-B cost */ - p1 = b = bframes + 1; - p0 = (IS_X265_TYPE_I(framesbframes + 1->sliceType)) ? b : 0; + m_outputLock.acquire(); - CostEstimateGroup estGroup(*this, frames); + while ((gopId >= 0) && (leftOver > 3)) + { + if (leftOver < gopLen) + { + gopId = gopId - 1; + gopLen = x265_gop_ra_lengthgopId; + continue; + } + else + { + int newbFrames = listReset + gopLen - 1; + //Re-assign GOP + listnewbFrames->m_lowres.sliceType = IS_X265_TYPE_I(listnewbFrames->m_lowres.sliceType) ? listnewbFrames->m_lowres.sliceType : X265_TYPE_P; + if (newbFrames) + listnewbFrames - 1->m_lowres.bLastMiniGopBFrame = true; + listnewbFrames->m_lowres.leadingBframes = newbFrames; + m_lastNonB = &listnewbFrames->m_lowres; + + /* insert a bref into the sequence */ + if (m_param->bBPyramid && newbFrames) + { + placeBref(list, listReset, newbFrames, newbFrames + 1, &brefs); + } + if (m_param->rc.rateControlMode != X265_RC_CQP) + { + int p0, p1, b; + /* For zero latency tuning, calculate frame cost to be used later in RC */ + if (!maxSearch) + { + for (int i = listReset; i <= newbFrames; i++) + framesi + 1 = &listlistReset + i->m_lowres; + } - estGroup.singleCost(p0, p1, b); + /* estimate new non-B cost */ + p1 = b = newbFrames + 1; + p0 = (IS_X265_TYPE_I(framesnewbFrames + 1->sliceType)) ? b : listReset; - if (bframes) + CostEstimateGroup estGroup(*this, frames); + + estGroup.singleCost(p0, p1, b); + + if (newbFrames) + compCostBref(frames, listReset, newbFrames, newbFrames + 1); + } + + m_inputLock.acquire(); + /* dequeue all frames from inputQueue that are about to be enqueued + * in the output queue. The order is important because Frame can + * only be in one list at a time */ + int64_t ptsX265_BFRAME_MAX + 1; + for (int i = 0; i < gopLen; i++) + { + Frame *curFrame; + curFrame = m_inputQueue.popFront(); + ptsi = curFrame->m_pts; + maxSearch--; + } + m_inputLock.release(); + + int idx = 0; + /* add non-B to output queue */ + listnewbFrames->m_reorderedPts = ptsidx++; + listnewbFrames->m_gopOffset = 0; + listnewbFrames->m_gopId = gopId; + listnewbFrames->m_tempLayer = x265_gop_ragopId0.layer; + m_outputQueue.pushBack(*listnewbFrames); + + /* add B frames to output queue */ + int i = 1, j = 1; + while (i < gopLen) + { + int offset = listReset + (x265_gop_ragopIdj.poc_offset - 1); + if (!listoffset || offset == newbFrames) + continue; + + // Assign gop offset and temporal layer of frames + listoffset->m_gopOffset = j; + listbframes->m_gopId = gopId; + listoffset->m_tempLayer = x265_gop_ragopIdj++.layer; + + listoffset->m_reorderedPts = ptsidx++; + m_outputQueue.pushBack(*listoffset); + i++; + } + + listReset += gopLen; + leftOver = leftOver - gopLen; + gopId -= 1; + gopLen = (gopId >= 0) ? x265_gop_ra_lengthgopId : 0; + } + } + + if (leftOver > 0 && leftOver < 4) + { + int64_t ptsX265_BFRAME_MAX + 1; + int idx = 0; + + int newbFrames = listReset + leftOver - 1; + listnewbFrames->m_lowres.sliceType = IS_X265_TYPE_I(listnewbFrames->m_lowres.sliceType) ? listnewbFrames->m_lowres.sliceType : X265_TYPE_P; + if (newbFrames) + listnewbFrames - 1->m_lowres.bLastMiniGopBFrame = true; + listnewbFrames->m_lowres.leadingBframes = newbFrames; + m_lastNonB = &listnewbFrames->m_lowres; + + /* insert a bref into the sequence */ + if (m_param->bBPyramid && (newbFrames- listReset) > 1) + placeBref(list, listReset, newbFrames, newbFrames + 1, &brefs); + + if (m_param->rc.rateControlMode != X265_RC_CQP) + { + int p0, p1, b; + /* For zero latency tuning, calculate frame cost to be used later in RC */ + if (!maxSearch) + { + for (int i = listReset; i <= newbFrames; i++) + framesi + 1 = &listlistReset + i->m_lowres; + } + + /* estimate new non-B cost */ + p1 = b = newbFrames + 1; + p0 = (IS_X265_TYPE_I(framesnewbFrames + 1->sliceType)) ? b : listReset; + + CostEstimateGroup estGroup(*this, frames); + + estGroup.singleCost(p0, p1, b); + + if (newbFrames) + compCostBref(frames, listReset, newbFrames, newbFrames + 1); + } + + m_inputLock.acquire(); + /* dequeue all frames from inputQueue that are about to be enqueued + * in the output queue. The order is important because Frame can + * only be in one list at a time */ + for (int i = 0; i < leftOver; i++) + { + Frame *curFrame; + curFrame = m_inputQueue.popFront(); + ptsi = curFrame->m_pts; + maxSearch--; + } + m_inputLock.release(); + + m_lastNonB = &listnewbFrames->m_lowres; + listnewbFrames->m_reorderedPts = ptsidx++; + listnewbFrames->m_gopOffset = 0; + listnewbFrames->m_gopId = -1; + listnewbFrames->m_tempLayer = 0; + m_outputQueue.pushBack(*listnewbFrames); + if (brefs) + { + for (int i = listReset; i < newbFrames; i++) + { + if (listi->m_lowres.sliceType == X265_TYPE_BREF) + { + listi->m_reorderedPts = ptsidx++; + listi->m_gopOffset = 0; + listi->m_gopId = -1; + listi->m_tempLayer = 0; + m_outputQueue.pushBack(*listi); + } + } + } + + /* add B frames to output queue */ + for (int i = listReset; i < newbFrames; i++) + { + /* push all the B frames into output queue except B-ref, which already pushed into output queue */ + if (listi->m_lowres.sliceType != X265_TYPE_BREF) + { + listi->m_reorderedPts = ptsidx++; + listi->m_gopOffset = 0; + listi->m_gopId = -1; + listi->m_tempLayer = 1; + m_outputQueue.pushBack(*listi); + } + } + } + } + else + // Fill the complete mini GOP when temporal sub layers are enabled { - p0 = 0; // last nonb - bool isp0available = framesbframes + 1->sliceType == X265_TYPE_IDR ? false : true; - for (b = 1; b <= bframes; b++) + listbframes - 1->m_lowres.bLastMiniGopBFrame = true; + listbframes->m_lowres.leadingBframes = bframes; + m_lastNonB = &listbframes->m_lowres; + + /* insert a bref into the sequence */ + if (m_param->bBPyramid && !brefs) { - if (!isp0available) - p0 = b; + placeBref(list, 0, bframes, bframes + 1, &brefs); + } - if (framesb->sliceType == X265_TYPE_B) - for (p1 = b; framesp1->sliceType == X265_TYPE_B; p1++) - ; // find new nonb or bref - else - p1 = bframes + 1; + /* calculate the frame costs ahead of time for estimateFrameCost while we still have lowres */ + if (m_param->rc.rateControlMode != X265_RC_CQP) + { + int p0, p1, b; + /* For zero latency tuning, calculate frame cost to be used later in RC */ + if (!maxSearch) + { + for (int i = 0; i <= bframes; i++) + framesi + 1 = &listi->m_lowres; + } + /* estimate new non-B cost */ + p1 = b = bframes + 1; + p0 = (IS_X265_TYPE_I(framesbframes + 1->sliceType)) ? b : 0; + + CostEstimateGroup estGroup(*this, frames); estGroup.singleCost(p0, p1, b); - if (framesb->sliceType == X265_TYPE_BREF) + compCostBref(frames, 0, bframes, bframes + 1); + } + + m_inputLock.acquire(); + /* dequeue all frames from inputQueue that are about to be enqueued + * in the output queue. The order is important because Frame can + * only be in one list at a time */ + int64_t ptsX265_BFRAME_MAX + 1; + for (int i = 0; i <= bframes; i++) + { + Frame *curFrame; + curFrame = m_inputQueue.popFront(); + ptsi = curFrame->m_pts; + maxSearch--; + } + m_inputLock.release(); + + m_outputLock.acquire(); + + int idx = 0; + /* add non-B to output queue */ + listbframes->m_reorderedPts = ptsidx++; + listbframes->m_gopOffset = 0; + listbframes->m_gopId = m_gopId; + listbframes->m_tempLayer = x265_gop_ram_gopId0.layer; + m_outputQueue.pushBack(*listbframes); + + int i = 1, j = 1; + while (i <= bframes) + { + int offset = x265_gop_ram_gopIdj.poc_offset - 1; + if (!listoffset || offset == bframes) + continue; + + // Assign gop offset and temporal layer of frames + listoffset->m_gopOffset = j; + listoffset->m_gopId = m_gopId; + listoffset->m_tempLayer = x265_gop_ram_gopIdj++.layer; + + /* add B frames to output queue */ + listoffset->m_reorderedPts = ptsidx++; + m_outputQueue.pushBack(*listoffset); + i++; + } + } + + bool isKeyFrameAnalyse = (m_param->rc.cuTree || (m_param->rc.vbvBufferSize && m_param->lookaheadDepth)); + if (isKeyFrameAnalyse && IS_X265_TYPE_I(m_lastNonB->sliceType)) + { + m_inputLock.acquire(); + Frame *curFrame = m_inputQueue.first(); + frames0 = m_lastNonB; + int j; + for (j = 0; j < maxSearch; j++) + { + framesj + 1 = &curFrame->m_lowres; + curFrame = curFrame->m_next; + } + m_inputLock.release(); + + framesj + 1 = NULL; + if (!m_param->rc.bStatRead) + slicetypeAnalyse(frames, true); + bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0; + if ((m_param->analysisLoad && m_param->scaleFactor && bIsVbv) || m_param->bliveVBV2pass) + { + int numFrames; + for (numFrames = 0; numFrames < maxSearch; numFrames++) { - p0 = b; - isp0available = true; + Lowres *fenc = framesnumFrames + 1; + if (!fenc) + break; } + vbvLookahead(frames, numFrames, true); } } - } - m_inputLock.acquire(); - /* dequeue all frames from inputQueue that are about to be enqueued - * in the output queue. The order is important because Frame can - * only be in one list at a time */ - int64_t ptsX265_BFRAME_MAX + 1; - for (int i = 0; i <= bframes; i++) - { - Frame *curFrame; - curFrame = m_inputQueue.popFront(); - ptsi = curFrame->m_pts; - maxSearch--; - } - m_inputLock.release(); - m_outputLock.acquire(); - /* add non-B to output queue */ - int idx = 0; - listbframes->m_reorderedPts = ptsidx++; - m_outputQueue.pushBack(*listbframes); - /* Add B-ref frame next to P frame in output queue, the B-ref encode before non B-ref frame */ - if (brefs) + m_outputLock.release(); + } + else { - for (int i = 0; i < bframes; i++) + + if (bframes) + listbframes - 1->m_lowres.bLastMiniGopBFrame = true; + listbframes->m_lowres.leadingBframes = bframes; + m_lastNonB = &listbframes->m_lowres; + + /* insert a bref into the sequence */ + if (m_param->bBPyramid && bframes > 1 && !brefs) { - if (listi->m_lowres.sliceType == X265_TYPE_BREF) + placeBref(list, 0, bframes, bframes + 1, &brefs); + } + /* calculate the frame costs ahead of time for estimateFrameCost while we still have lowres */ + if (m_param->rc.rateControlMode != X265_RC_CQP) + { + int p0, p1, b; + /* For zero latency tuning, calculate frame cost to be used later in RC */ + if (!maxSearch) { - listi->m_reorderedPts = ptsidx++; - m_outputQueue.pushBack(*listi); + for (int i = 0; i <= bframes; i++) + framesi + 1 = &listi->m_lowres; + } + + /* estimate new non-B cost */ + p1 = b = bframes + 1; + p0 = (IS_X265_TYPE_I(framesbframes + 1->sliceType)) ? b : 0; + + CostEstimateGroup estGroup(*this, frames); + estGroup.singleCost(p0, p1, b); + + if (m_param->bEnableTemporalSubLayers > 1 && bframes) + { + compCostBref(frames, 0, bframes, bframes + 1); + } + else + { + if (bframes) + { + p0 = 0; // last nonb + bool isp0available = framesbframes + 1->sliceType == X265_TYPE_IDR ? false : true; + + for (b = 1; b <= bframes; b++) + { + if (!isp0available) + p0 = b; + + if (framesb->sliceType == X265_TYPE_B) + for (p1 = b; framesp1->sliceType == X265_TYPE_B; p1++) + ; // find new nonb or bref + else + p1 = bframes + 1; + + estGroup.singleCost(p0, p1, b); + + if (framesb->sliceType == X265_TYPE_BREF) + { + p0 = b; + isp0available = true; + } + } + } } } - } - /* add B frames to output queue */ - for (int i = 0; i < bframes; i++) - { - /* push all the B frames into output queue except B-ref, which already pushed into output queue */ - if (listi->m_lowres.sliceType != X265_TYPE_BREF) + m_inputLock.acquire(); + /* dequeue all frames from inputQueue that are about to be enqueued + * in the output queue. The order is important because Frame can + * only be in one list at a time */ + int64_t ptsX265_BFRAME_MAX + 1; + for (int i = 0; i <= bframes; i++) + { + Frame *curFrame; + curFrame = m_inputQueue.popFront(); + ptsi = curFrame->m_pts; + maxSearch--; + } + m_inputLock.release(); + + m_outputLock.acquire(); + + /* add non-B to output queue */ + int idx = 0; + listbframes->m_reorderedPts = ptsidx++; + m_outputQueue.pushBack(*listbframes); + + /* Add B-ref frame next to P frame in output queue, the B-ref encode before non B-ref frame */ + if (brefs) { - listi->m_reorderedPts = ptsidx++; - m_outputQueue.pushBack(*listi); + for (int i = 0; i < bframes; i++) + { + if (listi->m_lowres.sliceType == X265_TYPE_BREF) + { + listi->m_reorderedPts = ptsidx++; + m_outputQueue.pushBack(*listi); + } + } } - } - bool isKeyFrameAnalyse = (m_param->rc.cuTree || (m_param->rc.vbvBufferSize && m_param->lookaheadDepth)); - if (isKeyFrameAnalyse && IS_X265_TYPE_I(m_lastNonB->sliceType)) - { - m_inputLock.acquire(); - Frame *curFrame = m_inputQueue.first(); - frames0 = m_lastNonB; - int j; - for (j = 0; j < maxSearch; j++) + /* add B frames to output queue */ + for (int i = 0; i < bframes; i++) { - framesj + 1 = &curFrame->m_lowres; - curFrame = curFrame->m_next; + /* push all the B frames into output queue except B-ref, which already pushed into output queue */ + if (listi->m_lowres.sliceType != X265_TYPE_BREF) + { + listi->m_reorderedPts = ptsidx++; + m_outputQueue.pushBack(*listi); + } } - m_inputLock.release(); - framesj + 1 = NULL; - if (!m_param->rc.bStatRead) - slicetypeAnalyse(frames, true); - bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0; - if ((m_param->analysisLoad && m_param->scaleFactor && bIsVbv) || m_param->bliveVBV2pass) + + bool isKeyFrameAnalyse = (m_param->rc.cuTree || (m_param->rc.vbvBufferSize && m_param->lookaheadDepth)); + if (isKeyFrameAnalyse && IS_X265_TYPE_I(m_lastNonB->sliceType)) { - int numFrames; - for (numFrames = 0; numFrames < maxSearch; numFrames++) + m_inputLock.acquire(); + Frame *curFrame = m_inputQueue.first(); + frames0 = m_lastNonB; + int j; + for (j = 0; j < maxSearch; j++) + { + framesj + 1 = &curFrame->m_lowres; + curFrame = curFrame->m_next; + } + m_inputLock.release(); + + framesj + 1 = NULL; + if (!m_param->rc.bStatRead) + slicetypeAnalyse(frames, true); + bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0; + if ((m_param->analysisLoad && m_param->scaleFactor && bIsVbv) || m_param->bliveVBV2pass) { - Lowres *fenc = framesnumFrames + 1; - if (!fenc) - break; + int numFrames; + for (numFrames = 0; numFrames < maxSearch; numFrames++) + { + Lowres *fenc = framesnumFrames + 1; + if (!fenc) + break; + } + vbvLookahead(frames, numFrames, true); } - vbvLookahead(frames, numFrames, true); } + + m_outputLock.release(); } - m_outputLock.release(); } void Lookahead::vbvLookahead(Lowres **frames, int numFrames, int keyframe) @@ -1909,6 +2622,8 @@ nextZoneStart += (i + 1 < m_param->rc.zonefileCount) ? m_param->rc.zonesi + 1.startFrame + m_param->rc.zonesi + 1.zoneParam->radl : m_param->totalFrames; if (curZoneStart <= frames0->frameNum && nextZoneStart > frames0->frameNum) m_param->keyframeMax = nextZoneStart - curZoneStart; + if (m_param->rc.zonesm_param->rc.zonefileCount - 1.startFrame <= frames0->frameNum && nextZoneStart == 0) + m_param->keyframeMax = m_param->rc.zones0.keyframeMax; } } int keylimit = m_param->keyframeMax; @@ -2013,44 +2728,13 @@ int numAnalyzed = numFrames; bool isScenecut = false; - /* Temporal computations for scenecut detection */ if (m_param->bHistBasedSceneCut) - { - for (int i = numFrames - 1; i > 0; i--) - { - if (framesi->interPCostPercDiff > 0.0) - continue; - int64_t interCost = framesi->costEst10; - int64_t intraCost = framesi->costEst00; - if (interCost < 0 || intraCost < 0) - continue; - int times = 0; - double averagePcost = 0.0, averageIcost = 0.0; - for (int j = i - 1; j >= 0 && times < 5; j--, times++) - { - if (framesj->costEst00 > 0 && framesj->costEst10 > 0) - { - averageIcost += framesj->costEst00; - averagePcost += framesj->costEst10; - } - else - times--; - } - if (times) - { - averageIcost = averageIcost / times; - averagePcost = averagePcost / times; - framesi->interPCostPercDiff = abs(interCost - averagePcost) / X265_MIN(interCost, averagePcost) * 100; - framesi->intraCostPercDiff = abs(intraCost - averageIcost) / X265_MIN(intraCost, averageIcost) * 100; - } - } - } - - /* When scenecut threshold is set, use scenecut detection for I frame placements */ - if (!m_param->bHistBasedSceneCut || (m_param->bHistBasedSceneCut && frames1->bScenecut)) + isScenecut = histBasedScenecut(frames, 0, 1, origNumFrames); + else isScenecut = scenecut(frames, 0, 1, true, origNumFrames); - if (isScenecut && (m_param->bHistBasedSceneCut || m_param->scenecutThreshold)) + /* When scenecut threshold is set, use scenecut detection for I frame placements */ + if (m_param->scenecutThreshold && isScenecut) { frames1->sliceType = X265_TYPE_I; return; @@ -2061,8 +2745,7 @@ m_extendGopBoundary = false; for (int i = m_param->bframes + 1; i < origNumFrames; i += m_param->bframes + 1) { - if (!m_param->bHistBasedSceneCut || (m_param->bHistBasedSceneCut && framesi + 1->bScenecut)) - scenecut(frames, i, i + 1, true, origNumFrames); + scenecut(frames, i, i + 1, true, origNumFrames); for (int j = i + 1; j <= X265_MIN(i + m_param->bframes + 1, origNumFrames); j++) { @@ -2175,10 +2858,8 @@ { for (int j = 1; j < numBFrames + 1; j++) { - bool isNextScenecut = false; - if (!m_param->bHistBasedSceneCut || (m_param->bHistBasedSceneCut && framesj + 1->bScenecut)) - isNextScenecut = scenecut(frames, j, j + 1, false, origNumFrames); - if (isNextScenecut || (bForceRADL && framesj->frameNum == preRADL)) + if (scenecut(frames, j, j + 1, false, origNumFrames) || + (bForceRADL && (framesj->frameNum == preRADL))) { framesj->sliceType = X265_TYPE_P; numAnalyzed = j; @@ -2244,9 +2925,10 @@ /* Where A and B are scenes: AAAAAABBBAAAAAA * If BBB is shorter than (maxp1-p0), it is detected as a flash * and not considered a scenecut. */ + for (int cp1 = p1; cp1 <= maxp1; cp1++) { - if (!scenecutInternal(frames, p0, cp1, false) && !m_param->bHistBasedSceneCut) + if (!scenecutInternal(frames, p0, cp1, false)) { /* Any frame in between p0 and cur_p1 cannot be a real scenecut. */ for (int i = cp1; i > p0; i--) @@ -2255,7 +2937,7 @@ noScenecuts = false; } } - else if ((m_param->bHistBasedSceneCut && framescp1->m_bIsMaxThres) || scenecutInternal(frames, cp1 - 1, cp1, false)) + else if (scenecutInternal(frames, cp1 - 1, cp1, false)) { /* If current frame is a Scenecut from p0 frame as well as Scenecut from * preceeding frame, mark it as a Scenecut */ @@ -2316,9 +2998,6 @@ if (!framesp1->bScenecut) return false; - /* Check only scene transitions if max threshold */ - if (m_param->bHistBasedSceneCut && framesp1->m_bIsMaxThres) - return framesp1->bScenecut; return scenecutInternal(frames, p0, p1, bRealScenecut); } @@ -2336,19 +3015,8 @@ /* magic numbers pulled out of thin air */ float threshMin = (float)(threshMax * 0.25); double bias = m_param->scenecutBias; - if (m_param->bHistBasedSceneCut) - { - double minT = TEMPORAL_SCENECUT_THRESHOLD * (1 + m_param->edgeTransitionThreshold); - if (frame->interPCostPercDiff > minT || frame->intraCostPercDiff > minT) - { - if (bRealScenecut && frame->bScenecut) - x265_log(m_param, X265_LOG_DEBUG, "scene cut at %d \n", frame->frameNum); - return frame->bScenecut; - } - else - return false; - } - else if (bRealScenecut) + + if (bRealScenecut) { if (m_param->keyframeMin == m_param->keyframeMax) threshMin = threshMax; @@ -2375,6 +3043,167 @@ return res; } +bool Lookahead::detectHistBasedSceneChange(Lowres **frames, int p0, int p1, int p2) +{ + bool isAbruptChange; + bool isSceneChange; + + Lowres *previousFrame = framesp0; + Lowres *currentFrame = framesp1; + Lowres *futureFrame = framesp2; + + currentFrame->bHistScenecutAnalyzed = true; + + uint32_t **accHistDiffRunningAvgCb = m_accHistDiffRunningAvgCb; + uint32_t **accHistDiffRunningAvgCr = m_accHistDiffRunningAvgCr; + uint32_t **accHistDiffRunningAvg = m_accHistDiffRunningAvg; + + uint8_t absIntDiffFuturePast = 0; + uint8_t absIntDiffFuturePresent = 0; + uint8_t absIntDiffPresentPast = 0; + + uint32_t abruptChangeCount = 0; + uint32_t sceneChangeCount = 0; + + uint32_t segmentWidth = frames1->widthFullRes / NUMBER_OF_SEGMENTS_IN_WIDTH; + uint32_t segmentHeight = frames1->heightFullRes / NUMBER_OF_SEGMENTS_IN_HEIGHT; + + for (uint32_t segmentInFrameWidthIndex = 0; segmentInFrameWidthIndex < NUMBER_OF_SEGMENTS_IN_WIDTH; segmentInFrameWidthIndex++) + { + for (uint32_t segmentInFrameHeightIndex = 0; segmentInFrameHeightIndex < NUMBER_OF_SEGMENTS_IN_HEIGHT; segmentInFrameHeightIndex++) + { + isAbruptChange = false; + isSceneChange = false; + + // accumulative absolute histogram differences between the past and current frame + uint32_t accHistDiff = 0; + uint32_t accHistDiffCb = 0; + uint32_t accHistDiffCr = 0; + + uint32_t segmentWidthOffset = (segmentInFrameWidthIndex == NUMBER_OF_SEGMENTS_IN_WIDTH - 1) ? + frames1->widthFullRes - (NUMBER_OF_SEGMENTS_IN_WIDTH * segmentWidth) : 0; + + uint32_t segmentHeightOffset = (segmentInFrameHeightIndex == NUMBER_OF_SEGMENTS_IN_HEIGHT - 1) ? + frames1->heightFullRes - (NUMBER_OF_SEGMENTS_IN_HEIGHT * segmentHeight) : 0; + + segmentWidth += segmentWidthOffset; + segmentHeight += segmentHeightOffset; + + uint32_t segmentThreshHold = ( + ((X265_ABS((int64_t)currentFrame->picAvgVariance - (int64_t)previousFrame->picAvgVariance)) > PICTURE_DIFF_VARIANCE_TH) && + (currentFrame->picAvgVariance > PICTURE_VARIANCE_TH || previousFrame->picAvgVariance > PICTURE_VARIANCE_TH)) ? + HIGH_VAR_SCENE_CHANGE_TH * NUM64x64INPIC(segmentWidth, segmentHeight) : LOW_VAR_SCENE_CHANGE_TH * NUM64x64INPIC(segmentWidth, segmentHeight); + + uint32_t segmentThreshHoldCb = ( + ((X265_ABS((int64_t)currentFrame->picAvgVarianceCb - (int64_t)previousFrame->picAvgVarianceCb)) > PICTURE_DIFF_VARIANCE_CHROMA_TH) && + (currentFrame->picAvgVarianceCb > PICTURE_VARIANCE_CHROMA_TH || previousFrame->picAvgVarianceCb > PICTURE_VARIANCE_CHROMA_TH)) ? + HIGH_VAR_SCENE_CHANGE_CHROMA_TH * NUM64x64INPIC(segmentWidth, segmentHeight) : LOW_VAR_SCENE_CHANGE_CHROMA_TH * NUM64x64INPIC(segmentWidth, segmentHeight); + + uint32_t segmentThreshHoldCr = ( + ((X265_ABS((int64_t)currentFrame->picAvgVarianceCr - (int64_t)previousFrame->picAvgVarianceCr)) > PICTURE_DIFF_VARIANCE_CHROMA_TH) && + (currentFrame->picAvgVarianceCr > PICTURE_VARIANCE_CHROMA_TH || previousFrame->picAvgVarianceCr > PICTURE_VARIANCE_CHROMA_TH)) ? + HIGH_VAR_SCENE_CHANGE_CHROMA_TH * NUM64x64INPIC(segmentWidth, segmentHeight) : LOW_VAR_SCENE_CHANGE_CHROMA_TH * NUM64x64INPIC(segmentWidth, segmentHeight); + + for (uint32_t bin = 0; bin < HISTOGRAM_NUMBER_OF_BINS; ++bin) { + accHistDiff += X265_ABS((int32_t)currentFrame->picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex0bin - (int32_t)previousFrame->picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex0bin); + accHistDiffCb += X265_ABS((int32_t)currentFrame->picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex1bin - (int32_t)previousFrame->picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex1bin); + accHistDiffCr += X265_ABS((int32_t)currentFrame->picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex2bin - (int32_t)previousFrame->picHistogramsegmentInFrameWidthIndexsegmentInFrameHeightIndex2bin); + } + + if (m_resetRunningAvg) { + accHistDiffRunningAvgsegmentInFrameWidthIndexsegmentInFrameHeightIndex = accHistDiff; + accHistDiffRunningAvgCbsegmentInFrameWidthIndexsegmentInFrameHeightIndex = accHistDiffCb; + accHistDiffRunningAvgCrsegmentInFrameWidthIndexsegmentInFrameHeightIndex = accHistDiffCr; + } + + // difference between accumulative absolute histogram differences and the running average at the current frame. + uint32_t accHistDiffError = X265_ABS((int32_t)accHistDiffRunningAvgsegmentInFrameWidthIndexsegmentInFrameHeightIndex - (int32_t)accHistDiff); + uint32_t accHistDiffErrorCb = X265_ABS((int32_t)accHistDiffRunningAvgCbsegmentInFrameWidthIndexsegmentInFrameHeightIndex - (int32_t)accHistDiffCb); + uint32_t accHistDiffErrorCr = X265_ABS((int32_t)accHistDiffRunningAvgCrsegmentInFrameWidthIndexsegmentInFrameHeightIndex - (int32_t)accHistDiffCr); + + if ((accHistDiffError > segmentThreshHold && accHistDiff >= accHistDiffError) || + (accHistDiffErrorCb > segmentThreshHoldCb && accHistDiffCb >= accHistDiffErrorCb) || + (accHistDiffErrorCr > segmentThreshHoldCr && accHistDiffCr >= accHistDiffErrorCr)) { + + isAbruptChange = true; + } + + if (isAbruptChange) + { + absIntDiffFuturePast = (uint8_t)X265_ABS((int16_t)futureFrame->averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex0 - (int16_t)previousFrame->averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex0); + absIntDiffFuturePresent = (uint8_t)X265_ABS((int16_t)futureFrame->averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex0 - (int16_t)currentFrame->averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex0); + absIntDiffPresentPast = (uint8_t)X265_ABS((int16_t)currentFrame->averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex0 - (int16_t)previousFrame->averageIntensityPerSegmentsegmentInFrameWidthIndexsegmentInFrameHeightIndex0); + + if (absIntDiffFuturePresent >= FLASH_TH * absIntDiffFuturePast && absIntDiffPresentPast >= FLASH_TH * absIntDiffFuturePast) { + x265_log(m_param, X265_LOG_DEBUG, "Flash in frame# %i , %i, %i, %i\n", currentFrame->frameNum, absIntDiffFuturePast, absIntDiffFuturePresent, absIntDiffPresentPast); + } + else if (absIntDiffFuturePresent < FADE_TH && absIntDiffPresentPast < FADE_TH) { + x265_log(m_param, X265_LOG_DEBUG, "Fade in frame# %i , %i, %i, %i\n", currentFrame->frameNum, absIntDiffFuturePast, absIntDiffFuturePresent, absIntDiffPresentPast); + } + else if (X265_ABS(absIntDiffFuturePresent - absIntDiffPresentPast) < INTENSITY_CHANGE_TH && absIntDiffFuturePresent + absIntDiffPresentPast >= absIntDiffFuturePast) { + x265_log(m_param, X265_LOG_DEBUG, "Intensity Change in frame# %i , %i, %i, %i\n", currentFrame->frameNum, absIntDiffFuturePast, absIntDiffFuturePresent, absIntDiffPresentPast); + } + else { + isSceneChange = true; + x265_log(m_param, X265_LOG_DEBUG, "Scene change in frame# %i , %i, %i, %i\n", currentFrame->frameNum, absIntDiffFuturePast, absIntDiffFuturePresent, absIntDiffPresentPast); + } + + } + else { + accHistDiffRunningAvgsegmentInFrameWidthIndexsegmentInFrameHeightIndex = (3 * accHistDiffRunningAvgsegmentInFrameWidthIndexsegmentInFrameHeightIndex + accHistDiff) / 4; + } + + abruptChangeCount += isAbruptChange; + sceneChangeCount += isSceneChange; + } + } + + if (abruptChangeCount >= m_segmentCountThreshold) { + m_resetRunningAvg = true; + } + else { + m_resetRunningAvg = false; + } + + if ((sceneChangeCount >= m_segmentCountThreshold)) { + x265_log(m_param, X265_LOG_DEBUG, "Scene Change in Pic Number# %i\n", currentFrame->frameNum); + + return true; + } + else { + return false; + } + +} + +bool Lookahead::histBasedScenecut(Lowres **frames, int p0, int p1, int numFrames) +{ + /* Only do analysis during a normal scenecut check. */ + if (m_param->bframes) + { + int origmaxp1 = p0 + 1; + /* Look ahead to avoid coding short flashes as scenecuts. */ + origmaxp1 += m_param->bframes; + int maxp1 = X265_MIN(origmaxp1, numFrames); + + for (int cp1 = p0; cp1 < maxp1; cp1++) + { + if (framescp1 + 1->bHistScenecutAnalyzed == true) + continue; + + if (framescp1 + 2 != NULL && detectHistBasedSceneChange(frames, cp1, cp1 + 1, cp1 + 2)) + { + /* If current frame is a Scenecut from p0 frame as well as Scenecut from + * preceeding frame, mark it as a Scenecut */ + framescp1+1->bScenecut = true; + } + } + + } + + return framesp1->bScenecut; +} + void Lookahead::slicetypePath(Lowres **frames, int length, char(*best_paths)X265_LOOKAHEAD_MAX + 1) { char paths2X265_LOOKAHEAD_MAX + 1; @@ -2404,6 +3233,27 @@ memcpy(best_pathslength % (X265_BFRAME_MAX + 1), pathsidx ^ 1, length); } +// Find slicetype of the frame with poc # in lookahead buffer +int Lookahead::findSliceType(int poc) +{ + int out_slicetype = X265_TYPE_AUTO; + if (m_filled) + { + m_outputLock.acquire(); + Frame* out = m_outputQueue.first(); + while (out != NULL) { + if (poc == out->m_poc) + { + out_slicetype = out->m_lowres.sliceType; + break; + } + out = out->m_next; + } + m_outputLock.release(); + } + return out_slicetype; +} + int64_t Lookahead::slicetypePathCost(Lowres **frames, char *path, int64_t threshold) { int64_t cost = 0;
View file
x265_3.5.tar.gz/source/encoder/slicetype.h -> x265_3.6.tar.gz/source/encoder/slicetype.h
Changed
@@ -44,6 +44,24 @@ #define EDGE_INCLINATION 45 #define TEMPORAL_SCENECUT_THRESHOLD 50 +#define X265_ABS(a) (((a) < 0) ? (-(a)) : (a)) + +#define PICTURE_DIFF_VARIANCE_TH 390 +#define PICTURE_VARIANCE_TH 1500 +#define LOW_VAR_SCENE_CHANGE_TH 2250 +#define HIGH_VAR_SCENE_CHANGE_TH 3500 + +#define PICTURE_DIFF_VARIANCE_CHROMA_TH 10 +#define PICTURE_VARIANCE_CHROMA_TH 20 +#define LOW_VAR_SCENE_CHANGE_CHROMA_TH 2250/4 +#define HIGH_VAR_SCENE_CHANGE_CHROMA_TH 3500/4 + +#define FLASH_TH 1.5 +#define FADE_TH 4 +#define INTENSITY_CHANGE_TH 4 + +#define NUM64x64INPIC(w,h) ((w*h)>> (MAX_LOG2_CU_SIZE<<1)) + #if HIGH_BIT_DEPTH #define EDGE_THRESHOLD 1023.0 #else @@ -93,7 +111,29 @@ ~LookaheadTLD() { X265_FREE(wbuffer0); } + void collectPictureStatistics(Frame *curFrame); + void computeIntensityHistogramBinsLuma(Frame *curFrame, uint64_t *sumAvgIntensityTotalSegmentsLuma); + + void computeIntensityHistogramBinsChroma( + Frame *curFrame, + uint64_t *sumAverageIntensityCb, + uint64_t *sumAverageIntensityCr); + + void calculateHistogram( + pixel *inputSrc, + uint32_t inputWidth, + uint32_t inputHeight, + intptr_t stride, + uint8_t dsFactor, + uint32_t *histogram, + uint64_t *sum); + + void computePictureStatistics(Frame *curFrame); + + uint32_t calcVariance(pixel* src, intptr_t stride, intptr_t blockOffset, uint32_t plane); + void calcAdaptiveQuantFrame(Frame *curFrame, x265_param* param); + void calcFrameSegment(Frame *curFrame); void lowresIntraEstimate(Lowres& fenc, uint32_t qgSize); void weightsAnalyse(Lowres& fenc, Lowres& ref); @@ -124,7 +164,6 @@ /* pre-lookahead */ int m_fullQueueSize; - int m_histogramX265_BFRAME_MAX + 1; int m_lastKeyframe; int m_8x8Width; int m_8x8Height; @@ -153,6 +192,16 @@ bool m_isFadeIn; uint64_t m_fadeCount; int m_fadeStart; + + uint32_t **m_accHistDiffRunningAvgCb; + uint32_t **m_accHistDiffRunningAvgCr; + uint32_t **m_accHistDiffRunningAvg; + + bool m_resetRunningAvg; + uint32_t m_segmentCountThreshold; + + int8_t m_gopId; + Lookahead(x265_param *param, ThreadPool *pool); #if DETAILED_CU_STATS int64_t m_slicetypeDecideElapsedTime; @@ -174,6 +223,7 @@ void getEstimatedPictureCost(Frame *pic); void setLookaheadQueue(); + int findSliceType(int poc); protected: @@ -184,6 +234,10 @@ /* called by slicetypeAnalyse() to make slice decisions */ bool scenecut(Lowres **frames, int p0, int p1, bool bRealScenecut, int numFrames); bool scenecutInternal(Lowres **frames, int p0, int p1, bool bRealScenecut); + + bool histBasedScenecut(Lowres **frames, int p0, int p1, int numFrames); + bool detectHistBasedSceneChange(Lowres **frames, int p0, int p1, int p2); + void slicetypePath(Lowres **frames, int length, char(*best_paths)X265_LOOKAHEAD_MAX + 1); int64_t slicetypePathCost(Lowres **frames, char *path, int64_t threshold); int64_t vbvFrameCost(Lowres **frames, int p0, int p1, int b); @@ -199,6 +253,9 @@ /* called by getEstimatedPictureCost() to finalize cuTree costs */ int64_t frameCostRecalculate(Lowres **frames, int p0, int p1, int b); + /*Compute index for positioning B-Ref frames*/ + void placeBref(Frame** frames, int start, int end, int num, int *brefs); + void compCostBref(Lowres **frame, int start, int end, int num); }; class PreLookaheadGroup : public BondedTaskGroup
View file
x265_3.5.tar.gz/source/output/output.cpp -> x265_3.6.tar.gz/source/output/output.cpp
Changed
@@ -30,14 +30,14 @@ using namespace X265_NS; -ReconFile* ReconFile::open(const char *fname, int width, int height, uint32_t bitdepth, uint32_t fpsNum, uint32_t fpsDenom, int csp) +ReconFile* ReconFile::open(const char *fname, int width, int height, uint32_t bitdepth, uint32_t fpsNum, uint32_t fpsDenom, int csp, int sourceBitDepth) { const char * s = strrchr(fname, '.'); if (s && !strcmp(s, ".y4m")) - return new Y4MOutput(fname, width, height, fpsNum, fpsDenom, csp); + return new Y4MOutput(fname, width, height, bitdepth, fpsNum, fpsDenom, csp, sourceBitDepth); else - return new YUVOutput(fname, width, height, bitdepth, csp); + return new YUVOutput(fname, width, height, bitdepth, csp, sourceBitDepth); } OutputFile* OutputFile::open(const char *fname, InputFileInfo& inputInfo)
View file
x265_3.5.tar.gz/source/output/output.h -> x265_3.6.tar.gz/source/output/output.h
Changed
@@ -42,7 +42,7 @@ ReconFile() {} static ReconFile* open(const char *fname, int width, int height, uint32_t bitdepth, - uint32_t fpsNum, uint32_t fpsDenom, int csp); + uint32_t fpsNum, uint32_t fpsDenom, int csp, int sourceBitDepth); virtual bool isFail() const = 0;
View file
x265_3.5.tar.gz/source/output/y4m.cpp -> x265_3.6.tar.gz/source/output/y4m.cpp
Changed
@@ -28,11 +28,13 @@ using namespace X265_NS; using namespace std; -Y4MOutput::Y4MOutput(const char *filename, int w, int h, uint32_t fpsNum, uint32_t fpsDenom, int csp) +Y4MOutput::Y4MOutput(const char* filename, int w, int h, uint32_t bitdepth, uint32_t fpsNum, uint32_t fpsDenom, int csp, int inputdepth) : width(w) , height(h) + , bitDepth(bitdepth) , colorSpace(csp) , frameSize(0) + , inputDepth(inputdepth) { ofs.open(filename, ios::binary | ios::out); buf = new charwidth; @@ -41,7 +43,13 @@ if (ofs) { - ofs << "YUV4MPEG2 W" << width << " H" << height << " F" << fpsNum << ":" << fpsDenom << " Ip" << " C" << cf << "\n"; + if (bitDepth == 10) + ofs << "YUV4MPEG2 W" << width << " H" << height << " F" << fpsNum << ":" << fpsDenom << " Ip" << " C" << cf << "p10" << " XYSCSS = " << cf << "P10" << "\n"; + else if (bitDepth == 12) + ofs << "YUV4MPEG2 W" << width << " H" << height << " F" << fpsNum << ":" << fpsDenom << " Ip" << " C" << cf << "p12" << " XYSCSS = " << cf << "P12" << "\n"; + else + ofs << "YUV4MPEG2 W" << width << " H" << height << " F" << fpsNum << ":" << fpsDenom << " Ip" << " C" << cf << "\n"; + header = ofs.tellp(); } @@ -58,52 +66,81 @@ bool Y4MOutput::writePicture(const x265_picture& pic) { std::ofstream::pos_type outPicPos = header; - outPicPos += (uint64_t)pic.poc * (6 + frameSize); + if (pic.bitDepth > 8) + outPicPos += (uint64_t)(pic.poc * (6 + frameSize * 2)); + else + outPicPos += (uint64_t)pic.poc * (6 + frameSize); ofs.seekp(outPicPos); ofs << "FRAME\n"; -#if HIGH_BIT_DEPTH - if (pic.bitDepth > 8 && pic.poc == 0) - x265_log(NULL, X265_LOG_WARNING, "y4m: down-shifting reconstructed pixels to 8 bits\n"); -#else - if (pic.bitDepth > 8 && pic.poc == 0) - x265_log(NULL, X265_LOG_WARNING, "y4m: forcing reconstructed pixels to 8 bits\n"); -#endif + if (inputDepth > 8) + { + if (pic.bitDepth == 8 && pic.poc == 0) + x265_log(NULL, X265_LOG_WARNING, "y4m: down-shifting reconstructed pixels to 8 bits\n"); + } X265_CHECK(pic.colorSpace == colorSpace, "invalid chroma subsampling\n"); -#if HIGH_BIT_DEPTH - - // encoder gave us short pixels, downshift, then write - X265_CHECK(pic.bitDepth > 8, "invalid bit depth\n"); - int shift = pic.bitDepth - 8; - for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++) + if (inputDepth > 8)//if HIGH_BIT_DEPTH { - uint16_t *src = (uint16_t*)pic.planesi; - for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++) + if (pic.bitDepth == 8) { - for (int w = 0; w < width >> x265_cli_cspscolorSpace.widthi; w++) - bufw = (char)(srcw >> shift); - - ofs.write(buf, width >> x265_cli_cspscolorSpace.widthi); - src += pic.stridei / sizeof(*src); + // encoder gave us short pixels, downshift, then write + X265_CHECK(pic.bitDepth == 8, "invalid bit depth\n"); + int shift = pic.bitDepth - 8; + for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++) + { + char *src = (char*)pic.planesi; + for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++) + { + for (int w = 0; w < width >> x265_cli_cspscolorSpace.widthi; w++) + bufw = (char)(srcw >> shift); + + ofs.write(buf, width >> x265_cli_cspscolorSpace.widthi); + src += pic.stridei / sizeof(*src); + } + } + } + else + { + X265_CHECK(pic.bitDepth > 8, "invalid bit depth\n"); + for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++) + { + uint16_t *src = (uint16_t*)pic.planesi; + for (int h = 0; h < (height * 1) >> x265_cli_cspscolorSpace.heighti; h++) + { + ofs.write((const char*)src, (width * 2) >> x265_cli_cspscolorSpace.widthi); + src += pic.stridei / sizeof(*src); + } + } } } - -#else // if HIGH_BIT_DEPTH - - X265_CHECK(pic.bitDepth == 8, "invalid bit depth\n"); - for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++) + else if (inputDepth == 8 && pic.bitDepth > 8) { - char *src = (char*)pic.planesi; - for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++) + X265_CHECK(pic.bitDepth > 8, "invalid bit depth\n"); + for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++) { - ofs.write(src, width >> x265_cli_cspscolorSpace.widthi); - src += pic.stridei / sizeof(*src); + uint16_t* src = (uint16_t*)pic.planesi; + for (int h = 0; h < (height * 1) >> x265_cli_cspscolorSpace.heighti; h++) + { + ofs.write((const char*)src, (width * 2) >> x265_cli_cspscolorSpace.widthi); + src += pic.stridei / sizeof(*src); + } + } + } + else + { + X265_CHECK(pic.bitDepth == 8, "invalid bit depth\n"); + for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++) + { + char *src = (char*)pic.planesi; + for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++) + { + ofs.write(src, width >> x265_cli_cspscolorSpace.widthi); + src += pic.stridei / sizeof(*src); + } } } - -#endif // if HIGH_BIT_DEPTH return true; }
View file
x265_3.5.tar.gz/source/output/y4m.h -> x265_3.6.tar.gz/source/output/y4m.h
Changed
@@ -38,10 +38,14 @@ int height; + uint32_t bitDepth; + int colorSpace; uint32_t frameSize; + int inputDepth; + std::ofstream ofs; std::ofstream::pos_type header; @@ -52,7 +56,7 @@ public: - Y4MOutput(const char *filename, int width, int height, uint32_t fpsNum, uint32_t fpsDenom, int csp); + Y4MOutput(const char *filename, int width, int height, uint32_t bitdepth, uint32_t fpsNum, uint32_t fpsDenom, int csp, int inputDepth); virtual ~Y4MOutput();
View file
x265_3.5.tar.gz/source/output/yuv.cpp -> x265_3.6.tar.gz/source/output/yuv.cpp
Changed
@@ -28,12 +28,13 @@ using namespace X265_NS; using namespace std; -YUVOutput::YUVOutput(const char *filename, int w, int h, uint32_t d, int csp) +YUVOutput::YUVOutput(const char *filename, int w, int h, uint32_t d, int csp, int inputdepth) : width(w) , height(h) , depth(d) , colorSpace(csp) , frameSize(0) + , inputDepth(inputdepth) { ofs.open(filename, ios::binary | ios::out); buf = new charwidth; @@ -56,50 +57,52 @@ X265_CHECK(pic.colorSpace == colorSpace, "invalid chroma subsampling\n"); X265_CHECK(pic.bitDepth == (int)depth, "invalid bit depth\n"); -#if HIGH_BIT_DEPTH - if (depth == 8) + if (inputDepth > 8) { - int shift = pic.bitDepth - 8; - ofs.seekp((std::streamoff)fileOffset); - for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++) - { - uint16_t *src = (uint16_t*)pic.planesi; - for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++) - { - for (int w = 0; w < width >> x265_cli_cspscolorSpace.widthi; w++) - bufw = (char)(srcw >> shift); + if (depth == 8) + { + int shift = pic.bitDepth - 8; + ofs.seekp((std::streamoff)fileOffset); + for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++) + { + uint16_t *src = (uint16_t*)pic.planesi; + for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++) + { + for (int w = 0; w < width >> x265_cli_cspscolorSpace.widthi; w++) + bufw = (char)(srcw >> shift); - ofs.write(buf, width >> x265_cli_cspscolorSpace.widthi); - src += pic.stridei / sizeof(*src); - } - } + ofs.write(buf, width >> x265_cli_cspscolorSpace.widthi); + src += pic.stridei / sizeof(*src); + } + } + } + else + { + ofs.seekp((std::streamoff)(fileOffset * 2)); + for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++) + { + uint16_t *src = (uint16_t*)pic.planesi; + for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++) + { + ofs.write((const char*)src, (width * 2) >> x265_cli_cspscolorSpace.widthi); + src += pic.stridei / sizeof(*src); + } + } + } } else { - ofs.seekp((std::streamoff)(fileOffset * 2)); - for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++) - { - uint16_t *src = (uint16_t*)pic.planesi; - for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++) - { - ofs.write((const char*)src, (width * 2) >> x265_cli_cspscolorSpace.widthi); - src += pic.stridei / sizeof(*src); - } - } + ofs.seekp((std::streamoff)fileOffset); + for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++) + { + char *src = (char*)pic.planesi; + for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++) + { + ofs.write(src, width >> x265_cli_cspscolorSpace.widthi); + src += pic.stridei / sizeof(*src); + } + } } -#else // if HIGH_BIT_DEPTH - ofs.seekp((std::streamoff)fileOffset); - for (int i = 0; i < x265_cli_cspscolorSpace.planes; i++) - { - char *src = (char*)pic.planesi; - for (int h = 0; h < height >> x265_cli_cspscolorSpace.heighti; h++) - { - ofs.write(src, width >> x265_cli_cspscolorSpace.widthi); - src += pic.stridei / sizeof(*src); - } - } - -#endif // if HIGH_BIT_DEPTH return true; }
View file
x265_3.5.tar.gz/source/output/yuv.h -> x265_3.6.tar.gz/source/output/yuv.h
Changed
@@ -46,13 +46,15 @@ uint32_t frameSize; + int inputDepth; + char *buf; std::ofstream ofs; public: - YUVOutput(const char *filename, int width, int height, uint32_t bitdepth, int csp); + YUVOutput(const char *filename, int width, int height, uint32_t bitdepth, int csp, int inputDepth); virtual ~YUVOutput();
View file
x265_3.5.tar.gz/source/test/CMakeLists.txt -> x265_3.6.tar.gz/source/test/CMakeLists.txt
Changed
@@ -23,15 +23,13 @@ # add ARM assembly files if(ARM OR CROSS_COMPILE_ARM) - if(NOT ARM64) - enable_language(ASM) - set(NASM_SRC checkasm-arm.S) - add_custom_command( - OUTPUT checkasm-arm.obj - COMMAND ${CMAKE_CXX_COMPILER} - ARGS ${NASM_FLAGS} ${CMAKE_CURRENT_SOURCE_DIR}/checkasm-arm.S -o checkasm-arm.obj - DEPENDS checkasm-arm.S) - endif() + enable_language(ASM) + set(NASM_SRC checkasm-arm.S) + add_custom_command( + OUTPUT checkasm-arm.obj + COMMAND ${CMAKE_CXX_COMPILER} + ARGS ${NASM_FLAGS} ${CMAKE_CURRENT_SOURCE_DIR}/checkasm-arm.S -o checkasm-arm.obj + DEPENDS checkasm-arm.S) endif(ARM OR CROSS_COMPILE_ARM) # add PowerPC assembly files
View file
x265_3.5.tar.gz/source/test/pixelharness.cpp -> x265_3.6.tar.gz/source/test/pixelharness.cpp
Changed
@@ -406,6 +406,32 @@ return true; } +bool PixelHarness::check_downscaleluma_t(downscaleluma_t ref, downscaleluma_t opt) +{ + ALIGN_VAR_16(pixel, ref_destf32 * 32); + ALIGN_VAR_16(pixel, opt_destf32 * 32); + + intptr_t src_stride = 64; + intptr_t dst_stride = 32; + int bx = 32; + int by = 32; + int j = 0; + for (int i = 0; i < ITERS; i++) + { + int index = i % TEST_CASES; + ref(pixel_test_buffindex + j, ref_destf, src_stride, dst_stride, bx, by); + checked(opt, pixel_test_buffindex + j, opt_destf, src_stride, dst_stride, bx, by); + + if (memcmp(ref_destf, opt_destf, 32 * 32 * sizeof(pixel))) + return false; + + reportfail(); + j += INCR; + } + + return true; +} + bool PixelHarness::check_cpy2Dto1D_shl_t(cpy2Dto1D_shl_t ref, cpy2Dto1D_shl_t opt) { ALIGN_VAR_16(int16_t, ref_dest64 * 64); @@ -2793,6 +2819,15 @@ } } + if (opt.frameSubSampleLuma) + { + if (!check_downscaleluma_t(ref.frameSubSampleLuma, opt.frameSubSampleLuma)) + { + printf("SubSample Luma failed!\n"); + return false; + } + } + if (opt.scale1D_128to64NONALIGNED) { if (!check_scale1D_pp(ref.scale1D_128to64NONALIGNED, opt.scale1D_128to64NONALIGNED)) @@ -3492,6 +3527,12 @@ REPORT_SPEEDUP(opt.frameInitLowres, ref.frameInitLowres, pbuf2, pbuf1, pbuf2, pbuf3, pbuf4, 64, 64, 64, 64); } + if (opt.frameSubSampleLuma) + { + HEADER0("downscaleluma"); + REPORT_SPEEDUP(opt.frameSubSampleLuma, ref.frameSubSampleLuma, pbuf2, pbuf1, 64, 64, 64, 64); + } + if (opt.scale1D_128to64NONALIGNED) { HEADER0("scale1D_128to64");
View file
x265_3.5.tar.gz/source/test/pixelharness.h -> x265_3.6.tar.gz/source/test/pixelharness.h
Changed
@@ -138,6 +138,7 @@ bool check_integral_inith(integralh_t ref, integralh_t opt); bool check_ssimDist(ssimDistortion_t ref, ssimDistortion_t opt); bool check_normFact(normFactor_t ref, normFactor_t opt, int block); + bool check_downscaleluma_t(downscaleluma_t ref, downscaleluma_t opt); public:
View file
x265_3.5.tar.gz/source/test/rate-control-tests.txt -> x265_3.6.tar.gz/source/test/rate-control-tests.txt
Changed
@@ -15,7 +15,7 @@ 112_1920x1080_25.yuv,--preset ultrafast --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 15000 --hrd --strict-cbr Traffic_4096x2048_30.yuv,--preset superfast --bitrate 20000 --vbv-maxrate 20000 --vbv-bufsize 20000 --repeat-headers --strict-cbr Traffic_4096x2048_30.yuv,--preset faster --bitrate 8000 --vbv-maxrate 8000 --vbv-bufsize 6000 --aud --repeat-headers --no-open-gop --hrd --pmode --pme -News-4k.y4m,--preset veryfast --bitrate 3000 --vbv-maxrate 5000 --vbv-bufsize 5000 --repeat-headers --temporal-layers +News-4k.y4m,--preset veryfast --bitrate 3000 --vbv-maxrate 5000 --vbv-bufsize 5000 --repeat-headers --temporal-layers 3 NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --bitrate 18000 --vbv-bufsize 20000 --vbv-maxrate 18000 --strict-cbr NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --bitrate 8000 --vbv-bufsize 12000 --vbv-maxrate 10000 --tune grain big_buck_bunny_360p24.y4m,--preset medium --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 600 --aud --hrd --tune fast-decode
View file
x265_3.5.tar.gz/source/test/regression-tests.txt -> x265_3.6.tar.gz/source/test/regression-tests.txt
Changed
@@ -18,12 +18,12 @@ BasketballDrive_1920x1080_50.y4m,--preset faster --aq-strength 2 --merange 190 --slices 3 BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7 --qg-size 16 --cu-lossless --tu-inter-depth 3 --limit-tu 1 BasketballDrive_1920x1080_50.y4m,--preset medium --keyint -1 --nr-inter 100 -F4 --no-sao -BasketballDrive_1920x1080_50.y4m,--preset medium --no-cutree --analysis-save x265_analysis.dat --analysis-save-reuse-level 2 --bitrate 7000 --limit-modes::--preset medium --no-cutree --analysis-load x265_analysis.dat --analysis-load-reuse-level 2 --bitrate 7000 --limit-modes +BasketballDrive_1920x1080_50.y4m,--preset medium --analysis-save x265_analysis.dat --analysis-save-reuse-level 2 --bitrate 7000 --limit-modes::--preset medium --analysis-load x265_analysis.dat --analysis-load-reuse-level 2 --bitrate 7000 --limit-modes BasketballDrive_1920x1080_50.y4m,--preset slow --nr-intra 100 -F4 --aq-strength 3 --qg-size 16 --limit-refs 1 BasketballDrive_1920x1080_50.y4m,--preset slower --lossless --chromaloc 3 --subme 0 --limit-tu 4 -BasketballDrive_1920x1080_50.y4m,--preset slower --no-cutree --analysis-save x265_analysis.dat --analysis-save-reuse-level 10 --bitrate 7000 --limit-tu 0::--preset slower --no-cutree --analysis-load x265_analysis.dat --analysis-load-reuse-level 10 --bitrate 7000 --limit-tu 0 +BasketballDrive_1920x1080_50.y4m,--preset slower --analysis-save x265_analysis.dat --analysis-save-reuse-level 10 --bitrate 7000 --limit-tu 0::--preset slower --analysis-load x265_analysis.dat --analysis-load-reuse-level 10 --bitrate 7000 --limit-tu 0 BasketballDrive_1920x1080_50.y4m,--preset veryslow --crf 4 --cu-lossless --pmode --limit-refs 1 --aq-mode 3 --limit-tu 3 -BasketballDrive_1920x1080_50.y4m,--preset veryslow --no-cutree --analysis-save x265_analysis.dat --analysis-save-reuse-level 5 --crf 18 --tskip-fast --limit-tu 2::--preset veryslow --no-cutree --analysis-load x265_analysis.dat --analysis-load-reuse-level 5 --crf 18 --tskip-fast --limit-tu 2 +BasketballDrive_1920x1080_50.y4m,--preset veryslow --analysis-save x265_analysis.dat --analysis-save-reuse-level 5 --crf 18 --tskip-fast --limit-tu 2::--preset veryslow --analysis-load x265_analysis.dat --analysis-load-reuse-level 5 --crf 18 --tskip-fast --limit-tu 2 BasketballDrive_1920x1080_50.y4m,--preset veryslow --recon-y4m-exec "ffplay -i pipe:0 -autoexit" Coastguard-4k.y4m,--preset ultrafast --recon-y4m-exec "ffplay -i pipe:0 -autoexit" Coastguard-4k.y4m,--preset superfast --tune grain --overscan=crop @@ -33,7 +33,7 @@ Coastguard-4k.y4m,--preset slow --tune psnr --cbqpoffs -1 --crqpoffs 1 --limit-refs 1 CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency --qg-size 16 CrowdRun_1920x1080_50_10bit_422.yuv,--preset superfast --weightp --no-wpp --sao -CrowdRun_1920x1080_50_10bit_422.yuv,--preset veryfast --temporal-layers --tune grain +CrowdRun_1920x1080_50_10bit_422.yuv,--preset veryfast --temporal-layers 2 --tune grain CrowdRun_1920x1080_50_10bit_422.yuv,--preset faster --max-tu-size 4 --min-cu-size 32 CrowdRun_1920x1080_50_10bit_422.yuv,--preset fast --aq-mode 0 --sar 2 --range full CrowdRun_1920x1080_50_10bit_422.yuv,--preset medium --no-wpp --no-cutree --no-strong-intra-smoothing --limit-refs 1 @@ -41,7 +41,7 @@ CrowdRun_1920x1080_50_10bit_422.yuv,--preset slower --tune ssim --tune fastdecode --limit-refs 2 CrowdRun_1920x1080_50_10bit_444.yuv,--preset ultrafast --weightp --no-wpp --no-open-gop CrowdRun_1920x1080_50_10bit_444.yuv,--preset superfast --weightp --dither --no-psy-rd -CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryfast --temporal-layers --repeat-headers --limit-refs 2 +CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryfast --temporal-layers 2 --repeat-headers --limit-refs 2 CrowdRun_1920x1080_50_10bit_444.yuv,--preset medium --dither --keyint -1 --rdoq-level 1 --limit-modes CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryslow --tskip --tskip-fast --no-scenecut --limit-tu 1 CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryslow --aq-mode 3 --aq-strength 1.5 --aq-motion --bitrate 5000 @@ -49,11 +49,11 @@ CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryslow --hevc-aq --no-cutree --qg-size 16 DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset superfast --weightp --qg-size 16 DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset medium --tune psnr --bframes 16 --limit-modes -DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers --no-psy-rd --qg-size 32 --limit-refs 0 --cu-lossless +DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers 2 --no-psy-rd --qg-size 32 --limit-refs 0 --cu-lossless DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset veryfast --weightp --nr-intra 1000 -F4 DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset medium --nr-inter 500 -F4 --no-psy-rdoq DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset slower --no-weightp --rdoq-level 0 --limit-refs 3 --tu-inter-depth 4 --limit-tu 3 -DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset fast --no-cutree --analysis-save x265_analysis.dat --analysis-save-reuse-level 5 --bitrate 3000 --early-skip --tu-inter-depth 3 --limit-tu 1::--preset fast --no-cutree --analysis-load x265_analysis.dat --analysis-load-reuse-level 5 --bitrate 3000 --early-skip --tu-inter-depth 3 --limit-tu 1 +DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset fast --analysis-save x265_analysis.dat --analysis-save-reuse-level 5 --bitrate 3000 --early-skip --tu-inter-depth 3 --limit-tu 1::--preset fast --analysis-load x265_analysis.dat --analysis-load-reuse-level 5 --bitrate 3000 --early-skip --tu-inter-depth 3 --limit-tu 1 FourPeople_1280x720_60.y4m,--preset superfast --no-wpp --lookahead-slices 2 FourPeople_1280x720_60.y4m,--preset veryfast --aq-mode 2 --aq-strength 1.5 --qg-size 8 FourPeople_1280x720_60.y4m,--preset medium --qp 38 --no-psy-rd @@ -158,13 +158,10 @@ ducks_take_off_420_1_720p50.y4m,--preset medium --selective-sao 4 --sao --crf 20 Traffic_4096x2048_30p.y4m, --preset medium --frame-dup --dup-threshold 60 --hrd --bitrate 10000 --vbv-bufsize 15000 --vbv-maxrate 12000 Kimono1_1920x1080_24_400.yuv,--preset superfast --qp 28 --zones 0,139,q=32 -sintel_trailer_2k_1920x1080_24.yuv, --preset medium --hist-scenecut --hist-threshold 0.02 --frame-dup --dup-threshold 60 --hrd --bitrate 10000 --vbv-bufsize 15000 --vbv-maxrate 12000 -sintel_trailer_2k_1920x1080_24.yuv, --preset medium --hist-scenecut --hist-threshold 0.02 -sintel_trailer_2k_1920x1080_24.yuv, --preset ultrafast --hist-scenecut --hist-threshold 0.02 crowd_run_1920x1080_50.yuv, --preset faster --ctu 32 --rskip 2 --rskip-edge-threshold 5 crowd_run_1920x1080_50.yuv, --preset fast --ctu 64 --rskip 2 --rskip-edge-threshold 5 --aq-mode 4 -crowd_run_1920x1080_50.yuv, --preset slow --ctu 32 --rskip 2 --rskip-edge-threshold 5 --hist-scenecut --hist-threshold 0.1 -crowd_run_1920x1080_50.yuv, --preset slower --ctu 16 --rskip 2 --rskip-edge-threshold 5 --hist-scenecut --hist-threshold 0.1 --aq-mode 4 +crowd_run_1920x1080_50.yuv, --preset ultrafast --video-signal-type-preset BT2100_PQ_YCC:BT2100x108n0005 +crowd_run_1920x1080_50.yuv, --preset ultrafast --eob --eos # Main12 intraCost overflow bug test 720p50_parkrun_ter.y4m,--preset medium @@ -182,14 +179,22 @@ #scaled save/load test crowd_run_1080p50.y4m,--preset ultrafast --no-cutree --analysis-save x265_analysis.dat --analysis-save-reuse-level 1 --scale-factor 2 --crf 26 --vbv-maxrate 8000 --vbv-bufsize 8000::crowd_run_2160p50.y4m, --preset ultrafast --no-cutree --analysis-load x265_analysis.dat --analysis-load-reuse-level 1 --scale-factor 2 --crf 26 --vbv-maxrate 12000 --vbv-bufsize 12000 -crowd_run_1080p50.y4m,--preset superfast --no-cutree --analysis-save x265_analysis.dat --analysis-save-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 5000 --vbv-bufsize 5000::crowd_run_2160p50.y4m, --preset superfast --no-cutree --analysis-load x265_analysis.dat --analysis-load-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 10000 --vbv-bufsize 10000 -crowd_run_1080p50.y4m,--preset fast --no-cutree --analysis-save x265_analysis.dat --analysis-save-reuse-level 5 --scale-factor 2 --qp 18::crowd_run_2160p50.y4m, --preset fast --no-cutree --analysis-load x265_analysis.dat --analysis-load-reuse-level 5 --scale-factor 2 --qp 18 +crowd_run_1080p50.y4m,--preset superfast --analysis-save x265_analysis.dat --analysis-save-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 5000 --vbv-bufsize 5000::crowd_run_2160p50.y4m, --preset superfast --analysis-load x265_analysis.dat --analysis-load-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 10000 --vbv-bufsize 10000 +crowd_run_1080p50.y4m,--preset fast --analysis-save x265_analysis.dat --analysis-save-reuse-level 5 --scale-factor 2 --qp 18::crowd_run_2160p50.y4m, --preset fast --analysis-load x265_analysis.dat --analysis-load-reuse-level 5 --scale-factor 2 --qp 18 crowd_run_1080p50.y4m,--preset medium --no-cutree --analysis-save x265_analysis.dat --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 5000 --vbv-maxrate 5000 --vbv-bufsize 5000 --early-skip --tu-inter-depth 3::crowd_run_2160p50.y4m, --preset medium --no-cutree --analysis-load x265_analysis.dat --analysis-load-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 4 --dynamic-refine::crowd_run_2160p50.y4m, --preset medium --no-cutree --analysis-load x265_analysis.dat --analysis-load-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 3 --refine-inter 3 -RaceHorses_416x240_30.y4m,--preset slow --no-cutree --ctu 16 --analysis-save x265_analysis.dat --analysis-save-reuse-level 10 --scale-factor 2 --crf 22 --vbv-maxrate 1000 --vbv-bufsize 1000::RaceHorses_832x480_30.y4m, --preset slow --no-cutree --ctu 32 --analysis-load x265_analysis.dat --analysis-save x265_analysis_2.dat --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --crf 16 --vbv-maxrate 4000 --vbv-bufsize 4000 --refine-intra 0 --refine-inter 1::RaceHorses_1664x960_30.y4m,--preset slow --no-cutree --ctu 64 --analysis-load x265_analysis_2.dat --analysis-load-reuse-level 10 --scale-factor 2 --crf 12 --vbv-maxrate 7000 --vbv-bufsize 7000 --refine-intra 2 --refine-inter 2 +RaceHorses_416x240_30.y4m,--preset slow --ctu 16 --analysis-save x265_analysis.dat --analysis-save-reuse-level 10 --scale-factor 2 --crf 22 --vbv-maxrate 1000 --vbv-bufsize 1000::RaceHorses_832x480_30.y4m, --preset slow --ctu 32 --analysis-load x265_analysis.dat --analysis-save x265_analysis_2.dat --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --crf 16 --vbv-maxrate 4000 --vbv-bufsize 4000 --refine-intra 0 --refine-inter 1::RaceHorses_1664x960_30.y4m,--preset slow --ctu 64 --analysis-load x265_analysis_2.dat --analysis-load-reuse-level 10 --scale-factor 2 --crf 12 --vbv-maxrate 7000 --vbv-bufsize 7000 --refine-intra 2 --refine-inter 2 ElFunete_960x540_60.yuv,--colorprim bt709 --transfer bt709 --chromaloc 2 --aud --repeat-headers --no-opt-qp-pps --no-opt-ref-list-length-pps --wpp --no-interlace --sar 1:1 --min-keyint 60 --no-open-gop --rc-lookahead 180 --bframes 5 --b-intra --ref 4 --cbqpoffs -2 --crqpoffs -2 --lookahead-threads 0 --weightb --qg-size 8 --me star --preset veryslow --frame-threads 1 --b-adapt 2 --aq-mode 3 --rd 6 --pools 15 --colormatrix bt709 --keyint 120 --high-tier --ctu 64 --tune psnr --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500 --analysis-save-reuse-level 10 --analysis-save elfuente_960x540.dat --scale-factor 2::ElFunete_1920x1080_60.yuv,--colorprim bt709 --transfer bt709 --chromaloc 2 --aud --repeat-headers --no-opt-qp-pps --no-opt-ref-list-length-pps --wpp --no-interlace --sar 1:1 --min-keyint 60 --no-open-gop --rc-lookahead 180 --bframes 5 --b-intra --ref 4 --cbqpoffs -2 --crqpoffs -2 --lookahead-threads 0 --weightb --qg-size 8 --me star --preset veryslow --frame-threads 1 --b-adapt 2 --aq-mode 3 --rd 6 --pools 15 --colormatrix bt709 --keyint 120 --high-tier --ctu 64 --tune psnr --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500 --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --analysis-save elfuente_1920x1080.dat --limit-tu 0 --scale-factor 2 --analysis-load elfuente_960x540.dat --refine-intra 4 --refine-inter 2::ElFuente_3840x2160_60.yuv,--colorprim bt709 --transfer bt709 --chromaloc 2 --aud --repeat-headers --no-opt-qp-pps --no-opt-ref-list-length-pps --wpp --no-interlace --sar 1:1 --min-keyint 60 --no-open-gop --rc-lookahead 180 --bframes 5 --b-intra --ref 4 --cbqpoffs -2 --crqpoffs -2 --lookahead-threads 0 --weightb --qg-size 8 --me star --preset veryslow --frame-threads 1 --b-adapt 2 --aq-mode 3 --rd 6 --pools 15 --colormatrix bt709 --keyint 120 --high-tier --ctu 64 --tune=psnr --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000 --analysis-load-reuse-level 10 --limit-tu 0 --scale-factor 2 --analysis-load elfuente_1920x1080.dat --refine-intra 4 --refine-inter 2 #save/load with ctu distortion refinement CrowdRun_1920x1080_50_10bit_422.yuv,--no-cutree --analysis-save x265_analysis.dat --analysis-save-reuse-level 5 --refine-ctu-distortion 1 --bitrate 7000::--no-cutree --analysis-load x265_analysis.dat --refine-ctu-distortion 1 --bitrate 7000 --analysis-load-reuse-level 5 #segment encoding BasketballDrive_1920x1080_50.y4m, --preset ultrafast --no-open-gop --chunk-start 100 --chunk-end 200 +#Test FG SEI message addition +#OldTownCross_1920x1080_50_10bit_422.yuv,--preset slower --tune grain --film-grain "OldTownCross_1920x1080_50_10bit_422.bin" +#RaceHorses_416x240_30_10bit.yuv,--preset ultrafast --signhide --colormatrix bt709 --film-grain "RaceHorses_416x240_30_10bit.bin" + +#Temporal layers tests +ducks_take_off_420_720p50.y4m,--preset slow --temporal-layers 3 --b-adapt 0 +parkrun_ter_720p50.y4m,--preset medium --temporal-layers 4 --b-adapt 0 +BasketballDrive_1920x1080_50.y4m, --preset medium --no-open-gop --keyint 50 --min-keyint 50 --temporal-layers 5 --b-adapt 0 # vim: tw=200
View file
x265_3.5.tar.gz/source/test/save-load-tests.txt -> x265_3.6.tar.gz/source/test/save-load-tests.txt
Changed
@@ -12,10 +12,10 @@ # not auto-detected. crowd_run_1080p50.y4m, --preset ultrafast --no-cutree --analysis-save x265_analysis.dat --analysis-save-reuse-level 1 --scale-factor 2 --crf 26 --vbv-maxrate 8000 --vbv-bufsize 8000::crowd_run_2160p50.y4m, --preset ultrafast --no-cutree --analysis-load x265_analysis.dat --analysis-load-reuse-level 1 --scale-factor 2 --crf 26 --vbv-maxrate 12000 --vbv-bufsize 12000 crowd_run_540p50.y4m, --preset ultrafast --no-cutree --analysis-save x265_analysis.dat --scale-factor 2 --crf 26 --vbv-maxrate 8000 --vbv-bufsize 8000::crowd_run_1080p50.y4m, --preset ultrafast --no-cutree --analysis-load x265_analysis.dat --scale-factor 2 --crf 26 --vbv-maxrate 12000 --vbv-bufsize 12000 -crowd_run_1080p50.y4m, --preset superfast --no-cutree --analysis-save x265_analysis.dat --analysis-save-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 5000 --vbv-bufsize 5000::crowd_run_2160p50.y4m, --preset superfast --no-cutree --analysis-load x265_analysis.dat --analysis-load-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 10000 --vbv-bufsize 10000 -crowd_run_1080p50.y4m, --preset fast --no-cutree --analysis-save x265_analysis.dat --analysis-save-reuse-level 5 --scale-factor 2 --qp 18::crowd_run_2160p50.y4m, --preset fast --no-cutree --analysis-load x265_analysis.dat --analysis-load-reuse-level 5 --scale-factor 2 --qp 18 -crowd_run_1080p50.y4m, --preset medium --no-cutree --analysis-save x265_analysis.dat --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 5000 --vbv-maxrate 5000 --vbv-bufsize 5000 --early-skip --tu-inter-depth 3::crowd_run_2160p50.y4m, --preset medium --no-cutree --analysis-load x265_analysis.dat --analysis-load-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 4 --dynamic-refine::crowd_run_2160p50.y4m, --preset medium --no-cutree --analysis-load x265_analysis.dat --analysis-load-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 3 --refine-inter 3 +crowd_run_1080p50.y4m, --preset superfast --analysis-save x265_analysis.dat --analysis-save-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 5000 --vbv-bufsize 5000::crowd_run_2160p50.y4m, --preset superfast --analysis-load x265_analysis.dat --analysis-load-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 10000 --vbv-bufsize 10000 +crowd_run_1080p50.y4m, --preset fast --analysis-save x265_analysis.dat --analysis-save-reuse-level 5 --scale-factor 2 --qp 18::crowd_run_2160p50.y4m, --preset fast --analysis-load x265_analysis.dat --analysis-load-reuse-level 5 --scale-factor 2 --qp 18 +crowd_run_1080p50.y4m, --preset medium --analysis-save x265_analysis.dat --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 5000 --vbv-maxrate 5000 --vbv-bufsize 5000 --early-skip --tu-inter-depth 3::crowd_run_2160p50.y4m, --preset medium --analysis-load x265_analysis.dat --analysis-load-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 4 --dynamic-refine::crowd_run_2160p50.y4m, --preset medium --analysis-load x265_analysis.dat --analysis-load-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 3 --refine-inter 3 RaceHorses_416x240_30.y4m, --preset slow --no-cutree --ctu 16 --analysis-save x265_analysis.dat --analysis-save-reuse-level 10 --scale-factor 2 --crf 22 --vbv-maxrate 1000 --vbv-bufsize 1000::RaceHorses_832x480_30.y4m, --preset slow --no-cutree --ctu 32 --analysis-load x265_analysis.dat --analysis-save x265_analysis_2.dat --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --crf 16 --vbv-maxrate 4000 --vbv-bufsize 4000 --refine-intra 0 --refine-inter 1::RaceHorses_1664x960_30.y4m, --preset slow --no-cutree --ctu 64 --analysis-load x265_analysis_2.dat --analysis-load-reuse-level 10 --scale-factor 2 --crf 12 --vbv-maxrate 7000 --vbv-bufsize 7000 --refine-intra 2 --refine-inter 2 -crowd_run_540p50.y4m, --preset veryslow --no-cutree --analysis-save x265_analysis_540.dat --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 5000 --vbv-bufsize 15000 --vbv-maxrate 9000::crowd_run_1080p50.y4m, --preset veryslow --no-cutree --analysis-save x265_analysis_1080.dat --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500::crowd_run_1080p50.y4m, --preset veryslow --no-cutree --analysis-save x265_analysis_1080.dat --analysis-load x265_analysis_540.dat --refine-intra 4 --dynamic-refine --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500::crowd_run_2160p50.y4m, --preset veryslow --no-cutree --analysis-save x265_analysis_2160.dat --analysis-load x265_analysis_1080.dat --refine-intra 3 --dynamic-refine --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000::crowd_run_2160p50.y4m, --preset veryslow --no-cutree --analysis-load x265_analysis_2160.dat --refine-intra 2 --dynamic-refine --analysis-load-reuse-level 10 --scale-factor 1 --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000 +crowd_run_540p50.y4m, --preset veryslow --analysis-save x265_analysis_540.dat --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 5000 --vbv-bufsize 15000 --vbv-maxrate 9000::crowd_run_1080p50.y4m, --preset veryslow --analysis-save x265_analysis_1080.dat --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500::crowd_run_1080p50.y4m, --preset veryslow --analysis-save x265_analysis_1080.dat --analysis-load x265_analysis_540.dat --refine-intra 4 --dynamic-refine --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500::crowd_run_2160p50.y4m, --preset veryslow --analysis-save x265_analysis_2160.dat --analysis-load x265_analysis_1080.dat --refine-intra 3 --dynamic-refine --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000::crowd_run_2160p50.y4m, --preset veryslow --analysis-load x265_analysis_2160.dat --refine-intra 2 --dynamic-refine --analysis-load-reuse-level 10 --scale-factor 1 --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000 crowd_run_540p50.y4m, --preset medium --no-cutree --analysis-save x265_analysis_540.dat --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 5000 --vbv-bufsize 15000 --vbv-maxrate 9000::crowd_run_1080p50.y4m, --preset medium --no-cutree --analysis-save x265_analysis_1080.dat --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500::crowd_run_1080p50.y4m, --preset medium --no-cutree --analysis-save x265_analysis_1080.dat --analysis-load x265_analysis_540.dat --refine-intra 4 --dynamic-refine --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500::crowd_run_2160p50.y4m, --preset medium --no-cutree --analysis-save x265_analysis_2160.dat --analysis-load x265_analysis_1080.dat --refine-intra 3 --dynamic-refine --analysis-load-reuse-level 10 --analysis-save-reuse-level 10 --scale-factor 2 --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000::crowd_run_2160p50.y4m, --preset medium --no-cutree --analysis-load x265_analysis_2160.dat --refine-intra 2 --dynamic-refine --analysis-load-reuse-level 10 --scale-factor 1 --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000 News-4k.y4m, --preset medium --analysis-save x265_analysis_fdup.dat --frame-dup --hrd --bitrate 10000 --vbv-bufsize 15000 --vbv-maxrate 12000::News-4k.y4m, --analysis-load x265_analysis_fdup.dat --frame-dup --hrd --bitrate 10000 --vbv-bufsize 15000 --vbv-maxrate 12000
View file
x265_3.5.tar.gz/source/test/smoke-tests.txt -> x265_3.6.tar.gz/source/test/smoke-tests.txt
Changed
@@ -23,3 +23,7 @@ # Main12 intraCost overflow bug test 720p50_parkrun_ter.y4m,--preset medium 720p50_parkrun_ter.y4m,--preset=fast --hevc-aq --no-cutree +# Test FG SEI message addition +# CrowdRun_1920x1080_50_10bit_444.yuv,--preset=ultrafast --weightp --keyint -1 --film-grain "CrowdRun_1920x1080_50_10bit_444.bin" +# DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset=veryfast --min-cu 16 --film-grain "DucksAndLegs_1920x1080_60_10bit_422.bin" +# NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset=superfast --bitrate 10000 --sao --limit-sao --cll --max-cll "1000,400" --film-grain "NebutaFestival_2560x1600_60_10bit_crop.bin"
View file
x265_3.5.tar.gz/source/test/testbench.cpp -> x265_3.6.tar.gz/source/test/testbench.cpp
Changed
@@ -174,6 +174,8 @@ { "AVX512", X265_CPU_AVX512 }, { "ARMv6", X265_CPU_ARMV6 }, { "NEON", X265_CPU_NEON }, + { "SVE2", X265_CPU_SVE2 }, + { "SVE", X265_CPU_SVE }, { "FastNeonMRC", X265_CPU_FAST_NEON_MRC }, { "", 0 }, }; @@ -208,15 +210,8 @@ EncoderPrimitives asmprim; memset(&asmprim, 0, sizeof(asmprim)); - setupAssemblyPrimitives(asmprim, test_archi.flag); - -#if X265_ARCH_ARM64 - /* Temporary workaround because luma_vsp assembly primitive has not been completed - * but interp_8tap_hv_pp_cpu uses mixed C primitive and assembly primitive. - * Otherwise, segment fault occurs. */ - setupAliasCPrimitives(cprim, asmprim, test_archi.flag); -#endif + setupAssemblyPrimitives(asmprim, test_archi.flag); setupAliasPrimitives(asmprim); memcpy(&primitives, &asmprim, sizeof(EncoderPrimitives)); for (size_t h = 0; h < sizeof(harness) / sizeof(TestHarness*); h++) @@ -239,14 +234,8 @@ #if X265_ARCH_X86 setupInstrinsicPrimitives(optprim, cpuid); #endif - setupAssemblyPrimitives(optprim, cpuid); -#if X265_ARCH_ARM64 - /* Temporary workaround because luma_vsp assembly primitive has not been completed - * but interp_8tap_hv_pp_cpu uses mixed C primitive and assembly primitive. - * Otherwise, segment fault occurs. */ - setupAliasCPrimitives(cprim, optprim, cpuid); -#endif + setupAssemblyPrimitives(optprim, cpuid); /* Note that we do not setup aliases for performance tests, that would be * redundant. The testbench only verifies they are correctly aliased */
View file
x265_3.5.tar.gz/source/test/testharness.h -> x265_3.6.tar.gz/source/test/testharness.h
Changed
@@ -73,7 +73,7 @@ #include <x86intrin.h> #elif ( !defined(__APPLE__) && defined (__GNUC__) && defined(__ARM_NEON__)) #include <arm_neon.h> -#elif defined(__GNUC__) && (!defined(__clang__) || __clang_major__ < 4) +#else /* fallback for older GCC/MinGW */ static inline uint32_t __rdtsc(void) { @@ -82,15 +82,13 @@ #if X265_ARCH_X86 asm volatile("rdtsc" : "=a" (a) ::"edx"); #elif X265_ARCH_ARM -#if X265_ARCH_ARM64 - asm volatile("mrs %0, cntvct_el0" : "=r"(a)); -#else // TOD-DO: verify following inline asm to get cpu Timestamp Counter for ARM arch // asm volatile("mrc p15, 0, %0, c9, c13, 0" : "=r"(a)); // TO-DO: replace clock() function with appropriate ARM cpu instructions a = clock(); -#endif +#elif X265_ARCH_ARM64 + asm volatile("mrs %0, cntvct_el0" : "=r"(a)); #endif return a; } @@ -128,8 +126,8 @@ x265_emms(); \ float optperf = (10.0f * cycles / runs) / 4; \ float refperf = (10.0f * refcycles / refruns) / 4; \ - printf("\t%3.2fx ", refperf / optperf); \ - printf("\t %-8.2lf \t %-8.2lf\n", optperf, refperf); \ + printf(" | \t%3.2fx | ", refperf / optperf); \ + printf("\t %-8.2lf | \t %-8.2lf\n", optperf, refperf); \ } extern "C" { @@ -140,7 +138,7 @@ * needs an explicit asm check because it only sometimes crashes in normal use. */ intptr_t PFX(checkasm_call)(intptr_t (*func)(), int *ok, ...); float PFX(checkasm_call_float)(float (*func)(), int *ok, ...); -#elif X265_ARCH_ARM == 0 +#elif (X265_ARCH_ARM == 0 && X265_ARCH_ARM64 == 0) #define PFX(stack_pagealign)(func, align) func() #endif
View file
x265_3.5.tar.gz/source/x265.cpp -> x265_3.6.tar.gz/source/x265.cpp
Changed
@@ -296,6 +296,16 @@ int ret = 0; + if (cliopt0.scenecutAwareQpConfig) + { + if (!cliopt0.parseScenecutAwareQpConfig()) + { + x265_log(NULL, X265_LOG_ERROR, "Unable to parse scenecut aware qp config file \n"); + fclose(cliopt0.scenecutAwareQpConfig); + cliopt0.scenecutAwareQpConfig = NULL; + } + } + AbrEncoder* abrEnc = new AbrEncoder(cliopt, numEncodes, ret); int threadsActive = abrEnc->m_numActiveEncodes.get(); while (threadsActive)
View file
x265_3.5.tar.gz/source/x265.h -> x265_3.6.tar.gz/source/x265.h
Changed
@@ -26,6 +26,7 @@ #define X265_H #include <stdint.h> #include <stdio.h> +#include <sys/stat.h> #include "x265_config.h" #ifdef __cplusplus extern "C" { @@ -59,7 +60,7 @@ NAL_UNIT_CODED_SLICE_TRAIL_N = 0, NAL_UNIT_CODED_SLICE_TRAIL_R, NAL_UNIT_CODED_SLICE_TSA_N, - NAL_UNIT_CODED_SLICE_TLA_R, + NAL_UNIT_CODED_SLICE_TSA_R, NAL_UNIT_CODED_SLICE_STSA_N, NAL_UNIT_CODED_SLICE_STSA_R, NAL_UNIT_CODED_SLICE_RADL_N, @@ -311,6 +312,7 @@ double vmafFrameScore; double bufferFillFinal; double unclippedBufferFillFinal; + uint8_t tLayer; } x265_frame_stats; typedef struct x265_ctu_info_t @@ -536,6 +538,8 @@ /* ARM */ #define X265_CPU_ARMV6 0x0000001 #define X265_CPU_NEON 0x0000002 /* ARM NEON */ +#define X265_CPU_SVE2 0x0000008 /* ARM SVE2 */ +#define X265_CPU_SVE 0x0000010 /* ARM SVE2 */ #define X265_CPU_FAST_NEON_MRC 0x0000004 /* Transfer from NEON to ARM register is fast (Cortex-A9) */ /* IBM Power8 */ @@ -613,6 +617,13 @@ #define SLICE_TYPE_DELTA 0.3 /* The offset decremented or incremented for P-frames or b-frames respectively*/ #define BACKWARD_WINDOW 1 /* Scenecut window before a scenecut */ #define FORWARD_WINDOW 2 /* Scenecut window after a scenecut */ +#define BWD_WINDOW_DELTA 0.4 + +#define X265_MAX_GOP_CONFIG 3 +#define X265_MAX_GOP_LENGTH 16 +#define MAX_T_LAYERS 7 + +#define X265_IPRATIO_STRENGTH 1.43 typedef struct x265_cli_csp { @@ -696,6 +707,7 @@ typedef struct x265_zone { int startFrame, endFrame; /* range of frame numbers */ + int keyframeMax; /* it store the default/user defined keyframeMax value*/ int bForceQp; /* whether to use qp vs bitrate factor */ int qp; float bitrateFactor; @@ -747,6 +759,271 @@ static const x265_vmaf_commondata vcd = { { NULL, (char *)"/usr/local/share/model/vmaf_v0.6.1.pkl", NULL, NULL, 0, 0, 0, 0, 0, 0, 0, NULL, 0, 1, 0 } }; +typedef struct x265_temporal_layer { + int poc_offset; /* POC offset */ + int8_t layer; /* Current layer */ + int8_t qp_offset; /* QP offset */ +} x265_temporal_layer; + +static const int8_t x265_temporal_layer_bframesMAX_T_LAYERS = {-1, -1, 3, 7, 15, -1, -1}; + +static const int8_t x265_gop_ra_lengthX265_MAX_GOP_CONFIG = { 4, 8, 16}; +static const x265_temporal_layer x265_gop_raX265_MAX_GOP_CONFIGX265_MAX_GOP_LENGTH = { + { + { + 4, + 0, + 1, + }, + { + 2, + 1, + 5, + }, + { + 1, + 2, + 3, + }, + { + 3, + 2, + 5, + }, + { + -1, + -1, + -1, + }, + { + -1, + -1, + -1, + }, + { + -1, + -1, + -1, + }, + { + -1, + -1, + -1, + }, + { + -1, + -1, + -1, + }, + { + -1, + -1, + -1, + }, + { + -1, + -1, + -1, + }, + { + -1, + -1, + -1, + }, + { + -1, + -1, + -1, + }, + { + -1, + -1, + -1, + }, + { + -1, + -1, + -1, + }, + { + -1, + -1, + -1, + } + }, + + { + { + 8, + 0, + 1, + }, + { + 4, + 1, + 5, + }, + { + 2, + 2, + 4, + }, + { + 1, + 3, + 5, + }, + { + 3, + 3, + 2, + }, + { + 6, + 2, + 5, + }, + { + 5, + 3, + 4, + }, + { + 7, + 3, + 5, + }, + { + -1, + -1, + -1, + }, + { + -1, + -1, + -1, + }, + { + -1, + -1, + -1, + }, + { + -1, + -1, + -1, + }, + { + -1, + -1, + -1, + }, + { + -1, + -1, + -1, + }, + { + -1, + -1, + -1, + }, + { + -1, + -1, + -1, + }, + }, + { + { + 16, + 0, + 1, + }, + { + 8, + 1, + 6, + }, + { + 4, + 2, + 5, + }, + { + 2, + 3, + 6, + }, + { + 1, + 4, + 4, + }, + { + 3, + 4, + 6, + }, + { + 6, + 3, + 5, + }, + { + 5, + 4, + 6, + }, + { + 7, + 4, + 1, + }, + { + 12, + 2, + 6, + }, + { + 10, + 3, + 5, + }, + { + 9, + 4, + 6, + }, + { + 11, + 4, + 4, + }, + { + 14, + 3, + 6, + }, + { + 13, + 4, + 5, + }, + { + 15, + 4, + 6, + } + } +}; + +typedef enum +{ + X265_SHARE_MODE_FILE = 0, + X265_SHARE_MODE_SHAREDMEM +}X265_DATA_SHARE_MODES; + /* x265 input parameters * * For version safety you may use x265_param_alloc/free() to manage the @@ -983,6 +1260,9 @@ * performance impact, but the use case may preclude it. Default true */ int bOpenGOP; + /*Force nal type to CRA to all frames expect first frame. Default disabled*/ + int craNal; + /* Scene cuts closer together than this are coded as I, not IDR. */ int keyframeMin; @@ -1433,10 +1713,10 @@ double rfConstantMin; /* Multi-pass encoding */ - /* Enable writing the stats in a multi-pass encode to the stat output file */ + /* Enable writing the stats in a multi-pass encode to the stat output file/memory */ int bStatWrite; - /* Enable loading data from the stat input file in a multi pass encode */ + /* Enable loading data from the stat input file/memory in a multi pass encode */ int bStatRead; /* Filename of the 2pass output/input stats file, if unspecified the @@ -1489,6 +1769,21 @@ /* internally enable if tune grain is set */ int bEnableConstVbv; + /* if only the focused frames would be re-encode or not */ + int bEncFocusedFramesOnly; + + /* Share the data with stats file or shared memory. + It must be one of the X265_DATA_SHARE_MODES enum values + Available if the bStatWrite or bStatRead is true. + Use stats file by default. + The stats file mode would be used among the encoders running in sequence. + The shared memory mode could only be used among the encoders running in parallel. + Now only the cutree data could be shared among shared memory. More data would be support in the future.*/ + int dataShareMode; + + /* Unique shared memory name. Required if the shared memory mode enabled. NULL by default */ + const char* sharedMemName; + } rc; /*== Video Usability Information ==*/ @@ -1850,6 +2145,10 @@ Default 1 (Enabled). API only. */ int bResetZoneConfig; + /*Flag to indicate rate-control history has not to be reset during zone reconfiguration. + Default 0 (Disabled) */ + int bNoResetZoneConfig; + /* It reduces the bits spent on the inter-frames within the scenecutWindow before and / or after a scenecut * by increasing their QP in ratecontrol pass2 algorithm without any deterioration in visual quality. * 0 - Disabled (default). @@ -1860,20 +2159,15 @@ /* The duration(in milliseconds) for which there is a reduction in the bits spent on the inter-frames after a scenecut * by increasing their QP, when bEnableSceneCutAwareQp is 1 or 3. Default is 500ms.*/ - int fwdScenecutWindow; + int fwdMaxScenecutWindow; + int fwdScenecutWindow6; /* The offset by which QP is incremented for inter-frames after a scenecut when bEnableSceneCutAwareQp is 1 or 3. * Default is +5. */ - double fwdRefQpDelta; + double fwdRefQpDelta6; /* The offset by which QP is incremented for non-referenced inter-frames after a scenecut when bEnableSceneCutAwareQp is 1 or 3. */ - double fwdNonRefQpDelta; - - /* A genuine threshold used for histogram based scene cut detection. - * This threshold determines whether a frame is a scenecut or not - * when compared against the edge and chroma histogram sad values. - * Default 0.03. Range: Real number in the interval (0,1). */ - double edgeTransitionThreshold; + double fwdNonRefQpDelta6; /* Enables histogram based scenecut detection algorithm to detect scenecuts. Default disabled */ int bHistBasedSceneCut; @@ -1941,13 +2235,39 @@ /* The duration(in milliseconds) for which there is a reduction in the bits spent on the inter-frames before a scenecut * by increasing their QP, when bEnableSceneCutAwareQp is 2 or 3. Default is 100ms.*/ - int bwdScenecutWindow; + int bwdMaxScenecutWindow; + int bwdScenecutWindow6; /* The offset by which QP is incremented for inter-frames before a scenecut when bEnableSceneCutAwareQp is 2 or 3. */ - double bwdRefQpDelta; + double bwdRefQpDelta6; /* The offset by which QP is incremented for non-referenced inter-frames before a scenecut when bEnableSceneCutAwareQp is 2 or 3. */ - double bwdNonRefQpDelta; + double bwdNonRefQpDelta6; + + /* Specify combinations of color primaries, transfer characteristics, color matrix, + * range of luma and chroma signals, and chroma sample location. This has higher + * precedence than individual VUI parameters. If any individual VUI option is specified + * together with this, which changes the values set corresponding to the system-id + * or color-volume, it will be discarded. */ + const char* videoSignalTypePreset; + + /* Flag indicating whether the encoder should emit an End of Bitstream + * NAL at the end of bitstream. Default false */ + int bEnableEndOfBitstream; + + /* Flag indicating whether the encoder should emit an End of Sequence + * NAL at the end of every Coded Video Sequence. Default false */ + int bEnableEndOfSequence; + + /* Film Grain Characteristic file */ + char* filmGrain; + + /*Motion compensated temporal filter*/ + int bEnableTemporalFilter; + double temporalFilterStrength; + + /*SBRC*/ + int bEnableSBRC; } x265_param; /* x265_param_alloc: @@ -1982,6 +2302,8 @@ int x265_zone_param_parse(x265_param* p, const char* name, const char* value); +int x265_scenecut_aware_qp_param_parse(x265_param* p, const char* name, const char* value); + static const char * const x265_profile_names = { /* HEVC v1 */ "main", "main10", "mainstillpicture", /* alias */ "msp", @@ -2251,6 +2573,7 @@ void (*param_free)(x265_param*); void (*param_default)(x265_param*); int (*param_parse)(x265_param*, const char*, const char*); + int (*scenecut_aware_qp_param_parse)(x265_param*, const char*, const char*); int (*param_apply_profile)(x265_param*, const char*); int (*param_default_preset)(x265_param*, const char*, const char *); x265_picture* (*picture_alloc)(void);
View file
x265_3.5.tar.gz/source/x265cli.cpp -> x265_3.6.tar.gz/source/x265cli.cpp
Changed
@@ -28,8 +28,8 @@ #include "x265cli.h" #include "svt.h" -#define START_CODE 0x00000001 -#define START_CODE_BYTES 4 +#define START_CODE 0x00000001 +#define START_CODE_BYTES 4 #ifdef __cplusplus namespace X265_NS { @@ -166,6 +166,7 @@ H0(" --rdpenalty <0..2> penalty for 32x32 intra TU in non-I slices. 0:disabled 1:RD-penalty 2:maximum. Default %d\n", param->rdPenalty); H0("\nSlice decision options:\n"); H0(" --no-open-gop Enable open-GOP, allows I slices to be non-IDR. Default %s\n", OPT(param->bOpenGOP)); + H0(" --cra-nal Force nal type to CRA to all frames expect first frame, works only with keyint 1. Default %s\n", OPT(param->craNal)); H0("-I/--keyint <integer> Max IDR period in frames. -1 for infinite-gop. Default %d\n", param->keyframeMax); H0("-i/--min-keyint <integer> Scenecuts closer together than this are coded as I, not IDR. Default: auto\n"); H0(" --gop-lookahead <integer> Extends gop boundary if a scenecut is found within this from keyint boundary. Default 0\n"); @@ -174,7 +175,6 @@ H1(" --scenecut-bias <0..100.0> Bias for scenecut detection. Default %.2f\n", param->scenecutBias); H0(" --hist-scenecut Enables histogram based scene-cut detection using histogram based algorithm.\n"); H0(" --no-hist-scenecut Disables histogram based scene-cut detection using histogram based algorithm.\n"); - H1(" --hist-threshold <0.0..1.0> Luma Edge histogram's Normalized SAD threshold for histogram based scenecut detection Default %.2f\n", param->edgeTransitionThreshold); H0(" --no-fades Enable detection and handling of fade-in regions. Default %s\n", OPT(param->bEnableFades)); H1(" --scenecut-aware-qp <0..3> Enable increasing QP for frames inside the scenecut window around scenecut. Default %s\n", OPT(param->bEnableSceneCutAwareQp)); H1(" 0 - Disabled\n"); @@ -182,6 +182,7 @@ H1(" 2 - Backward masking\n"); H1(" 3 - Bidirectional masking\n"); H1(" --masking-strength <string> Comma separated values which specify the duration and offset for the QP increment for inter-frames when scenecut-aware-qp is enabled.\n"); + H1(" --scenecut-qp-config <file> File containing scenecut-aware-qp mode, window duration and offsets settings required for the masking. Works only with --pass 2\n"); H0(" --radl <integer> Number of RADL pictures allowed in front of IDR. Default %d\n", param->radl); H0(" --intra-refresh Use Periodic Intra Refresh instead of IDR frames\n"); H0(" --rc-lookahead <integer> Number of frames for frame-type lookahead (determines encoder latency) Default %d\n", param->lookaheadDepth); @@ -262,6 +263,7 @@ H0(" --aq-strength <float> Reduces blocking and blurring in flat and textured areas (0 to 3.0). Default %.2f\n", param->rc.aqStrength); H0(" --qp-adaptation-range <float> Delta QP range by QP adaptation based on a psycho-visual model (1.0 to 6.0). Default %.2f\n", param->rc.qpAdaptationRange); H0(" --no-aq-motion Block level QP adaptation based on the relative motion between the block and the frame. Default %s\n", OPT(param->bAQMotion)); + H1(" --no-sbrc Enables the segment based rate control. Default %s\n", OPT(param->bEnableSBRC)); H0(" --qg-size <int> Specifies the size of the quantization group (64, 32, 16, 8). Default %d\n", param->rc.qgSize); H0(" --no-cutree Enable cutree for Adaptive Quantization. Default %s\n", OPT(param->rc.cuTree)); H0(" --no-rc-grain Enable ratecontrol mode to handle grains specifically. turned on with tune grain. Default %s\n", OPT(param->rc.bEnableGrain)); @@ -282,6 +284,7 @@ H1(" q=<integer> (force QP)\n"); H1(" or b=<float> (bitrate multiplier)\n"); H0(" --zonefile <filename> Zone file containing the zone boundaries and the parameters to be reconfigured.\n"); + H0(" --no-zonefile-rc-init This allow to use rate-control history across zones in zonefile.\n"); H1(" --lambda-file <string> Specify a file containing replacement values for the lambda tables\n"); H1(" MAX_MAX_QP+1 floats for lambda table, then again for lambda2 table\n"); H1(" Blank lines and lines starting with hash(#) are ignored\n"); @@ -314,6 +317,30 @@ H0(" --master-display <string> SMPTE ST 2086 master display color volume info SEI (HDR)\n"); H0(" format: G(x,y)B(x,y)R(x,y)WP(x,y)L(max,min)\n"); H0(" --max-cll <string> Specify content light level info SEI as \"cll,fall\" (HDR).\n"); + H0(" --video-signal-type-preset <string> Specify combinations of color primaries, transfer characteristics, color matrix, range of luma and chroma signals, and chroma sample location\n"); + H0(" format: <system-id>:<color-volume>\n"); + H0(" This has higher precedence than individual VUI parameters. If any individual VUI option is specified together with this,\n"); + H0(" which changes the values set corresponding to the system-id or color-volume, it will be discarded.\n"); + H0(" The color-volume can be used only with the system-id options BT2100_PQ_YCC, BT2100_PQ_ICTCP, and BT2100_PQ_RGB.\n"); + H0(" system-id options and their corresponding values:\n"); + H0(" BT601_525: --colorprim smpte170m --transfer smpte170m --colormatrix smpte170m --range limited --chromaloc 0\n"); + H0(" BT601_626: --colorprim bt470bg --transfer smpte170m --colormatrix bt470bg --range limited --chromaloc 0\n"); + H0(" BT709_YCC: --colorprim bt709 --transfer bt709 --colormatrix bt709 --range limited --chromaloc 0\n"); + H0(" BT709_RGB: --colorprim bt709 --transfer bt709 --colormatrix gbr --range limited\n"); + H0(" BT2020_YCC_NCL: --colorprim bt2020 --transfer bt2020-10 --colormatrix bt709 --range limited --chromaloc 2\n"); + H0(" BT2020_RGB: --colorprim bt2020 --transfer smpte2084 --colormatrix bt2020nc --range limited\n"); + H0(" BT2100_PQ_YCC: --colorprim bt2020 --transfer smpte2084 --colormatrix bt2020nc --range limited --chromaloc 2\n"); + H0(" BT2100_PQ_ICTCP: --colorprim bt2020 --transfer smpte2084 --colormatrix ictcp --range limited --chromaloc 2\n"); + H0(" BT2100_PQ_RGB: --colorprim bt2020 --transfer smpte2084 --colormatrix gbr --range limited\n"); + H0(" BT2100_HLG_YCC: --colorprim bt2020 --transfer arib-std-b67 --colormatrix bt2020nc --range limited --chromaloc 2\n"); + H0(" BT2100_HLG_RGB: --colorprim bt2020 --transfer arib-std-b67 --colormatrix gbr --range limited\n"); + H0(" FR709_RGB: --colorprim bt709 --transfer bt709 --colormatrix gbr --range full\n"); + H0(" FR2020_RGB: --colorprim bt2020 --transfer bt2020-10 --colormatrix gbr --range full\n"); + H0(" FRP3D65_YCC: --colorprim smpte432 --transfer bt709 --colormatrix smpte170m --range full --chromaloc 1\n"); + H0(" color-volume options and their corresponding values:\n"); + H0(" P3D65x1000n0005: --master-display G(13250,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(10000000,5)\n"); + H0(" P3D65x4000n005: --master-display G(13250,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(40000000,50)\n"); + H0(" BT2100x108n0005: --master-display G(8500,39850)B(6550,2300)R(34000,146000)WP(15635,16450)L(10000000,1)\n"); H0(" --no-cll Emit content light level info SEI. Default %s\n", OPT(param->bEmitCLL)); H0(" --no-hdr10 Control dumping of HDR10 SEI packet. If max-cll or master-display has non-zero values, this is enabled. Default %s\n", OPT(param->bEmitHDR10SEI)); H0(" --no-hdr-opt Add luma and chroma offsets for HDR/WCG content. Default %s. Now deprecated.\n", OPT(param->bHDROpt)); @@ -324,9 +351,11 @@ H0(" --no-repeat-headers Emit SPS and PPS headers at each keyframe. Default %s\n", OPT(param->bRepeatHeaders)); H0(" --no-info Emit SEI identifying encoder and parameters. Default %s\n", OPT(param->bEmitInfoSEI)); H0(" --no-hrd Enable HRD parameters signaling. Default %s\n", OPT(param->bEmitHRDSEI)); - H0(" --no-idr-recovery-sei Emit recovery point infor SEI at each IDR frame \n"); - H0(" --no-temporal-layers Enable a temporal sublayer for unreferenced B frames. Default %s\n", OPT(param->bEnableTemporalSubLayers)); + H0(" --no-idr-recovery-sei Emit recovery point infor SEI at each IDR frame \n"); + H0(" --temporal-layers Enable a temporal sublayer for unreferenced B frames. Default %s\n", OPT(param->bEnableTemporalSubLayers)); H0(" --no-aud Emit access unit delimiters at the start of each access unit. Default %s\n", OPT(param->bEnableAccessUnitDelimiters)); + H0(" --no-eob Emit end of bitstream nal unit at the end of the bitstream. Default %s\n", OPT(param->bEnableEndOfBitstream)); + H0(" --no-eos Emit end of sequence nal unit at the end of every coded video sequence. Default %s\n", OPT(param->bEnableEndOfSequence)); H1(" --hash <integer> Decoded Picture Hash SEI 0: disabled, 1: MD5, 2: CRC, 3: Checksum. Default %d\n", param->decodedPictureHashSEI); H0(" --atc-sei <integer> Emit the alternative transfer characteristics SEI message where the integer is the preferred transfer characteristics. Default disabled\n"); H0(" --pic-struct <integer> Set the picture structure and emits it in the picture timing SEI message. Values in the range 0..12. See D.3.3 of the HEVC spec. for a detailed explanation.\n"); @@ -344,6 +373,7 @@ H0(" --lowpass-dct Use low-pass subband dct approximation. Default %s\n", OPT(param->bLowPassDct)); H0(" --no-frame-dup Enable Frame duplication. Default %s\n", OPT(param->bEnableFrameDuplication)); H0(" --dup-threshold <integer> PSNR threshold for Frame duplication. Default %d\n", param->dupThreshold); + H0(" --no-mcstf Enable GOP based temporal filter. Default %d\n", param->bEnableTemporalFilter); #ifdef SVT_HEVC H0(" --nosvt Enable SVT HEVC encoder %s\n", OPT(param->bEnableSvtHevc)); H0(" --no-svt-hme Enable Hierarchial motion estimation(HME) in SVT HEVC encoder \n"); @@ -365,6 +395,9 @@ H1(" 2 - unable to open encoder\n"); H1(" 3 - unable to generate stream headers\n"); H1(" 4 - encoder abort\n"); + H0("\nSEI Message Options\n"); + H0(" --film-grain <filename> File containing Film Grain Characteristics to be written as a SEI Message\n"); + #undef OPT #undef H0 #undef H1 @@ -484,6 +517,9 @@ memcpy(globalParam->rc.zoneszonefileCount.zoneParam, globalParam, sizeof(x265_param)); + if (zonefileCount == 0) + globalParam->rc.zoneszonefileCount.keyframeMax = globalParam->keyframeMax; + for (optind = 0;;) { int long_options_index = -1; @@ -708,12 +744,19 @@ return true; } } + OPT("scenecut-qp-config") + { + this->scenecutAwareQpConfig = x265_fopen(optarg, "rb"); + if (!this->scenecutAwareQpConfig) + x265_log_file(param, X265_LOG_ERROR, "%s scenecut aware qp config file not found or error in opening config file\n", optarg); + } OPT("zonefile") { this->zoneFile = x265_fopen(optarg, "rb"); if (!this->zoneFile) x265_log_file(param, X265_LOG_ERROR, "%s zone file not found or error in opening zone file\n", optarg); } + OPT("no-zonefile-rc-init") this->param->bNoResetZoneConfig = true; OPT("fullhelp") { param->logLevel = X265_LOG_FULL; @@ -875,7 +918,7 @@ if (reconFileBitDepth == 0) reconFileBitDepth = param->internalBitDepth; this->recon = ReconFile::open(reconfn, param->sourceWidth, param->sourceHeight, reconFileBitDepth, - param->fpsNum, param->fpsDenom, param->internalCsp); + param->fpsNum, param->fpsDenom, param->internalCsp, param->sourceBitDepth); if (this->recon->isFail()) { x265_log(param, X265_LOG_WARNING, "unable to write reconstructed outputs file\n"); @@ -973,6 +1016,7 @@ param->rc.zones = X265_MALLOC(x265_zone, param->rc.zonefileCount); for (int i = 0; i < param->rc.zonefileCount; i++) { + param->rc.zonesi.startFrame = -1; while (fgets(line, sizeof(line), zoneFile)) { if (*line == '#' || (strcmp(line, "\r\n") == 0)) @@ -1010,57 +1054,179 @@ return 1; } - /* Parse the RPU file and extract the RPU corresponding to the current picture - * and fill the rpu field of the input picture */ - int CLIOptions::rpuParser(x265_picture * pic) - { - uint8_t byteVal; - uint32_t code = 0; - int bytesRead = 0; - pic->rpu.payloadSize = 0; - - if (!pic->pts) - { - while (bytesRead++ < 4 && fread(&byteVal, sizeof(uint8_t), 1, dolbyVisionRpu)) - code = (code << 8) | byteVal; - - if (code != START_CODE) - { - x265_log(NULL, X265_LOG_ERROR, "Invalid Dolby Vision RPU startcode in POC %d\n", pic->pts); - return 1; - } - } - - bytesRead = 0; - while (fread(&byteVal, sizeof(uint8_t), 1, dolbyVisionRpu)) - { - code = (code << 8) | byteVal; - if (bytesRead++ < 3) - continue; - if (bytesRead >= 1024) - { - x265_log(NULL, X265_LOG_ERROR, "Invalid Dolby Vision RPU size in POC %d\n", pic->pts); - return 1; - } - - if (code != START_CODE) - pic->rpu.payloadpic->rpu.payloadSize++ = (code >> (3 * 8)) & 0xFF; - else - return 0; - } - - int ShiftBytes = START_CODE_BYTES - (bytesRead - pic->rpu.payloadSize); - int bytesLeft = bytesRead - pic->rpu.payloadSize; - code = (code << ShiftBytes * 8); - for (int i = 0; i < bytesLeft; i++) - { - pic->rpu.payloadpic->rpu.payloadSize++ = (code >> (3 * 8)) & 0xFF; - code = (code << 8); - } - if (!pic->rpu.payloadSize) - x265_log(NULL, X265_LOG_WARNING, "Dolby Vision RPU not found for POC %d\n", pic->pts); - return 0; - } + /* Parse the RPU file and extract the RPU corresponding to the current picture + * and fill the rpu field of the input picture */ + int CLIOptions::rpuParser(x265_picture * pic) + { + uint8_t byteVal; + uint32_t code = 0; + int bytesRead = 0; + pic->rpu.payloadSize = 0; + + if (!pic->pts) + { + while (bytesRead++ < 4 && fread(&byteVal, sizeof(uint8_t), 1, dolbyVisionRpu)) + code = (code << 8) | byteVal; + + if (code != START_CODE) + { + x265_log(NULL, X265_LOG_ERROR, "Invalid Dolby Vision RPU startcode in POC %d\n", pic->pts); + return 1; + } + } + + bytesRead = 0; + while (fread(&byteVal, sizeof(uint8_t), 1, dolbyVisionRpu)) + { + code = (code << 8) | byteVal; + if (bytesRead++ < 3) + continue; + if (bytesRead >= 1024) + { + x265_log(NULL, X265_LOG_ERROR, "Invalid Dolby Vision RPU size in POC %d\n", pic->pts); + return 1; + } + + if (code != START_CODE) + pic->rpu.payloadpic->rpu.payloadSize++ = (code >> (3 * 8)) & 0xFF; + else + return 0; + } + + int ShiftBytes = START_CODE_BYTES - (bytesRead - pic->rpu.payloadSize); + int bytesLeft = bytesRead - pic->rpu.payloadSize; + code = (code << ShiftBytes * 8); + for (int i = 0; i < bytesLeft; i++) + { + pic->rpu.payloadpic->rpu.payloadSize++ = (code >> (3 * 8)) & 0xFF; + code = (code << 8); + } + if (!pic->rpu.payloadSize) + x265_log(NULL, X265_LOG_WARNING, "Dolby Vision RPU not found for POC %d\n", pic->pts); + return 0; + } + + bool CLIOptions::parseScenecutAwareQpConfig() + { + char line256; + char* argLine; + rewind(scenecutAwareQpConfig); + while (fgets(line, sizeof(line), scenecutAwareQpConfig)) + { + if (*line == '#' || (strcmp(line, "\r\n") == 0)) + continue; + int index = (int)strcspn(line, "\r\n"); + lineindex = '\0'; + argLine = line; + while (isspace((unsigned char)*argLine)) argLine++; + char* start = strchr(argLine, '-'); + int argCount = 0; + char **args = (char**)malloc(256 * sizeof(char *)); + //Adding a dummy string to avoid file parsing error + argsargCount++ = (char *)"x265"; + char* token = strtok(start, " "); + while (token) + { + argsargCount++ = token; + token = strtok(NULL, " "); + } + argsargCount = NULL; + CLIOptions cliopt; + if (cliopt.parseScenecutAwareQpParam(argCount, args, param)) + { + cliopt.destroy(); + if (cliopt.api) + cliopt.api->param_free(cliopt.param); + exit(1); + } + break; + } + return 1; + } + bool CLIOptions::parseScenecutAwareQpParam(int argc, char **argv, x265_param* globalParam) + { + bool bError = false; + int bShowHelp = false; + int outputBitDepth = 0; + const char *profile = NULL; + /* Presets are applied before all other options. */ + for (optind = 0;;) + { + int c = getopt_long(argc, argv, short_options, long_options, NULL); + if (c == -1) + break; + else if (c == 'D') + outputBitDepth = atoi(optarg); + else if (c == 'P') + profile = optarg; + else if (c == '?') + bShowHelp = true; + } + if (!outputBitDepth && profile) + { + /*try to derive the output bit depth from the requested profile*/ + if (strstr(profile, "10")) + outputBitDepth = 10; + else if (strstr(profile, "12")) + outputBitDepth = 12; + else + outputBitDepth = 8; + } + api = x265_api_get(outputBitDepth); + if (!api) + { + x265_log(NULL, X265_LOG_WARNING, "falling back to default bit-depth\n"); + api = x265_api_get(0); + } + if (bShowHelp) + { + printVersion(globalParam, api); + showHelp(globalParam); + } + for (optind = 0;;) + { + int long_options_index = -1; + int c = getopt_long(argc, argv, short_options, long_options, &long_options_index); + if (c == -1) + break; + if (long_options_index < 0 && c > 0) + { + for (size_t i = 0; i < sizeof(long_options) / sizeof(long_options0); i++) + { + if (long_optionsi.val == c) + { + long_options_index = (int)i; + break; + } + } + if (long_options_index < 0) + { + /* getopt_long might have already printed an error message */ + if (c != 63) + x265_log(NULL, X265_LOG_WARNING, "internal error: short option '%c' has no long option\n", c); + return true; + } + } + if (long_options_index < 0) + { + x265_log(NULL, X265_LOG_WARNING, "short option '%c' unrecognized\n", c); + return true; + } + bError |= !!api->scenecut_aware_qp_param_parse(globalParam, long_optionslong_options_index.name, optarg); + if (bError) + { + const char *name = long_options_index > 0 ? long_optionslong_options_index.name : argvoptind - 2; + x265_log(NULL, X265_LOG_ERROR, "invalid argument: %s = %s\n", name, optarg); + return true; + } + } + if (optind < argc) + { + x265_log(param, X265_LOG_WARNING, "extra unused command arguments given <%s>\n", argvoptind); + return true; + } + return false; + } #ifdef __cplusplus }
View file
x265_3.5.tar.gz/source/x265cli.h -> x265_3.6.tar.gz/source/x265cli.h
Changed
@@ -135,6 +135,7 @@ { "no-fast-intra", no_argument, NULL, 0 }, { "no-open-gop", no_argument, NULL, 0 }, { "open-gop", no_argument, NULL, 0 }, + { "cra-nal", no_argument, NULL, 0 }, { "keyint", required_argument, NULL, 'I' }, { "min-keyint", required_argument, NULL, 'i' }, { "gop-lookahead", required_argument, NULL, 0 }, @@ -143,7 +144,6 @@ { "scenecut-bias", required_argument, NULL, 0 }, { "hist-scenecut", no_argument, NULL, 0}, { "no-hist-scenecut", no_argument, NULL, 0}, - { "hist-threshold", required_argument, NULL, 0}, { "fades", no_argument, NULL, 0 }, { "no-fades", no_argument, NULL, 0 }, { "scenecut-aware-qp", required_argument, NULL, 0 }, @@ -182,6 +182,8 @@ { "qp", required_argument, NULL, 'q' }, { "aq-mode", required_argument, NULL, 0 }, { "aq-strength", required_argument, NULL, 0 }, + { "sbrc", no_argument, NULL, 0 }, + { "no-sbrc", no_argument, NULL, 0 }, { "rc-grain", no_argument, NULL, 0 }, { "no-rc-grain", no_argument, NULL, 0 }, { "ipratio", required_argument, NULL, 0 }, @@ -244,6 +246,7 @@ { "crop-rect", required_argument, NULL, 0 }, /* DEPRECATED */ { "master-display", required_argument, NULL, 0 }, { "max-cll", required_argument, NULL, 0 }, + {"video-signal-type-preset", required_argument, NULL, 0 }, { "min-luma", required_argument, NULL, 0 }, { "max-luma", required_argument, NULL, 0 }, { "log2-max-poc-lsb", required_argument, NULL, 8 }, @@ -263,11 +266,16 @@ { "repeat-headers", no_argument, NULL, 0 }, { "aud", no_argument, NULL, 0 }, { "no-aud", no_argument, NULL, 0 }, + { "eob", no_argument, NULL, 0 }, + { "no-eob", no_argument, NULL, 0 }, + { "eos", no_argument, NULL, 0 }, + { "no-eos", no_argument, NULL, 0 }, { "info", no_argument, NULL, 0 }, { "no-info", no_argument, NULL, 0 }, { "zones", required_argument, NULL, 0 }, { "qpfile", required_argument, NULL, 0 }, { "zonefile", required_argument, NULL, 0 }, + { "no-zonefile-rc-init", no_argument, NULL, 0 }, { "lambda-file", required_argument, NULL, 0 }, { "b-intra", no_argument, NULL, 0 }, { "no-b-intra", no_argument, NULL, 0 }, @@ -298,8 +306,7 @@ { "dynamic-refine", no_argument, NULL, 0 }, { "no-dynamic-refine", no_argument, NULL, 0 }, { "strict-cbr", no_argument, NULL, 0 }, - { "temporal-layers", no_argument, NULL, 0 }, - { "no-temporal-layers", no_argument, NULL, 0 }, + { "temporal-layers", required_argument, NULL, 0 }, { "qg-size", required_argument, NULL, 0 }, { "recon-y4m-exec", required_argument, NULL, 0 }, { "analyze-src-pics", no_argument, NULL, 0 }, @@ -349,6 +356,8 @@ { "frame-dup", no_argument, NULL, 0 }, { "no-frame-dup", no_argument, NULL, 0 }, { "dup-threshold", required_argument, NULL, 0 }, + { "mcstf", no_argument, NULL, 0 }, + { "no-mcstf", no_argument, NULL, 0 }, #ifdef SVT_HEVC { "svt", no_argument, NULL, 0 }, { "no-svt", no_argument, NULL, 0 }, @@ -373,6 +382,8 @@ { "abr-ladder", required_argument, NULL, 0 }, { "min-vbv-fullness", required_argument, NULL, 0 }, { "max-vbv-fullness", required_argument, NULL, 0 }, + { "scenecut-qp-config", required_argument, NULL, 0 }, + { "film-grain", required_argument, NULL, 0 }, { 0, 0, 0, 0 }, { 0, 0, 0, 0 }, { 0, 0, 0, 0 }, @@ -388,6 +399,7 @@ FILE* qpfile; FILE* zoneFile; FILE* dolbyVisionRpu; /* File containing Dolby Vision BL RPU metadata */ + FILE* scenecutAwareQpConfig; /* File containing scenecut aware frame quantization related CLI options */ const char* reconPlayCmd; const x265_api* api; x265_param* param; @@ -425,6 +437,7 @@ qpfile = NULL; zoneFile = NULL; dolbyVisionRpu = NULL; + scenecutAwareQpConfig = NULL; reconPlayCmd = NULL; api = NULL; param = NULL; @@ -455,6 +468,8 @@ bool parseQPFile(x265_picture &pic_org); bool parseZoneFile(); int rpuParser(x265_picture * pic); + bool parseScenecutAwareQpConfig(); + bool parseScenecutAwareQpParam(int argc, char **argv, x265_param* globalParam); }; #ifdef __cplusplus }
View file
x265_3.5.tar.gz/x265Version.txt -> x265_3.6.tar.gz/x265Version.txt
Changed
@@ -1,4 +1,4 @@ #Attribute: Values -repositorychangeset: f0c1022b6 +repositorychangeset: aa7f602f7 releasetagdistance: 1 -releasetag: 3.5 +releasetag: 3.6
Locations
Projects
Search
Status Monitor
Help
Open Build Service
OBS Manuals
API Documentation
OBS Portal
Reporting a Bug
Contact
Mailing List
Forums
Chat (IRC)
Twitter
Open Build Service (OBS)
is an
openSUSE project
.