Projects
Essentials
x265
Sign Up
Log In
Username
Password
Overview
Repositories
Revisions
Requests
Users
Attributes
Meta
Expand all
Collapse all
Changes of Revision 13
View file
x265.changes
Changed
@@ -1,4 +1,40 @@ ------------------------------------------------------------------- +Sun Aug 28 11:51:23 UTC 2016 - joerg.lorenzen@ki.tng.de + +- Update to version 2.0 + API and Key Behavior Changes + * x265_rc_stats added to x265_picture, containing all RC decision + points for that frame. + * PTL: high tier is now allowed by default, chosen only if + necessary. + * multi-pass: First pass now uses slow-firstpass by default, + enabling better RC decisions in future passes. + * pools: fix behaviour on multi-socketed Windows systems, provide + more flexibility in determining thread and pool counts. + * ABR: improve bits allocation in the first few frames, abr reset, + vbv and cutree improved. + New Features + * uhd-bd: Enforce Ultra-HD Blu-ray Disc parameters + (overrides any other settings). + * rskip: Enables skipping recursion to analyze lower CU sizes + using heuristics at different rd-levels. Provides good visual + quality gains at the highest quality presets. + * rc-grain: Enables a new rate control mode specifically for + grainy content. Strictly prevents QP oscillations within and + between frames to avoid grain fluctuations. + * tune grain: A fully refactored and improved option to encode + film grain content including QP control as well as analysis + options. + * asm: ARM assembly is now enabled by default, native or cross + compiled builds supported on armv6 and later systems. + Misc + * An SSIM calculation bug was corrected +- soname bump to 87. +- Fixed arm.patch. +- Added libnuma-devel as buildrequires for arch x86_64 (except + for openSUSE 13.1 because libnuma-devel >= 2.0.9 is required). + +------------------------------------------------------------------- Wed Feb 3 13:22:42 UTC 2016 - idonmez@suse.com - Update to version 1.9
View file
x265.spec
Changed
@@ -1,10 +1,10 @@ # based on the spec file from https://build.opensuse.org/package/view_file/home:Simmphonie/libx265/ Name: x265 -%define soname 79 +%define soname 87 %define libname lib%{name} %define libsoname %{libname}-%{soname} -Version: 1.9 +Version: 2.0 Release: 0 License: GPL-2.0+ Summary: A free h265/HEVC encoder - encoder binary @@ -14,6 +14,13 @@ Patch0: arm.patch BuildRequires: gcc gcc-c++ BuildRequires: cmake >= 2.8.8 +# for openSUSE 13.1 only libnuma-devel = 2.0.8 is available, but version 2.0.9 or higher is required +# build against version 2.0.8 failes with "error: 'numa_bitmask_weight' was not declared in this scope" +%if ! ( 0%{?suse_version} == 1310 ) +%ifarch x86_64 +BuildRequires: libnuma-devel >= 2.0.9 +%endif +%endif BuildRequires: pkg-config BuildRequires: yasm >= 1.2.0 BuildRoot: %{_tmppath}/%{name}-%{version}-build
View file
arm.patch
Changed
@@ -1,19 +1,25 @@ -Index: x265_11047/source/CMakeLists.txt +Index: x265_2.0/source/CMakeLists.txt =================================================================== ---- x265_11047.orig/source/CMakeLists.txt -+++ x265_11047/source/CMakeLists.txt -@@ -56,10 +56,22 @@ elseif(POWERMATCH GREATER "-1") +--- x265_2.0.orig/source/CMakeLists.txt ++++ x265_2.0/source/CMakeLists.txt +@@ -60,15 +60,22 @@ message(STATUS "Detected POWER target processor") set(POWER 1) add_definitions(-DX265_ARCH_POWER=1) +-elseif(ARMMATCH GREATER "-1") +- if(CROSS_COMPILE_ARM) +- message(STATUS "Cross compiling for ARM arch") +- else() +- set(CROSS_COMPILE_ARM 0) +- endif() +- message(STATUS "Detected ARM target processor") +- set(ARM 1) +- add_definitions(-DX265_ARCH_ARM=1 -DHAVE_ARMV6=1) +elseif(${SYSPROC} MATCHES "armv5.*") + message(STATUS "Detected ARMV5 system processor") + set(ARMV5 1) + add_definitions(-DX265_ARCH_ARM=1 -DHAVE_ARMV6=0 -DHAVE_NEON=0) - elseif(${SYSPROC} STREQUAL "armv6l") -- message(STATUS "Detected ARM target processor") -- set(ARM 1) -- add_definitions(-DX265_ARCH_ARM=1 -DHAVE_ARMV6=1) ++elseif(${SYSPROC} STREQUAL "armv6l") + message(STATUS "Detected ARMV6 system processor") + set(ARMV6 1) + add_definitions(-DX265_ARCH_ARM=1 -DHAVE_ARMV6=1 -DHAVE_NEON=0) @@ -28,21 +34,32 @@ else() message(STATUS "CMAKE_SYSTEM_PROCESSOR value `${CMAKE_SYSTEM_PROCESSOR}` is unknown") message(STATUS "Please add this value near ${CMAKE_CURRENT_LIST_FILE}:${CMAKE_CURRENT_LIST_LINE}") -@@ -169,8 +181,8 @@ if(GCC) - elseif(X86 AND NOT X64) - add_definitions(-march=i686) +@@ -186,18 +193,9 @@ + add_definitions(-march=i686) + endif() endif() -- if(ARM) -- add_definitions(-march=armv6 -mfloat-abi=hard -mfpu=vfp) +- if(ARM AND CROSS_COMPILE_ARM) +- set(ARM_ARGS -march=armv6 -mfloat-abi=soft -mfpu=vfp -marm -fPIC) +- elseif(ARM) +- find_package(Neon) +- if(CPU_HAS_NEON) +- set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=neon -marm -fPIC) +- add_definitions(-DHAVE_NEON) +- else() +- set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=vfp -marm) +- endif() +- endif() +- add_definitions(${ARM_ARGS}) + if(ARMV7) + add_definitions(-fPIC) - endif() ++ endif() if(FPROFILE_GENERATE) if(INTEL_CXX) -Index: x265_11047/source/common/cpu.cpp + add_definitions(-prof-gen -prof-dir="${CMAKE_CURRENT_BINARY_DIR}") +Index: x265_2.0/source/common/cpu.cpp =================================================================== ---- x265_11047.orig/source/common/cpu.cpp -+++ x265_11047/source/common/cpu.cpp +--- x265_2.0.orig/source/common/cpu.cpp ++++ x265_2.0/source/common/cpu.cpp @@ -37,7 +37,7 @@ #include <machine/cpu.h> #endif @@ -52,3 +69,20 @@ #include <signal.h> #include <setjmp.h> static sigjmp_buf jmpbuf; +@@ -340,7 +340,6 @@ + } + + canjump = 1; +- PFX(cpu_neon_test)(); + canjump = 0; + signal(SIGILL, oldsig); + #endif // if !HAVE_NEON +@@ -356,7 +355,7 @@ + // which may result in incorrect detection and the counters stuck enabled. + // right now Apple does not seem to support performance counters for this test + #ifndef __MACH__ +- flags |= PFX(cpu_fast_neon_mrc_test)() ? X265_CPU_FAST_NEON_MRC : 0; ++ //flags |= PFX(cpu_fast_neon_mrc_test)() ? X265_CPU_FAST_NEON_MRC : 0; + #endif + // TODO: write dual issue test? currently it's A8 (dual issue) vs. A9 (fast mrc) + #endif // if HAVE_ARMV6
View file
x265_1.9.tar.gz/.hg_archival.txt -> x265_2.0.tar.gz/.hg_archival.txt
Changed
@@ -1,4 +1,4 @@ repo: 09fe40627f03a0f9c3e6ac78b22ac93da23f9fdf -node: 1d3b6e448e01ec40b392ef78b7e55a86249fbe68 +node: 960c9991d0dcf46559c32e070418d3cbb7e8aa2f branch: stable -tag: 1.9 +tag: 2.0
View file
x265_1.9.tar.gz/.hgtags -> x265_2.0.tar.gz/.hgtags
Changed
@@ -17,3 +17,4 @@ cbeb7d8a4880e4020c4545dd8e498432c3c6cad3 1.6 8425278def1edf0931dc33fc518e1950063e76b0 1.7 e27327f5da35c5feb660360336fdc94bd0afe719 1.8 +1d3b6e448e01ec40b392ef78b7e55a86249fbe68 1.9
View file
x265_2.0.tar.gz/build/arm-linux/crosscompile.cmake
Added
@@ -0,0 +1,15 @@ +# CMake toolchain file for cross compiling x265 for ARM arch +# This feature is only supported as experimental. Use with caution. +# Please report bugs on bitbucket +# Run cmake with: cmake -DCMAKE_TOOLCHAIN_FILE=crosscompile.cmake -G "Unix Makefiles" ../../source && ccmake ../../source + +set(CROSS_COMPILE_ARM 1) +set(CMAKE_SYSTEM_NAME Linux) +set(CMAKE_SYSTEM_PROCESSOR armv6l) + +# specify the cross compiler +set(CMAKE_C_COMPILER arm-linux-gnueabi-gcc) +set(CMAKE_CXX_COMPILER arm-linux-gnueabi-g++) + +# specify the target environment +SET(CMAKE_FIND_ROOT_PATH /usr/arm-linux-gnueabi)
View file
x265_2.0.tar.gz/build/arm-linux/make-Makefiles.bash
Added
@@ -0,0 +1,4 @@ +#!/bin/bash +# Run this from within a bash shell + +cmake -G "Unix Makefiles" ../../source && ccmake ../../source
View file
x265_1.9.tar.gz/doc/reST/api.rst -> x265_2.0.tar.gz/doc/reST/api.rst
Changed
@@ -180,7 +180,8 @@ * used to modify encoder parameters. * various parameters from x265_param are copied. * this takes effect immediately, on whichever frame is encoded next; - * returns 0 on success, negative on parameter validation error. + * returns negative on parameter validation error, 0 on successful reconfigure + * and 1 when a reconfigure is already in progress. * * not all parameters can be changed; see the actual function for a * detailed breakdown. since not all parameters can be changed, moving
View file
x265_1.9.tar.gz/doc/reST/cli.rst -> x265_2.0.tar.gz/doc/reST/cli.rst
Changed
@@ -376,10 +376,10 @@ .. option:: --dither - Enable high quality downscaling. Dithering is based on the diffusion - of errors from one row of pixels to the next row of pixels in a - picture. Only applicable when the input bit depth is larger than - 8bits and internal bit depth is 8bits. Default disabled + Enable high quality downscaling to the encoder's internal bitdepth. + Dithering is based on the diffusion of errors from one row of pixels + to the next row of pixels in a picture. Only applicable when the + input bit depth is larger than 8bits. Default disabled **CLI ONLY** @@ -522,16 +522,14 @@ .. option:: --high-tier, --no-high-tier - If :option:`--level-idc` has been specified, the option adds the - intention to support the High tier of that level. If your specified - level does not support a High tier, a warning is issued and this - modifier flag is ignored. If :option:`--level-idc` has been specified, - but not --high-tier, then the encoder will attempt to encode at the - specified level, main tier first, turning on high tier only if - necessary and available at that level. + If :option:`--level-idc` has been specified, --high-tier allows the + support of high tier at that level. The encoder will first attempt to encode + at the specified level, main tier first, turning on high tier only if + necessary and available at that level.If your requested level does not + support a High tier, high tier will not be supported. If --no-high-tier + has been specified, then the encoder will attempt to encode only at the main tier. - If :option:`--level-idc` has not been specified, this argument is - ignored. + Default: enabled .. option:: --ref <1..16> @@ -564,6 +562,15 @@ Default: disabled +.. option:: --uhd-bd + + Enable Ultra HD Blu-ray format support. If specified with incompatible + encoding options, the encoder will attempt to modify/set the right + encode specifications. If the encoder is unable to do so, this option + will be turned OFF. Highly experimental. + + Default: disabled + .. note:: :option:`--profile`, :option:`--level-idc`, and @@ -600,7 +607,7 @@ Mode decision / Analysis ======================== -.. option:: --rd <0..6> +.. option:: --rd <1..6> Level of RDO in mode decision. The higher the value, the more exhaustive the analysis and the more rate distortion optimization is @@ -629,7 +636,7 @@ | 6 | Currently same as 5 | +-------+---------------------------------------------------------------+ - **Range of values:** 0: least .. 6: full RDO analysis + **Range of values:** 1: least .. 6: full RDO analysis Options which affect the coding unit quad-tree, sometimes referred to as the prediction quad-tree. @@ -722,8 +729,18 @@ .. option:: --early-skip, --no-early-skip - Measure full CU size (2Nx2N) merge candidates first; if no residual - is found the analysis is short circuited. Default disabled + Measure 2Nx2N merge candidates first; if no residual is found, + additional modes at that depth are not analysed. Default disabled + +.. option:: --rskip, --no-rskip + + This option determines early exit from CU depth recursion. When a skip CU is + found, additional heuristics (depending on rd-level) are used to decide whether + to terminate recursion. In rdlevels 5 and 6, comparison with inter2Nx2N is used, + while at rdlevels 4 and neighbour costs are used to skip recursion. + Provides minimal quality degradation at good performance gains when enabled. + + Default: enabled, disabled for :option:`--tune grain` .. option:: --fast-intra, --no-fast-intra @@ -756,6 +773,14 @@ evaluate if luma used tskip. Inter block tskip analysis is unmodified. Default disabled +.. option:: --rd-refine, --no-rd-refine + + For each analysed CU, calculate R-D cost on the best partition mode + for a range of QP values, to find the optimal rounding effect. + Default disabled. + + Only effective at RD levels 5 and 6 + Analysis re-use options, to improve performance when encoding the same sequence multiple times (presumably at varying bitrates). The encoder will not reuse analysis if the resolution and slice type parameters do @@ -1039,7 +1064,7 @@ cause ringing artifacts. psy-rdoq is less accurate than psy-rd, it is biasing towards energy in general while psy-rd biases towards the energy of the source image. But very large psy-rdoq values can sometimes be -beneficial, preserving film grain for instance. +beneficial. As a general rule, when both psycho-visual features are disabled, the encoder will tend to blur blocks in areas of difficult motion. Turning @@ -1076,8 +1101,8 @@ energy in the reconstructed image. This generally improves perceived visual quality at the cost of lower quality metric scores. It only has effect when :option:`--rdoq-level` is 1 or 2. High values can - be beneficial in preserving high-frequency detail like film grain. - Default: 1.0 + be beneficial in preserving high-frequency detail. + Default: 0.0 (1.0 for presets slow, slower, veryslow) **Range of values:** 0 .. 50.0 @@ -1336,13 +1361,13 @@ .. option:: --slow-firstpass, --no-slow-firstpass - Enable a slow and more detailed first pass encode in multi-pass rate - control mode. Speed of the first pass encode is slightly lesser and - quality midly improved when compared to the default settings in a - multi-pass encode. Default disabled (turbo mode enabled) + Enable first pass encode with the exact settings specified. + The quality in subsequent multi-pass encodes is better + (compared to first pass) when the settings match across each pass. + Default enabled. - When **turbo** first pass is not disabled, these options are - set on the first pass to improve performance: + When slow first pass is disabled, a **turbo** encode with the following + go-fast options is used to improve performance: * :option:`--fast-intra` * :option:`--no-rect` @@ -1408,7 +1433,16 @@ The maximum single adjustment in QP allowed to rate control. Default 4 - + +.. option:: --rc-grain, --no-rc-grain + + Enables a specialised ratecontrol algorithm for film grain content. This + parameter strictly minimises QP fluctuations within and across frames + and removes pulsing of grain. Default disabled. + Enabled when :option:'--tune' grain is applied. It is highly recommended + that this option is used through the tune grain feature where a combination + of param options are used to improve visual quality. + .. option:: --qblur <float> Temporally blur quants. Default 0.5 @@ -1660,10 +1694,13 @@ a string which is parsed when the stream header SEI are emitted. The string format is "G(%hu,%hu)B(%hu,%hu)R(%hu,%hu)WP(%hu,%hu)L(%u,%u)" where %hu are unsigned 16bit integers and %u are unsigned 32bit - integers. The SEI includes X,Y display primaries for RGB channels, - white point X,Y and max,min luminance values. (HDR) + integers. The SEI includes X,Y display primaries for RGB channels + and white point (WP) in units of 0.00002 and max,min luminance (L) + values in units of 0.0001 candela per meter square. (HDR) - Example for D65P3 1000-nits: + Example for a P3D65 1000-nits monitor, where G(x=0.265, y=0.690), + B(x=0.150, y=0.060), R(x=0.680, y=0.320), WP(x=0.3127, y=0.3290), + L(max=1000, min=0.0001): G(13250,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(10000000,1) @@ -1672,8 +1709,9 @@ .. option:: --max-cll <string> - Maximum content light level and maximum frame average light level as - required by the Consumer Electronics Association 861.3 specification. + Maximum content light level (MaxCLL) and maximum frame average light + level (MaxFALL) as required by the Consumer Electronics Association + 861.3 specification. Specified as a string which is parsed when the stream header SEI are emitted. The string format is "%hu,%hu" where %hu are unsigned 16bit @@ -1681,6 +1719,11 @@ maximum is indicated), the second value is the maximum picture average light level (or 0). (HDR) + Example for MaxCLL=1000 candela per square meter, MaxFALL=400 + candela per square meter: + + --max-cll “1000,400” + Note that this string value will need to be escaped or quoted to protect against shell expansion on many platforms. No default.
View file
x265_1.9.tar.gz/doc/reST/presets.rst -> x265_2.0.tar.gz/doc/reST/presets.rst
Changed
@@ -21,68 +21,80 @@ The presets adjust encoder parameters as shown in the following table. Any parameters below that are specified in your command-line will be changed from the value specified by the preset. - -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| |ultrafast |superfast |veryfast |faster |fast |medium |slow |slower |veryslow |placebo | -+=================+==========+==========+=========+=======+=====+=======+=====+=======+=========+========+ -| ctu | 32 | 32 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| min-cu-size | 16 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| bframes | 3 | 3 | 4 | 4 | 4 | 4 | 4 | 8 | 8 | 8 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| b-adapt | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 2 | 2 | 2 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| rc-lookahead | 5 | 10 | 15 | 15 | 15 | 20 | 25 | 30 | 40 | 60 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| lookahead-slices| 8 | 8 | 8 | 8 | 8 | 8 | 4 | 4 | 1 | 1 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| scenecut | 0 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| ref | 1 | 1 | 2 | 2 | 3 | 3 | 4 | 4 | 5 | 5 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| limit-refs | 0 | 0 | 3 | 3 | 3 | 3 | 3 | 2 | 1 | 0 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| me | dia | hex | hex | hex |hex | hex |star | star | star | star | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| merange | 57 | 57 | 57 | 57 | 57 | 57 | 57 | 57 | 57 | 92 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| subme | 0 | 1 | 1 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| rect | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| amp | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| limit-modes | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| max-merge | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| early-skip | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| fast-intra | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| b-intra | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| sao | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| signhide | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| weightp | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| weightb | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| aq-mode | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| cuTree | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| rdLevel | 2 | 2 | 2 | 2 | 2 | 3 | 4 | 6 | 6 | 6 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| rdoq-level | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 2 | 2 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| tu-intra | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 3 | 4 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ -| tu-inter | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 3 | 4 | -+-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ + 0. ultrafast + 1. superfast + 2. veryfast + 3. faster + 4. fast + 5. medium **(default)** + 6. slow + 7. slower + 8. veryslow + 9. placebo + ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| preset | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ++=================+=====+=====+=====+=====+=====+=====+======+======+======+======+ +| ctu | 32 | 32 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| min-cu-size | 16 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| bframes | 3 | 3 | 4 | 4 | 4 | 4 | 4 | 8 | 8 | 8 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| b-adapt | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 2 | 2 | 2 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| rc-lookahead | 5 | 10 | 15 | 15 | 15 | 20 | 25 | 30 | 40 | 60 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| lookahead-slices| 8 | 8 | 8 | 8 | 8 | 8 | 4 | 4 | 1 | 1 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| scenecut | 0 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| ref | 1 | 1 | 2 | 2 | 3 | 3 | 4 | 4 | 5 | 5 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| limit-refs | 0 | 0 | 3 | 3 | 3 | 3 | 3 | 2 | 1 | 0 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| me | dia | hex | hex | hex | hex | hex | star | star | star | star | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| merange | 57 | 57 | 57 | 57 | 57 | 57 | 57 | 57 | 57 | 92 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| subme | 0 | 1 | 1 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| rect | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| amp | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| limit-modes | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| max-merge | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| early-skip | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| recursion-skip | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| fast-intra | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| b-intra | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| sao | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| signhide | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| weightp | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| weightb | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| aq-mode | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| cuTree | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| rdLevel | 2 | 2 | 2 | 2 | 2 | 3 | 4 | 6 | 6 | 6 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| rdoq-level | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 2 | 2 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| tu-intra | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 3 | 4 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ +| tu-inter | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 3 | 4 | ++-----------------+-----+-----+-----+-----+-----+-----+------+------+------+------+ .. _tunings: @@ -117,33 +129,32 @@ -Film Grain Retention -~~~~~~~~~~~~~~~~~~~~ - -:option:`--tune` *grain* tries to improve the retention of film grain in -the reconstructed output. It disables rate distortion optimizations in -quantization, and increases the default psy-rd. - - * :option:`--psy-rd` 0.5 - * :option:`--rdoq-level` 0 - * :option:`--psy-rdoq` 0 - -It lowers the strength of adaptive quantization, so residual energy can -be more evenly distributed across the (noisy) picture: +Film Grain +~~~~~~~~~~ - * :option:`--aq-strength` 0.3 - -And it similarly tunes rate control to prevent the slice QP from -swinging too wildly from frame to frame: +:option:`--tune` *grain* aims to encode grainy content with the best +visual quality. The purpose of this option is neither to retain nor +eliminate grain, but prevent noticeable artifacts caused by uneven +distribution of grain. :option:`--tune` *grain* strongly restricts +algorithms that vary the quantization parameter within and across frames. +Tune grain also biases towards decisions that retain more high frequency +components. + * :option:`--aq-mode` 0 + * :option:`--cutree` 0 * :option:`--ipratio` 1.1 - * :option:`--pbratio` 1.1 - * :option:`--qcomp` 0.8 - -And lastly it reduces the strength of deblocking to prevent grain being -blurred on block boundaries: - - * :option:`--deblock` -2 + * :option:`--pbratio` 1.0 + * :option:`--qpstep` 1 + * :option:`--sao` 0 + * :option:`--psy-rd` 4.0 + * :option:`--psy-rdoq` 10.0 + * :option:`--recursion-skip` 0 + +It also enables a specialised ratecontrol algorithm :option:`--rc-grain` +that strictly minimises QP fluctuations across frames, while still allowing +the encoder to hit bitrate targets and VBV buffer limits (with a slightly +higher margin of error than normal). It is highly recommended that this +algorithm is used only through the :option:`--tune` *grain* feature. Fast Decode ~~~~~~~~~~~
View file
x265_1.9.tar.gz/source/CMakeLists.txt -> x265_2.0.tar.gz/source/CMakeLists.txt
Changed
@@ -30,7 +30,7 @@ mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD) # X265_BUILD must be incremented each time the public API is changed -set(X265_BUILD 79) +set(X265_BUILD 87) configure_file("${PROJECT_SOURCE_DIR}/x265.def.in" "${PROJECT_BINARY_DIR}/x265.def") configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in" @@ -41,7 +41,9 @@ # System architecture detection string(TOLOWER "${CMAKE_SYSTEM_PROCESSOR}" SYSPROC) set(X86_ALIASES x86 i386 i686 x86_64 amd64) +set(ARM_ALIASES armv6l armv7l) list(FIND X86_ALIASES "${SYSPROC}" X86MATCH) +list(FIND ARM_ALIASES "${SYSPROC}" ARMMATCH) set(POWER_ALIASES ppc64 ppc64le) list(FIND POWER_ALIASES "${SYSPROC}" POWERMATCH) if("${SYSPROC}" STREQUAL "" OR X86MATCH GREATER "-1") @@ -58,7 +60,12 @@ message(STATUS "Detected POWER target processor") set(POWER 1) add_definitions(-DX265_ARCH_POWER=1) -elseif(${SYSPROC} STREQUAL "armv6l") +elseif(ARMMATCH GREATER "-1") + if(CROSS_COMPILE_ARM) + message(STATUS "Cross compiling for ARM arch") + else() + set(CROSS_COMPILE_ARM 0) + endif() message(STATUS "Detected ARM target processor") set(ARM 1) add_definitions(-DX265_ARCH_ARM=1 -DHAVE_ARMV6=1) @@ -174,11 +181,23 @@ add_definitions(-march=native) endif() elseif(X86 AND NOT X64) - add_definitions(-march=i686) + string(FIND "${CMAKE_CXX_FLAGS}" "-march" marchPos) + if(marchPos LESS "0") + add_definitions(-march=i686) + endif() endif() - if(ARM) - add_definitions(-march=armv6 -mfloat-abi=hard -mfpu=vfp) + if(ARM AND CROSS_COMPILE_ARM) + set(ARM_ARGS -march=armv6 -mfloat-abi=soft -mfpu=vfp -marm -fPIC) + elseif(ARM) + find_package(Neon) + if(CPU_HAS_NEON) + set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=neon -marm -fPIC) + add_definitions(-DHAVE_NEON) + else() + set(ARM_ARGS -mcpu=native -mfloat-abi=hard -mfpu=vfp -marm) + endif() endif() + add_definitions(${ARM_ARGS}) if(FPROFILE_GENERATE) if(INTEL_CXX) add_definitions(-prof-gen -prof-dir="${CMAKE_CURRENT_BINARY_DIR}") @@ -269,7 +288,9 @@ endif(GCC) find_package(Yasm) -if(YASM_FOUND AND X86) +if(ARM OR CROSS_COMPILE_ARM) + option(ENABLE_ASSEMBLY "Enable use of assembly coded primitives" ON) +elseif(YASM_FOUND AND X86) if (YASM_VERSION_STRING VERSION_LESS "1.2.0") message(STATUS "Yasm version ${YASM_VERSION_STRING} is too old. 1.2.0 or later required") option(ENABLE_ASSEMBLY "Enable use of assembly coded primitives" OFF) @@ -409,7 +430,7 @@ add_subdirectory(encoder) add_subdirectory(common) -if((MSVC_IDE OR XCODE) AND ENABLE_ASSEMBLY) +if((MSVC_IDE OR XCODE OR GCC) AND ENABLE_ASSEMBLY) # this is required because of this cmake bug # http://www.cmake.org/Bug/print_bug_page.php?bug_id=8170 if(WIN32) @@ -417,19 +438,36 @@ else() set(SUFFIX o) endif() - foreach(ASM ${MSVC_ASMS}) - set(YASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/x86/${ASM}) - list(APPEND YASM_SRCS ${YASM_SRC}) - list(APPEND YASM_OBJS ${ASM}.${SUFFIX}) - add_custom_command( - OUTPUT ${ASM}.${SUFFIX} - COMMAND ${YASM_EXECUTABLE} ARGS ${YASM_FLAGS} ${YASM_SRC} -o ${ASM}.${SUFFIX} - DEPENDS ${YASM_SRC}) - endforeach() + + if(ARM OR CROSS_COMPILE_ARM) + # compile ARM arch asm files here + enable_language(ASM) + foreach(ASM ${ARM_ASMS}) + set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/arm/${ASM}) + list(APPEND ASM_SRCS ${ASM_SRC}) + list(APPEND ASM_OBJS ${ASM}.${SUFFIX}) + add_custom_command( + OUTPUT ${ASM}.${SUFFIX} + COMMAND ${CMAKE_CXX_COMPILER} + ARGS ${ARM_ARGS} -c ${ASM_SRC} -o ${ASM}.${SUFFIX} + DEPENDS ${ASM_SRC}) + endforeach() + elseif(X86) + # compile X86 arch asm files here + foreach(ASM ${MSVC_ASMS}) + set(ASM_SRC ${CMAKE_CURRENT_SOURCE_DIR}/common/x86/${ASM}) + list(APPEND ASM_SRCS ${ASM_SRC}) + list(APPEND ASM_OBJS ${ASM}.${SUFFIX}) + add_custom_command( + OUTPUT ${ASM}.${SUFFIX} + COMMAND ${YASM_EXECUTABLE} ARGS ${YASM_FLAGS} ${ASM_SRC} -o ${ASM}.${SUFFIX} + DEPENDS ${ASM_SRC}) + endforeach() + endif() endif() -source_group(ASM FILES ${YASM_SRCS}) -add_library(x265-static STATIC $<TARGET_OBJECTS:encoder> $<TARGET_OBJECTS:common> ${YASM_OBJS} ${YASM_SRCS}) +source_group(ASM FILES ${ASM_SRCS}) +add_library(x265-static STATIC $<TARGET_OBJECTS:encoder> $<TARGET_OBJECTS:common> ${ASM_OBJS} ${ASM_SRCS}) if(NOT MSVC) set_target_properties(x265-static PROPERTIES OUTPUT_NAME x265) endif() @@ -463,7 +501,7 @@ option(ENABLE_SHARED "Build shared library" ON) if(ENABLE_SHARED) - add_library(x265-shared SHARED "${PROJECT_BINARY_DIR}/x265.def" ${YASM_OBJS} + add_library(x265-shared SHARED "${PROJECT_BINARY_DIR}/x265.def" ${ASM_OBJS} ${X265_RC_FILE} $<TARGET_OBJECTS:encoder> $<TARGET_OBJECTS:common>) if(EXTRA_LIB) target_link_libraries(x265-shared ${EXTRA_LIB}) @@ -559,7 +597,7 @@ # Xcode seems unable to link the CLI with libs, so link as one targget add_executable(cli ../COPYING ${InputFiles} ${OutputFiles} ${GETOPT} x265.cpp x265.h x265cli.h x265-extras.h x265-extras.cpp - $<TARGET_OBJECTS:encoder> $<TARGET_OBJECTS:common> ${YASM_OBJS} ${YASM_SRCS}) + $<TARGET_OBJECTS:encoder> $<TARGET_OBJECTS:common> ${ASM_OBJS} ${ASM_SRCS}) else() add_executable(cli ../COPYING ${InputFiles} ${OutputFiles} ${GETOPT} ${X265_RC_FILE} ${ExportDefs} x265.cpp x265.h x265cli.h x265-extras.h x265-extras.cpp) @@ -587,3 +625,11 @@ add_subdirectory(test) endif() endif() + +get_directory_property(hasParent PARENT_DIRECTORY) +if(hasParent) + if(PLATFORM_LIBS) + LIST(REMOVE_DUPLICATES PLATFORM_LIBS) + set(PLATFORM_LIBS ${PLATFORM_LIBS} PARENT_SCOPE) + endif(PLATFORM_LIBS) +endif(hasParent)
View file
x265_2.0.tar.gz/source/cmake/FindNeon.cmake
Added
@@ -0,0 +1,10 @@ +include(FindPackageHandleStandardArgs) + +# Check the version of neon supported by the ARM CPU +execute_process(COMMAND cat /proc/cpuinfo | grep Features | grep neon + OUTPUT_VARIABLE neon_version + ERROR_QUIET + OUTPUT_STRIP_TRAILING_WHITESPACE) +if(neon_version) + set(CPU_HAS_NEON 1) +endif()
View file
x265_1.9.tar.gz/source/cmake/version.cmake -> x265_2.0.tar.gz/source/cmake/version.cmake
Changed
@@ -52,39 +52,55 @@ ) execute_process( COMMAND - ${HG_EXECUTABLE} log -r. --template "{node|short}" + ${HG_EXECUTABLE} log -r. --template "{node}" WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR} - OUTPUT_VARIABLE HG_REVISION_ID + OUTPUT_VARIABLE X265_REVISION_ID ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE ) + string(SUBSTRING "${X265_REVISION_ID}" 0 12 X265_REVISION_ID) if(X265_LATEST_TAG MATCHES "^r") string(SUBSTRING ${X265_LATEST_TAG} 1 -1 X265_LATEST_TAG) endif() - if(X265_TAG_DISTANCE STREQUAL "0") - set(X265_VERSION "${X265_LATEST_TAG}") - else() - set(X265_VERSION "${X265_LATEST_TAG}+${X265_TAG_DISTANCE}-${HG_REVISION_ID}") - endif() elseif(GIT_EXECUTABLE AND EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/../.git) execute_process( COMMAND - ${GIT_EXECUTABLE} describe --tags --abbrev=0 + ${GIT_EXECUTABLE} rev-list --tags --max-count=1 + WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR} + OUTPUT_VARIABLE X265_LATEST_TAG_COMMIT + ERROR_QUIET + OUTPUT_STRIP_TRAILING_WHITESPACE + ) + execute_process( + COMMAND + ${GIT_EXECUTABLE} describe --tags ${X265_LATEST_TAG_COMMIT} WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR} OUTPUT_VARIABLE X265_LATEST_TAG ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE ) - execute_process( COMMAND - ${GIT_EXECUTABLE} describe --tags + ${GIT_EXECUTABLE} rev-list ${X265_LATEST_TAG}.. --count --first-parent WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR} - OUTPUT_VARIABLE X265_VERSION + OUTPUT_VARIABLE X265_TAG_DISTANCE ERROR_QUIET OUTPUT_STRIP_TRAILING_WHITESPACE ) + execute_process( + COMMAND + ${GIT_EXECUTABLE} log -1 --format=g%h + WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR} + OUTPUT_VARIABLE X265_REVISION_ID + ERROR_QUIET + OUTPUT_STRIP_TRAILING_WHITESPACE + ) +endif() +if(X265_TAG_DISTANCE STREQUAL "0") + set(X265_VERSION "${X265_LATEST_TAG}") +else() + set(X265_VERSION "${X265_LATEST_TAG}+${X265_TAG_DISTANCE}-${X265_REVISION_ID}") endif() message(STATUS "x265 version ${X265_VERSION}")
View file
x265_1.9.tar.gz/source/common/CMakeLists.txt -> x265_2.0.tar.gz/source/common/CMakeLists.txt
Changed
@@ -16,12 +16,14 @@ if(ENABLE_ASSEMBLY) set_source_files_properties(threading.cpp primitives.cpp PROPERTIES COMPILE_FLAGS -DENABLE_ASSEMBLY=1) list(APPEND VFLAGS "-DENABLE_ASSEMBLY=1") +endif(ENABLE_ASSEMBLY) +if(ENABLE_ASSEMBLY AND X86) set(SSE3 vec/dct-sse3.cpp) set(SSSE3 vec/dct-ssse3.cpp) set(SSE41 vec/dct-sse41.cpp) - if(MSVC AND X86) + if(MSVC) set(PRIMITIVES ${SSE3} ${SSSE3} ${SSE41}) set(WARNDISABLE "/wd4100") # unreferenced formal parameter if(INTEL_CXX) @@ -38,7 +40,7 @@ set_source_files_properties(${SSE3} ${SSSE3} ${SSE41} PROPERTIES COMPILE_FLAGS "${WARNDISABLE} /arch:SSE2") endif() endif() - if(GCC AND X86) + if(GCC) if(CLANG) # llvm intrinsic headers cause shadow warnings set(WARNDISABLE "-Wno-shadow -Wno-unused-parameter") @@ -81,7 +83,21 @@ set(ASM_PRIMITIVES ${ASM_PRIMITIVES} x86/${SRC}) endforeach() source_group(Assembly FILES ${ASM_PRIMITIVES}) -endif(ENABLE_ASSEMBLY) +endif(ENABLE_ASSEMBLY AND X86) + +if(ENABLE_ASSEMBLY AND (ARM OR CROSS_COMPILE_ARM)) + set(C_SRCS asm-primitives.cpp pixel.h mc.h ipfilter8.h blockcopy8.h dct8.h loopfilter.h) + + # add ARM assembly/intrinsic files here + set(A_SRCS asm.S cpu-a.S mc-a.S sad-a.S pixel-util.S ssd-a.S blockcopy8.S ipfilter8.S dct-a.S) + set(VEC_PRIMITIVES) + + set(ARM_ASMS "${A_SRCS}" CACHE INTERNAL "ARM Assembly Sources") + foreach(SRC ${C_SRCS}) + set(ASM_PRIMITIVES ${ASM_PRIMITIVES} arm/${SRC}) + endforeach() + source_group(Assembly FILES ${ASM_PRIMITIVES}) +endif(ENABLE_ASSEMBLY AND (ARM OR CROSS_COMPILE_ARM)) # set_target_properties can't do list expansion string(REPLACE ";" " " VERSION_FLAGS "${VFLAGS}")
View file
x265_2.0.tar.gz/source/common/arm/asm-primitives.cpp
Added
@@ -0,0 +1,1022 @@ +/***************************************************************************** + * Copyright (C) 2016 x265 project + * + * Authors: Steve Borho <steve@borho.org> + * Praveen Kumar Tiwari <praveen@multicorewareinc.com> + * Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com> + * Dnyaneshwar Gorade <dnyaneshwar@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "common.h" +#include "primitives.h" +#include "x265.h" +#include "cpu.h" + +extern "C" { +#include "blockcopy8.h" +#include "pixel.h" +#include "pixel-util.h" +#include "ipfilter8.h" +#include "dct8.h" +} + +namespace X265_NS { +// private x265 namespace + +void setupAssemblyPrimitives(EncoderPrimitives &p, int cpuMask) +{ + if (cpuMask & X265_CPU_NEON) + { + // ssim_4x4x2_core + p.ssim_4x4x2_core = PFX(ssim_4x4x2_core_neon); + + // addAvg + p.pu[LUMA_4x4].addAvg = PFX(addAvg_4x4_neon); + p.pu[LUMA_4x8].addAvg = PFX(addAvg_4x8_neon); + p.pu[LUMA_4x16].addAvg = PFX(addAvg_4x16_neon); + p.pu[LUMA_8x4].addAvg = PFX(addAvg_8x4_neon); + p.pu[LUMA_8x8].addAvg = PFX(addAvg_8x8_neon); + p.pu[LUMA_8x16].addAvg = PFX(addAvg_8x16_neon); + p.pu[LUMA_8x32].addAvg = PFX(addAvg_8x32_neon); + p.pu[LUMA_12x16].addAvg = PFX(addAvg_12x16_neon); + p.pu[LUMA_16x4].addAvg = PFX(addAvg_16x4_neon); + p.pu[LUMA_16x8].addAvg = PFX(addAvg_16x8_neon); + p.pu[LUMA_16x12].addAvg = PFX(addAvg_16x12_neon); + p.pu[LUMA_16x16].addAvg = PFX(addAvg_16x16_neon); + p.pu[LUMA_16x32].addAvg = PFX(addAvg_16x32_neon); + p.pu[LUMA_16x64].addAvg = PFX(addAvg_16x64_neon); + p.pu[LUMA_24x32].addAvg = PFX(addAvg_24x32_neon); + p.pu[LUMA_32x8].addAvg = PFX(addAvg_32x8_neon); + p.pu[LUMA_32x16].addAvg = PFX(addAvg_32x16_neon); + p.pu[LUMA_32x24].addAvg = PFX(addAvg_32x24_neon); + p.pu[LUMA_32x32].addAvg = PFX(addAvg_32x32_neon); + p.pu[LUMA_32x64].addAvg = PFX(addAvg_32x64_neon); + p.pu[LUMA_48x64].addAvg = PFX(addAvg_48x64_neon); + p.pu[LUMA_64x16].addAvg = PFX(addAvg_64x16_neon); + p.pu[LUMA_64x32].addAvg = PFX(addAvg_64x32_neon); + p.pu[LUMA_64x48].addAvg = PFX(addAvg_64x48_neon); + p.pu[LUMA_64x64].addAvg = PFX(addAvg_64x64_neon); + + // chroma addAvg + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].addAvg = PFX(addAvg_4x2_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].addAvg = PFX(addAvg_4x4_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].addAvg = PFX(addAvg_4x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].addAvg = PFX(addAvg_4x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].addAvg = PFX(addAvg_6x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].addAvg = PFX(addAvg_8x2_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].addAvg = PFX(addAvg_8x4_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].addAvg = PFX(addAvg_8x6_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].addAvg = PFX(addAvg_8x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].addAvg = PFX(addAvg_8x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].addAvg = PFX(addAvg_8x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].addAvg = PFX(addAvg_12x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].addAvg = PFX(addAvg_16x4_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].addAvg = PFX(addAvg_16x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].addAvg = PFX(addAvg_16x12_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].addAvg = PFX(addAvg_16x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].addAvg = PFX(addAvg_16x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].addAvg = PFX(addAvg_24x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].addAvg = PFX(addAvg_32x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].addAvg = PFX(addAvg_32x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].addAvg = PFX(addAvg_32x24_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].addAvg = PFX(addAvg_32x32_neon); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].addAvg = PFX(addAvg_4x8_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].addAvg = PFX(addAvg_4x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].addAvg = PFX(addAvg_4x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].addAvg = PFX(addAvg_6x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].addAvg = PFX(addAvg_8x4_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].addAvg = PFX(addAvg_8x8_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].addAvg = PFX(addAvg_8x12_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].addAvg = PFX(addAvg_8x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].addAvg = PFX(addAvg_8x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].addAvg = PFX(addAvg_8x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].addAvg = PFX(addAvg_12x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].addAvg = PFX(addAvg_16x8_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].addAvg = PFX(addAvg_16x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].addAvg = PFX(addAvg_16x24_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].addAvg = PFX(addAvg_16x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].addAvg = PFX(addAvg_16x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].addAvg = PFX(addAvg_24x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].addAvg = PFX(addAvg_32x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].addAvg = PFX(addAvg_32x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].addAvg = PFX(addAvg_32x48_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].addAvg = PFX(addAvg_32x64_neon); + + // quant + p.quant = PFX(quant_neon); + p.nquant = PFX(nquant_neon); + + // dequant_scaling + p.dequant_scaling = PFX(dequant_scaling_neon); + p.dequant_normal = PFX(dequant_normal_neon); + + // luma satd + p.pu[LUMA_4x4].satd = PFX(pixel_satd_4x4_neon); + p.pu[LUMA_4x8].satd = PFX(pixel_satd_4x8_neon); + p.pu[LUMA_4x16].satd = PFX(pixel_satd_4x16_neon); + p.pu[LUMA_8x4].satd = PFX(pixel_satd_8x4_neon); + p.pu[LUMA_8x8].satd = PFX(pixel_satd_8x8_neon); + p.pu[LUMA_8x16].satd = PFX(pixel_satd_8x16_neon); + p.pu[LUMA_8x32].satd = PFX(pixel_satd_8x32_neon); + p.pu[LUMA_12x16].satd = PFX(pixel_satd_12x16_neon); + p.pu[LUMA_16x4].satd = PFX(pixel_satd_16x4_neon); + p.pu[LUMA_16x8].satd = PFX(pixel_satd_16x8_neon); + p.pu[LUMA_16x16].satd = PFX(pixel_satd_16x16_neon); + p.pu[LUMA_16x32].satd = PFX(pixel_satd_16x32_neon); + p.pu[LUMA_16x64].satd = PFX(pixel_satd_16x64_neon); + p.pu[LUMA_24x32].satd = PFX(pixel_satd_24x32_neon); + p.pu[LUMA_32x8].satd = PFX(pixel_satd_32x8_neon); + p.pu[LUMA_32x16].satd = PFX(pixel_satd_32x16_neon); + p.pu[LUMA_32x24].satd = PFX(pixel_satd_32x24_neon); + p.pu[LUMA_32x32].satd = PFX(pixel_satd_32x32_neon); + p.pu[LUMA_32x64].satd = PFX(pixel_satd_32x64_neon); + p.pu[LUMA_48x64].satd = PFX(pixel_satd_48x64_neon); + p.pu[LUMA_64x16].satd = PFX(pixel_satd_64x16_neon); + p.pu[LUMA_64x32].satd = PFX(pixel_satd_64x32_neon); + p.pu[LUMA_64x48].satd = PFX(pixel_satd_64x48_neon); + p.pu[LUMA_64x64].satd = PFX(pixel_satd_64x64_neon); + + // chroma satd + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].satd = PFX(pixel_satd_4x4_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].satd = PFX(pixel_satd_4x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].satd = PFX(pixel_satd_4x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].satd = PFX(pixel_satd_8x4_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].satd = PFX(pixel_satd_8x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].satd = PFX(pixel_satd_8x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].satd = PFX(pixel_satd_8x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].satd = PFX(pixel_satd_12x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].satd = PFX(pixel_satd_16x4_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].satd = PFX(pixel_satd_16x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].satd = PFX(pixel_satd_16x12_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].satd = PFX(pixel_satd_16x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].satd = PFX(pixel_satd_16x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].satd = PFX(pixel_satd_24x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].satd = PFX(pixel_satd_32x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].satd = PFX(pixel_satd_32x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].satd = PFX(pixel_satd_32x24_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].satd = PFX(pixel_satd_32x32_neon); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].satd = PFX(pixel_satd_4x4_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].satd = PFX(pixel_satd_4x8_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].satd = PFX(pixel_satd_4x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].satd = PFX(pixel_satd_4x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].satd = PFX(pixel_satd_8x4_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].satd = PFX(pixel_satd_8x8_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].satd = PFX(pixel_satd_8x12_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].satd = PFX(pixel_satd_8x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].satd = PFX(pixel_satd_8x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].satd = PFX(pixel_satd_8x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].satd = PFX(pixel_satd_12x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].satd = PFX(pixel_satd_16x8_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].satd = PFX(pixel_satd_16x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].satd = PFX(pixel_satd_16x24_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].satd = PFX(pixel_satd_16x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].satd = PFX(pixel_satd_16x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].satd = PFX(pixel_satd_24x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].satd = PFX(pixel_satd_32x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].satd = PFX(pixel_satd_32x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].satd = PFX(pixel_satd_32x48_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].satd = PFX(pixel_satd_32x64_neon); + + // chroma_hpp + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_hpp = PFX(interp_4tap_horiz_pp_4x2_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_hpp = PFX(interp_4tap_horiz_pp_4x4_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_hpp = PFX(interp_4tap_horiz_pp_4x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_hpp = PFX(interp_4tap_horiz_pp_4x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_hpp = PFX(interp_4tap_horiz_pp_8x2_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_hpp = PFX(interp_4tap_horiz_pp_8x4_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_hpp = PFX(interp_4tap_horiz_pp_8x6_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_hpp = PFX(interp_4tap_horiz_pp_8x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_hpp = PFX(interp_4tap_horiz_pp_8x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_hpp = PFX(interp_4tap_horiz_pp_8x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].filter_hpp = PFX(interp_4tap_horiz_pp_12x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_hpp = PFX(interp_4tap_horiz_pp_16x4_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_hpp = PFX(interp_4tap_horiz_pp_16x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_hpp = PFX(interp_4tap_horiz_pp_16x12_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_hpp = PFX(interp_4tap_horiz_pp_16x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_hpp = PFX(interp_4tap_horiz_pp_16x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_hpp = PFX(interp_4tap_horiz_pp_24x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_hpp = PFX(interp_4tap_horiz_pp_32x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_hpp = PFX(interp_4tap_horiz_pp_32x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_hpp = PFX(interp_4tap_horiz_pp_32x24_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_hpp = PFX(interp_4tap_horiz_pp_32x32_neon); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_hpp = PFX(interp_4tap_horiz_pp_4x4_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_hpp = PFX(interp_4tap_horiz_pp_4x8_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_hpp = PFX(interp_4tap_horiz_pp_4x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_hpp = PFX(interp_4tap_horiz_pp_4x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_hpp = PFX(interp_4tap_horiz_pp_8x4_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_hpp = PFX(interp_4tap_horiz_pp_8x8_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_hpp = PFX(interp_4tap_horiz_pp_8x12_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_hpp = PFX(interp_4tap_horiz_pp_8x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_hpp = PFX(interp_4tap_horiz_pp_8x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_hpp = PFX(interp_4tap_horiz_pp_8x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_hpp = PFX(interp_4tap_horiz_pp_12x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_hpp = PFX(interp_4tap_horiz_pp_16x8_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_hpp = PFX(interp_4tap_horiz_pp_16x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_hpp = PFX(interp_4tap_horiz_pp_16x24_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_hpp = PFX(interp_4tap_horiz_pp_16x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_hpp = PFX(interp_4tap_horiz_pp_16x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_hpp = PFX(interp_4tap_horiz_pp_24x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_hpp = PFX(interp_4tap_horiz_pp_32x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_hpp = PFX(interp_4tap_horiz_pp_32x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_hpp = PFX(interp_4tap_horiz_pp_32x48_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_hpp = PFX(interp_4tap_horiz_pp_32x64_neon); + + p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_hpp = PFX(interp_4tap_horiz_pp_4x4_neon); + p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_hpp = PFX(interp_4tap_horiz_pp_4x8_neon); + p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_hpp = PFX(interp_4tap_horiz_pp_4x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_hpp = PFX(interp_4tap_horiz_pp_8x4_neon); + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_hpp = PFX(interp_4tap_horiz_pp_8x8_neon); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_hpp = PFX(interp_4tap_horiz_pp_8x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_hpp = PFX(interp_4tap_horiz_pp_8x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_hpp = PFX(interp_4tap_horiz_pp_12x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_hpp = PFX(interp_4tap_horiz_pp_16x4_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_hpp = PFX(interp_4tap_horiz_pp_16x8_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_hpp = PFX(interp_4tap_horiz_pp_16x12_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_hpp = PFX(interp_4tap_horiz_pp_16x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_hpp = PFX(interp_4tap_horiz_pp_16x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_hpp = PFX(interp_4tap_horiz_pp_16x64_neon); + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_hpp = PFX(interp_4tap_horiz_pp_24x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_hpp = PFX(interp_4tap_horiz_pp_32x8_neon); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_hpp = PFX(interp_4tap_horiz_pp_32x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_hpp = PFX(interp_4tap_horiz_pp_32x24_neon); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_hpp = PFX(interp_4tap_horiz_pp_32x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_hpp = PFX(interp_4tap_horiz_pp_32x64_neon); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_hpp = PFX(interp_4tap_horiz_pp_48x64_neon); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_hpp = PFX(interp_4tap_horiz_pp_64x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_hpp = PFX(interp_4tap_horiz_pp_64x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_hpp = PFX(interp_4tap_horiz_pp_64x48_neon); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_hpp = PFX(interp_4tap_horiz_pp_64x64_neon); + + // chroma_hps + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_hps = PFX(interp_4tap_horiz_ps_4x2_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_hps = PFX(interp_4tap_horiz_ps_4x4_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_hps = PFX(interp_4tap_horiz_ps_4x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_hps = PFX(interp_4tap_horiz_ps_4x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_hps = PFX(interp_4tap_horiz_ps_8x2_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_hps = PFX(interp_4tap_horiz_ps_8x4_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_hps = PFX(interp_4tap_horiz_ps_8x6_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_hps = PFX(interp_4tap_horiz_ps_8x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_hps = PFX(interp_4tap_horiz_ps_8x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_hps = PFX(interp_4tap_horiz_ps_8x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].filter_hps = PFX(interp_4tap_horiz_ps_12x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_hps = PFX(interp_4tap_horiz_ps_16x4_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_hps = PFX(interp_4tap_horiz_ps_16x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_hps = PFX(interp_4tap_horiz_ps_16x12_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_hps = PFX(interp_4tap_horiz_ps_16x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_hps = PFX(interp_4tap_horiz_ps_16x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_hps = PFX(interp_4tap_horiz_ps_24x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_hps = PFX(interp_4tap_horiz_ps_32x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_hps = PFX(interp_4tap_horiz_ps_32x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_hps = PFX(interp_4tap_horiz_ps_32x24_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_hps = PFX(interp_4tap_horiz_ps_32x32_neon); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_hps = PFX(interp_4tap_horiz_ps_4x4_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_hps = PFX(interp_4tap_horiz_ps_4x8_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_hps = PFX(interp_4tap_horiz_ps_4x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_hps = PFX(interp_4tap_horiz_ps_4x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_hps = PFX(interp_4tap_horiz_ps_8x4_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_hps = PFX(interp_4tap_horiz_ps_8x8_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_hps = PFX(interp_4tap_horiz_ps_8x12_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_hps = PFX(interp_4tap_horiz_ps_8x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_hps = PFX(interp_4tap_horiz_ps_8x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_hps = PFX(interp_4tap_horiz_ps_8x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_hps = PFX(interp_4tap_horiz_ps_12x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_hps = PFX(interp_4tap_horiz_ps_16x8_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_hps = PFX(interp_4tap_horiz_ps_16x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_hps = PFX(interp_4tap_horiz_ps_16x24_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_hps = PFX(interp_4tap_horiz_ps_16x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_hps = PFX(interp_4tap_horiz_ps_16x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_hps = PFX(interp_4tap_horiz_ps_24x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_hps = PFX(interp_4tap_horiz_ps_32x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_hps = PFX(interp_4tap_horiz_ps_32x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_hps = PFX(interp_4tap_horiz_ps_32x48_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_hps = PFX(interp_4tap_horiz_ps_32x64_neon); + + p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_hps = PFX(interp_4tap_horiz_ps_4x4_neon); + p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_hps = PFX(interp_4tap_horiz_ps_4x8_neon); + p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_hps = PFX(interp_4tap_horiz_ps_4x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_hps = PFX(interp_4tap_horiz_ps_8x4_neon); + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_hps = PFX(interp_4tap_horiz_ps_8x8_neon); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_hps = PFX(interp_4tap_horiz_ps_8x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_hps = PFX(interp_4tap_horiz_ps_8x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_hps = PFX(interp_4tap_horiz_ps_12x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_hps = PFX(interp_4tap_horiz_ps_16x4_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_hps = PFX(interp_4tap_horiz_ps_16x8_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_hps = PFX(interp_4tap_horiz_ps_16x12_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_hps = PFX(interp_4tap_horiz_ps_16x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_hps = PFX(interp_4tap_horiz_ps_16x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_hps = PFX(interp_4tap_horiz_ps_16x64_neon); + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_hps = PFX(interp_4tap_horiz_ps_24x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_hps = PFX(interp_4tap_horiz_ps_32x8_neon); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_hps = PFX(interp_4tap_horiz_ps_32x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_hps = PFX(interp_4tap_horiz_ps_32x24_neon); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_hps = PFX(interp_4tap_horiz_ps_32x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_hps = PFX(interp_4tap_horiz_ps_32x64_neon); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_hps = PFX(interp_4tap_horiz_ps_48x64_neon); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_hps = PFX(interp_4tap_horiz_ps_64x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_hps = PFX(interp_4tap_horiz_ps_64x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_hps = PFX(interp_4tap_horiz_ps_64x48_neon); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_hps = PFX(interp_4tap_horiz_ps_64x64_neon); + + // luma_hpp + p.pu[LUMA_4x4].luma_hpp = PFX(interp_horiz_pp_4x4_neon); + p.pu[LUMA_4x8].luma_hpp = PFX(interp_horiz_pp_4x8_neon); + p.pu[LUMA_4x16].luma_hpp = PFX(interp_horiz_pp_4x16_neon); + p.pu[LUMA_8x4].luma_hpp = PFX(interp_horiz_pp_8x4_neon); + p.pu[LUMA_8x8].luma_hpp = PFX(interp_horiz_pp_8x8_neon); + p.pu[LUMA_8x16].luma_hpp = PFX(interp_horiz_pp_8x16_neon); + p.pu[LUMA_8x32].luma_hpp = PFX(interp_horiz_pp_8x32_neon); + p.pu[LUMA_12x16].luma_hpp = PFX(interp_horiz_pp_12x16_neon); + p.pu[LUMA_16x4].luma_hpp = PFX(interp_horiz_pp_16x4_neon); + p.pu[LUMA_16x8].luma_hpp = PFX(interp_horiz_pp_16x8_neon); + p.pu[LUMA_16x12].luma_hpp = PFX(interp_horiz_pp_16x12_neon); + p.pu[LUMA_16x16].luma_hpp = PFX(interp_horiz_pp_16x16_neon); + p.pu[LUMA_16x32].luma_hpp = PFX(interp_horiz_pp_16x32_neon); + p.pu[LUMA_16x64].luma_hpp = PFX(interp_horiz_pp_16x64_neon); + p.pu[LUMA_24x32].luma_hpp = PFX(interp_horiz_pp_24x32_neon); + p.pu[LUMA_32x8].luma_hpp = PFX(interp_horiz_pp_32x8_neon); + p.pu[LUMA_32x16].luma_hpp = PFX(interp_horiz_pp_32x16_neon); + p.pu[LUMA_32x24].luma_hpp = PFX(interp_horiz_pp_32x24_neon); + p.pu[LUMA_32x32].luma_hpp = PFX(interp_horiz_pp_32x32_neon); + p.pu[LUMA_32x64].luma_hpp = PFX(interp_horiz_pp_32x64_neon); + p.pu[LUMA_48x64].luma_hpp = PFX(interp_horiz_pp_48x64_neon); + p.pu[LUMA_64x16].luma_hpp = PFX(interp_horiz_pp_64x16_neon); + p.pu[LUMA_64x32].luma_hpp = PFX(interp_horiz_pp_64x32_neon); + p.pu[LUMA_64x48].luma_hpp = PFX(interp_horiz_pp_64x48_neon); + p.pu[LUMA_64x64].luma_hpp = PFX(interp_horiz_pp_64x64_neon); + + // luma_hps + p.pu[LUMA_4x4].luma_hps = PFX(interp_horiz_ps_4x4_neon); + p.pu[LUMA_4x8].luma_hps = PFX(interp_horiz_ps_4x8_neon); + p.pu[LUMA_4x16].luma_hps = PFX(interp_horiz_ps_4x16_neon); + p.pu[LUMA_8x4].luma_hps = PFX(interp_horiz_ps_8x4_neon); + p.pu[LUMA_8x8].luma_hps = PFX(interp_horiz_ps_8x8_neon); + p.pu[LUMA_8x16].luma_hps = PFX(interp_horiz_ps_8x16_neon); + p.pu[LUMA_8x32].luma_hps = PFX(interp_horiz_ps_8x32_neon); + p.pu[LUMA_12x16].luma_hps = PFX(interp_horiz_ps_12x16_neon); + p.pu[LUMA_16x4].luma_hps = PFX(interp_horiz_ps_16x4_neon); + p.pu[LUMA_16x8].luma_hps = PFX(interp_horiz_ps_16x8_neon); + p.pu[LUMA_16x12].luma_hps = PFX(interp_horiz_ps_16x12_neon); + p.pu[LUMA_16x16].luma_hps = PFX(interp_horiz_ps_16x16_neon); + p.pu[LUMA_16x32].luma_hps = PFX(interp_horiz_ps_16x32_neon); + p.pu[LUMA_16x64].luma_hps = PFX(interp_horiz_ps_16x64_neon); + p.pu[LUMA_24x32].luma_hps = PFX(interp_horiz_ps_24x32_neon); + p.pu[LUMA_32x8].luma_hps = PFX(interp_horiz_ps_32x8_neon); + p.pu[LUMA_32x16].luma_hps = PFX(interp_horiz_ps_32x16_neon); + p.pu[LUMA_32x24].luma_hps = PFX(interp_horiz_ps_32x24_neon); + p.pu[LUMA_32x32].luma_hps = PFX(interp_horiz_ps_32x32_neon); + p.pu[LUMA_32x64].luma_hps = PFX(interp_horiz_ps_32x64_neon); + p.pu[LUMA_48x64].luma_hps = PFX(interp_horiz_ps_48x64_neon); + p.pu[LUMA_64x16].luma_hps = PFX(interp_horiz_ps_64x16_neon); + p.pu[LUMA_64x32].luma_hps = PFX(interp_horiz_ps_64x32_neon); + p.pu[LUMA_64x48].luma_hps = PFX(interp_horiz_ps_64x48_neon); + p.pu[LUMA_64x64].luma_hps = PFX(interp_horiz_ps_64x64_neon); + + // count nonzero + p.cu[BLOCK_4x4].count_nonzero = PFX(count_nonzero_4_neon); + p.cu[BLOCK_8x8].count_nonzero = PFX(count_nonzero_8_neon); + p.cu[BLOCK_16x16].count_nonzero = PFX(count_nonzero_16_neon); + p.cu[BLOCK_32x32].count_nonzero = PFX(count_nonzero_32_neon); + + //scale2D_64to32 + p.scale2D_64to32 = PFX(scale2D_64to32_neon); + + // scale1D_128to64 + p.scale1D_128to64 = PFX(scale1D_128to64_neon); + + // copy_count + p.cu[BLOCK_4x4].copy_cnt = PFX(copy_cnt_4_neon); + p.cu[BLOCK_8x8].copy_cnt = PFX(copy_cnt_8_neon); + p.cu[BLOCK_16x16].copy_cnt = PFX(copy_cnt_16_neon); + p.cu[BLOCK_32x32].copy_cnt = PFX(copy_cnt_32_neon); + + // filterPixelToShort + p.pu[LUMA_4x4].convert_p2s = PFX(filterPixelToShort_4x4_neon); + p.pu[LUMA_4x8].convert_p2s = PFX(filterPixelToShort_4x8_neon); + p.pu[LUMA_4x16].convert_p2s = PFX(filterPixelToShort_4x16_neon); + p.pu[LUMA_8x4].convert_p2s = PFX(filterPixelToShort_8x4_neon); + p.pu[LUMA_8x8].convert_p2s = PFX(filterPixelToShort_8x8_neon); + p.pu[LUMA_8x16].convert_p2s = PFX(filterPixelToShort_8x16_neon); + p.pu[LUMA_8x32].convert_p2s = PFX(filterPixelToShort_8x32_neon); + p.pu[LUMA_12x16].convert_p2s = PFX(filterPixelToShort_12x16_neon); + p.pu[LUMA_16x4].convert_p2s = PFX(filterPixelToShort_16x4_neon); + p.pu[LUMA_16x8].convert_p2s = PFX(filterPixelToShort_16x8_neon); + p.pu[LUMA_16x12].convert_p2s = PFX(filterPixelToShort_16x12_neon); + p.pu[LUMA_16x16].convert_p2s = PFX(filterPixelToShort_16x16_neon); + p.pu[LUMA_16x32].convert_p2s = PFX(filterPixelToShort_16x32_neon); + p.pu[LUMA_16x64].convert_p2s = PFX(filterPixelToShort_16x64_neon); + p.pu[LUMA_24x32].convert_p2s = PFX(filterPixelToShort_24x32_neon); + p.pu[LUMA_32x8].convert_p2s = PFX(filterPixelToShort_32x8_neon); + p.pu[LUMA_32x16].convert_p2s = PFX(filterPixelToShort_32x16_neon); + p.pu[LUMA_32x24].convert_p2s = PFX(filterPixelToShort_32x24_neon); + p.pu[LUMA_32x32].convert_p2s = PFX(filterPixelToShort_32x32_neon); + p.pu[LUMA_32x64].convert_p2s = PFX(filterPixelToShort_32x64_neon); + p.pu[LUMA_48x64].convert_p2s = PFX(filterPixelToShort_48x64_neon); + p.pu[LUMA_64x16].convert_p2s = PFX(filterPixelToShort_64x16_neon); + p.pu[LUMA_64x32].convert_p2s = PFX(filterPixelToShort_64x32_neon); + p.pu[LUMA_64x48].convert_p2s = PFX(filterPixelToShort_64x48_neon); + p.pu[LUMA_64x64].convert_p2s = PFX(filterPixelToShort_64x64_neon); + + // Block_fill + p.cu[BLOCK_4x4].blockfill_s = PFX(blockfill_s_4x4_neon); + p.cu[BLOCK_8x8].blockfill_s = PFX(blockfill_s_8x8_neon); + p.cu[BLOCK_16x16].blockfill_s = PFX(blockfill_s_16x16_neon); + p.cu[BLOCK_32x32].blockfill_s = PFX(blockfill_s_32x32_neon); + + // Blockcopy_ss + p.cu[BLOCK_4x4].copy_ss = PFX(blockcopy_ss_4x4_neon); + p.cu[BLOCK_8x8].copy_ss = PFX(blockcopy_ss_8x8_neon); + p.cu[BLOCK_16x16].copy_ss = PFX(blockcopy_ss_16x16_neon); + p.cu[BLOCK_32x32].copy_ss = PFX(blockcopy_ss_32x32_neon); + p.cu[BLOCK_64x64].copy_ss = PFX(blockcopy_ss_64x64_neon); + + // Blockcopy_ps + p.cu[BLOCK_4x4].copy_ps = PFX(blockcopy_ps_4x4_neon); + p.cu[BLOCK_8x8].copy_ps = PFX(blockcopy_ps_8x8_neon); + p.cu[BLOCK_16x16].copy_ps = PFX(blockcopy_ps_16x16_neon); + p.cu[BLOCK_32x32].copy_ps = PFX(blockcopy_ps_32x32_neon); + p.cu[BLOCK_64x64].copy_ps = PFX(blockcopy_ps_64x64_neon); + + // Blockcopy_sp + p.cu[BLOCK_4x4].copy_sp = PFX(blockcopy_sp_4x4_neon); + p.cu[BLOCK_8x8].copy_sp = PFX(blockcopy_sp_8x8_neon); + p.cu[BLOCK_16x16].copy_sp = PFX(blockcopy_sp_16x16_neon); + p.cu[BLOCK_32x32].copy_sp = PFX(blockcopy_sp_32x32_neon); + p.cu[BLOCK_64x64].copy_sp = PFX(blockcopy_sp_64x64_neon); + + // chroma blockcopy_ss + p.chroma[X265_CSP_I420].cu[BLOCK_420_4x4].copy_ss = PFX(blockcopy_ss_4x4_neon); + p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].copy_ss = PFX(blockcopy_ss_8x8_neon); + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].copy_ss = PFX(blockcopy_ss_16x16_neon); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_ss = PFX(blockcopy_ss_32x32_neon); + p.chroma[X265_CSP_I422].cu[BLOCK_422_4x8].copy_ss = PFX(blockcopy_ss_4x8_neon); + p.chroma[X265_CSP_I422].cu[BLOCK_422_8x16].copy_ss = PFX(blockcopy_ss_8x16_neon); + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].copy_ss = PFX(blockcopy_ss_16x32_neon); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_ss = PFX(blockcopy_ss_32x64_neon); + + // chroma blockcopy_ps + p.chroma[X265_CSP_I420].cu[BLOCK_420_4x4].copy_ps = PFX(blockcopy_ps_4x4_neon); + p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].copy_ps = PFX(blockcopy_ps_8x8_neon); + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].copy_ps = PFX(blockcopy_ps_16x16_neon); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_ps = PFX(blockcopy_ps_32x32_neon); + p.chroma[X265_CSP_I422].cu[BLOCK_422_4x8].copy_ps = PFX(blockcopy_ps_4x8_neon); + p.chroma[X265_CSP_I422].cu[BLOCK_422_8x16].copy_ps = PFX(blockcopy_ps_8x16_neon); + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].copy_ps = PFX(blockcopy_ps_16x32_neon); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_ps = PFX(blockcopy_ps_32x64_neon); + + // chroma blockcopy_sp + p.chroma[X265_CSP_I420].cu[BLOCK_420_4x4].copy_sp = PFX(blockcopy_sp_4x4_neon); + p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].copy_sp = PFX(blockcopy_sp_8x8_neon); + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].copy_sp = PFX(blockcopy_sp_16x16_neon); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_sp = PFX(blockcopy_sp_32x32_neon); + p.chroma[X265_CSP_I422].cu[BLOCK_422_4x8].copy_sp = PFX(blockcopy_sp_4x8_neon); + p.chroma[X265_CSP_I422].cu[BLOCK_422_8x16].copy_sp = PFX(blockcopy_sp_8x16_neon); + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].copy_sp = PFX(blockcopy_sp_16x32_neon); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_sp = PFX(blockcopy_sp_32x64_neon); + + // pixel_add_ps + p.cu[BLOCK_4x4].add_ps = PFX(pixel_add_ps_4x4_neon); + p.cu[BLOCK_8x8].add_ps = PFX(pixel_add_ps_8x8_neon); + p.cu[BLOCK_16x16].add_ps = PFX(pixel_add_ps_16x16_neon); + p.cu[BLOCK_32x32].add_ps = PFX(pixel_add_ps_32x32_neon); + p.cu[BLOCK_64x64].add_ps = PFX(pixel_add_ps_64x64_neon); + + // chroma add_ps + p.chroma[X265_CSP_I420].cu[BLOCK_420_4x4].add_ps = PFX(pixel_add_ps_4x4_neon); + p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].add_ps = PFX(pixel_add_ps_8x8_neon); + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].add_ps = PFX(pixel_add_ps_16x16_neon); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].add_ps = PFX(pixel_add_ps_32x32_neon); + p.chroma[X265_CSP_I422].cu[BLOCK_422_4x8].add_ps = PFX(pixel_add_ps_4x8_neon); + p.chroma[X265_CSP_I422].cu[BLOCK_422_8x16].add_ps = PFX(pixel_add_ps_8x16_neon); + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].add_ps = PFX(pixel_add_ps_16x32_neon); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].add_ps = PFX(pixel_add_ps_32x64_neon); + + // cpy2Dto1D_shr + p.cu[BLOCK_4x4].cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_4x4_neon); + p.cu[BLOCK_8x8].cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_8x8_neon); + p.cu[BLOCK_16x16].cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_16x16_neon); + p.cu[BLOCK_32x32].cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_32x32_neon); + + // ssd_s + p.cu[BLOCK_4x4].ssd_s = PFX(pixel_ssd_s_4x4_neon); + p.cu[BLOCK_8x8].ssd_s = PFX(pixel_ssd_s_8x8_neon); + p.cu[BLOCK_16x16].ssd_s = PFX(pixel_ssd_s_16x16_neon); + p.cu[BLOCK_32x32].ssd_s = PFX(pixel_ssd_s_32x32_neon); + + // sse_ss + p.cu[BLOCK_4x4].sse_ss = PFX(pixel_sse_ss_4x4_neon); + p.cu[BLOCK_8x8].sse_ss = PFX(pixel_sse_ss_8x8_neon); + p.cu[BLOCK_16x16].sse_ss = PFX(pixel_sse_ss_16x16_neon); + p.cu[BLOCK_32x32].sse_ss = PFX(pixel_sse_ss_32x32_neon); + p.cu[BLOCK_64x64].sse_ss = PFX(pixel_sse_ss_64x64_neon); + + // pixel_sub_ps + p.cu[BLOCK_4x4].sub_ps = PFX(pixel_sub_ps_4x4_neon); + p.cu[BLOCK_8x8].sub_ps = PFX(pixel_sub_ps_8x8_neon); + p.cu[BLOCK_16x16].sub_ps = PFX(pixel_sub_ps_16x16_neon); + p.cu[BLOCK_32x32].sub_ps = PFX(pixel_sub_ps_32x32_neon); + p.cu[BLOCK_64x64].sub_ps = PFX(pixel_sub_ps_64x64_neon); + + // chroma sub_ps + p.chroma[X265_CSP_I420].cu[BLOCK_420_4x4].sub_ps = PFX(pixel_sub_ps_4x4_neon); + p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].sub_ps = PFX(pixel_sub_ps_8x8_neon); + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sub_ps = PFX(pixel_sub_ps_16x16_neon); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sub_ps = PFX(pixel_sub_ps_32x32_neon); + p.chroma[X265_CSP_I422].cu[BLOCK_422_4x8].sub_ps = PFX(pixel_sub_ps_4x8_neon); + p.chroma[X265_CSP_I422].cu[BLOCK_422_8x16].sub_ps = PFX(pixel_sub_ps_8x16_neon); + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sub_ps = PFX(pixel_sub_ps_16x32_neon); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sub_ps = PFX(pixel_sub_ps_32x64_neon); + + // calc_Residual + p.cu[BLOCK_4x4].calcresidual = PFX(getResidual4_neon); + p.cu[BLOCK_8x8].calcresidual = PFX(getResidual8_neon); + p.cu[BLOCK_16x16].calcresidual = PFX(getResidual16_neon); + p.cu[BLOCK_32x32].calcresidual = PFX(getResidual32_neon); + + // sse_pp + p.cu[BLOCK_4x4].sse_pp = PFX(pixel_sse_pp_4x4_neon); + p.cu[BLOCK_8x8].sse_pp = PFX(pixel_sse_pp_8x8_neon); + p.cu[BLOCK_16x16].sse_pp = PFX(pixel_sse_pp_16x16_neon); + p.cu[BLOCK_32x32].sse_pp = PFX(pixel_sse_pp_32x32_neon); + p.cu[BLOCK_64x64].sse_pp = PFX(pixel_sse_pp_64x64_neon); + + // pixel_var + p.cu[BLOCK_8x8].var = PFX(pixel_var_8x8_neon); + p.cu[BLOCK_16x16].var = PFX(pixel_var_16x16_neon); + p.cu[BLOCK_32x32].var = PFX(pixel_var_32x32_neon); + p.cu[BLOCK_64x64].var = PFX(pixel_var_64x64_neon); + + // blockcopy + p.pu[LUMA_16x16].copy_pp = PFX(blockcopy_pp_16x16_neon); + p.pu[LUMA_8x4].copy_pp = PFX(blockcopy_pp_8x4_neon); + p.pu[LUMA_8x8].copy_pp = PFX(blockcopy_pp_8x8_neon); + p.pu[LUMA_8x16].copy_pp = PFX(blockcopy_pp_8x16_neon); + p.pu[LUMA_8x32].copy_pp = PFX(blockcopy_pp_8x32_neon); + p.pu[LUMA_12x16].copy_pp = PFX(blockcopy_pp_12x16_neon); + p.pu[LUMA_4x4].copy_pp = PFX(blockcopy_pp_4x4_neon); + p.pu[LUMA_4x8].copy_pp = PFX(blockcopy_pp_4x8_neon); + p.pu[LUMA_4x16].copy_pp = PFX(blockcopy_pp_4x16_neon); + p.pu[LUMA_16x4].copy_pp = PFX(blockcopy_pp_16x4_neon); + p.pu[LUMA_16x8].copy_pp = PFX(blockcopy_pp_16x8_neon); + p.pu[LUMA_16x12].copy_pp = PFX(blockcopy_pp_16x12_neon); + p.pu[LUMA_16x32].copy_pp = PFX(blockcopy_pp_16x32_neon); + p.pu[LUMA_16x64].copy_pp = PFX(blockcopy_pp_16x64_neon); + p.pu[LUMA_24x32].copy_pp = PFX(blockcopy_pp_24x32_neon); + p.pu[LUMA_32x8].copy_pp = PFX(blockcopy_pp_32x8_neon); + p.pu[LUMA_32x16].copy_pp = PFX(blockcopy_pp_32x16_neon); + p.pu[LUMA_32x24].copy_pp = PFX(blockcopy_pp_32x24_neon); + p.pu[LUMA_32x32].copy_pp = PFX(blockcopy_pp_32x32_neon); + p.pu[LUMA_32x64].copy_pp = PFX(blockcopy_pp_32x64_neon); + p.pu[LUMA_48x64].copy_pp = PFX(blockcopy_pp_48x64_neon); + p.pu[LUMA_64x16].copy_pp = PFX(blockcopy_pp_64x16_neon); + p.pu[LUMA_64x32].copy_pp = PFX(blockcopy_pp_64x32_neon); + p.pu[LUMA_64x48].copy_pp = PFX(blockcopy_pp_64x48_neon); + p.pu[LUMA_64x64].copy_pp = PFX(blockcopy_pp_64x64_neon); + + // chroma blockcopy + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].copy_pp = PFX(blockcopy_pp_2x4_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].copy_pp = PFX(blockcopy_pp_2x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].copy_pp = PFX(blockcopy_pp_4x2_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].copy_pp = PFX(blockcopy_pp_4x4_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].copy_pp = PFX(blockcopy_pp_4x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].copy_pp = PFX(blockcopy_pp_4x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].copy_pp = PFX(blockcopy_pp_6x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].copy_pp = PFX(blockcopy_pp_8x2_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].copy_pp = PFX(blockcopy_pp_8x4_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].copy_pp = PFX(blockcopy_pp_8x6_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].copy_pp = PFX(blockcopy_pp_8x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].copy_pp = PFX(blockcopy_pp_8x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].copy_pp = PFX(blockcopy_pp_8x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].copy_pp = PFX(blockcopy_pp_12x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].copy_pp = PFX(blockcopy_pp_16x4_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].copy_pp = PFX(blockcopy_pp_16x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].copy_pp = PFX(blockcopy_pp_16x12_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].copy_pp = PFX(blockcopy_pp_16x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].copy_pp = PFX(blockcopy_pp_16x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].copy_pp = PFX(blockcopy_pp_24x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].copy_pp = PFX(blockcopy_pp_32x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].copy_pp = PFX(blockcopy_pp_32x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].copy_pp = PFX(blockcopy_pp_32x24_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].copy_pp = PFX(blockcopy_pp_32x32_neon); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].copy_pp = PFX(blockcopy_pp_2x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].copy_pp = PFX(blockcopy_pp_4x4_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].copy_pp = PFX(blockcopy_pp_4x8_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].copy_pp = PFX(blockcopy_pp_4x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].copy_pp = PFX(blockcopy_pp_4x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].copy_pp = PFX(blockcopy_pp_6x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].copy_pp = PFX(blockcopy_pp_8x4_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].copy_pp = PFX(blockcopy_pp_8x8_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].copy_pp = PFX(blockcopy_pp_8x12_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].copy_pp = PFX(blockcopy_pp_8x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].copy_pp = PFX(blockcopy_pp_8x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].copy_pp = PFX(blockcopy_pp_8x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].copy_pp = PFX(blockcopy_pp_12x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].copy_pp = PFX(blockcopy_pp_16x8_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].copy_pp = PFX(blockcopy_pp_16x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].copy_pp = PFX(blockcopy_pp_16x24_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].copy_pp = PFX(blockcopy_pp_16x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].copy_pp = PFX(blockcopy_pp_16x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].copy_pp = PFX(blockcopy_pp_24x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].copy_pp = PFX(blockcopy_pp_32x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].copy_pp = PFX(blockcopy_pp_32x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].copy_pp = PFX(blockcopy_pp_32x48_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].copy_pp = PFX(blockcopy_pp_32x64_neon); + + // sad + p.pu[LUMA_8x4].sad = PFX(pixel_sad_8x4_neon); + p.pu[LUMA_8x8].sad = PFX(pixel_sad_8x8_neon); + p.pu[LUMA_8x16].sad = PFX(pixel_sad_8x16_neon); + p.pu[LUMA_8x32].sad = PFX(pixel_sad_8x32_neon); + p.pu[LUMA_16x4].sad = PFX(pixel_sad_16x4_neon); + p.pu[LUMA_16x8].sad = PFX(pixel_sad_16x8_neon); + p.pu[LUMA_16x16].sad = PFX(pixel_sad_16x16_neon); + p.pu[LUMA_16x12].sad = PFX(pixel_sad_16x12_neon); + p.pu[LUMA_16x32].sad = PFX(pixel_sad_16x32_neon); + p.pu[LUMA_16x64].sad = PFX(pixel_sad_16x64_neon); + p.pu[LUMA_32x8].sad = PFX(pixel_sad_32x8_neon); + p.pu[LUMA_32x16].sad = PFX(pixel_sad_32x16_neon); + p.pu[LUMA_32x32].sad = PFX(pixel_sad_32x32_neon); + p.pu[LUMA_32x64].sad = PFX(pixel_sad_32x64_neon); + p.pu[LUMA_32x24].sad = PFX(pixel_sad_32x24_neon); + p.pu[LUMA_64x16].sad = PFX(pixel_sad_64x16_neon); + p.pu[LUMA_64x32].sad = PFX(pixel_sad_64x32_neon); + p.pu[LUMA_64x64].sad = PFX(pixel_sad_64x64_neon); + p.pu[LUMA_64x48].sad = PFX(pixel_sad_64x48_neon); + p.pu[LUMA_12x16].sad = PFX(pixel_sad_12x16_neon); + p.pu[LUMA_24x32].sad = PFX(pixel_sad_24x32_neon); + p.pu[LUMA_48x64].sad = PFX(pixel_sad_48x64_neon); + + // sad_x3 + p.pu[LUMA_4x4].sad_x3 = PFX(sad_x3_4x4_neon); + p.pu[LUMA_4x8].sad_x3 = PFX(sad_x3_4x8_neon); + p.pu[LUMA_4x16].sad_x3 = PFX(sad_x3_4x16_neon); + p.pu[LUMA_8x4].sad_x3 = PFX(sad_x3_8x4_neon); + p.pu[LUMA_8x8].sad_x3 = PFX(sad_x3_8x8_neon); + p.pu[LUMA_8x16].sad_x3 = PFX(sad_x3_8x16_neon); + p.pu[LUMA_8x32].sad_x3 = PFX(sad_x3_8x32_neon); + p.pu[LUMA_12x16].sad_x3 = PFX(sad_x3_12x16_neon); + p.pu[LUMA_16x4].sad_x3 = PFX(sad_x3_16x4_neon); + p.pu[LUMA_16x8].sad_x3 = PFX(sad_x3_16x8_neon); + p.pu[LUMA_16x12].sad_x3 = PFX(sad_x3_16x12_neon); + p.pu[LUMA_16x16].sad_x3 = PFX(sad_x3_16x16_neon); + p.pu[LUMA_16x32].sad_x3 = PFX(sad_x3_16x32_neon); + p.pu[LUMA_16x64].sad_x3 = PFX(sad_x3_16x64_neon); + p.pu[LUMA_24x32].sad_x3 = PFX(sad_x3_24x32_neon); + p.pu[LUMA_32x8].sad_x3 = PFX(sad_x3_32x8_neon); + p.pu[LUMA_32x16].sad_x3 = PFX(sad_x3_32x16_neon); + p.pu[LUMA_32x24].sad_x3 = PFX(sad_x3_32x24_neon); + p.pu[LUMA_32x32].sad_x3 = PFX(sad_x3_32x32_neon); + p.pu[LUMA_32x64].sad_x3 = PFX(sad_x3_32x64_neon); + p.pu[LUMA_48x64].sad_x3 = PFX(sad_x3_48x64_neon); + p.pu[LUMA_64x16].sad_x3 = PFX(sad_x3_64x16_neon); + p.pu[LUMA_64x32].sad_x3 = PFX(sad_x3_64x32_neon); + p.pu[LUMA_64x48].sad_x3 = PFX(sad_x3_64x48_neon); + p.pu[LUMA_64x64].sad_x3 = PFX(sad_x3_64x64_neon); + + // sad_x4 + p.pu[LUMA_4x4].sad_x4 = PFX(sad_x4_4x4_neon); + p.pu[LUMA_4x8].sad_x4 = PFX(sad_x4_4x8_neon); + p.pu[LUMA_4x16].sad_x4 = PFX(sad_x4_4x16_neon); + p.pu[LUMA_8x4].sad_x4 = PFX(sad_x4_8x4_neon); + p.pu[LUMA_8x8].sad_x4 = PFX(sad_x4_8x8_neon); + p.pu[LUMA_8x16].sad_x4 = PFX(sad_x4_8x16_neon); + p.pu[LUMA_8x32].sad_x4 = PFX(sad_x4_8x32_neon); + p.pu[LUMA_12x16].sad_x4 = PFX(sad_x4_12x16_neon); + p.pu[LUMA_16x4].sad_x4 = PFX(sad_x4_16x4_neon); + p.pu[LUMA_16x8].sad_x4 = PFX(sad_x4_16x8_neon); + p.pu[LUMA_16x12].sad_x4 = PFX(sad_x4_16x12_neon); + p.pu[LUMA_16x16].sad_x4 = PFX(sad_x4_16x16_neon); + p.pu[LUMA_16x32].sad_x4 = PFX(sad_x4_16x32_neon); + p.pu[LUMA_16x64].sad_x4 = PFX(sad_x4_16x64_neon); + p.pu[LUMA_24x32].sad_x4 = PFX(sad_x4_24x32_neon); + p.pu[LUMA_32x8].sad_x4 = PFX(sad_x4_32x8_neon); + p.pu[LUMA_32x16].sad_x4 = PFX(sad_x4_32x16_neon); + p.pu[LUMA_32x24].sad_x4 = PFX(sad_x4_32x24_neon); + p.pu[LUMA_32x32].sad_x4 = PFX(sad_x4_32x32_neon); + p.pu[LUMA_32x64].sad_x4 = PFX(sad_x4_32x64_neon); + p.pu[LUMA_48x64].sad_x4 = PFX(sad_x4_48x64_neon); + p.pu[LUMA_64x16].sad_x4 = PFX(sad_x4_64x16_neon); + p.pu[LUMA_64x32].sad_x4 = PFX(sad_x4_64x32_neon); + p.pu[LUMA_64x48].sad_x4 = PFX(sad_x4_64x48_neon); + p.pu[LUMA_64x64].sad_x4 = PFX(sad_x4_64x64_neon); + + // pixel_avg_pp + p.pu[LUMA_4x4].pixelavg_pp = PFX(pixel_avg_pp_4x4_neon); + p.pu[LUMA_4x8].pixelavg_pp = PFX(pixel_avg_pp_4x8_neon); + p.pu[LUMA_4x16].pixelavg_pp = PFX(pixel_avg_pp_4x16_neon); + p.pu[LUMA_8x4].pixelavg_pp = PFX(pixel_avg_pp_8x4_neon); + p.pu[LUMA_8x8].pixelavg_pp = PFX(pixel_avg_pp_8x8_neon); + p.pu[LUMA_8x16].pixelavg_pp = PFX(pixel_avg_pp_8x16_neon); + p.pu[LUMA_8x32].pixelavg_pp = PFX(pixel_avg_pp_8x32_neon); + p.pu[LUMA_12x16].pixelavg_pp = PFX(pixel_avg_pp_12x16_neon); + p.pu[LUMA_16x4].pixelavg_pp = PFX(pixel_avg_pp_16x4_neon); + p.pu[LUMA_16x8].pixelavg_pp = PFX(pixel_avg_pp_16x8_neon); + p.pu[LUMA_16x12].pixelavg_pp = PFX(pixel_avg_pp_16x12_neon); + p.pu[LUMA_16x16].pixelavg_pp = PFX(pixel_avg_pp_16x16_neon); + p.pu[LUMA_16x32].pixelavg_pp = PFX(pixel_avg_pp_16x32_neon); + p.pu[LUMA_16x64].pixelavg_pp = PFX(pixel_avg_pp_16x64_neon); + p.pu[LUMA_24x32].pixelavg_pp = PFX(pixel_avg_pp_24x32_neon); + p.pu[LUMA_32x8].pixelavg_pp = PFX(pixel_avg_pp_32x8_neon); + p.pu[LUMA_32x16].pixelavg_pp = PFX(pixel_avg_pp_32x16_neon); + p.pu[LUMA_32x24].pixelavg_pp = PFX(pixel_avg_pp_32x24_neon); + p.pu[LUMA_32x32].pixelavg_pp = PFX(pixel_avg_pp_32x32_neon); + p.pu[LUMA_32x64].pixelavg_pp = PFX(pixel_avg_pp_32x64_neon); + p.pu[LUMA_48x64].pixelavg_pp = PFX(pixel_avg_pp_48x64_neon); + p.pu[LUMA_64x16].pixelavg_pp = PFX(pixel_avg_pp_64x16_neon); + p.pu[LUMA_64x32].pixelavg_pp = PFX(pixel_avg_pp_64x32_neon); + p.pu[LUMA_64x48].pixelavg_pp = PFX(pixel_avg_pp_64x48_neon); + p.pu[LUMA_64x64].pixelavg_pp = PFX(pixel_avg_pp_64x64_neon); + + // planecopy + p.planecopy_cp = PFX(pixel_planecopy_cp_neon); + + p.cu[BLOCK_4x4].sa8d = PFX(pixel_satd_4x4_neon); + p.cu[BLOCK_8x8].sa8d = PFX(pixel_sa8d_8x8_neon); + p.cu[BLOCK_16x16].sa8d = PFX(pixel_sa8d_16x16_neon); + p.cu[BLOCK_32x32].sa8d = PFX(pixel_sa8d_32x32_neon); + p.cu[BLOCK_64x64].sa8d = PFX(pixel_sa8d_64x64_neon); + p.chroma[X265_CSP_I422].cu[BLOCK_422_8x16].sa8d = PFX(pixel_sa8d_8x16_neon); + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sa8d = PFX(pixel_sa8d_16x32_neon); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sa8d = PFX(pixel_sa8d_32x64_neon); + + // vertical interpolation filters + p.pu[LUMA_4x4].luma_vpp = PFX(interp_8tap_vert_pp_4x4_neon); + p.pu[LUMA_4x8].luma_vpp = PFX(interp_8tap_vert_pp_4x8_neon); + p.pu[LUMA_4x16].luma_vpp = PFX(interp_8tap_vert_pp_4x16_neon); + p.pu[LUMA_8x4].luma_vpp = PFX(interp_8tap_vert_pp_8x4_neon); + p.pu[LUMA_8x8].luma_vpp = PFX(interp_8tap_vert_pp_8x8_neon); + p.pu[LUMA_8x16].luma_vpp = PFX(interp_8tap_vert_pp_8x16_neon); + p.pu[LUMA_8x32].luma_vpp = PFX(interp_8tap_vert_pp_8x32_neon); + p.pu[LUMA_16x4].luma_vpp = PFX(interp_8tap_vert_pp_16x4_neon); + p.pu[LUMA_16x8].luma_vpp = PFX(interp_8tap_vert_pp_16x8_neon); + p.pu[LUMA_16x16].luma_vpp = PFX(interp_8tap_vert_pp_16x16_neon); + p.pu[LUMA_16x32].luma_vpp = PFX(interp_8tap_vert_pp_16x32_neon); + p.pu[LUMA_16x64].luma_vpp = PFX(interp_8tap_vert_pp_16x64_neon); + p.pu[LUMA_16x12].luma_vpp = PFX(interp_8tap_vert_pp_16x12_neon); + p.pu[LUMA_32x8].luma_vpp = PFX(interp_8tap_vert_pp_32x8_neon); + p.pu[LUMA_32x16].luma_vpp = PFX(interp_8tap_vert_pp_32x16_neon); + p.pu[LUMA_32x32].luma_vpp = PFX(interp_8tap_vert_pp_32x32_neon); + p.pu[LUMA_32x64].luma_vpp = PFX(interp_8tap_vert_pp_32x64_neon); + p.pu[LUMA_32x24].luma_vpp = PFX(interp_8tap_vert_pp_32x24_neon); + p.pu[LUMA_64x16].luma_vpp = PFX(interp_8tap_vert_pp_64x16_neon); + p.pu[LUMA_64x32].luma_vpp = PFX(interp_8tap_vert_pp_64x32_neon); + p.pu[LUMA_64x64].luma_vpp = PFX(interp_8tap_vert_pp_64x64_neon); + p.pu[LUMA_64x48].luma_vpp = PFX(interp_8tap_vert_pp_64x48_neon); + p.pu[LUMA_24x32].luma_vpp = PFX(interp_8tap_vert_pp_24x32_neon); + p.pu[LUMA_48x64].luma_vpp = PFX(interp_8tap_vert_pp_48x64_neon); + p.pu[LUMA_12x16].luma_vpp = PFX(interp_8tap_vert_pp_12x16_neon); + + p.pu[LUMA_4x4].luma_vsp = PFX(interp_8tap_vert_sp_4x4_neon); + p.pu[LUMA_4x8].luma_vsp = PFX(interp_8tap_vert_sp_4x8_neon); + p.pu[LUMA_4x16].luma_vsp = PFX(interp_8tap_vert_sp_4x16_neon); + p.pu[LUMA_8x4].luma_vsp = PFX(interp_8tap_vert_sp_8x4_neon); + p.pu[LUMA_8x8].luma_vsp = PFX(interp_8tap_vert_sp_8x8_neon); + p.pu[LUMA_8x16].luma_vsp = PFX(interp_8tap_vert_sp_8x16_neon); + p.pu[LUMA_8x32].luma_vsp = PFX(interp_8tap_vert_sp_8x32_neon); + p.pu[LUMA_16x4].luma_vsp = PFX(interp_8tap_vert_sp_16x4_neon); + p.pu[LUMA_16x8].luma_vsp = PFX(interp_8tap_vert_sp_16x8_neon); + p.pu[LUMA_16x16].luma_vsp = PFX(interp_8tap_vert_sp_16x16_neon); + p.pu[LUMA_16x32].luma_vsp = PFX(interp_8tap_vert_sp_16x32_neon); + p.pu[LUMA_16x64].luma_vsp = PFX(interp_8tap_vert_sp_16x64_neon); + p.pu[LUMA_16x12].luma_vsp = PFX(interp_8tap_vert_sp_16x12_neon); + p.pu[LUMA_32x8].luma_vsp = PFX(interp_8tap_vert_sp_32x8_neon); + p.pu[LUMA_32x16].luma_vsp = PFX(interp_8tap_vert_sp_32x16_neon); + p.pu[LUMA_32x32].luma_vsp = PFX(interp_8tap_vert_sp_32x32_neon); + p.pu[LUMA_32x64].luma_vsp = PFX(interp_8tap_vert_sp_32x64_neon); + p.pu[LUMA_32x24].luma_vsp = PFX(interp_8tap_vert_sp_32x24_neon); + p.pu[LUMA_64x16].luma_vsp = PFX(interp_8tap_vert_sp_64x16_neon); + p.pu[LUMA_64x32].luma_vsp = PFX(interp_8tap_vert_sp_64x32_neon); + p.pu[LUMA_64x64].luma_vsp = PFX(interp_8tap_vert_sp_64x64_neon); + p.pu[LUMA_64x48].luma_vsp = PFX(interp_8tap_vert_sp_64x48_neon); + p.pu[LUMA_24x32].luma_vsp = PFX(interp_8tap_vert_sp_24x32_neon); + p.pu[LUMA_48x64].luma_vsp = PFX(interp_8tap_vert_sp_48x64_neon); + p.pu[LUMA_12x16].luma_vsp = PFX(interp_8tap_vert_sp_12x16_neon); + + p.pu[LUMA_4x4].luma_vps = PFX(interp_8tap_vert_ps_4x4_neon); + p.pu[LUMA_4x8].luma_vps = PFX(interp_8tap_vert_ps_4x8_neon); + p.pu[LUMA_4x16].luma_vps = PFX(interp_8tap_vert_ps_4x16_neon); + p.pu[LUMA_8x4].luma_vps = PFX(interp_8tap_vert_ps_8x4_neon); + p.pu[LUMA_8x8].luma_vps = PFX(interp_8tap_vert_ps_8x8_neon); + p.pu[LUMA_8x16].luma_vps = PFX(interp_8tap_vert_ps_8x16_neon); + p.pu[LUMA_8x32].luma_vps = PFX(interp_8tap_vert_ps_8x32_neon); + p.pu[LUMA_16x4].luma_vps = PFX(interp_8tap_vert_ps_16x4_neon); + p.pu[LUMA_16x8].luma_vps = PFX(interp_8tap_vert_ps_16x8_neon); + p.pu[LUMA_16x16].luma_vps = PFX(interp_8tap_vert_ps_16x16_neon); + p.pu[LUMA_16x32].luma_vps = PFX(interp_8tap_vert_ps_16x32_neon); + p.pu[LUMA_16x64].luma_vps = PFX(interp_8tap_vert_ps_16x64_neon); + p.pu[LUMA_16x12].luma_vps = PFX(interp_8tap_vert_ps_16x12_neon); + p.pu[LUMA_32x8].luma_vps = PFX(interp_8tap_vert_ps_32x8_neon); + p.pu[LUMA_32x16].luma_vps = PFX(interp_8tap_vert_ps_32x16_neon); + p.pu[LUMA_32x32].luma_vps = PFX(interp_8tap_vert_ps_32x32_neon); + p.pu[LUMA_32x64].luma_vps = PFX(interp_8tap_vert_ps_32x64_neon); + p.pu[LUMA_32x24].luma_vps = PFX(interp_8tap_vert_ps_32x24_neon); + p.pu[LUMA_64x16].luma_vps = PFX(interp_8tap_vert_ps_64x16_neon); + p.pu[LUMA_64x32].luma_vps = PFX(interp_8tap_vert_ps_64x32_neon); + p.pu[LUMA_64x64].luma_vps = PFX(interp_8tap_vert_ps_64x64_neon); + p.pu[LUMA_64x48].luma_vps = PFX(interp_8tap_vert_ps_64x48_neon); + p.pu[LUMA_24x32].luma_vps = PFX(interp_8tap_vert_ps_24x32_neon); + p.pu[LUMA_48x64].luma_vps = PFX(interp_8tap_vert_ps_48x64_neon); + p.pu[LUMA_12x16].luma_vps = PFX(interp_8tap_vert_ps_12x16_neon); + + //vertical chroma filters + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_vpp = PFX(interp_4tap_vert_pp_8x2_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vpp = PFX(interp_4tap_vert_pp_8x4_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vpp = PFX(interp_4tap_vert_pp_8x6_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vpp = PFX(interp_4tap_vert_pp_8x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vpp = PFX(interp_4tap_vert_pp_8x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vpp = PFX(interp_4tap_vert_pp_8x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_vpp = PFX(interp_4tap_vert_pp_16x4_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vpp = PFX(interp_4tap_vert_pp_16x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_vpp = PFX(interp_4tap_vert_pp_16x12_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vpp = PFX(interp_4tap_vert_pp_16x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vpp = PFX(interp_4tap_vert_pp_16x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_vpp = PFX(interp_4tap_vert_pp_32x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vpp = PFX(interp_4tap_vert_pp_32x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vpp = PFX(interp_4tap_vert_pp_32x24_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vpp = PFX(interp_4tap_vert_pp_32x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_vpp = PFX(interp_4tap_vert_pp_24x32_neon); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vpp = PFX(interp_4tap_vert_pp_8x4_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vpp = PFX(interp_4tap_vert_pp_8x8_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vpp = PFX(interp_4tap_vert_pp_8x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vpp = PFX(interp_4tap_vert_pp_8x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vpp = PFX(interp_4tap_vert_pp_8x12_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vpp = PFX(interp_4tap_vert_pp_8x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vpp = PFX(interp_4tap_vert_pp_16x8_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vpp = PFX(interp_4tap_vert_pp_16x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vpp = PFX(interp_4tap_vert_pp_16x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vpp = PFX(interp_4tap_vert_pp_16x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vpp = PFX(interp_4tap_vert_pp_16x24_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vpp = PFX(interp_4tap_vert_pp_32x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vpp = PFX(interp_4tap_vert_pp_32x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vpp = PFX(interp_4tap_vert_pp_32x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vpp = PFX(interp_4tap_vert_pp_32x48_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vpp = PFX(interp_4tap_vert_pp_24x64_neon); + + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vpp = PFX(interp_4tap_vert_pp_8x4_neon); + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vpp = PFX(interp_4tap_vert_pp_8x8_neon); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vpp = PFX(interp_4tap_vert_pp_8x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vpp = PFX(interp_4tap_vert_pp_8x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vpp = PFX(interp_4tap_vert_pp_16x4_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vpp = PFX(interp_4tap_vert_pp_16x8_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vpp = PFX(interp_4tap_vert_pp_16x12_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vpp = PFX(interp_4tap_vert_pp_16x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vpp = PFX(interp_4tap_vert_pp_16x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vpp = PFX(interp_4tap_vert_pp_16x64_neon); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vpp = PFX(interp_4tap_vert_pp_32x8_neon); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vpp = PFX(interp_4tap_vert_pp_32x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vpp = PFX(interp_4tap_vert_pp_32x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vpp = PFX(interp_4tap_vert_pp_32x64_neon); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vpp = PFX(interp_4tap_vert_pp_64x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vpp = PFX(interp_4tap_vert_pp_64x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vpp = PFX(interp_4tap_vert_pp_64x48_neon); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vpp = PFX(interp_4tap_vert_pp_64x64_neon); + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vpp = PFX(interp_4tap_vert_pp_24x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vpp = PFX(interp_4tap_vert_pp_48x64_neon); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_vps = PFX(interp_4tap_vert_ps_8x2_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vps = PFX(interp_4tap_vert_ps_8x4_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vps = PFX(interp_4tap_vert_ps_8x6_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vps = PFX(interp_4tap_vert_ps_8x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vps = PFX(interp_4tap_vert_ps_8x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vps = PFX(interp_4tap_vert_ps_8x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_vps = PFX(interp_4tap_vert_ps_16x4_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vps = PFX(interp_4tap_vert_ps_16x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_vps = PFX(interp_4tap_vert_ps_16x12_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vps = PFX(interp_4tap_vert_ps_16x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vps = PFX(interp_4tap_vert_ps_16x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_vps = PFX(interp_4tap_vert_ps_32x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vps = PFX(interp_4tap_vert_ps_32x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vps = PFX(interp_4tap_vert_ps_32x24_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vps = PFX(interp_4tap_vert_ps_32x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_vps = PFX(interp_4tap_vert_ps_24x32_neon); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vps = PFX(interp_4tap_vert_ps_8x4_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vps = PFX(interp_4tap_vert_ps_8x8_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vps = PFX(interp_4tap_vert_ps_8x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vps = PFX(interp_4tap_vert_ps_8x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vps = PFX(interp_4tap_vert_ps_8x12_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vps = PFX(interp_4tap_vert_ps_8x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vps = PFX(interp_4tap_vert_ps_16x8_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vps = PFX(interp_4tap_vert_ps_16x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vps = PFX(interp_4tap_vert_ps_16x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vps = PFX(interp_4tap_vert_ps_16x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vps = PFX(interp_4tap_vert_ps_16x24_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vps = PFX(interp_4tap_vert_ps_32x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vps = PFX(interp_4tap_vert_ps_32x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vps = PFX(interp_4tap_vert_ps_32x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vps = PFX(interp_4tap_vert_ps_32x48_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vps = PFX(interp_4tap_vert_ps_24x64_neon); + + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vps = PFX(interp_4tap_vert_ps_8x4_neon); + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vps = PFX(interp_4tap_vert_ps_8x8_neon); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vps = PFX(interp_4tap_vert_ps_8x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vps = PFX(interp_4tap_vert_ps_8x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vps = PFX(interp_4tap_vert_ps_16x4_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vps = PFX(interp_4tap_vert_ps_16x8_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vps = PFX(interp_4tap_vert_ps_16x12_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vps = PFX(interp_4tap_vert_ps_16x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vps = PFX(interp_4tap_vert_ps_16x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vps = PFX(interp_4tap_vert_ps_16x64_neon); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vps = PFX(interp_4tap_vert_ps_32x8_neon); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vps = PFX(interp_4tap_vert_ps_32x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vps = PFX(interp_4tap_vert_ps_32x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vps = PFX(interp_4tap_vert_ps_32x64_neon); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vps = PFX(interp_4tap_vert_ps_64x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vps = PFX(interp_4tap_vert_ps_64x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vps = PFX(interp_4tap_vert_ps_64x48_neon); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vps = PFX(interp_4tap_vert_ps_64x64_neon); + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vps = PFX(interp_4tap_vert_ps_24x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vps = PFX(interp_4tap_vert_ps_48x64_neon); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_vsp = PFX(interp_4tap_vert_sp_8x2_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vsp = PFX(interp_4tap_vert_sp_8x4_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vsp = PFX(interp_4tap_vert_sp_8x6_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vsp = PFX(interp_4tap_vert_sp_8x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vsp = PFX(interp_4tap_vert_sp_8x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vsp = PFX(interp_4tap_vert_sp_8x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_vsp = PFX(interp_4tap_vert_sp_16x4_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vsp = PFX(interp_4tap_vert_sp_16x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_vsp = PFX(interp_4tap_vert_sp_16x12_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vsp = PFX(interp_4tap_vert_sp_16x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vsp = PFX(interp_4tap_vert_sp_16x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_vsp = PFX(interp_4tap_vert_sp_32x8_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vsp = PFX(interp_4tap_vert_sp_32x16_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vsp = PFX(interp_4tap_vert_sp_32x24_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vsp = PFX(interp_4tap_vert_sp_32x32_neon); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_vsp = PFX(interp_4tap_vert_sp_24x32_neon); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vsp = PFX(interp_4tap_vert_sp_8x4_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vsp = PFX(interp_4tap_vert_sp_8x8_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vsp = PFX(interp_4tap_vert_sp_8x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vsp = PFX(interp_4tap_vert_sp_8x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vsp = PFX(interp_4tap_vert_sp_8x12_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vsp = PFX(interp_4tap_vert_sp_8x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vsp = PFX(interp_4tap_vert_sp_16x8_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vsp = PFX(interp_4tap_vert_sp_16x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vsp = PFX(interp_4tap_vert_sp_16x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vsp = PFX(interp_4tap_vert_sp_16x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vsp = PFX(interp_4tap_vert_sp_16x24_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vsp = PFX(interp_4tap_vert_sp_32x16_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vsp = PFX(interp_4tap_vert_sp_32x32_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vsp = PFX(interp_4tap_vert_sp_32x64_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vsp = PFX(interp_4tap_vert_sp_32x48_neon); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vsp = PFX(interp_4tap_vert_sp_24x64_neon); + + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vsp = PFX(interp_4tap_vert_sp_8x4_neon); + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vsp = PFX(interp_4tap_vert_sp_8x8_neon); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vsp = PFX(interp_4tap_vert_sp_8x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vsp = PFX(interp_4tap_vert_sp_8x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vsp = PFX(interp_4tap_vert_sp_16x4_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vsp = PFX(interp_4tap_vert_sp_16x8_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vsp = PFX(interp_4tap_vert_sp_16x12_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vsp = PFX(interp_4tap_vert_sp_16x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vsp = PFX(interp_4tap_vert_sp_16x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vsp = PFX(interp_4tap_vert_sp_16x64_neon); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vsp = PFX(interp_4tap_vert_sp_32x8_neon); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vsp = PFX(interp_4tap_vert_sp_32x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vsp = PFX(interp_4tap_vert_sp_32x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vsp = PFX(interp_4tap_vert_sp_32x64_neon); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vsp = PFX(interp_4tap_vert_sp_64x16_neon); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vsp = PFX(interp_4tap_vert_sp_64x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vsp = PFX(interp_4tap_vert_sp_64x48_neon); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vsp = PFX(interp_4tap_vert_sp_64x64_neon); + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vsp = PFX(interp_4tap_vert_sp_24x32_neon); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vsp = PFX(interp_4tap_vert_sp_48x64_neon); + + p.cu[BLOCK_4x4].dct = PFX(dct_4x4_neon); + p.cu[BLOCK_8x8].dct = PFX(dct_8x8_neon); + p.cu[BLOCK_16x16].dct = PFX(dct_16x16_neon); +#if !HIGH_BIT_DEPTH + p.cu[BLOCK_4x4].psy_cost_pp = PFX(psyCost_4x4_neon); +#endif // !HIGH_BIT_DEPTH + } + if (cpuMask & X265_CPU_ARMV6) + { + p.pu[LUMA_4x4].sad = PFX(pixel_sad_4x4_armv6); + p.pu[LUMA_4x8].sad = PFX(pixel_sad_4x8_armv6); + p.pu[LUMA_4x16].sad=PFX(pixel_sad_4x16_armv6); + } +} +} // namespace X265_NS
View file
x265_2.0.tar.gz/source/common/arm/asm.S
Added
@@ -0,0 +1,194 @@ +/***************************************************************************** + * asm.S: arm utility macros + ***************************************************************************** + * Copyright (C) 2016 x265 project + * + * Authors: Mans Rullgard <mans@mansr.com> + * David Conrad <lessen42@gmail.com> + * Dnyaneshwar Gorade <dnyaneshwar@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +.syntax unified + +#if HAVE_NEON + .arch armv7-a +#elif HAVE_ARMV6T2 + .arch armv6t2 +#elif HAVE_ARMV6 + .arch armv6 +#endif + +.fpu neon + +#ifdef PREFIX +# define EXTERN_ASM _ +#else +# define EXTERN_ASM +#endif + +#ifdef __ELF__ +# define ELF +#else +# define ELF @ +#endif + +#if HAVE_AS_FUNC +# define FUNC +#else +# define FUNC @ +#endif + +.macro require8, val=1 +ELF .eabi_attribute 24, \val +.endm + +.macro preserve8, val=1 +ELF .eabi_attribute 25, \val +.endm + +.macro function name, export=1 + .macro endfunc +ELF .size \name, . - \name +FUNC .endfunc + .purgem endfunc + .endm + .align 2 +.if \export == 1 + .global EXTERN_ASM\name +ELF .hidden EXTERN_ASM\name +ELF .type EXTERN_ASM\name, %function +FUNC .func EXTERN_ASM\name +EXTERN_ASM\name: +.else +ELF .hidden \name +ELF .type \name, %function +FUNC .func \name +\name: +.endif +.endm + +.macro movrel rd, val +#if HAVE_ARMV6T2 && !defined(PIC) + movw \rd, #:lower16:\val + movt \rd, #:upper16:\val +#else + ldr \rd, =\val +#endif +.endm + +.macro movconst rd, val +#if HAVE_ARMV6T2 + movw \rd, #:lower16:\val +.if \val >> 16 + movt \rd, #:upper16:\val +.endif +#else + ldr \rd, =\val +#endif +.endm + +#define GLUE(a, b) a ## b +#define JOIN(a, b) GLUE(a, b) +#define X(s) JOIN(EXTERN_ASM, s) + +#define FENC_STRIDE 64 +#define FDEC_STRIDE 32 + +.macro HORIZ_ADD dest, a, b +.ifnb \b + vadd.u16 \a, \a, \b +.endif + vpaddl.u16 \a, \a + vpaddl.u32 \dest, \a +.endm + +.macro SUMSUB_AB sum, diff, a, b + vadd.s16 \sum, \a, \b + vsub.s16 \diff, \a, \b +.endm + +.macro SUMSUB_ABCD s1, d1, s2, d2, a, b, c, d + SUMSUB_AB \s1, \d1, \a, \b + SUMSUB_AB \s2, \d2, \c, \d +.endm + +.macro ABS2 a b + vabs.s16 \a, \a + vabs.s16 \b, \b +.endm + +// dist = distance in elements (0 for vertical pass, 1/2 for horizontal passes) +// op = sumsub/amax (sum and diff / maximum of absolutes) +// d1/2 = destination registers +// s1/2 = source registers +.macro HADAMARD dist, op, d1, d2, s1, s2 +.if \dist == 1 + vtrn.16 \s1, \s2 +.else + vtrn.32 \s1, \s2 +.endif +.ifc \op, sumsub + SUMSUB_AB \d1, \d2, \s1, \s2 +.else + vabs.s16 \s1, \s1 + vabs.s16 \s2, \s2 + vmax.s16 \d1, \s1, \s2 +.endif +.endm + +.macro TRANSPOSE8x8 r0 r1 r2 r3 r4 r5 r6 r7 + vtrn.32 \r0, \r4 + vtrn.32 \r1, \r5 + vtrn.32 \r2, \r6 + vtrn.32 \r3, \r7 + vtrn.16 \r0, \r2 + vtrn.16 \r1, \r3 + vtrn.16 \r4, \r6 + vtrn.16 \r5, \r7 + vtrn.8 \r0, \r1 + vtrn.8 \r2, \r3 + vtrn.8 \r4, \r5 + vtrn.8 \r6, \r7 +.endm + +.macro TRANSPOSE4x4 r0 r1 r2 r3 + vtrn.16 \r0, \r2 + vtrn.16 \r1, \r3 + vtrn.8 \r0, \r1 + vtrn.8 \r2, \r3 +.endm + +.macro TRANSPOSE4x4_16 r0, r1, r2, r3 + vtrn.32 \r0, \r2 // r0 = [21 20 01 00], r2 = [23 22 03 02] + vtrn.32 \r1, \r3 // r1 = [31 30 11 10], r3 = [33 32 13 12] + vtrn.16 \r0, \r1 // r0 = [30 20 10 00], r1 = [31 21 11 01] + vtrn.16 \r2, \r3 // r2 = [32 22 12 02], r3 = [33 23 13 03] +.endm + +.macro TRANSPOSE4x4x2_16 rA0, rA1, rA2, rA3, rB0, rB1, rB2, rB3 + vtrn.32 \rA0, \rA2 // r0 = [21 20 01 00], r2 = [23 22 03 02] + vtrn.32 \rA1, \rA3 // r1 = [31 30 11 10], r3 = [33 32 13 12] + vtrn.32 \rB0, \rB2 + vtrn.32 \rB1, \rB3 + vtrn.16 \rA0, \rA1 // r0 = [30 20 10 00], r1 = [31 21 11 01] + vtrn.16 \rA2, \rA3 // r2 = [32 22 12 02], r3 = [33 23 13 03] + vtrn.16 \rB0, \rB1 + vtrn.16 \rB2, \rB3 +.endm
View file
x265_2.0.tar.gz/source/common/arm/blockcopy8.S
Added
@@ -0,0 +1,838 @@ +/***************************************************************************** + * Copyright (C) 2016 x265 project + * + * Authors: Radhakrishnan VR <radhakrishnan@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "asm.S" + +.section .rodata + +.align 4 + +.text + +/* void blockcopy_sp(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb) + * + * r0 - a + * r1 - stridea + * r2 - b + * r3 - strideb */ +function x265_blockcopy_sp_4x4_neon + lsl r3, #1 +.rept 2 + vld1.u16 {q0}, [r2], r3 + vld1.u16 {q1}, [r2], r3 + vmovn.u16 d0, q0 + vmovn.u16 d1, q1 + vst1.u32 {d0[0]}, [r0], r1 + vst1.u32 {d1[0]}, [r0], r1 +.endr + bx lr +endfunc + +function x265_blockcopy_sp_8x8_neon + lsl r3, #1 +.rept 4 + vld1.u16 {q0}, [r2], r3 + vld1.u16 {q1}, [r2], r3 + vmovn.u16 d0, q0 + vmovn.u16 d1, q1 + vst1.u8 {d0}, [r0], r1 + vst1.u8 {d1}, [r0], r1 +.endr + bx lr +endfunc + +function x265_blockcopy_sp_16x16_neon + lsl r3, #1 +.rept 8 + vld1.u16 {q0, q1}, [r2], r3 + vld1.u16 {q2, q3}, [r2], r3 + vmovn.u16 d0, q0 + vmovn.u16 d1, q1 + vmovn.u16 d2, q2 + vmovn.u16 d3, q3 + vst1.u8 {q0}, [r0], r1 + vst1.u8 {q1}, [r0], r1 +.endr + bx lr +endfunc + +function x265_blockcopy_sp_32x32_neon + mov r12, #4 + lsl r3, #1 + sub r3, #32 +loop_csp32: + subs r12, #1 +.rept 4 + vld1.u16 {q0, q1}, [r2]! + vld1.u16 {q2, q3}, [r2], r3 + vld1.u16 {q8, q9}, [r2]! + vld1.u16 {q10, q11}, [r2], r3 + + vmovn.u16 d0, q0 + vmovn.u16 d1, q1 + vmovn.u16 d2, q2 + vmovn.u16 d3, q3 + + vmovn.u16 d4, q8 + vmovn.u16 d5, q9 + vmovn.u16 d6, q10 + vmovn.u16 d7, q11 + + vst1.u8 {q0, q1}, [r0], r1 + vst1.u8 {q2, q3}, [r0], r1 +.endr + bne loop_csp32 + bx lr +endfunc + +function x265_blockcopy_sp_64x64_neon + mov r12, #16 + lsl r3, #1 + sub r3, #96 + sub r1, #32 +loop_csp64: + subs r12, #1 +.rept 4 + vld1.u16 {q0, q1}, [r2]! + vld1.u16 {q2, q3}, [r2]! + vld1.u16 {q8, q9}, [r2]! + vld1.u16 {q10, q11}, [r2], r3 + + vmovn.u16 d0, q0 + vmovn.u16 d1, q1 + vmovn.u16 d2, q2 + vmovn.u16 d3, q3 + + vmovn.u16 d4, q8 + vmovn.u16 d5, q9 + vmovn.u16 d6, q10 + vmovn.u16 d7, q11 + + vst1.u8 {q0, q1}, [r0]! + vst1.u8 {q2, q3}, [r0], r1 +.endr + bne loop_csp64 + bx lr +endfunc + +// void blockcopy_ps(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb) +function x265_blockcopy_ps_4x4_neon + lsl r1, #1 +.rept 2 + vld1.u8 {d0}, [r2], r3 + vld1.u8 {d1}, [r2], r3 + vmovl.u8 q1, d0 + vmovl.u8 q2, d1 + vst1.u16 {d2}, [r0], r1 + vst1.u16 {d4}, [r0], r1 +.endr + bx lr +endfunc + +function x265_blockcopy_ps_8x8_neon + lsl r1, #1 +.rept 4 + vld1.u8 {d0}, [r2], r3 + vld1.u8 {d1}, [r2], r3 + vmovl.u8 q1, d0 + vmovl.u8 q2, d1 + vst1.u16 {q1}, [r0], r1 + vst1.u16 {q2}, [r0], r1 +.endr + bx lr +endfunc + +function x265_blockcopy_ps_16x16_neon + lsl r1, #1 +.rept 8 + vld1.u8 {q0}, [r2], r3 + vld1.u8 {q1}, [r2], r3 + vmovl.u8 q8, d0 + vmovl.u8 q9, d1 + vmovl.u8 q10, d2 + vmovl.u8 q11, d3 + vst1.u16 {q8, q9}, [r0], r1 + vst1.u16 {q10, q11}, [r0], r1 +.endr + bx lr +endfunc + +function x265_blockcopy_ps_32x32_neon + lsl r1, #1 + sub r1, #32 + mov r12, #4 +loop_cps32: + subs r12, #1 +.rept 4 + vld1.u8 {q0, q1}, [r2], r3 + vld1.u8 {q2, q3}, [r2], r3 + vmovl.u8 q8, d0 + vmovl.u8 q9, d1 + vmovl.u8 q10, d2 + vmovl.u8 q11, d3 + + vmovl.u8 q12, d4 + vmovl.u8 q13, d5 + vmovl.u8 q14, d6 + vmovl.u8 q15, d7 + + vst1.u16 {q8, q9}, [r0]! + vst1.u16 {q10, q11}, [r0], r1 + vst1.u16 {q12, q13}, [r0]! + vst1.u16 {q14, q15}, [r0], r1 +.endr + bne loop_cps32 + bx lr +endfunc + +function x265_blockcopy_ps_64x64_neon + lsl r1, #1 + sub r1, #96 + sub r3, #32 + mov r12, #16 +loop_cps64: + subs r12, #1 +.rept 4 + vld1.u8 {q0, q1}, [r2]! + vld1.u8 {q2, q3}, [r2], r3 + vmovl.u8 q8, d0 + vmovl.u8 q9, d1 + vmovl.u8 q10, d2 + vmovl.u8 q11, d3 + + vmovl.u8 q12, d4 + vmovl.u8 q13, d5 + vmovl.u8 q14, d6 + vmovl.u8 q15, d7 + + vst1.u16 {q8, q9}, [r0]! + vst1.u16 {q10, q11}, [r0]! + vst1.u16 {q12, q13}, [r0]! + vst1.u16 {q14, q15}, [r0], r1 +.endr + bne loop_cps64 + bx lr +endfunc + +// void x265_blockcopy_ss(int16_t* a, intptr_t stridea, const int16_t* b, intptr_t strideb) +function x265_blockcopy_ss_4x4_neon + lsl r1, #1 + lsl r3, #1 +.rept 2 + vld1.u16 {d0}, [r2], r3 + vld1.u16 {d1}, [r2], r3 + vst1.u16 {d0}, [r0], r1 + vst1.u16 {d1}, [r0], r1 +.endr + bx lr +endfunc + +function x265_blockcopy_ss_8x8_neon + lsl r1, #1 + lsl r3, #1 +.rept 4 + vld1.u16 {q0}, [r2], r3 + vld1.u16 {q1}, [r2], r3 + vst1.u16 {q0}, [r0], r1 + vst1.u16 {q1}, [r0], r1 +.endr + bx lr +endfunc + +function x265_blockcopy_ss_16x16_neon + lsl r1, #1 + lsl r3, #1 +.rept 8 + vld1.u16 {q0, q1}, [r2], r3 + vld1.u16 {q2, q3}, [r2], r3 + vst1.u16 {q0, q1}, [r0], r1 + vst1.u16 {q2, q3}, [r0], r1 +.endr + bx lr +endfunc + +function x265_blockcopy_ss_32x32_neon + lsl r1, #1 + lsl r3, #1 + mov r12, #4 + sub r1, #32 + sub r3, #32 +loop_css32: + subs r12, #1 +.rept 8 + vld1.u16 {q0, q1}, [r2]! + vld1.u16 {q2, q3}, [r2], r3 + vst1.u16 {q0, q1}, [r0]! + vst1.u16 {q2, q3}, [r0], r1 +.endr + bne loop_css32 + bx lr +endfunc + +function x265_blockcopy_ss_64x64_neon + lsl r1, #1 + lsl r3, #1 + mov r12, #8 + sub r1, #96 + sub r3, #96 +loop_css64: + subs r12, #1 +.rept 8 + vld1.u16 {q0, q1}, [r2]! + vld1.u16 {q2, q3}, [r2]! + vld1.u16 {q8, q9}, [r2]! + vld1.u16 {q10, q11}, [r2], r3 + + vst1.u16 {q0, q1}, [r0]! + vst1.u16 {q2, q3}, [r0]! + vst1.u16 {q8, q9}, [r0]! + vst1.u16 {q10, q11}, [r0], r1 +.endr + bne loop_css64 + bx lr +endfunc + +/******** Chroma blockcopy********/ +function x265_blockcopy_ss_4x8_neon + lsl r1, #1 + lsl r3, #1 +.rept 4 + vld1.u16 {d0}, [r2], r3 + vld1.u16 {d1}, [r2], r3 + vst1.u16 {d0}, [r0], r1 + vst1.u16 {d1}, [r0], r1 +.endr + bx lr +endfunc + +function x265_blockcopy_ss_8x16_neon + lsl r1, #1 + lsl r3, #1 +.rept 8 + vld1.u16 {q0}, [r2], r3 + vld1.u16 {q1}, [r2], r3 + vst1.u16 {q0}, [r0], r1 + vst1.u16 {q1}, [r0], r1 +.endr + bx lr +endfunc + +function x265_blockcopy_ss_16x32_neon + lsl r1, #1 + lsl r3, #1 +.rept 16 + vld1.u16 {q0, q1}, [r2], r3 + vld1.u16 {q2, q3}, [r2], r3 + vst1.u16 {q0, q1}, [r0], r1 + vst1.u16 {q2, q3}, [r0], r1 +.endr + bx lr +endfunc + +function x265_blockcopy_ss_32x64_neon + lsl r1, #1 + lsl r3, #1 + mov r12, #8 + sub r1, #32 + sub r3, #32 +loop_css_32x64: + subs r12, #1 +.rept 8 + vld1.u16 {q0, q1}, [r2]! + vld1.u16 {q2, q3}, [r2], r3 + vst1.u16 {q0, q1}, [r0]! + vst1.u16 {q2, q3}, [r0], r1 +.endr + bne loop_css_32x64 + bx lr +endfunc + +// chroma blockcopy_ps +function x265_blockcopy_ps_4x8_neon + lsl r1, #1 +.rept 4 + vld1.u8 {d0}, [r2], r3 + vld1.u8 {d1}, [r2], r3 + vmovl.u8 q1, d0 + vmovl.u8 q2, d1 + vst1.u16 {d2}, [r0], r1 + vst1.u16 {d4}, [r0], r1 +.endr + bx lr +endfunc + +function x265_blockcopy_ps_8x16_neon + lsl r1, #1 +.rept 8 + vld1.u8 {d0}, [r2], r3 + vld1.u8 {d1}, [r2], r3 + vmovl.u8 q1, d0 + vmovl.u8 q2, d1 + vst1.u16 {q1}, [r0], r1 + vst1.u16 {q2}, [r0], r1 +.endr + bx lr +endfunc + +function x265_blockcopy_ps_16x32_neon + lsl r1, #1 + mov r12, #4 +loop_cps_16x32: + subs r12, #1 +.rept 4 + vld1.u8 {q0}, [r2], r3 + vld1.u8 {q1}, [r2], r3 + vmovl.u8 q8, d0 + vmovl.u8 q9, d1 + vmovl.u8 q10, d2 + vmovl.u8 q11, d3 + vst1.u16 {q8, q9}, [r0], r1 + vst1.u16 {q10, q11}, [r0], r1 +.endr + bne loop_cps_16x32 + bx lr +endfunc + +function x265_blockcopy_ps_32x64_neon + lsl r1, #1 + sub r1, #32 + mov r12, #8 +loop_cps_32x64: + subs r12, #1 +.rept 4 + vld1.u8 {q0, q1}, [r2], r3 + vld1.u8 {q2, q3}, [r2], r3 + vmovl.u8 q8, d0 + vmovl.u8 q9, d1 + vmovl.u8 q10, d2 + vmovl.u8 q11, d3 + + vmovl.u8 q12, d4 + vmovl.u8 q13, d5 + vmovl.u8 q14, d6 + vmovl.u8 q15, d7 + + vst1.u16 {q8, q9}, [r0]! + vst1.u16 {q10, q11}, [r0], r1 + vst1.u16 {q12, q13}, [r0]! + vst1.u16 {q14, q15}, [r0], r1 +.endr + bne loop_cps_32x64 + bx lr +endfunc + +// chroma blockcopy_sp +function x265_blockcopy_sp_4x8_neon + lsl r3, #1 +.rept 4 + vld1.u16 {q0}, [r2], r3 + vld1.u16 {q1}, [r2], r3 + vmovn.u16 d0, q0 + vmovn.u16 d1, q1 + vst1.u32 {d0[0]}, [r0], r1 + vst1.u32 {d1[0]}, [r0], r1 +.endr + bx lr +endfunc + +function x265_blockcopy_sp_8x16_neon + lsl r3, #1 +.rept 8 + vld1.u16 {q0}, [r2], r3 + vld1.u16 {q1}, [r2], r3 + vmovn.u16 d0, q0 + vmovn.u16 d1, q1 + vst1.u8 {d0}, [r0], r1 + vst1.u8 {d1}, [r0], r1 +.endr + bx lr +endfunc + +function x265_blockcopy_sp_16x32_neon + lsl r3, #1 + mov r12, #4 +loop_csp_16x32: + subs r12, #1 +.rept 4 + vld1.u16 {q0, q1}, [r2], r3 + vld1.u16 {q2, q3}, [r2], r3 + vmovn.u16 d0, q0 + vmovn.u16 d1, q1 + vmovn.u16 d2, q2 + vmovn.u16 d3, q3 + vst1.u8 {q0}, [r0], r1 + vst1.u8 {q1}, [r0], r1 +.endr + bne loop_csp_16x32 + bx lr +endfunc + +function x265_blockcopy_sp_32x64_neon + mov r12, #8 + lsl r3, #1 + sub r3, #32 +loop_csp_32x64: + subs r12, #1 +.rept 4 + vld1.u16 {q0, q1}, [r2]! + vld1.u16 {q2, q3}, [r2], r3 + vld1.u16 {q8, q9}, [r2]! + vld1.u16 {q10, q11}, [r2], r3 + + vmovn.u16 d0, q0 + vmovn.u16 d1, q1 + vmovn.u16 d2, q2 + vmovn.u16 d3, q3 + + vmovn.u16 d4, q8 + vmovn.u16 d5, q9 + vmovn.u16 d6, q10 + vmovn.u16 d7, q11 + + vst1.u8 {q0, q1}, [r0], r1 + vst1.u8 {q2, q3}, [r0], r1 +.endr + bne loop_csp_32x64 + bx lr +endfunc + +// void x265_blockfill_s_neon(int16_t* dst, intptr_t dstride, int16_t val) +function x265_blockfill_s_4x4_neon + vdup.u16 d0, r2 + lsl r1, #1 +.rept 4 + vst1.16 {d0}, [r0], r1 +.endr + bx lr +endfunc + +function x265_blockfill_s_8x8_neon + vdup.u16 q0, r2 + lsl r1, #1 +.rept 8 + vst1.16 {q0}, [r0], r1 +.endr + bx lr +endfunc + +function x265_blockfill_s_16x16_neon + vdup.u16 q0, r2 + vmov q1, q0 + lsl r1, #1 +.rept 16 + vst1.16 {q0, q1}, [r0], r1 +.endr + bx lr +endfunc + +function x265_blockfill_s_32x32_neon + vdup.u16 q0, r2 + vmov q1, q0 + lsl r1, #1 + sub r1, #32 +.rept 32 + vst1.16 {q0, q1}, [r0]! + vst1.16 {q0, q1}, [r0], r1 +.endr + bx lr +endfunc + +// uint32_t copy_count(int16_t* coeff, const int16_t* residual, intptr_t resiStride) +function x265_copy_cnt_4_neon + lsl r2, #1 + mov r12, #8 + veor d4, d4 +.rept 2 + vld1.s16 {d0}, [r1], r2 + vld1.s16 {d1}, [r1], r2 + vclz.i16 d2, d0 + vclz.i16 d3, d1 + vshr.u16 q1, #4 + vadd.u16 d2, d3 + vadd.u16 d4, d2 + vst1.s16 {d0}, [r0], r12 + vst1.s16 {d1}, [r0], r12 +.endr + vpadd.u16 d4, d4 + vpadd.u16 d4, d4 + vmov.u16 r12, d4[0] + rsb r0, r12, #16 + bx lr +endfunc + +function x265_copy_cnt_8_neon + lsl r2, #1 + mov r12, #16 + veor q8, q8 +.rept 4 + vld1.s16 {q0}, [r1], r2 + vld1.s16 {q1}, [r1], r2 + vclz.i16 q2, q0 + vclz.i16 q3, q1 + vshr.u16 q2, #4 + vshr.u16 q3, #4 + vadd.u16 q2, q3 + vadd.u16 q8, q2 + vst1.s16 {q0}, [r0], r12 + vst1.s16 {q1}, [r0], r12 +.endr + vadd.u16 d16, d17 + vpadd.u16 d16, d16 + vpadd.u16 d16, d16 + vmov.u16 r12, d16[0] + rsb r0, r12, #64 + bx lr +endfunc + +function x265_copy_cnt_16_neon + lsl r2, #1 + mov r12, #32 + veor q2, q2 +.rept 16 + vld1.s16 {q0, q1}, [r1], r2 + vst1.s16 {q0, q1}, [r0], r12 + vclz.i16 q8, q0 + vclz.i16 q9, q1 + vshr.u16 q8, #4 + vshr.u16 q9, #4 + vadd.u16 q8, q9 + vadd.u16 q2, q8 +.endr + vadd.u16 d4, d5 + vpadd.u16 d4, d4 + vpadd.u16 d4, d4 + + vmov.u16 r12, d4[0] + rsb r0, r12, #256 + bx lr +endfunc + +function x265_copy_cnt_32_neon + lsl r2, #1 + sub r2, #32 + mov r12, #32 + veor q12, q12 +.rept 32 + vld1.s16 {q0, q1}, [r1]! + vld1.s16 {q2, q3}, [r1], r2 + vst1.s16 {q0, q1}, [r0]! + vst1.s16 {q2, q3}, [r0], r12 + + vclz.i16 q8, q0 + vclz.i16 q9, q1 + vclz.i16 q10, q2 + vclz.i16 q11, q3 + + vshr.u16 q8, #4 + vshr.u16 q9, #4 + vshr.u16 q10, #4 + vshr.u16 q11, #4 + + vadd.u16 q8, q9 + vadd.u16 q10, q11 + vadd.u16 q8, q10 + vadd.u16 q12, q8 +.endr + vadd.u16 d24, d25 + vpadd.u16 d24, d24 + vpadd.u16 d24, d24 + + vmov.u16 r12, d24[0] + rsb r0, r12, #1024 + bx lr +endfunc + +// int count_nonzero_c(const int16_t* quantCoeff) +function x265_count_nonzero_4_neon + vld1.s16 {d0-d3}, [r0] + vceq.u16 q0, #0 + vceq.u16 q1, #0 + eor r1, r1 + vtrn.8 q0, q1 + + vshr.u8 q0, #7 + + vadd.u8 d0, d1 + vshr.u64 d1, d0, #32 + vadd.u8 d0, d1 + vmov.u32 r0, d0[0] + usad8 r0, r0, r1 + rsb r0, #16 + bx lr +endfunc + +function x265_count_nonzero_8_neon + vldm r0, {q8-q15} + eor r1, r1 + vceq.u16 q8, #0 + vceq.u16 q9, #0 + vceq.u16 q10, #0 + vceq.u16 q11, #0 + vceq.u16 q12, #0 + vceq.u16 q13, #0 + vceq.u16 q14, #0 + vceq.u16 q15, #0 + + vtrn.8 q8, q9 + vtrn.8 q10, q11 + vtrn.8 q12, q13 + vtrn.8 q14, q15 + + vadd.s8 q8, q10 + vadd.s8 q12, q14 + vadd.s8 q8, q12 + + vadd.s8 d16, d17 + vshr.u64 d17, d16, #32 + vadd.s8 d16, d17 + vabs.s8 d16, d16 + + vmov.u32 r0, d16[0] + usad8 r0, r0, r1 + rsb r0, #64 + bx lr +endfunc + +function x265_count_nonzero_16_neon + vldm r0!, {q8-q15} + eor r1, r1 + vceq.u16 q8, #0 + vceq.u16 q9, #0 + vceq.u16 q10, #0 + vceq.u16 q11, #0 + vceq.u16 q12, #0 + vceq.u16 q13, #0 + vceq.u16 q14, #0 + vceq.u16 q15, #0 + + vtrn.8 q8, q9 + vtrn.8 q10, q11 + vtrn.8 q12, q13 + vtrn.8 q14, q15 + + vmov q0, q8 + vmov q1, q10 + vmov q2, q12 + vmov q3, q14 + +.rept 3 + vldm r0!, {q8-q15} + vceq.u16 q8, #0 + vceq.u16 q9, #0 + vceq.u16 q10, #0 + vceq.u16 q11, #0 + vceq.u16 q12, #0 + vceq.u16 q13, #0 + vceq.u16 q14, #0 + vceq.u16 q15, #0 + + vtrn.8 q8, q9 + vtrn.8 q10, q11 + vtrn.8 q12, q13 + vtrn.8 q14, q15 + + vadd.s8 q0, q8 + vadd.s8 q1, q10 + vadd.s8 q2, q12 + vadd.s8 q3, q14 +.endr + + vadd.s8 q0, q1 + vadd.s8 q2, q3 + vadd.s8 q0, q2 // dynamic range is 4+1 bits + + vadd.s8 d0, d1 + vshr.u64 d1, d0, #32 + vadd.s8 d0, d1 + vabs.s8 d0, d0 // maximum value of each element are 64 + + vmov.u32 r0, d0[0] + usad8 r0, r0, r1 + rsb r0, #256 + bx lr +endfunc + +function x265_count_nonzero_32_neon + vldm r0!, {q8-q15} + vceq.u16 q8, #0 + vceq.u16 q9, #0 + vceq.u16 q10, #0 + vceq.u16 q11, #0 + vceq.u16 q12, #0 + vceq.u16 q13, #0 + vceq.u16 q14, #0 + vceq.u16 q15, #0 + + vtrn.8 q8, q9 + vtrn.8 q10, q11 + vtrn.8 q12, q13 + vtrn.8 q14, q15 + + mov r1, #15 + + vmov q0, q8 + vmov q1, q10 + vmov q2, q12 + vmov q3, q14 + +.loop: + vldm r0!, {q8-q15} + subs r1, #1 + + vceq.u16 q8, #0 + vceq.u16 q9, #0 + vceq.u16 q10, #0 + vceq.u16 q11, #0 + vceq.u16 q12, #0 + vceq.u16 q13, #0 + vceq.u16 q14, #0 + vceq.u16 q15, #0 + + vtrn.8 q8, q9 + vtrn.8 q10, q11 + vtrn.8 q12, q13 + vtrn.8 q14, q15 + + vadd.s8 q0, q8 + vadd.s8 q1, q10 + vadd.s8 q2, q12 + vadd.s8 q3, q14 + bgt .loop + + // sum + vadd.s8 q0, q1 + vadd.s8 q2, q3 + vadd.s8 q0, q2 // dynamic range is 6+1 bits + + vaddl.s8 q0, d0, d1 + vadd.s16 d0, d1 + vshr.u64 d1, d0, #32 + vadd.s16 d0, d1 + vabs.s16 d0, d0 // maximum value of each element are 512 + + vmov.u32 r0, d0[0] + uasx r0, r0, r0 + mov r0, r0, lsr 16 + rsb r0, #1024 + bx lr +endfunc
View file
x265_2.0.tar.gz/source/common/arm/blockcopy8.h
Added
@@ -0,0 +1,123 @@ +/***************************************************************************** + * Copyright (C) 2016 x265 project + * + * Authors: Steve Borho <steve@borho.org> + * Min Chen <chenm003@163.com> + * Dnyaneshwar Gorade <dnyaneshwar@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#ifndef X265_BLOCKCOPY8_ARM_H +#define X265_BLOCKCOPY8_ARM_H + +void x265_blockcopy_pp_16x16_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_8x4_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_8x8_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_8x16_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_8x32_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_12x16_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_4x4_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_4x8_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_4x16_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_16x4_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_16x8_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_16x12_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_16x32_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_16x64_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_24x32_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_32x8_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_32x16_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_32x24_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_32x32_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_32x64_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_48x64_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_64x16_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_64x32_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_64x48_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_64x64_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_2x4_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_2x8_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_2x16_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_6x8_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_6x16_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_8x2_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_8x6_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_8x12_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_8x64_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_12x32_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_4x2_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_4x32_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_16x24_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_24x64_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +void x265_blockcopy_pp_32x48_neon(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); + +void x265_cpy2Dto1D_shr_4x4_neon(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); +void x265_cpy2Dto1D_shr_8x8_neon(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); +void x265_cpy2Dto1D_shr_16x16_neon(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); +void x265_cpy2Dto1D_shr_32x32_neon(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); + +void x265_blockcopy_sp_4x4_neon(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb); +void x265_blockcopy_sp_8x8_neon(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb); +void x265_blockcopy_sp_16x16_neon(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb); +void x265_blockcopy_sp_32x32_neon(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb); +void x265_blockcopy_sp_64x64_neon(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb); + +void x265_blockcopy_ps_4x4_neon(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb); +void x265_blockcopy_ps_8x8_neon(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb); +void x265_blockcopy_ps_16x16_neon(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb); +void x265_blockcopy_ps_32x32_neon(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb); +void x265_blockcopy_ps_64x64_neon(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb); + +void x265_blockcopy_ss_4x4_neon(int16_t* a, intptr_t stridea, const int16_t* b, intptr_t strideb); +void x265_blockcopy_ss_8x8_neon(int16_t* a, intptr_t stridea, const int16_t* b, intptr_t strideb); +void x265_blockcopy_ss_16x16_neon(int16_t* a, intptr_t stridea, const int16_t* b, intptr_t strideb); +void x265_blockcopy_ss_32x32_neon(int16_t* a, intptr_t stridea, const int16_t* b, intptr_t strideb); +void x265_blockcopy_ss_64x64_neon(int16_t* a, intptr_t stridea, const int16_t* b, intptr_t strideb); + +// chroma blockcopy +void x265_blockcopy_ss_4x8_neon(int16_t* a, intptr_t stridea, const int16_t* b, intptr_t strideb); +void x265_blockcopy_ss_8x16_neon(int16_t* a, intptr_t stridea, const int16_t* b, intptr_t strideb); +void x265_blockcopy_ss_16x32_neon(int16_t* a, intptr_t stridea, const int16_t* b, intptr_t strideb); +void x265_blockcopy_ss_32x64_neon(int16_t* a, intptr_t stridea, const int16_t* b, intptr_t strideb); + +void x265_blockcopy_sp_4x8_neon(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb); +void x265_blockcopy_sp_8x16_neon(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb); +void x265_blockcopy_sp_16x32_neon(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb); +void x265_blockcopy_sp_32x64_neon(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb); + +void x265_blockcopy_ps_4x8_neon(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb); +void x265_blockcopy_ps_8x16_neon(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb); +void x265_blockcopy_ps_16x32_neon(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb); +void x265_blockcopy_ps_32x64_neon(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb); + +void x265_blockfill_s_4x4_neon(int16_t* dst, intptr_t dstride, int16_t val); +void x265_blockfill_s_8x8_neon(int16_t* dst, intptr_t dstride, int16_t val); +void x265_blockfill_s_16x16_neon(int16_t* dst, intptr_t dstride, int16_t val); +void x265_blockfill_s_32x32_neon(int16_t* dst, intptr_t dstride, int16_t val); + +uint32_t x265_copy_cnt_4_neon(int16_t* coeff, const int16_t* residual, intptr_t resiStride); +uint32_t x265_copy_cnt_8_neon(int16_t* coeff, const int16_t* residual, intptr_t resiStride); +uint32_t x265_copy_cnt_16_neon(int16_t* coeff, const int16_t* residual, intptr_t resiStride); +uint32_t x265_copy_cnt_32_neon(int16_t* coeff, const int16_t* residual, intptr_t resiStride); + +int x265_count_nonzero_4_neon(const int16_t* quantCoeff); +int x265_count_nonzero_8_neon(const int16_t* quantCoeff); +int x265_count_nonzero_16_neon(const int16_t* quantCoeff); +int x265_count_nonzero_32_neon(const int16_t* quantCoeff); +#endif // ifndef X265_I386_PIXEL_ARM_H
View file
x265_2.0.tar.gz/source/common/arm/cpu-a.S
Added
@@ -0,0 +1,109 @@ +/***************************************************************************** + * cpu-a.S: arm cpu detection + ***************************************************************************** + * Copyright (C) 2016 x265 project + * + * Authors: David Conrad <lessen42@gmail.com> + * Dnyaneshwar Gorade <dnyaneshwar@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "asm.S" + +.align 2 + +// done in gas because .fpu neon overrides the refusal to assemble +// instructions the selected -march/-mcpu doesn't support +function x265_cpu_neon_test + vadd.i16 q0, q0, q0 + bx lr +endfunc + +// return: 0 on success +// 1 if counters were already enabled +// 9 if lo-res counters were already enabled +function x265_cpu_enable_armv7_counter, export=0 + mrc p15, 0, r2, c9, c12, 0 // read PMNC + ands r0, r2, #1 + andne r0, r2, #9 + + orr r2, r2, #1 // enable counters + bic r2, r2, #8 // full resolution + mcreq p15, 0, r2, c9, c12, 0 // write PMNC + mov r2, #1 << 31 // enable cycle counter + mcr p15, 0, r2, c9, c12, 1 // write CNTENS + bx lr +endfunc + +function x265_cpu_disable_armv7_counter, export=0 + mrc p15, 0, r0, c9, c12, 0 // read PMNC + bic r0, r0, #1 // disable counters + mcr p15, 0, r0, c9, c12, 0 // write PMNC + bx lr +endfunc + + +.macro READ_TIME r + mrc p15, 0, \r, c9, c13, 0 +.endm + +// return: 0 if transfers neon -> arm transfers take more than 10 cycles +// nonzero otherwise +function x265_cpu_fast_neon_mrc_test + // check for user access to performance counters + mrc p15, 0, r0, c9, c14, 0 + cmp r0, #0 + bxeq lr + + push {r4-r6,lr} + bl x265_cpu_enable_armv7_counter + ands r1, r0, #8 + mov r3, #0 + mov ip, #4 + mov r6, #4 + moveq r5, #1 + movne r5, #64 + +average_loop: + mov r4, r5 + READ_TIME r1 +1: subs r4, r4, #1 +.rept 8 + vmov.u32 lr, d0[0] + add lr, lr, lr +.endr + bgt 1b + READ_TIME r2 + + subs r6, r6, #1 + sub r2, r2, r1 + cmpgt r2, #30 << 3 // assume context switch if it took over 30 cycles + addle r3, r3, r2 + subsle ip, ip, #1 + bgt average_loop + + // disable counters if we enabled them + ands r0, r0, #1 + bleq x265_cpu_disable_armv7_counter + + lsr r0, r3, #5 + cmp r0, #10 + movgt r0, #0 + pop {r4-r6,pc} +endfunc
View file
x265_2.0.tar.gz/source/common/arm/dct-a.S
Added
@@ -0,0 +1,900 @@ +/***************************************************************************** + * Copyright (C) 2016 x265 project + * + * Authors: Min Chen <chenm003@163.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "asm.S" + +.section .rodata + +.align 4 + +.text + +.align 4 + +// dst[0 * line] = ((64 * E[0] + 64 * E[1] + add) >> shift); +// dst[2 * line] = ((64 * E[0] - 64 * E[1] + add) >> shift); +// dst[1 * line] = ((83 * O[0] + 36 * O[1] + add) >> shift); +// dst[3 * line] = ((36 * O[0] - 83 * O[1] + add) >> shift); + +/* void dct4_c(const int16_t* src, int16_t* dst, intptr_t srcStride) */ +function x265_dct_4x4_neon + mov r2, r2, lsl #1 + vld1.16 {d0}, [r0, :64], r2 // d0 = [03 02 01 00] + vld1.16 {d1}, [r0, :64], r2 // d1 = [13 12 11 10] + vld1.16 {d2}, [r0, :64], r2 // d2 = [23 22 21 20] + vld1.16 {d3}, [r0, :64] // d3 = [33 32 31 30] + + vtrn.32 q0, q1 // q0 = [31 30 11 10 21 20 01 00], q1 = [33 32 13 12 23 22 03 02] + vrev32.16 q1, q1 // q1 = [32 33 12 13 22 23 02 03] + + movconst r0, 0x00240053 + movconst r2, 0xFFAD0024 + + // DCT-1D + vadd.s16 q2, q0, q1 // q2 = [E31 E30 E11 E10 E21 E20 E01 E00] + vsub.s16 q3, q0, q1 // q3 = [O31 O30 O11 O10 O21 O20 O01 O00] + vdup.32 d16, r0 // d16 = [ 36 83] + vdup.32 d17, r2 // d17 = [-83 36] + vtrn.16 d4, d5 // d4 = [E30 E20 E10 E00], d5 = [E31 E21 E11 E01] + vtrn.32 d6, d7 // q3 = [O31 O30 O21 O20 O11 O10 O01 O00] + + vmull.s16 q9, d6, d16 + vmull.s16 q10, d7, d16 // [q9, q10] = [ 36*O1 83*O0] -> [1] + vmull.s16 q11, d6, d17 + vmull.s16 q12, d7, d17 // [q11,q12] = [-83*O1 36*O0] -> [3] + + vadd.s16 d0, d4, d5 // d0 = [E0 + E1] + vsub.s16 d1, d4, d5 // d1 = [E0 - E1] + + vpadd.s32 d18, d18, d19 // q9 = [1] + vpadd.s32 d19, d20, d21 + vpadd.s32 d20, d22, d23 // q10 = [3] + vpadd.s32 d21, d24, d25 + + vshll.s16 q1, d0, #6 // q1 = 64 * [0] + vshll.s16 q2, d1, #6 // q2 = 64 * [2] + + // TODO: Dynamic Range is 11+6-1 bits + vqrshrn.s32 d25, q9, 1 // d25 = R[13 12 11 10] + vqrshrn.s32 d24, q1, 1 // d24 = R[03 02 01 00] + vqrshrn.s32 d26, q2, 1 // q26 = R[23 22 21 20] + vqrshrn.s32 d27, q10, 1 // d27 = R[33 32 31 30] + + + // DCT-2D + vmovl.s16 q0, d16 // q14 = [ 36 83] + + vtrn.32 q12, q13 // q12 = [31 30 11 10 21 20 01 00], q13 = [33 32 13 12 23 22 03 02] + vrev32.16 q13, q13 // q13 = [32 33 12 13 22 23 02 03] + + vaddl.s16 q1, d24, d26 // q0 = [E21 E20 E01 E00] + vaddl.s16 q2, d25, d27 // q1 = [E31 E30 E11 E10] + vsubl.s16 q3, d24, d26 // q2 = [O21 O20 O01 O00] + vsubl.s16 q8, d25, d27 // q3 = [O31 O30 O11 O10] + + vtrn.32 q1, q2 // q1 = [E30 E20 E10 E00], q2 = [E31 E21 E11 E01] + vtrn.32 q3, q8 // q3 = [O30 O20 O10 O00], q8 = [O31 O21 O11 O01] + + vmul.s32 q9, q3, d0[0] // q9 = [83*O30 83*O20 83*O10 83*O00] + vmul.s32 q10, q8, d0[1] // q10 = [36*O31 36*O21 36*O11 36*O01] + vmul.s32 q11, q3, d0[1] // q11 = [36*O30 36*O20 36*O10 36*O00] + vmul.s32 q12, q8, d0[0] // q12 = [83*O31 83*O21 83*O11 83*O01] + + vadd.s32 q0, q1, q2 // d0 = [E0 + E1] + vsub.s32 q1, q1, q2 // d1 = [E0 - E1] + + vadd.s32 q9, q9, q10 + vsub.s32 q10, q11, q12 + + vshl.s32 q0, q0, #6 // q1 = 64 * [0] + vshl.s32 q1, q1, #6 // q2 = 64 * [2] + + vqrshrn.s32 d25, q9, 8 // d25 = R[13 12 11 10] + vqrshrn.s32 d27, q10, 8 // d27 = R[33 32 31 30] + + vqrshrn.s32 d24, q0, 8 // d24 = R[03 02 01 00] + vqrshrn.s32 d26, q1, 8 // q26 = R[23 22 21 20] + + vst1.16 {d24-d27}, [r1] + + bx lr +endfunc + +/* uses registers q4 - q7 for temp values */ +.macro tr4 r0, r1, r2, r3 + vsub.s32 q8, \r0, \r3 // EO0 + vadd.s32 q9, \r0, \r3 // EE0 + vadd.s32 q10, \r1, \r2 // EE1 + vsub.s32 q11, \r1, \r2 // EO1 + + vmul.s32 \r1, q8, d0[0] // 83 * EO0 + vmul.s32 \r3, q8, d0[1] // 36 * EO0 + vshl.s32 q9, q9, #6 // 64 * EE0 + vshl.s32 q10, q10, #6 // 64 * EE1 + vmla.s32 \r1, q11, d0[1] // 83 * EO0 + 36 * EO1 + vmls.s32 \r3, q11, d0[0] // 36 * EO0 - 83 * EO1 + vadd.s32 \r0, q9, q10 // 64 * (EE0 + EE1) + vsub.s32 \r2, q9, q10 // 64 * (EE0 - EE1) +.endm + + +.macro tr8 r0, r1, r2, r3 + vmul.s32 q12, \r0, d1[1] // 89 * src1 + vmul.s32 q13, \r0, d1[0] // 75 * src1 + vmul.s32 q14, \r0, d2[1] // 50 * src1 + vmul.s32 q15, \r0, d2[0] // 18 * src1 + + vmla.s32 q12, \r1, d1[0] // 75 * src3 + vmls.s32 q13, \r1, d2[0] // -18 * src3 + vmls.s32 q14, \r1, d1[1] // -89 * src3 + vmls.s32 q15, \r1, d2[1] // -50 * src3 + + vmla.s32 q12, \r2, d2[1] // 50 * src5 + vmls.s32 q13, \r2, d1[1] // -89 * src5 + vmla.s32 q14, \r2, d2[0] // 18 * src5 + vmla.s32 q15, \r2, d1[0] // 75 * src5 + + vmla.s32 q12, \r3, d2[0] // 18 * src7 + vmls.s32 q13, \r3, d2[1] // -50 * src7 + vmla.s32 q14, \r3, d1[0] // 75 * src7 + vmls.s32 q15, \r3, d1[1] // -89 * src7 +.endm + + +// TODO: in the DCT-2D stage, I spending 4x8=32 LD/ST operators because I haven't temporary buffer +/* void dct8_c(const int16_t* src, int16_t* dst, intptr_t srcStride) */ +function x265_dct_8x8_neon + vpush {q4-q7} + + mov r2, r2, lsl #1 + + adr r3, ctr4 + vld1.16 {d0-d2}, [r3] + mov r3, r1 + + // DCT-1D + // top half + vld1.16 {q12}, [r0], r2 + vld1.16 {q13}, [r0], r2 + vld1.16 {q14}, [r0], r2 + vld1.16 {q15}, [r0], r2 + + TRANSPOSE4x4x2_16 d24, d26, d28, d30, d25, d27, d29, d31 + + // |--| + // |24| + // |26| + // |28| + // |30| + // |25| + // |27| + // |29| + // |31| + // |--| + + vaddl.s16 q4, d28, d27 + vaddl.s16 q5, d30, d25 + vaddl.s16 q2, d24, d31 + vaddl.s16 q3, d26, d29 + + tr4 q2, q3, q4, q5 + + vqrshrn.s32 d20, q3, 2 + vqrshrn.s32 d16, q2, 2 + vqrshrn.s32 d17, q4, 2 + vqrshrn.s32 d21, q5, 2 + + vsubl.s16 q2, d24, d31 + vsubl.s16 q3, d26, d29 + vsubl.s16 q4, d28, d27 + vsubl.s16 q5, d30, d25 + + tr8 q2, q3, q4, q5 + + vqrshrn.s32 d18, q12, 2 + vqrshrn.s32 d22, q13, 2 + vqrshrn.s32 d19, q14, 2 + vqrshrn.s32 d23, q15, 2 + + vstm r1!, {d16-d23] + + // bottom half + vld1.16 {q12}, [r0], r2 + vld1.16 {q13}, [r0], r2 + vld1.16 {q14}, [r0], r2 + vld1.16 {q15}, [r0], r2 + mov r2, #8*2 + + TRANSPOSE4x4x2_16 d24, d26, d28, d30, d25, d27, d29, d31 + + // |--| + // |24| + // |26| + // |28| + // |30| + // |25| + // |27| + // |29| + // |31| + // |--| + + vaddl.s16 q4, d28, d27 + vaddl.s16 q5, d30, d25 + vaddl.s16 q2, d24, d31 + vaddl.s16 q3, d26, d29 + + tr4 q2, q3, q4, q5 + + vqrshrn.s32 d20, q3, 2 + vqrshrn.s32 d16, q2, 2 + vqrshrn.s32 d17, q4, 2 + vqrshrn.s32 d21, q5, 2 + + vsubl.s16 q2, d24, d31 + vsubl.s16 q3, d26, d29 + vsubl.s16 q4, d28, d27 + vsubl.s16 q5, d30, d25 + + tr8 q2, q3, q4, q5 + + vqrshrn.s32 d18, q12, 2 + vqrshrn.s32 d22, q13, 2 + vqrshrn.s32 d19, q14, 2 + vqrshrn.s32 d23, q15, 2 + + vstm r1, {d16-d23] + mov r1, r3 + + // DCT-2D + // left half + vld1.16 {d24}, [r1], r2 + vld1.16 {d26}, [r1], r2 + vld1.16 {d28}, [r1], r2 + vld1.16 {d30}, [r1], r2 + vld1.16 {d25}, [r1], r2 + vld1.16 {d27}, [r1], r2 + vld1.16 {d29}, [r1], r2 + vld1.16 {d31}, [r1], r2 + mov r1, r3 + + TRANSPOSE4x4x2_16 d24, d26, d28, d30, d25, d27, d29, d31 + + // |--| + // |24| + // |26| + // |28| + // |30| + // |25| + // |27| + // |29| + // |31| + // |--| + + vaddl.s16 q4, d28, d27 + vaddl.s16 q5, d30, d25 + vaddl.s16 q2, d24, d31 + vaddl.s16 q3, d26, d29 + + tr4 q2, q3, q4, q5 + + vqrshrn.s32 d18, q3, 9 + vqrshrn.s32 d16, q2, 9 + vqrshrn.s32 d20, q4, 9 + vqrshrn.s32 d22, q5, 9 + + vsubl.s16 q2, d24, d31 + vsubl.s16 q3, d26, d29 + vsubl.s16 q4, d28, d27 + vsubl.s16 q5, d30, d25 + + tr8 q2, q3, q4, q5 + + vqrshrn.s32 d17, q12, 9 + vqrshrn.s32 d19, q13, 9 + vqrshrn.s32 d21, q14, 9 + vqrshrn.s32 d23, q15, 9 + + add r3, #8 + vst1.16 {d16}, [r1], r2 + vst1.16 {d17}, [r1], r2 + vst1.16 {d18}, [r1], r2 + vst1.16 {d19}, [r1], r2 + vst1.16 {d20}, [r1], r2 + vst1.16 {d21}, [r1], r2 + vst1.16 {d22}, [r1], r2 + vst1.16 {d23}, [r1], r2 + mov r1, r3 + + + // right half + vld1.16 {d24}, [r1], r2 + vld1.16 {d26}, [r1], r2 + vld1.16 {d28}, [r1], r2 + vld1.16 {d30}, [r1], r2 + vld1.16 {d25}, [r1], r2 + vld1.16 {d27}, [r1], r2 + vld1.16 {d29}, [r1], r2 + vld1.16 {d31}, [r1], r2 + mov r1, r3 + + TRANSPOSE4x4x2_16 d24, d26, d28, d30, d25, d27, d29, d31 + + // |--| + // |24| + // |26| + // |28| + // |30| + // |25| + // |27| + // |29| + // |31| + // |--| + + vaddl.s16 q4, d28, d27 + vaddl.s16 q5, d30, d25 + vaddl.s16 q2, d24, d31 + vaddl.s16 q3, d26, d29 + + tr4 q2, q3, q4, q5 + + vqrshrn.s32 d18, q3, 9 + vqrshrn.s32 d16, q2, 9 + vqrshrn.s32 d20, q4, 9 + vqrshrn.s32 d22, q5, 9 + + vsubl.s16 q2, d24, d31 + vsubl.s16 q3, d26, d29 + vsubl.s16 q4, d28, d27 + vsubl.s16 q5, d30, d25 + + tr8 q2, q3, q4, q5 + + vqrshrn.s32 d17, q12, 9 + vqrshrn.s32 d19, q13, 9 + vqrshrn.s32 d21, q14, 9 + vqrshrn.s32 d23, q15, 9 + + vst1.16 {d16}, [r1], r2 + vst1.16 {d17}, [r1], r2 + vst1.16 {d18}, [r1], r2 + vst1.16 {d19}, [r1], r2 + vst1.16 {d20}, [r1], r2 + vst1.16 {d21}, [r1], r2 + vst1.16 {d22}, [r1], r2 + vst1.16 {d23}, [r1], r2 + + vpop {q4-q7} + bx lr +endfunc + + +.align 8 +pw_tr16: .hword 90, 87, 80, 70, 57, 43, 25, 9 // q0 = [ 9 25 43 57 70 80 87 90] + .hword 83, 36, 75, 89, 18, 50, 00, 00 // q1 = [ x x 50 18 89 75 36 83] + +.align 8 +ctr4: + .word 83 // d0[0] = 83 + .word 36 // d0[1] = 36 +ctr8: + .word 75 // d1[0] = 75 + .word 89 // d1[1] = 89 + .word 18 // d2[0] = 18 + .word 50 // d2[1] = 50 +ctr16: + .word 90, 87 // d0 + .word 80, 70 // d1 + .word 57, 43 // d2 + .word 25, 9 // d3 + +/* void dct16_c(const int16_t* src, int16_t* dst, intptr_t srcStride) */ +function x265_dct_16x16_neon + push {lr} + + // fill 3 of pipeline stall cycles (dependency link on SP) + add r2, r2 + adr r3, pw_tr16 + mov r12, #16/4 + + vpush {q4-q7} + + // TODO: 16x16 transpose buffer (may share with input buffer in future) + sub sp, #16*16*2 + + vld1.16 {d0-d3}, [r3] + mov r3, sp + mov lr, #4*16*2 + + // DCT-1D +.loop1: + // Row[0-3] + vld1.16 {q8-q9}, [r0, :64], r2 // q8 = [07 06 05 04 03 02 01 00], q9 = [0F 0E 0D 0C 0B 0A 09 08] + vld1.16 {q10-q11}, [r0, :64], r2 // q10 = [17 16 15 14 13 12 11 10], q11 = [1F 1E 1D 1C 1B 1A 19 18] + vld1.16 {q12-q13}, [r0, :64], r2 // q12 = [27 26 25 24 23 22 21 20], q13 = [2F 2E 2D 2C 2B 2A 29 28] + vld1.16 {q14-q15}, [r0, :64], r2 // q14 = [37 36 35 34 33 32 31 30], q15 = [3F 3E 3D 3C 3B 3A 39 38] + + // Register map + // | 16 17 18 19 | + // | 20 21 22 23 | + // | 24 25 26 27 | + // | 28 29 30 31 | + + // Transpose 16x4 + vtrn.32 q8, q12 // q8 = [25 24 05 04 21 20 01 00], q12 = [27 26 07 06 23 22 03 02] + vtrn.32 q10, q14 // q10 = [35 34 15 14 31 30 11 10], q14 = [37 36 17 16 33 32 13 12] + vtrn.32 q9, q13 // q9 = [2D 2C 0D 0C 29 28 09 08], q13 = [2F 2E 0F 0E 2B 2A 0B 0A] + vtrn.32 q11, q15 // q11 = [3D 3C 1D 1C 39 38 19 18], q15 = [3F 3E 1F 1E 3B 3A 1B 1A] + + vtrn.16 q8, q10 // q8 = [34 24 14 04 30 20 10 00], q10 = [35 25 15 05 31 21 11 01] + vtrn.16 q12, q14 // q12 = [36 26 16 06 32 22 12 02], q14 = [37 27 17 07 33 23 13 03] + vtrn.16 q13, q15 // q13 = [3E 2E 1E 0E 3A 2A 1A 0A], q15 = [3F 2F 1F 0F 3B 2B 1B 0B] + vtrn.16 q9, q11 // q9 = [3C 2C 1C 0C 38 28 18 08], q11 = [3D 2D 1D 0D 39 29 19 09] + + vswp d26, d27 // q13 = [3A 2A 1A 0A 3E 2E 1E 0E] + vswp d30, d31 // q15 = [3B 2B 1B 0B 3F 2F 1F 0F] + vswp d18, d19 // q9 = [38 28 18 08 3C 2C 1C 0C] + vswp d22, d23 // q11 = [39 29 19 09 3D 2D 1D 0D] + + // E[0-7] - 10 bits + vadd.s16 q4, q8, q15 // q4 = [E4 E0] + vadd.s16 q5, q10, q13 // q5 = [E5 E1] + vadd.s16 q6, q12, q11 // q6 = [E6 E2] + vadd.s16 q7, q14, q9 // q7 = [E7 E3] + + // O[0-7] - 10 bits + vsub.s16 q8, q8, q15 // q8 = [O4 O0] + vsub.s16 q9, q14, q9 // q9 = [O7 O3] + vsub.s16 q10, q10, q13 // q10 = [O5 O1] + vsub.s16 q11, q12, q11 // q11 = [O6 O2] + + // reorder Ex for EE/EO + vswp d9, d14 // q4 = [E3 E0], q7 = [E7 E4] + vswp d11, d12 // q5 = [E2 E1], q6 = [E6 E5] + vswp d14, d15 // q7 = [E4 E7] + vswp d12, d13 // q6 = [E5 E6] + + // EE[0-3] - 11 bits + vadd.s16 q2, q4, q7 // q2 = [EE3 EE0] + vadd.s16 q3, q5, q6 // q3 = [EE2 EE1] + + // EO[0-3] - 11 bits + vsub.s16 q4, q4, q7 // q4 = [EO3 EO0] + vsub.s16 q5, q5, q6 // q5 = [EO2 EO1] + + // EEx[0-1] - 12 bits + vadd.s16 d12, d4, d5 // q6 = [EEE1 EEE0] + vadd.s16 d13, d6, d7 + vsub.s16 d14, d4, d5 // q7 = [EEO1 EEO0] + vsub.s16 d15, d6, d7 + + // NEON Register map + // Ex -> [q4, q5, q6, q7], Ox -> [q8, q9, q10, q11], Const -> [q0, q1], Free -> [q2, q3, q12, q13, q14, q15] + + // ODD[4,12] + vmull.s16 q14, d14, d2[0] // q14 = EEO0 * 83 + vmull.s16 q15, d14, d2[1] // q15 = EEO0 * 36 + vmlal.s16 q14, d15, d2[1] // q14+= EEO1 * 36 + vmlsl.s16 q15, d15, d2[0] // q15+= EEO1 *-83 + + vadd.s16 d4, d12, d13 // d4 = (EEE0 + EEE1) + vsub.s16 d12, d13 // d12 = (EEE0 - EEE1) + + // Row + vmull.s16 q12, d16, d0[0] // q12 = O0 * 90 + vmull.s16 q13, d8, d2[3] // q13 = EO0 * 89 + vqrshrn.s32 d14, q14, 3 + vqrshrn.s32 d15, q15, 3 // q7 = [12 4] -> [12 4] + vmull.s16 q14, d16, d0[1] // q14 = O0 * 87 + vmull.s16 q15, d16, d0[2] // q15 = O0 * 80 + vshll.s16 q2, d4, #6 // q2 = (EEE0 + EEE1) * 64 -> [ 0] + vshll.s16 q6, d12, #6 // q6 = (EEE0 - EEE1) * 64 -> [ 8] + + vmlal.s16 q12, d20, d0[1] // q12+= O1 * 87 + vmlal.s16 q13, d10, d2[2] // q13+= EO1 * 75 + vmlal.s16 q14, d20, d1[0] // q14+= O1 * 57 + vmlal.s16 q15, d20, d1[3] // q15+= O1 * 9 + vqrshrn.s32 d4, q2, 3 // q2 = [- 0] + vqrshrn.s32 d12, q6, 3 // q6 = [- 8] + + vmlal.s16 q12, d22, d0[2] // q12+= O2 * 80 + vmlal.s16 q13, d11, d3[1] // q13+= EO2 * 50 + vmlal.s16 q14, d22, d1[3] // q14+= O2 * 9 + vmlsl.s16 q15, d22, d0[3] // q15+= O2 *-70 + + vmlal.s16 q12, d18, d0[3] // q12+= O3 * 70 + vmlal.s16 q13, d9, d3[0] // q13+= EO3 * 18 -> [ 2] + vmlsl.s16 q14, d18, d1[1] // q14+= O3 *-43 + vmlsl.s16 q15, d18, d0[1] // q15+= O3 *-87 + + vmlal.s16 q12, d17, d1[0] // q12+= O4 * 57 + vmlsl.s16 q14, d17, d0[2] // q14+= O4 *-80 + vmlsl.s16 q15, d17, d1[2] // q15+= O4 *-25 + vqrshrn.s32 d6, q13, 3 // q3 = [- 2] + vmull.s16 q13, d8, d2[2] // q13 = EO0 * 75 + + vmlal.s16 q12, d21, d1[1] // q12+= O5 * 43 + vmlsl.s16 q13, d10, d3[0] // q13+= EO1 *-18 + vmlsl.s16 q14, d21, d0[0] // q14+= O5 *-90 + vmlal.s16 q15, d21, d1[0] // q15+= O5 * 57 + + vmlal.s16 q12, d23, d1[2] // q12+= O6 * 25 + vmlsl.s16 q13, d11, d2[3] // q13+= EO2 *-89 + vmlsl.s16 q14, d23, d0[3] // q14+= O6 *-70 + vmlal.s16 q15, d23, d0[0] // q15+= O6 * 90 + + vmlal.s16 q12, d19, d1[3] // q12+= O7 * 9 -> [ 1] + vmlsl.s16 q13, d9, d3[1] // q13+= EO3 *-50 -> [ 6] + vmlsl.s16 q14, d19, d1[2] // q14+= O7 *-25 -> [ 3] + vmlal.s16 q15, d19, d1[1] // q15+= O7 * 43 -> [ 5] + vqrshrn.s32 d5, q12, 3 // q2 = [1 0] + + vmull.s16 q12, d16, d0[3] // q12 = O0 * 70 + vqrshrn.s32 d7, q14, 3 // q3 = [3 2] + vmull.s16 q14, d16, d1[0] // q14 = O0 * 57 + + vmlsl.s16 q12, d20, d1[1] // q12+= O1 *-43 + vmlsl.s16 q14, d20, d0[2] // q14+= O1 *-80 + + vmlsl.s16 q12, d22, d0[1] // q12+= O2 *-87 + vmlsl.s16 q14, d22, d1[2] // q14+= O2 *-25 + + vmlal.s16 q12, d18, d1[3] // q12+= O3 * 9 + vmlal.s16 q14, d18, d0[0] // q14+= O3 * 90 + + // Row[0-3] + vst4.16 {d4-d7}, [r3], lr + + vqrshrn.s32 d5, q15, 3 // q2 = [5 -] + vqrshrn.s32 d6, q13, 3 // q3 = [- 6] + vmull.s16 q13, d8, d3[1] // q13 = EO0 * 50 + vmlal.s16 q12, d17, d0[0] // q12+= O4 * 90 + vmlsl.s16 q14, d17, d1[3] // q14+= O4 *-9 + vmull.s16 q15, d16, d1[1] // q15 = O0 * 43 + + vmlsl.s16 q13, d10, d2[3] // q13+= EO1 *-89 + vmlal.s16 q12, d21, d1[2] // q12+= O5 * 25 + vmlsl.s16 q14, d21, d0[1] // q14+= O5 *-87 + vmlsl.s16 q15, d20, d0[0] // q15+= O1 *-90 + + vmlal.s16 q13, d11, d3[0] // q13+= EO2 * 18 + vmlsl.s16 q12, d23, d0[2] // q12+= O6 *-80 + vmlal.s16 q14, d23, d1[1] // q14+= O6 * 43 + vmlal.s16 q15, d22, d1[0] // q15+= O2 * 57 + + vmlal.s16 q13, d9, d2[2] // q13+= EO3 * 75 -> [10] + vmlsl.s16 q12, d19, d1[0] // q12+= O7 *-57 -> [ 7] + vmlal.s16 q14, d19, d0[3] // q14+= O7 * 70 -> [ 9] + vmlal.s16 q15, d18, d1[2] // q15+= O3 * 25 + vmlsl.s16 q15, d17, d0[1] // q15+= O4 *-87 + vmlal.s16 q15, d21, d0[3] // q15+= O5 * 70 + vmlal.s16 q15, d23, d1[3] // q15+= O6 * 9 + vmlsl.s16 q15, d19, d0[2] // q15+= O7 *-80 -> [11] + vmov d4, d14 // q2 = [5 4] + vqrshrn.s32 d14, q13, 3 // q7 = [12 10] + vmull.s16 q13, d8, d3[0] // q13 = EO0 * 18 + vqrshrn.s32 d7, q12, 3 // q3 = [7 6] + vmull.s16 q12, d16, d1[2] // q12 = O0 * 25 + vmlsl.s16 q13, d9, d2[3] // q13 = EO3 *-89 + vmull.s16 q4, d16, d1[3] // q4 = O0 * 9 + vmlsl.s16 q12, d20, d0[3] // q12+= O1 *-70 + vmlsl.s16 q13, d10, d3[1] // q13 = EO1 *-50 + vmlsl.s16 q4, d20, d1[2] // q4 += O1 *-25 + vmlal.s16 q12, d22, d0[0] // q12+= O2 * 90 + vmlal.s16 q13, d11, d2[2] // q13 = EO2 * 75 -> [14] + vmlal.s16 q4, d22, d1[1] // q4 += O2 * 43 + vmlsl.s16 q12, d18, d0[2] // q12+= O3 *-80 + vmlsl.s16 q4, d18, d1[0] // q4 += O3 *-57 + vmlal.s16 q12, d17, d1[1] // q12+= O4 * 43 + vqrshrn.s32 d13, q14, 3 // q6 = [9 8] + vmov d28, d15 // q14 = [- 12] + vqrshrn.s32 d15, q15, 3 // q7 = [11 10] + vqrshrn.s32 d30, q13, 3 // q15 = [- 14] + vmlal.s16 q4, d17, d0[3] // q4 += O4 * 70 + vmlal.s16 q12, d21, d1[3] // q12+= O5 * 9 + vmlsl.s16 q4, d21, d0[2] // q4 += O5 *-80 + vmlsl.s16 q12, d23, d1[0] // q12+= O6 *-57 + vmlal.s16 q4, d23, d0[1] // q4 += O6 * 87 + vmlal.s16 q12, d19, d0[1] // q12+= O7 * 87 -> [13] + vmlsl.s16 q4, d19, d0[0] // q4 += O7 *-90 -> [15] + + // Row[4-7] + vst4.16 {d4-d7}, [r3], lr + vqrshrn.s32 d29, q12, 3 // q14 = [13 12] + vqrshrn.s32 d31, q4, 3 // q15 = [15 14] + + // Row[8-11] + vst4.16 {d12-d15}, [r3], lr + + // Row[12-15] + vst4.16 {d28-d31}, [r3]! + + + // loop into next process group + sub r3, #3*4*16*2 + subs r12, #1 + bgt .loop1 + + + // DCT-2D + // r[0,2,3,12,lr], q[2-15] are free here + mov r2, sp // r3 -> internal temporary buffer + mov r3, #16*2*2 + mov r12, #16/4 // Process 4 rows every loop + +.loop2: + vldm r2, {q8-q15} + + // d16 = [30 20 10 00] + // d17 = [31 21 11 01] + // q18 = [32 22 12 02] + // d19 = [33 23 13 03] + // d20 = [34 24 14 04] + // d21 = [35 25 15 05] + // q22 = [36 26 16 06] + // d23 = [37 27 17 07] + // d24 = [38 28 18 08] + // d25 = [39 29 19 09] + // q26 = [3A 2A 1A 0A] + // d27 = [3B 2B 1B 0B] + // d28 = [3C 2C 1C 0C] + // d29 = [3D 2D 1D 0D] + // q30 = [3E 2E 1E 0E] + // d31 = [3F 2F 1F 0F] + + // NOTE: the ARM haven't enough SIMD registers, so I have to process Even & Odd part series. + + // Process Even + + // E + vaddl.s16 q2, d16, d31 // q2 = [E30 E20 E10 E00] + vaddl.s16 q3, d17, d30 // q3 = [E31 E21 E11 E01] + vaddl.s16 q4, d18, d29 // q4 = [E32 E22 E12 E02] + vaddl.s16 q5, d19, d28 // q5 = [E33 E23 E13 E03] + vaddl.s16 q9, d23, d24 // q9 = [E37 E27 E17 E07] + vaddl.s16 q8, d22, d25 // q8 = [E36 E26 E16 E06] + vaddl.s16 q7, d21, d26 // q7 = [E35 E25 E15 E05] + vaddl.s16 q6, d20, d27 // q6 = [E34 E24 E14 E04] + + // EE & EO + vadd.s32 q13, q2, q9 // q13 = [EE30 EE20 EE10 EE00] + vsub.s32 q9, q2, q9 // q9 = [EO30 EO20 EO10 EO00] + + vadd.s32 q2, q5, q6 // q2 = [EE33 EE23 EE13 EE03] + vsub.s32 q12, q5, q6 // q12 = [EO33 EO23 EO13 EO03] + + vadd.s32 q14, q3, q8 // q14 = [EE31 EE21 EE11 EE01] + vsub.s32 q10, q3, q8 // q10 = [EO31 EO21 EO11 EO01] + + vadd.s32 q15, q4, q7 // q15 = [EE32 EE22 EE12 EE02] + vsub.s32 q11, q4, q7 // q11 = [EO32 EO22 EO12 EO02] + + // Free=[3,4,5,6,7,8] + + // EEE & EEO + vadd.s32 q5, q13, q2 // q5 = [EEE30 EEE20 EEE10 EEE00] + vadd.s32 q6, q14, q15 // q6 = [EEE31 EEE21 EEE11 EEE01] + vsub.s32 q7, q13, q2 // q7 = [EEO30 EEO20 EEO10 EEO00] + vsub.s32 q8, q14, q15 // q8 = [EEO31 EEO21 EEO11 EEO01] + + // Convert Const for Dct EE to 32-bits + adr r0, ctr4 + vld1.32 {d0-d3}, [r0, :64] + + // Register Map (Qx) + // Free=[2,3,4,13,14,15], Const=[0,1], EEEx=[5,6,7,8], EO=[9,10,11,12] + + vadd.s32 q15, q5, q6 // q15 = EEE0 + EEE1 -> 0 + vmul.s32 q2, q9, d1[1] // q2 = EO0 * 89 -> 2 + vmul.s32 q3, q7, d0[0] // q3 = EEO0 * 83 -> 4 + vmul.s32 q4, q9, d1[0] // q4 = EO0 * 75 -> 6 + vmul.s32 q14, q9, d2[1] // q14 = EO0 * 50 -> 10 + + vshl.s32 q15, #6 // q15 -> [ 0]' + vmla.s32 q2, q10, d1[0] // q2 += EO1 * 75 + vmla.s32 q3, q8, d0[1] // q3 += EEO1 * 36 -> [ 4]' + vmls.s32 q4, q10, d2[0] // q4 += EO1 *-18 + vmls.s32 q14, q10, d1[1] // q14+= EO1 *-89 + vmul.s32 q13, q7, d0[1] // q13 = EEO0 * 36 -> 12 + + vqrshrn.s32 d30, q15, 10 // d30 -> [ 0] + vqrshrn.s32 d31, q3, 10 // d31 -> [ 4] + vmls.s32 q4, q11, d1[1] // q4 += EO2 *-89 + vsub.s32 q3, q5, q6 // q3 = EEE0 - EEE1 -> 8 + vmla.s32 q2, q11, d2[1] // q2 += EO2 * 50 + vmla.s32 q14, q11, d2[0] // q14+= EO2 * 18 + vmls.s32 q13, q8, d0[0] // q13+= EEO1 *-83 -> [12]' + vst1.16 {d30}, [r1], r3 // Stroe [ 0] + + vshl.s32 q3, #6 // q3 -> [ 8]' + vmls.s32 q4, q12, d2[1] // q4 += EO3 *-50 -> [ 6]' + vmla.s32 q2, q12, d2[0] // q2 += EO3 * 18 -> [ 2]' + vqrshrn.s32 d26, q13, 10 // d26 -> [12] + vmla.s32 q14, q12, d1[0] // q14+= EO3 * 75 -> [10]' + + vqrshrn.s32 d30, q3, 10 // d30 -> [ 8] + vmul.s32 q3, q9, d2[0] // q3 = EO0 * 18 -> 14 + vqrshrn.s32 d4, q2, 10 // d4 -> [ 2] + vmls.s32 q3, q10, d2[1] // q3 += EO1 *-50 + vqrshrn.s32 d5, q4, 10 // d30 -> [ 6] + vmla.s32 q3, q11, d1[0] // q3 += EO2 * 75 + vqrshrn.s32 d27, q14, 10 // d27 -> [10] + vmls.s32 q3, q12, d1[1] // q3 += EO3 *-89 -> [14]' + + vst1.16 {d4 }, [r1], r3 // Stroe [ 2] + vst1.16 {d31}, [r1], r3 // Stroe [ 4] + vst1.16 {d5 }, [r1], r3 // Stroe [ 6] + vst1.16 {d30}, [r1], r3 // Stroe [ 8] + vqrshrn.s32 d30, q3, 10 // d30 -> [14] + vst1.16 {d27}, [r1], r3 // Stroe [10] + vst1.16 {d26}, [r1], r3 // Stroe [12] + vst1.16 {d30}, [r1], r3 // Stroe [14] + + // Process Odd + sub r1, #(15*16)*2 + vldm r2!, {q8-q15} + + // d8 = [30 20 10 00] + // d9 = [31 21 11 01] + // q10 = [32 22 12 02] + // d11 = [33 23 13 03] + // d12 = [34 24 14 04] + // d13 = [35 25 15 05] + // q14 = [36 26 16 06] + // d15 = [37 27 17 07] + // d16 = [38 28 18 08] + // d17 = [39 29 19 09] + // q18 = [3A 2A 1A 0A] + // d19 = [3B 2B 1B 0B] + // d20 = [3C 2C 1C 0C] + // d21 = [3D 2D 1D 0D] + // q22 = [3E 2E 1E 0E] + // d23 = [3F 2F 1F 0F] + + // O + vsubl.s16 q2, d16, d31 // q2 = [O30 O20 O10 O00] + vsubl.s16 q3, d17, d30 // q3 = [O31 O21 O11 O01] + vsubl.s16 q4, d18, d29 // q4 = [O32 O22 O12 O02] + vsubl.s16 q5, d19, d28 // q5 = [O33 O23 O13 O03] + vsubl.s16 q9, d23, d24 // q9 = [O37 O27 O17 O07] + vsubl.s16 q8, d22, d25 // q8 = [O36 O26 O16 O06] + vsubl.s16 q7, d21, d26 // q7 = [O35 O25 O15 O05] + vsubl.s16 q6, d20, d27 // q6 = [O34 O24 O14 O04] + + // Load DCT Ox Constant + adr r0, ctr16 + vld1.32 {d0-d3}, [r0] + + // Register Map (Qx) + // Free=[10,11,12,13,14,15], Const=[0,1], O=[2,3,4,5,6,7,8,9] + + vmul.s32 q10, q2, d0[0] // q10 = O0 * 90 -> 1 + vmul.s32 q11, q2, d0[1] // q11 = O0 * 87 -> 3 + vmul.s32 q12, q2, d1[0] // q12 = O0 * 80 -> 5 + vmul.s32 q13, q2, d1[1] // q13 = O0 * 70 -> 7 + vmul.s32 q14, q2, d2[0] // q14 = O0 * 57 -> 9 + vmul.s32 q15, q2, d2[1] // q15 = O0 * 43 -> 11 + + vmla.s32 q10, q3, d0[1] // q10+= O1 * 87 + vmla.s32 q11, q3, d2[0] // q11+= O1 * 57 + vmla.s32 q12, q3, d3[1] // q12+= O1 * 9 + vmls.s32 q13, q3, d2[1] // q13+= O1 *-43 + vmls.s32 q14, q3, d1[0] // q14+= O1 *-80 + vmls.s32 q15, q3, d0[0] // q15+= O1 *-90 + + vmla.s32 q10, q4, d1[0] // q10+= O2 * 80 + vmla.s32 q11, q4, d3[1] // q11+= O2 * 9 + vmls.s32 q12, q4, d1[1] // q12+= O2 *-70 + vmls.s32 q13, q4, d0[1] // q13+= O2 *-87 + vmls.s32 q14, q4, d3[0] // q14+= O2 *-25 + vmla.s32 q15, q4, d2[0] // q15+= O2 * 57 + + vmla.s32 q10, q5, d1[1] // q10+= O3 * 70 + vmls.s32 q11, q5, d2[1] // q11+= O3 *-43 + vmls.s32 q12, q5, d0[1] // q12+= O3 *-87 + vmla.s32 q13, q5, d3[1] // q13+= O3 * 9 + vmla.s32 q14, q5, d0[0] // q14+= O3 * 90 + vmla.s32 q15, q5, d3[0] // q15+= O3 * 25 + + vmla.s32 q10, q6, d2[0] // q10+= O4 * 57 + vmls.s32 q11, q6, d1[0] // q11+= O4 *-80 + vmls.s32 q12, q6, d3[0] // q12+= O4 *-25 + vmla.s32 q13, q6, d0[0] // q13+= O4 * 90 + vmls.s32 q14, q6, d3[1] // q14+= O4 *-9 + vmls.s32 q15, q6, d0[1] // q15+= O4 *-87 + + vmla.s32 q10, q7, d2[1] // q10+= O5 * 43 + vmls.s32 q11, q7, d0[0] // q11+= O5 *-90 + vmla.s32 q12, q7, d2[0] // q12+= O5 * 57 + vmla.s32 q13, q7, d3[0] // q13+= O5 * 25 + vmls.s32 q14, q7, d0[1] // q14+= O5 *-87 + vmla.s32 q15, q7, d1[1] // q15+= O5 * 70 + + vmla.s32 q10, q8, d3[0] // q10+= O6 * 25 + vmls.s32 q11, q8, d1[1] // q11+= O6 *-70 + vmla.s32 q12, q8, d0[0] // q12+= O6 * 90 + vmls.s32 q13, q8, d1[0] // q13+= O6 *-80 + vmla.s32 q14, q8, d2[1] // q14+= O6 * 43 + vmla.s32 q15, q8, d3[1] // q15+= O6 * 9 + + vmla.s32 q10, q9, d3[1] // q10+= O7 * 9 -> [ 1]' + vmls.s32 q11, q9, d3[0] // q11+= O7 *-25 -> [ 3]' + vmla.s32 q12, q9, d2[1] // q12+= O7 * 43 -> [ 5]' + vqrshrn.s32 d20, q10, 10 // d20 -> [ 1] + vmls.s32 q13, q9, d2[0] // q13+= O7 *-57 -> [ 7]' + vqrshrn.s32 d21, q11, 10 // d21 -> [ 3] + + vmul.s32 q11, q2, d3[0] // q11 = O0 * 25 -> 13 + vmul.s32 q2, q2, d3[1] // q2 = O0 * 9 -> 15 + + vst1.16 {d20}, [r1], r3 // Stroe [ 1] + vst1.16 {d21}, [r1], r3 // Stroe [ 3] + + vmls.s32 q11, q3, d1[1] // q11+= O1 *-70 + vmls.s32 q2, q3, d3[0] // q2 += O1 *-25 + + vmla.s32 q14, q9, d1[1] // q14+= O7 * 70 -> [ 9]' + vmls.s32 q15, q9, d1[0] // q15+= O7 *-80 -> [11]' + + vqrshrn.s32 d24, q12, 10 // d24 -> [ 5] + + vqrshrn.s32 d25, q13, 10 // d25 -> [ 7] + vqrshrn.s32 d28, q14, 10 // d28 -> [ 9] + vqrshrn.s32 d29, q15, 10 // d29 -> [11] + + vst1.16 {d24}, [r1], r3 // Stroe [ 5] + vst1.16 {d25}, [r1], r3 // Stroe [ 7] + vst1.16 {d28}, [r1], r3 // Stroe [ 9] + vst1.16 {d29}, [r1], r3 // Stroe [11] + + vmla.s32 q11, q4, d0[0] // q11+= O2 * 90 + vmla.s32 q2, q4, d2[1] // q2 += O2 * 43 + + vmls.s32 q11, q5, d1[0] // q11+= O3 *-80 + vmls.s32 q2, q5, d2[0] // q2 += O3 *-57 + + vmla.s32 q11, q6, d2[1] // q11+= O4 * 43 + vmla.s32 q2, q6, d1[1] // q2 += O4 * 70 + + vmla.s32 q11, q7, d3[1] // q11+= O5 * 9 + vmls.s32 q2, q7, d1[0] // q2 += O5 *-80 + + vmls.s32 q11, q8, d2[0] // q11+= O6 *-57 + vmla.s32 q2, q8, d0[1] // q2 += O6 * 87 + + vmla.s32 q11, q9, d0[1] // q11+= O7 * 87 -> [13]' + vmls.s32 q2, q9, d0[0] // q2 += O7 *-90 -> [15]' + + vqrshrn.s32 d6, q11, 10 // d6 -> [13] + vqrshrn.s32 d7, q2, 10 // d7 -> [15] + vst1.16 {d6}, [r1], r3 // Stroe [13] + vst1.16 {d7}, [r1], r3 // Stroe [15] + + sub r1, #(17*16-4)*2 + subs r12, #1 + bgt .loop2 + + add sp, #16*16*2 + vpop {q4-q7} + pop {pc} +endfunc +
View file
x265_2.0.tar.gz/source/common/arm/dct8.h
Added
@@ -0,0 +1,32 @@ +/***************************************************************************** + * Copyright (C) 2016 x265 project + * + * Authors: Min Chen <chenm003@163.com> + * Dnyaneshwar Gorade <dnyaneshwar@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#ifndef X265_DCT8_ARM_H +#define X265_DCT8_ARM_H + +void PFX(dct_4x4_neon)(const int16_t* src, int16_t* dst, intptr_t srcStride); +void PFX(dct_8x8_neon)(const int16_t* src, int16_t* dst, intptr_t srcStride); +void PFX(dct_16x16_neon)(const int16_t* src, int16_t* dst, intptr_t srcStride); + +#endif // ifndef X265_DCT8_ARM_H
View file
x265_2.0.tar.gz/source/common/arm/intrapred.h
Added
@@ -0,0 +1,31 @@ +/***************************************************************************** + * intrapred.h: Intra Prediction metrics + ***************************************************************************** + * Copyright (C) 2003-2013 x264 project + * + * Authors: Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com> + * Praveen Kumar Tiwari <praveen@multicorewareinc.com> + * Dnyaneshwar Gorade <dnyaneshwar@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#ifndef X265_INTRAPRED_ARM_H +#define X265_INTRAPRED_ARM_H + +#endif // ifndef X265_INTRAPRED_ARM_H
View file
x265_2.0.tar.gz/source/common/arm/ipfilter8.S
Added
@@ -0,0 +1,3341 @@ +/***************************************************************************** + * Copyright (C) 2016 x265 project + * + * Authors: Dnyaneshwar G <dnyaneshwar@multicorewareinc.com> + * Radhakrishnan VR <radhakrishnan@multicorewareinc.com> + * Min Chen <min.chen@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "asm.S" + +.section .rodata +.align 4 + +g_lumaFilter: +.word 0,0,0,0,0,0,64,64,0,0,0,0,0,0,0,0 +.word -1,-1,4,4,-10,-10,58,58,17,17,-5,-5,1,1,0,0 +.word -1,-1,4,4,-11,-11,40,40,40,40,-11,-11,4,4,-1,-1 +.word 0,0,1,1,-5,-5,17,17,58,58,-10,-10,4,4,-1,-1 +g_chromaFilter: +.word 0, 0, 64, 64, 0, 0, 0, 0 +.word -2, -2, 58, 58, 10, 10, -2, -2 +.word -4, -4, 54, 54, 16, 16, -2, -2 +.word -6, -6, 46, 46, 28, 28, -4, -4 +.word -4, -4, 36, 36, 36, 36, -4 ,-4 +.word -4, -4, 28, 28, 46, 46, -6, -6 +.word -2, -2, 16, 16, 54, 54, -4 ,-4 +.word -2, -2, 10, 10, 58, 58, -2, -2 + + +.text + +// filterPixelToShort(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride) +function x265_filterPixelToShort_4x4_neon + vld1.u32 {d0[]}, [r0], r1 + vld1.u32 {d0[1]}, [r0], r1 + vld1.u32 {d1[]}, [r0], r1 + vld1.u32 {d1[1]}, [r0], r1 + + // avoid load pipeline stall + vmov.i16 q1, #0xE000 + + vshll.u8 q2, d0, #6 + vshll.u8 q3, d1, #6 + vadd.i16 q2, q1 + vadd.i16 q3, q1 + + add r3, r3 + vst1.16 {d4}, [r2], r3 + vst1.16 {d5}, [r2], r3 + vst1.16 {d6}, [r2], r3 + vst1.16 {d7}, [r2], r3 + + bx lr +endfunc + +function x265_filterPixelToShort_4x8_neon + add r3, r3 + vmov.u16 q8, #64 + vmov.u16 q9, #8192 + vneg.s16 q9, q9 +.rept 4 + vld1.u8 {d0}, [r0], r1 + vld1.u8 {d2}, [r0], r1 + vmovl.u8 q0, d0 + vmovl.u8 q1, d2 + vmov q2, q9 + vmov q3, q9 + vmla.s16 q2, q0, q8 + vmla.s16 q3, q1, q8 + vst1.16 {d4}, [r2], r3 + vst1.16 {d6}, [r2], r3 +.endr + bx lr +endfunc + +function x265_filterPixelToShort_4x16_neon + add r3, r3 + vmov.u16 q8, #64 + vmov.u16 q9, #8192 + vneg.s16 q9, q9 +.rept 8 + vld1.u8 {d0}, [r0], r1 + vld1.u8 {d2}, [r0], r1 + vmovl.u8 q0, d0 + vmovl.u8 q1, d2 + vmov q2, q9 + vmov q3, q9 + vmla.s16 q2, q0, q8 + vmla.s16 q3, q1, q8 + vst1.16 {d4}, [r2], r3 + vst1.16 {d6}, [r2], r3 +.endr + bx lr +endfunc + +function x265_filterPixelToShort_8x4_neon + add r3, r3 + vmov.u16 q8, #64 + vmov.u16 q9, #8192 + vneg.s16 q9, q9 +.rept 2 + vld1.u8 {d0}, [r0], r1 + vld1.u8 {d2}, [r0], r1 + vmovl.u8 q0, d0 + vmovl.u8 q1, d2 + vmov q2, q9 + vmov q3, q9 + vmla.s16 q2, q0, q8 + vmla.s16 q3, q1, q8 + vst1.16 {q2}, [r2], r3 + vst1.16 {q3}, [r2], r3 +.endr + bx lr +endfunc + +function x265_filterPixelToShort_8x8_neon + add r3, r3 + vmov.u16 q8, #64 + vmov.u16 q9, #8192 + vneg.s16 q9, q9 +.rept 4 + vld1.u8 {d0}, [r0], r1 + vld1.u8 {d2}, [r0], r1 + vmovl.u8 q0, d0 + vmovl.u8 q1, d2 + vmov q2, q9 + vmov q3, q9 + vmla.s16 q2, q0, q8 + vmla.s16 q3, q1, q8 + vst1.16 {q2}, [r2], r3 + vst1.16 {q3}, [r2], r3 +.endr + bx lr +endfunc + +function x265_filterPixelToShort_8x16_neon + add r3, r3 + vmov.u16 q8, #64 + vmov.u16 q9, #8192 + vneg.s16 q9, q9 +.rept 8 + vld1.u8 {d0}, [r0], r1 + vld1.u8 {d2}, [r0], r1 + vmovl.u8 q0, d0 + vmovl.u8 q1, d2 + vmov q2, q9 + vmov q3, q9 + vmla.s16 q2, q0, q8 + vmla.s16 q3, q1, q8 + vst1.16 {q2}, [r2], r3 + vst1.16 {q3}, [r2], r3 +.endr + bx lr +endfunc + +function x265_filterPixelToShort_8x32_neon + add r3, r3 + vmov.u16 q8, #64 + vmov.u16 q9, #8192 + vneg.s16 q9, q9 +.rept 16 + vld1.u8 {d0}, [r0], r1 + vld1.u8 {d2}, [r0], r1 + vmovl.u8 q0, d0 + vmovl.u8 q1, d2 + vmov q2, q9 + vmov q3, q9 + vmla.s16 q2, q0, q8 + vmla.s16 q3, q1, q8 + vst1.16 {q2}, [r2], r3 + vst1.16 {q3}, [r2], r3 +.endr + bx lr +endfunc + +function x265_filterPixelToShort_12x16_neon + add r3, r3 + vmov.u16 q8, #64 + vmov.u16 q9, #8192 + vneg.s16 q9, q9 +.rept 16 + vld1.u8 {d2-d3}, [r0], r1 + vmovl.u8 q0, d2 + vmovl.u8 q1, d3 + vmov q2, q9 + vmov q3, q9 + vmla.s16 q2, q0, q8 + vmla.s16 q3, q1, q8 + vst1.16 {d4, d5, d6}, [r2], r3 +.endr + bx lr +endfunc + +function x265_filterPixelToShort_16x4_neon + add r3, r3 + vmov.u16 q8, #64 + vmov.u16 q9, #8192 + vneg.s16 q9, q9 +.rept 4 + vld1.u8 {d2-d3}, [r0], r1 + vmovl.u8 q0, d2 + vmovl.u8 q1, d3 + vmov q2, q9 + vmov q3, q9 + vmla.s16 q2, q0, q8 + vmla.s16 q3, q1, q8 + vst1.16 {q2-q3}, [r2], r3 +.endr + bx lr +endfunc + +function x265_filterPixelToShort_16x8_neon + add r3, r3 + vmov.u16 q8, #64 + vmov.u16 q9, #8192 + vneg.s16 q9, q9 +.rept 8 + vld1.u8 {d2-d3}, [r0], r1 + vmovl.u8 q0, d2 + vmovl.u8 q1, d3 + vmov q2, q9 + vmov q3, q9 + vmla.s16 q2, q0, q8 + vmla.s16 q3, q1, q8 + vst1.16 {q2-q3}, [r2], r3 +.endr + bx lr +endfunc + +function x265_filterPixelToShort_16x12_neon + add r3, r3 + vmov.u16 q8, #64 + vmov.u16 q9, #8192 + vneg.s16 q9, q9 +.rept 12 + vld1.u8 {d2-d3}, [r0], r1 + vmovl.u8 q0, d2 + vmovl.u8 q1, d3 + vmov q2, q9 + vmov q3, q9 + vmla.s16 q2, q0, q8 + vmla.s16 q3, q1, q8 + vst1.16 {q2-q3}, [r2], r3 +.endr + bx lr +endfunc + +function x265_filterPixelToShort_16x16_neon + add r3, r3 + vmov.u16 q8, #64 + vmov.u16 q9, #8192 + vneg.s16 q9, q9 +.rept 16 + vld1.u8 {d2-d3}, [r0], r1 + vmovl.u8 q0, d2 + vmovl.u8 q1, d3 + vmov q2, q9 + vmov q3, q9 + vmla.s16 q2, q0, q8 + vmla.s16 q3, q1, q8 + vst1.16 {q2-q3}, [r2], r3 +.endr + bx lr +endfunc + +function x265_filterPixelToShort_16x32_neon + add r3, r3 + vmov.u16 q8, #64 + vmov.u16 q9, #8192 + vneg.s16 q9, q9 +.rept 32 + vld1.u8 {d2-d3}, [r0], r1 + vmovl.u8 q0, d2 + vmovl.u8 q1, d3 + vmov q2, q9 + vmov q3, q9 + vmla.s16 q2, q0, q8 + vmla.s16 q3, q1, q8 + vst1.16 {q2-q3}, [r2], r3 +.endr + bx lr +endfunc + +function x265_filterPixelToShort_16x64_neon + add r3, r3 + vmov.u16 q8, #64 + vmov.u16 q9, #8192 + vneg.s16 q9, q9 +.rept 64 + vld1.u8 {d2-d3}, [r0], r1 + vmovl.u8 q0, d2 + vmovl.u8 q1, d3 + vmov q2, q9 + vmov q3, q9 + vmla.s16 q2, q0, q8 + vmla.s16 q3, q1, q8 + vst1.16 {q2-q3}, [r2], r3 +.endr + bx lr +endfunc + +function x265_filterPixelToShort_24x32_neon + add r3, r3 + sub r3, #32 + vmov.u16 q0, #64 + vmov.u16 q1, #8192 + vneg.s16 q1, q1 +.rept 32 + vld1.u8 {d18, d19, d20}, [r0], r1 + vmovl.u8 q8, d18 + vmovl.u8 q9, d19 + vmovl.u8 q11, d20 + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q8, q0 + vmla.s16 q3, q9, q0 + vst1.16 {q2-q3}, [r2]! + vmov q2, q1 + vmla.s16 q2, q11, q0 + vst1.16 {q2}, [r2], r3 +.endr + bx lr +endfunc + +function x265_filterPixelToShort_32x8_neon + add r3, r3 + sub r3, #32 + vmov.u16 q0, #64 + vmov.u16 q1, #8192 + vneg.s16 q1, q1 +.rept 8 + vld1.u8 {q9-q10}, [r0], r1 + vmovl.u8 q8, d18 + vmovl.u8 q9, d19 + vmovl.u8 q11, d20 + vmovl.u8 q10, d21 + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q8, q0 + vmla.s16 q3, q9, q0 + vst1.16 {q2-q3}, [r2]! + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q11, q0 + vmla.s16 q3, q10, q0 + vst1.16 {q2-q3}, [r2], r3 +.endr + bx lr +endfunc + +function x265_filterPixelToShort_32x16_neon + add r3, r3 + sub r3, #32 + vmov.u16 q0, #64 + vmov.u16 q1, #8192 + vneg.s16 q1, q1 + mov r12, #8 +.loop_filterP2S_32x16: + subs r12, #1 +.rept 2 + vld1.u8 {q9-q10}, [r0], r1 + vmovl.u8 q8, d18 + vmovl.u8 q9, d19 + vmovl.u8 q11, d20 + vmovl.u8 q10, d21 + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q8, q0 + vmla.s16 q3, q9, q0 + vst1.16 {q2-q3}, [r2]! + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q11, q0 + vmla.s16 q3, q10, q0 + vst1.16 {q2-q3}, [r2], r3 +.endr + bgt .loop_filterP2S_32x16 + bx lr +endfunc + +function x265_filterPixelToShort_32x24_neon + add r3, r3 + sub r3, #32 + vmov.u16 q0, #64 + vmov.u16 q1, #8192 + vneg.s16 q1, q1 + mov r12, #12 +.loop_filterP2S_32x24: + subs r12, #1 +.rept 2 + vld1.u8 {q9-q10}, [r0], r1 + vmovl.u8 q8, d18 + vmovl.u8 q9, d19 + vmovl.u8 q11, d20 + vmovl.u8 q10, d21 + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q8, q0 + vmla.s16 q3, q9, q0 + vst1.16 {q2-q3}, [r2]! + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q11, q0 + vmla.s16 q3, q10, q0 + vst1.16 {q2-q3}, [r2], r3 +.endr + bgt .loop_filterP2S_32x24 + bx lr +endfunc + +function x265_filterPixelToShort_32x32_neon + add r3, r3 + sub r3, #32 + vmov.u16 q0, #64 + vmov.u16 q1, #8192 + vneg.s16 q1, q1 + mov r12, #16 +.loop_filterP2S_32x32: + subs r12, #1 +.rept 2 + vld1.u8 {q9-q10}, [r0], r1 + vmovl.u8 q8, d18 + vmovl.u8 q9, d19 + vmovl.u8 q11, d20 + vmovl.u8 q10, d21 + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q8, q0 + vmla.s16 q3, q9, q0 + vst1.16 {q2-q3}, [r2]! + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q11, q0 + vmla.s16 q3, q10, q0 + vst1.16 {q2-q3}, [r2], r3 +.endr + bgt .loop_filterP2S_32x32 + bx lr +endfunc + +function x265_filterPixelToShort_32x64_neon + add r3, r3 + sub r3, #32 + vmov.u16 q0, #64 + vmov.u16 q1, #8192 + vneg.s16 q1, q1 + mov r12, #32 +.loop_filterP2S_32x64: + subs r12, #1 +.rept 2 + vld1.u8 {q9-q10}, [r0], r1 + vmovl.u8 q8, d18 + vmovl.u8 q9, d19 + vmovl.u8 q11, d20 + vmovl.u8 q10, d21 + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q8, q0 + vmla.s16 q3, q9, q0 + vst1.16 {q2-q3}, [r2]! + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q11, q0 + vmla.s16 q3, q10, q0 + vst1.16 {q2-q3}, [r2], r3 +.endr + bgt .loop_filterP2S_32x64 + bx lr +endfunc + +function x265_filterPixelToShort_64x16_neon + add r3, r3 + sub r1, #32 + sub r3, #96 + vmov.u16 q0, #64 + vmov.u16 q1, #8192 + vneg.s16 q1, q1 + mov r12, #8 +.loop_filterP2S_64x16: + subs r12, #1 +.rept 2 + vld1.u8 {q9-q10}, [r0]! + vmovl.u8 q8, d18 + vmovl.u8 q9, d19 + vmovl.u8 q11, d20 + vmovl.u8 q10, d21 + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q8, q0 + vmla.s16 q3, q9, q0 + vst1.16 {q2-q3}, [r2]! + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q11, q0 + vmla.s16 q3, q10, q0 + vst1.16 {q2-q3}, [r2]! + + vld1.u8 {q9-q10}, [r0], r1 + vmovl.u8 q8, d18 + vmovl.u8 q9, d19 + vmovl.u8 q11, d20 + vmovl.u8 q10, d21 + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q8, q0 + vmla.s16 q3, q9, q0 + vst1.16 {q2-q3}, [r2]! + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q11, q0 + vmla.s16 q3, q10, q0 + vst1.16 {q2-q3}, [r2], r3 +.endr + bgt .loop_filterP2S_64x16 + bx lr +endfunc + +function x265_filterPixelToShort_64x32_neon + add r3, r3 + sub r1, #32 + sub r3, #96 + vmov.u16 q0, #64 + vmov.u16 q1, #8192 + vneg.s16 q1, q1 + mov r12, #16 +.loop_filterP2S_64x32: + subs r12, #1 +.rept 2 + vld1.u8 {q9-q10}, [r0]! + vmovl.u8 q8, d18 + vmovl.u8 q9, d19 + vmovl.u8 q11, d20 + vmovl.u8 q10, d21 + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q8, q0 + vmla.s16 q3, q9, q0 + vst1.16 {q2-q3}, [r2]! + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q11, q0 + vmla.s16 q3, q10, q0 + vst1.16 {q2-q3}, [r2]! + + vld1.u8 {q9-q10}, [r0], r1 + vmovl.u8 q8, d18 + vmovl.u8 q9, d19 + vmovl.u8 q11, d20 + vmovl.u8 q10, d21 + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q8, q0 + vmla.s16 q3, q9, q0 + vst1.16 {q2-q3}, [r2]! + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q11, q0 + vmla.s16 q3, q10, q0 + vst1.16 {q2-q3}, [r2], r3 +.endr + bgt .loop_filterP2S_64x32 + bx lr +endfunc + +function x265_filterPixelToShort_64x48_neon + add r3, r3 + sub r1, #32 + sub r3, #96 + vmov.u16 q0, #64 + vmov.u16 q1, #8192 + vneg.s16 q1, q1 + mov r12, #24 +.loop_filterP2S_64x48: + subs r12, #1 +.rept 2 + vld1.u8 {q9-q10}, [r0]! + vmovl.u8 q8, d18 + vmovl.u8 q9, d19 + vmovl.u8 q11, d20 + vmovl.u8 q10, d21 + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q8, q0 + vmla.s16 q3, q9, q0 + vst1.16 {q2-q3}, [r2]! + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q11, q0 + vmla.s16 q3, q10, q0 + vst1.16 {q2-q3}, [r2]! + + vld1.u8 {q9-q10}, [r0], r1 + vmovl.u8 q8, d18 + vmovl.u8 q9, d19 + vmovl.u8 q11, d20 + vmovl.u8 q10, d21 + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q8, q0 + vmla.s16 q3, q9, q0 + vst1.16 {q2-q3}, [r2]! + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q11, q0 + vmla.s16 q3, q10, q0 + vst1.16 {q2-q3}, [r2], r3 +.endr + bgt .loop_filterP2S_64x48 + bx lr +endfunc + +function x265_filterPixelToShort_64x64_neon + add r3, r3 + sub r1, #32 + sub r3, #96 + vmov.u16 q0, #64 + vmov.u16 q1, #8192 + vneg.s16 q1, q1 + mov r12, #32 +.loop_filterP2S_64x64: + subs r12, #1 +.rept 2 + vld1.u8 {q9-q10}, [r0]! + vmovl.u8 q8, d18 + vmovl.u8 q9, d19 + vmovl.u8 q11, d20 + vmovl.u8 q10, d21 + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q8, q0 + vmla.s16 q3, q9, q0 + vst1.16 {q2-q3}, [r2]! + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q11, q0 + vmla.s16 q3, q10, q0 + vst1.16 {q2-q3}, [r2]! + + vld1.u8 {q9-q10}, [r0], r1 + vmovl.u8 q8, d18 + vmovl.u8 q9, d19 + vmovl.u8 q11, d20 + vmovl.u8 q10, d21 + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q8, q0 + vmla.s16 q3, q9, q0 + vst1.16 {q2-q3}, [r2]! + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q11, q0 + vmla.s16 q3, q10, q0 + vst1.16 {q2-q3}, [r2], r3 +.endr + bgt .loop_filterP2S_64x64 + bx lr +endfunc + +function x265_filterPixelToShort_48x64_neon + add r3, r3 + sub r1, #32 + sub r3, #64 + vmov.u16 q0, #64 + vmov.u16 q1, #8192 + vneg.s16 q1, q1 + mov r12, #32 +.loop_filterP2S_48x64: + subs r12, #1 +.rept 2 + vld1.u8 {q9-q10}, [r0]! + vmovl.u8 q8, d18 + vmovl.u8 q9, d19 + vmovl.u8 q11, d20 + vmovl.u8 q10, d21 + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q8, q0 + vmla.s16 q3, q9, q0 + vst1.16 {q2-q3}, [r2]! + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q11, q0 + vmla.s16 q3, q10, q0 + vst1.16 {q2-q3}, [r2]! + + vld1.u8 {q9}, [r0], r1 + vmovl.u8 q8, d18 + vmovl.u8 q9, d19 + vmov q2, q1 + vmov q3, q1 + vmla.s16 q2, q8, q0 + vmla.s16 q3, q9, q0 + vst1.16 {q2-q3}, [r2], r3 +.endr + bgt .loop_filterP2S_48x64 + bx lr +endfunc + +//**************luma_vpp************ +.align 8 +// TODO: I don't like S16 in here, but the VMUL with scalar doesn't support (U8 x U8) +g_luma_s16: +.hword 0, 0, 0, 64, 0, 0, 0, 0 +.hword -1, 4, -10, 58, 17, -5, 1, 0 +.hword -1, 4, -11, 40, 40, -11, 4, -1 +.hword 0, 1, -5, 17, 58, -10, 4, -1 + +.macro LUMA_VPP_4xN h +function x265_interp_8tap_vert_pp_4x\h\()_neon + ldr r12, [sp] + push {lr} + adr lr, g_luma_s16 + sub r0, r1 + sub r0, r0, r1, lsl #1 // src -= 3 * srcStride + add lr, lr, r12, lsl #4 + vld1.16 {q0}, [lr, :64] // q8 = luma interpolate coeff + vdup.s16 d24, d0[0] + vdup.s16 d25, d0[1] + vdup.s16 d26, d0[2] + vdup.s16 d27, d0[3] + vdup.s16 d28, d1[0] + vdup.s16 d29, d1[1] + vdup.s16 d30, d1[2] + vdup.s16 d31, d1[3] + + mov r12, #\h + + // prepare to load 8 lines + vld1.u32 {d0[0]}, [r0], r1 + vld1.u32 {d0[1]}, [r0], r1 + vld1.u32 {d2[0]}, [r0], r1 + vld1.u32 {d2[1]}, [r0], r1 + vld1.u32 {d4[0]}, [r0], r1 + vld1.u32 {d4[1]}, [r0], r1 + vld1.u32 {d6[0]}, [r0], r1 + vld1.u32 {d6[1]}, [r0], r1 + vmovl.u8 q0, d0 + vmovl.u8 q1, d2 + vmovl.u8 q2, d4 + vmovl.u8 q3, d6 + +.loop_4x\h: + // TODO: read extra 1 row for speed optimize, may made crash on OS X platform! + vld1.u32 {d16[0]}, [r0], r1 + vld1.u32 {d16[1]}, [r0], r1 + vmovl.u8 q8, d16 + + // row[0-1] + vmul.s16 q9, q0, q12 + vext.64 q11, q0, q1, 1 + vmul.s16 q10, q11, q12 + vmov q0, q1 + + // row[2-3] + vmla.s16 q9, q1, q13 + vext.64 q11, q1, q2, 1 + vmla.s16 q10, q11, q13 + vmov q1, q2 + + // row[4-5] + vmla.s16 q9, q2, q14 + vext.64 q11, q2, q3, 1 + vmla.s16 q10, q11, q14 + vmov q2, q3 + + // row[6-7] + vmla.s16 q9, q3, q15 + vext.64 q11, q3, q8, 1 + vmla.s16 q10, q11, q15 + vmov q3, q8 + + // sum row[0-7] + vadd.s16 d18, d18, d19 + vadd.s16 d19, d20, d21 + + vqrshrun.s16 d18, q9, #6 + vst1.u32 {d18[0]}, [r2], r3 + vst1.u32 {d18[1]}, [r2], r3 + + subs r12, #2 + bne .loop_4x4 + + pop {pc} + .ltorg +endfunc +.endm + +LUMA_VPP_4xN 4 +LUMA_VPP_4xN 8 +LUMA_VPP_4xN 16 + +.macro qpel_filter_0_32b + vmov.i16 d17, #64 + vmovl.u8 q11, d3 + vmull.s16 q9, d22, d17 // 64*d0 + vmull.s16 q10, d23, d17 // 64*d1 +.endm + +.macro qpel_filter_1_32b + vmov.i16 d16, #58 + vmovl.u8 q11, d3 + vmull.s16 q9, d22, d16 // 58 * d0 + vmull.s16 q10, d23, d16 // 58 * d1 + + vmov.i16 d17, #10 + vmovl.u8 q13, d2 + vmull.s16 q11, d26, d17 // 10 * c0 + vmull.s16 q12, d27, d17 // 10 * c1 + + vmov.i16 d16, #17 + vmovl.u8 q15, d4 + vmull.s16 q13, d30, d16 // 17 * e0 + vmull.s16 q14, d31, d16 // 17 * e1 + + vmov.i16 d17, #5 + vmovl.u8 q1, d5 + vmull.s16 q15, d2, d17 // 5 * f0 + vmull.s16 q8, d3, d17 // 5 * f1 + + vsub.s32 q9, q11 // 58 * d0 - 10 * c0 + vsub.s32 q10, q12 // 58 * d1 - 10 * c1 + + vmovl.u8 q1, d1 + vshll.s16 q11, d2, #2 // 4 * b0 + vshll.s16 q12, d3, #2 // 4 * b1 + + vadd.s32 q9, q13 // 58 * d0 - 10 * c0 + 17 * e0 + vadd.s32 q10, q14 // 58 * d1 - 10 * c1 + 17 * e1 + + vmovl.u8 q1, d0 + vmovl.u8 q2, d6 + vsubl.s16 q13, d4, d2 // g0 - a0 + vsubl.s16 q14, d5, d3 // g1 - a1 + + vadd.s32 q9, q11 // 58 * d0 - 10 * c0 + 17 * e0 + 4 * b0 + vadd.s32 q10, q12 // 58 * d1 - 10 * c1 + 17 * e1 + 4 * b1 + vsub.s32 q13, q15 // g0 - a0 - 5 * f0 + vsub.s32 q14, q8 // g1 - a1 - 5 * f1 + vadd.s32 q9, q13 // 58 * d0 - 10 * c0 + 17 * e0 + 4 * b0 + g0 - a0 - 5 * f0 + vadd.s32 q10, q14 // 58 * d1 - 10 * c1 + 17 * e1 + 4 * b1 + g1 - a1 - 5 * f1 +.endm + +.macro qpel_filter_2_32b + vmov.i32 q8, #11 + vmovl.u8 q11, d3 + vmovl.u8 q12, d4 + vaddl.s16 q9, d22,d24 // d0 + e0 + vaddl.s16 q10, d23, d25 // d1 + e1 + + vmovl.u8 q13, d2 //c + vmovl.u8 q14, d5 //f + vaddl.s16 q11, d26, d28 // c0 + f0 + vaddl.s16 q12, d27, d29 // c1 + f1 + + vmul.s32 q11, q8 // 11 * (c0 + f0) + vmul.s32 q12, q8 // 11 * (c1 + f1) + + vmov.i32 q8, #40 + vmul.s32 q9, q8 // 40 * (d0 + e0) + vmul.s32 q10, q8 // 40 * (d1 + e1) + + vmovl.u8 q13, d1 //b + vmovl.u8 q14, d6 //g + vaddl.s16 q15, d26, d28 // b0 + g0 + vaddl.s16 q8, d27, d29 // b1 + g1 + + vmovl.u8 q1, d0 //a + vmovl.u8 q2, d7 //h + vaddl.s16 q13, d2, d4 // a0 + h0 + vaddl.s16 q14, d3, d5 // a1 + h1 + + vshl.s32 q15, #2 // 4*(b0+g0) + vshl.s32 q8, #2 // 4*(b1+g1) + + vadd.s32 q11, q13 // 11 * (c0 + f0) + a0 + h0 + vadd.s32 q12, q14 // 11 * (c1 + f1) + a1 + h1 + vadd.s32 q9, q15 // 40 * (d0 + e0) + 4*(b0+g0) + vadd.s32 q10, q8 // 40 * (d1 + e1) + 4*(b1+g1) + vsub.s32 q9, q11 // 40 * (d0 + e0) + 4*(b0+g0) - (11 * (c0 + f0) + a0 + h0) + vsub.s32 q10, q12 // 40 * (d1 + e1) + 4*(b1+g1) - (11 * (c1 + f1) + a1 + h1) +.endm + +.macro qpel_filter_3_32b + + vmov.i16 d16, #17 + vmov.i16 d17, #5 + + vmovl.u8 q11, d3 + vmull.s16 q9, d22, d16 // 17 * d0 + vmull.s16 q10, d23, d16 // 17 * d1 + + vmovl.u8 q13, d2 + vmull.s16 q11, d26, d17 // 5 * c0 + vmull.s16 q12, d27, d17 // 5* c1 + + vmov.i16 d16, #58 + vmovl.u8 q15, d4 + vmull.s16 q13, d30, d16 // 58 * e0 + vmull.s16 q14, d31, d16 // 58 * e1 + + vmov.i16 d17, #10 + vmovl.u8 q1, d5 + vmull.s16 q15, d2, d17 // 10 * f0 + vmull.s16 q8, d3, d17 // 10 * f1 + + vsub.s32 q9, q11 // 17 * d0 - 5 * c0 + vsub.s32 q10, q12 // 17 * d1 - 5 * c1 + + vmovl.u8 q1, d6 + vshll.s16 q11, d2, #2 // 4 * g0 + vshll.s16 q12, d3, #2 // 4 * g1 + + vadd.s32 q9, q13 // 17 * d0 - 5 * c0+ 58 * e0 + vadd.s32 q10, q14 // 17 * d1 - 5 * c1 + 58 * e1 + + vmovl.u8 q1, d1 + vmovl.u8 q2, d7 + vsubl.s16 q13, d2, d4 // b0 - h0 + vsubl.s16 q14, d3, d5 // b1 - h1 + + vadd.s32 q9, q11 // 17 * d0 - 5 * c0+ 58 * e0 +4 * g0 + vadd.s32 q10, q12 // 17 * d1 - 5 * c1 + 58 * e1+4 * g1 + vsub.s32 q13, q15 // 17 * d0 - 5 * c0+ 58 * e0 +4 * g0 -10 * f0 + vsub.s32 q14, q8 // 17 * d1 - 5 * c1 + 58 * e1+4 * g1 - 10*f1 + vadd.s32 q9, q13 // 17 * d0 - 5 * c0+ 58 * e0 +4 * g0 -10 * f0 +b0 - h0 + vadd.s32 q10, q14 // 17 * d1 - 5 * c1 + 58 * e1+4 * g1 - 10*f1 + b1 - h1 +.endm + +.macro FILTER_VPP a b filterv + +.loop_\filterv\()_\a\()x\b: + + mov r7, r2 + mov r6, r0 + eor r8, r8 + +.loop_w8_\filterv\()_\a\()x\b: + + add r6, r0, r8 + + pld [r6] + vld1.u8 d0, [r6], r1 + pld [r6] + vld1.u8 d1, [r6], r1 + pld [r6] + vld1.u8 d2, [r6], r1 + pld [r6] + vld1.u8 d3, [r6], r1 + pld [r6] + vld1.u8 d4, [r6], r1 + pld [r6] + vld1.u8 d5, [r6], r1 + pld [r6] + vld1.u8 d6, [r6], r1 + pld [r6] + vld1.u8 d7, [r6], r1 + + veor.u8 q9, q9 + veor.u8 q10, q10 + + \filterv + + mov r12,#32 + vdup.32 q8, r12 + vadd.s32 q9, q8 + vqshrun.s32 d0, q9, #6 + vadd.s32 q10, q8 + vqshrun.s32 d1, q10, #6 + vqmovn.u16 d0, q0 + vst1.u8 d0, [r7]! + + add r8, #8 + cmp r8, #\a + blt .loop_w8_\filterv\()_\a\()x\b + + add r0, r1 + add r2, r3 + subs r4, #1 + bne .loop_\filterv\()_\a\()x\b + +.endm + +.macro LUMA_VPP w h +function x265_interp_8tap_vert_pp_\w\()x\h\()_neon + + push {r4, r5, r6, r7, r8} + ldr r5, [sp, #4 * 5] + mov r4, r1, lsl #2 + sub r4, r1 + sub r0, r4 + mov r4, #\h + + cmp r5, #0 + beq 0f + cmp r5, #1 + beq 1f + cmp r5, #2 + beq 2f + cmp r5, #3 + beq 3f +0: + FILTER_VPP \w \h qpel_filter_0_32b + b 5f +1: + FILTER_VPP \w \h qpel_filter_1_32b + b 5f +2: + FILTER_VPP \w \h qpel_filter_2_32b + b 5f +3: + FILTER_VPP \w \h qpel_filter_3_32b + b 5f +5: + pop {r4, r5, r6, r7, r8} + bx lr +endfunc +.endm + +LUMA_VPP 8 4 +LUMA_VPP 8 8 +LUMA_VPP 8 16 +LUMA_VPP 8 32 +LUMA_VPP 16 4 +LUMA_VPP 16 8 +LUMA_VPP 16 16 +LUMA_VPP 16 32 +LUMA_VPP 16 64 +LUMA_VPP 16 12 +LUMA_VPP 32 8 +LUMA_VPP 32 16 +LUMA_VPP 32 32 +LUMA_VPP 32 64 +LUMA_VPP 32 24 +LUMA_VPP 64 16 +LUMA_VPP 64 32 +LUMA_VPP 64 64 +LUMA_VPP 64 48 +LUMA_VPP 24 32 +LUMA_VPP 48 64 + +function x265_interp_8tap_vert_pp_12x16_neon + push {r4, r5, r6, r7} + ldr r5, [sp, #4 * 4] + mov r4, r1, lsl #2 + sub r4, r1 + sub r0, r4 + + mov r4, #16 +.loop_vpp_12x16: + + mov r6, r0 + mov r7, r2 + + pld [r6] + vld1.u8 d0, [r6], r1 + pld [r6] + vld1.u8 d1, [r6], r1 + pld [r6] + vld1.u8 d2, [r6], r1 + pld [r6] + vld1.u8 d3, [r6], r1 + pld [r6] + vld1.u8 d4, [r6], r1 + pld [r6] + vld1.u8 d5, [r6], r1 + pld [r6] + vld1.u8 d6, [r6], r1 + pld [r6] + vld1.u8 d7, [r6], r1 + + veor.u8 q9, q9 + veor.u8 q10, q10 + + cmp r5,#0 + beq 0f + cmp r5,#1 + beq 1f + cmp r5,#2 + beq 2f + cmp r5,#3 + beq 3f +0: + qpel_filter_0_32b + b 5f +1: + qpel_filter_1_32b + b 5f +2: + qpel_filter_2_32b + b 5f +3: + qpel_filter_3_32b + b 5f +5: + mov r12,#32 + vdup.32 q8, r12 + vadd.s32 q9, q8 + vqshrun.s32 d0, q9, #6 + vadd.s32 q10, q8 + vqshrun.s32 d1, q10, #6 + vqmovn.u16 d0, q0 + vst1.u8 d0, [r7]! + + add r6, r0, #8 + + pld [r6] + vld1.u8 d0, [r6], r1 + pld [r6] + vld1.u8 d1, [r6], r1 + pld [r6] + vld1.u8 d2, [r6], r1 + pld [r6] + vld1.u8 d3, [r6], r1 + pld [r6] + vld1.u8 d4, [r6], r1 + pld [r6] + vld1.u8 d5, [r6], r1 + pld [r6] + vld1.u8 d6, [r6], r1 + pld [r6] + vld1.u8 d7, [r6], r1 + + veor.u8 q9, q9 + veor.u8 q10, q10 + + cmp r5,#0 + beq 0f + cmp r5,#1 + beq 1f + cmp r5,#2 + beq 2f + cmp r5,#3 + beq 3f +0: + qpel_filter_0_32b + b 5f +1: + qpel_filter_1_32b + b 5f +2: + qpel_filter_2_32b + b 5f +3: + qpel_filter_3_32b + b 5f +5: + mov r12,#32 + vdup.32 q8, r12 + vadd.s32 q9, q8 + vqshrun.s32 d0, q9, #6 + vadd.s32 q10, q8 + vqshrun.s32 d1, q10, #6 + vqmovn.u16 d0, q0 + vst1.u32 d0[0], [r7]! + + add r0, r1 + add r2, r3 + subs r4, #1 + bne .loop_vpp_12x16 + + pop {r4, r5, r6, r7} + bx lr +endfunc +//**************luma_vsp************ +.macro LUMA_VSP_4xN h +function x265_interp_8tap_vert_sp_4x\h\()_neon + push {r4, r5, r6} + ldr r4, [sp, #4 * 3] + mov r5, r4, lsl #6 + lsl r1, #1 + mov r4, r1, lsl #2 + sub r4, r1 + sub r0, r4 + + mov r12, #1 + lsl r12, #19 + add r12, #2048 + vdup.32 q8, r12 + mov r4, #\h +.loop_vsp_4x\h: + movrel r12, g_lumaFilter + add r12, r5 + mov r6, r0 + + pld [r6] + vld1.u16 d0, [r6], r1 + pld [r6] + vld1.u16 d1, [r6], r1 + pld [r6] + vld1.u16 d2, [r6], r1 + pld [r6] + vld1.u16 d3, [r6], r1 + pld [r6] + vld1.u16 d4, [r6], r1 + pld [r6] + vld1.u16 d5, [r6], r1 + pld [r6] + vld1.u16 d6, [r6], r1 + pld [r6] + vld1.u16 d7, [r6], r1 + + veor.u8 q9, q9 + + vmovl.s16 q11, d0 + vld1.s32 d24, [r12]! + vmov.s32 d25, d24 + vmla.s32 q9, q12, q11 + + vmovl.s16 q11, d1 + vld1.s32 d24, [r12]! + vmov.s32 d25, d24 + vmla.s32 q9, q12, q11 + + vmovl.s16 q11, d2 + vld1.s32 d24, [r12]! + vmov.s32 d25, d24 + vmla.s32 q9, q12, q11 + + vmovl.s16 q11, d3 + vld1.s32 d24, [r12]! + vmov.s32 d25, d24 + vmla.s32 q9, q12, q11 + + vmovl.s16 q11, d4 + vld1.s32 d24, [r12]! + vmov.s32 d25, d24 + vmla.s32 q9, q12, q11 + + vmovl.s16 q11, d5 + vld1.s32 d24, [r12]! + vmov.s32 d25, d24 + vmla.s32 q9, q12, q11 + + vmovl.s16 q11, d6 + vld1.s32 d24, [r12]! + vmov.s32 d25, d24 + vmla.s32 q9, q12, q11 + + vmovl.s16 q11, d7 + vld1.s32 d24, [r12]! + vmov.s32 d25, d24 + vmla.s32 q9, q12, q11 + + + vadd.s32 q9, q8 + vqshrun.s32 d0, q9, #12 + vqmovn.u16 d0, q0 + vst1.u32 d0[0], [r2], r3 + + add r0, r1 + subs r4, #1 + bne .loop_vsp_4x\h + pop {r4, r5, r6} + bx lr + .ltorg +endfunc +.endm + +LUMA_VSP_4xN 4 +LUMA_VSP_4xN 8 +LUMA_VSP_4xN 16 + +.macro qpel_filter_0_32b_1 + vmov.i16 d17, #64 + vmull.s16 q9, d6, d17 // 64*d0 + vmull.s16 q10, d7, d17 // 64*d1 +.endm + +.macro qpel_filter_1_32b_1 + vmov.i16 d16, #58 + vmov.i16 d17, #10 + vmull.s16 q9, d6, d16 // 58 * d0 + vmull.s16 q10, d7, d16 // 58 * d1 + vmov.i16 d16, #17 + vmull.s16 q11, d4, d17 // 10 * c0 + vmull.s16 q12, d5, d17 // 10 * c1 + vmov.i16 d17, #5 + vmull.s16 q13, d8, d16 // 17 * e0 + vmull.s16 q14, d9, d16 // 17 * e1 + vmull.s16 q15, d10, d17 // 5 * f0 + vmull.s16 q8, d11, d17 // 5 * f1 + vsub.s32 q9, q11 // 58 * d0 - 10 * c0 + vsub.s32 q10, q12 // 58 * d1 - 10 * c1 + vshll.s16 q11, d2, #2 // 4 * b0 + vshll.s16 q12, d3, #2 // 4 * b1 + vadd.s32 q9, q13 // 58 * d0 - 10 * c0 + 17 * e0 + vadd.s32 q10, q14 // 58 * d1 - 10 * c1 + 17 * e1 + vsubl.s16 q13, d12, d0 // g0 - a0 + vsubl.s16 q14, d13, d1 // g1 - a1 + vadd.s32 q9, q11 // 58 * d0 - 10 * c0 + 17 * e0 + 4 * b0 + vadd.s32 q10, q12 // 58 * d1 - 10 * c1 + 17 * e1 + 4 * b1 + vsub.s32 q13, q15 // g0 - a0 - 5 * f0 + vsub.s32 q14, q8 // g1 - a1 - 5 * f1 + vadd.s32 q9, q13 // 58 * d0 - 10 * c0 + 17 * e0 + 4 * b0 + g0 - a0 - 5 * f0 + vadd.s32 q10, q14 // 58 * d1 - 10 * c1 + 17 * e1 + 4 * b1 + g1 - a1 - 5 * f1 +.endm + +.macro qpel_filter_2_32b_1 + vmov.i32 q8, #11 + vaddl.s16 q9, d6, d8 // d0 + e0 + vaddl.s16 q10, d7, d9 // d1 + e1 + vaddl.s16 q11, d4, d10 // c0 + f0 + vaddl.s16 q12, d5, d11 // c1 + f1 + vmul.s32 q11, q8 // 11 * (c0 + f0) + vmul.s32 q12, q8 // 11 * (c1 + f1) + vmov.i32 q8, #40 + vaddl.s16 q15, d2, d12 // b0 + g0 + vmul.s32 q9, q8 // 40 * (d0 + e0) + vmul.s32 q10, q8 // 40 * (d1 + e1) + vaddl.s16 q8, d3, d13 // b1 + g1 + vaddl.s16 q13, d0, d14 // a0 + h0 + vaddl.s16 q14, d1, d15 // a1 + h1 + vshl.s32 q15, #2 // 4*(b0+g0) + vshl.s32 q8, #2 // 4*(b1+g1) + vadd.s32 q11, q13 // 11 * (c0 + f0) + a0 + h0 + vadd.s32 q12, q14 // 11 * (c1 + f1) + a1 + h1 + vadd.s32 q9, q15 // 40 * (d0 + e0) + 4*(b0+g0) + vadd.s32 q10, q8 // 40 * (d1 + e1) + 4*(b1+g1) + vsub.s32 q9, q11 // 40 * (d0 + e0) + 4*(b0+g0) - (11 * (c0 + f0) + a0 + h0) + vsub.s32 q10, q12 // 40 * (d1 + e1) + 4*(b1+g1) - (11 * (c1 + f1) + a1 + h1) +.endm + +.macro qpel_filter_3_32b_1 + vmov.i16 d16, #17 + vmov.i16 d17, #5 + vmull.s16 q9, d6, d16 // 17 * d0 + vmull.s16 q10, d7, d16 // 17 * d1 + vmull.s16 q11, d4, d17 // 5 * c0 + vmull.s16 q12, d5, d17 // 5* c1 + vmov.i16 d16, #58 + vmull.s16 q13, d8, d16 // 58 * e0 + vmull.s16 q14, d9, d16 // 58 * e1 + vmov.i16 d17, #10 + vmull.s16 q15, d10, d17 // 10 * f0 + vmull.s16 q8, d11, d17 // 10 * f1 + vsub.s32 q9, q11 // 17 * d0 - 5 * c0 + vsub.s32 q10, q12 // 17 * d1 - 5 * c1 + vshll.s16 q11, d12, #2 // 4 * g0 + vshll.s16 q12, d13, #2 // 4 * g1 + vadd.s32 q9, q13 // 17 * d0 - 5 * c0+ 58 * e0 + vadd.s32 q10, q14 // 17 * d1 - 5 * c1 + 58 * e1 + vsubl.s16 q13, d2, d14 // b0 - h0 + vsubl.s16 q14, d3, d15 // b1 - h1 + vadd.s32 q9, q11 // 17 * d0 - 5 * c0+ 58 * e0 +4 * g0 + vadd.s32 q10, q12 // 17 * d1 - 5 * c1 + 58 * e1+4 * g1 + vsub.s32 q13, q15 // 17 * d0 - 5 * c0+ 58 * e0 +4 * g0 -10 * f0 + vsub.s32 q14, q8 // 17 * d1 - 5 * c1 + 58 * e1+4 * g1 - 10*f1 + vadd.s32 q9, q13 // 17 * d0 - 5 * c0+ 58 * e0 +4 * g0 -10 * f0 +b0 - h0 + vadd.s32 q10, q14 // 17 * d1 - 5 * c1 + 58 * e1+4 * g1 - 10*f1 + b1 - h1 +.endm + +.macro FILTER_VSP a b filterv + + vpush { q4 - q7} +.loop_\filterv\()_\a\()x\b: + + mov r7, r2 + mov r6, r0 + eor r8, r8 + +.loop_w8_\filterv\()_\a\()x\b: + + add r6, r0, r8 + + pld [r6] + vld1.u16 {q0}, [r6], r1 + pld [r6] + vld1.u16 {q1}, [r6], r1 + pld [r6] + vld1.u16 {q2}, [r6], r1 + pld [r6] + vld1.u16 {q3}, [r6], r1 + pld [r6] + vld1.u16 {q4}, [r6], r1 + pld [r6] + vld1.u16 {q5}, [r6], r1 + pld [r6] + vld1.u16 {q6}, [r6], r1 + pld [r6] + vld1.u16 {q7}, [r6], r1 + + veor.u8 q9, q9 + veor.u8 q10, q10 + + \filterv + + mov r12,#1 + lsl r12, #19 + add r12, #2048 + vdup.32 q8, r12 + vadd.s32 q9, q8 + vqshrun.s32 d0, q9, #12 + vadd.s32 q10, q8 + vqshrun.s32 d1, q10, #12 + vqmovn.u16 d0, q0 + vst1.u8 d0, [r7]! + + + add r8, #16 + mov r12, #\a + lsl r12, #1 + cmp r8, r12 + blt .loop_w8_\filterv\()_\a\()x\b + + add r0, r1 + add r2, r3 + subs r4, #1 + bne .loop_\filterv\()_\a\()x\b + + vpop { q4 - q7} + +.endm + +.macro LUMA_VSP w h +function x265_interp_8tap_vert_sp_\w\()x\h\()_neon + + push {r4, r5, r6, r7, r8} + ldr r5, [sp, #4 * 5] + lsl r1, #1 + mov r4, r1, lsl #2 + sub r4, r1 + sub r0, r4 + mov r4, #\h + + cmp r5, #0 + beq 0f + cmp r5, #1 + beq 1f + cmp r5, #2 + beq 2f + cmp r5, #3 + beq 3f +0: + FILTER_VSP \w \h qpel_filter_0_32b_1 + b 5f +1: + FILTER_VSP \w \h qpel_filter_1_32b_1 + b 5f +2: + FILTER_VSP \w \h qpel_filter_2_32b_1 + b 5f +3: + FILTER_VSP \w \h qpel_filter_3_32b_1 + b 5f +5: + pop {r4, r5, r6, r7, r8} + bx lr +endfunc +.endm + + +LUMA_VSP 8 4 +LUMA_VSP 8 8 +LUMA_VSP 8 16 +LUMA_VSP 8 32 +LUMA_VSP 16 4 +LUMA_VSP 16 8 +LUMA_VSP 16 16 +LUMA_VSP 16 32 +LUMA_VSP 16 64 +LUMA_VSP 16 12 +LUMA_VSP 32 8 +LUMA_VSP 32 16 +LUMA_VSP 32 32 +LUMA_VSP 32 64 +LUMA_VSP 32 24 +LUMA_VSP 64 16 +LUMA_VSP 64 32 +LUMA_VSP 64 64 +LUMA_VSP 64 48 +LUMA_VSP 24 32 +LUMA_VSP 48 64 + +function x265_interp_8tap_vert_sp_12x16_neon + push {r4, r5, r6, r7} + ldr r5, [sp, #4 * 4] + lsl r1, #1 + mov r4, r1, lsl #2 + sub r4, r1 + sub r0, r4 + + mov r4, #16 + vpush { q4 - q7} +.loop1_12x16: + + mov r6, r0 + mov r7, r2 + + pld [r6] + vld1.u16 {q0}, [r6], r1 + pld [r6] + vld1.u16 {q1}, [r6], r1 + pld [r6] + vld1.u8 {q2}, [r6], r1 + pld [r6] + vld1.u16 {q3}, [r6], r1 + pld [r6] + vld1.u16 {q4}, [r6], r1 + pld [r6] + vld1.u16 {q5}, [r6], r1 + pld [r6] + vld1.u16 {q6}, [r6], r1 + pld [r6] + vld1.u16 {q7}, [r6], r1 + + veor.u8 q9, q9 + veor.u8 q10, q10 + + cmp r5,#0 + beq 0f + cmp r5,#1 + beq 1f + cmp r5,#2 + beq 2f + cmp r5,#3 + beq 3f +0: + qpel_filter_0_32b_1 + b 5f +1: + qpel_filter_1_32b_1 + b 5f +2: + qpel_filter_2_32b_1 + b 5f +3: + qpel_filter_3_32b_1 + b 5f +5: + mov r12,#1 + lsl r12, #19 + add r12, #2048 + vdup.32 q8, r12 + vadd.s32 q9, q8 + vqshrun.s32 d0, q9, #12 + vadd.s32 q10, q8 + vqshrun.s32 d1, q10, #12 + vqmovn.u16 d0, q0 + vst1.u8 d0, [r7]! + + add r6, r0, #16 + + pld [r6] + vld1.u16 {q0}, [r6], r1 + pld [r6] + vld1.u16 {q1}, [r6], r1 + pld [r6] + vld1.u8 {q2}, [r6], r1 + pld [r6] + vld1.u16 {q3}, [r6], r1 + pld [r6] + vld1.u16 {q4}, [r6], r1 + pld [r6] + vld1.u16 {q5}, [r6], r1 + pld [r6] + vld1.u16 {q6}, [r6], r1 + pld [r6] + vld1.u16 {q7}, [r6], r1 + + veor.u8 q9, q9 + veor.u8 q10, q10 + + cmp r5,#0 + beq 0f + cmp r5,#1 + beq 1f + cmp r5,#2 + beq 2f + cmp r5,#3 + beq 3f +0: + qpel_filter_0_32b_1 + b 5f +1: + qpel_filter_1_32b_1 + b 5f +2: + qpel_filter_2_32b_1 + b 5f +3: + qpel_filter_3_32b_1 + b 5f +5: + mov r12,#1 + lsl r12, #19 + add r12, #2048 + vdup.32 q8, r12 + vadd.s32 q9, q8 + vqshrun.s32 d0, q9, #12 + vadd.s32 q10, q8 + vqshrun.s32 d1, q10, #12 + vqmovn.u16 d0, q0 + vst1.u32 d0[0], [r7]! + + add r0, r1 + add r2, r3 + subs r4, #1 + bne .loop1_12x16 + vpop { q4 - q7} + pop {r4, r5, r6, r7} + bx lr +endfunc +//**************luma_vps***************** +.macro LUMA_VPS_4xN h +function x265_interp_8tap_vert_ps_4x\h\()_neon + push {r4, r5, r6} + ldr r4, [sp, #4 * 3] + lsl r3, #1 + mov r5, r4, lsl #6 + mov r4, r1, lsl #2 + sub r4, r1 + sub r0, r4 + + mov r4, #8192 + vdup.32 q8, r4 + mov r4, #\h + +.loop_vps_4x\h: + movrel r12, g_lumaFilter + add r12, r5 + mov r6, r0 + + pld [r6] + vld1.u32 d0[0], [r6], r1 + pld [r6] + vld1.u32 d0[1], [r6], r1 + pld [r6] + vld1.u32 d1[0], [r6], r1 + pld [r6] + vld1.u32 d1[1], [r6], r1 + pld [r6] + vld1.u32 d2[0], [r6], r1 + pld [r6] + vld1.u32 d2[1], [r6], r1 + pld [r6] + vld1.u32 d3[0], [r6], r1 + pld [r6] + vld1.u32 d3[1], [r6], r1 + + veor.u8 q9, q9 + + vmovl.u8 q11, d0 + vmovl.u16 q12, d22 + vmovl.u16 q13, d23 + vld1.s32 d20, [r12]! + vmov.s32 d21, d20 + vmla.s32 q9, q12, q10 + vld1.s32 d20, [r12]! + vmov.s32 d21, d20 + vmla.s32 q9, q13, q10 + + vmovl.u8 q11, d1 + vmovl.u16 q12, d22 + vmovl.u16 q13, d23 + vld1.s32 d20, [r12]! + vmov.s32 d21, d20 + vmla.s32 q9, q12, q10 + vld1.s32 d20, [r12]! + vmov.s32 d21, d20 + vmla.s32 q9, q13, q10 + + vmovl.u8 q11, d2 + vmovl.u16 q12, d22 + vmovl.u16 q13, d23 + vld1.s32 d20, [r12]! + vmov.s32 d21, d20 + vmla.s32 q9, q12, q10 + vld1.s32 d20, [r12]! + vmov.s32 d21, d20 + vmla.s32 q9, q13, q10 + + vmovl.u8 q11, d3 + vmovl.u16 q12, d22 + vmovl.u16 q13, d23 + vld1.s32 d20, [r12]! + vmov.s32 d21, d20 + vmla.s32 q9, q12, q10 + vld1.s32 d20, [r12]! + vmov.s32 d21, d20 + vmla.s32 q9, q13, q10 + + vsub.s32 q9, q8 + vqmovn.s32 d0, q9 + vst1.u16 d0, [r2], r3 + + add r0, r1 + subs r4, #1 + bne .loop_vps_4x\h + + pop {r4, r5, r6} + bx lr + .ltorg +endfunc +.endm + +LUMA_VPS_4xN 4 +LUMA_VPS_4xN 8 +LUMA_VPS_4xN 16 + + +.macro FILTER_VPS a b filterv + +.loop_ps_\filterv\()_\a\()x\b: + + mov r7, r2 + mov r6, r0 + eor r8, r8 + +.loop_ps_w8_\filterv\()_\a\()x\b: + + add r6, r0, r8 + + pld [r6] + vld1.u8 d0, [r6], r1 + pld [r6] + vld1.u8 d1, [r6], r1 + pld [r6] + vld1.u8 d2, [r6], r1 + pld [r6] + vld1.u8 d3, [r6], r1 + pld [r6] + vld1.u8 d4, [r6], r1 + pld [r6] + vld1.u8 d5, [r6], r1 + pld [r6] + vld1.u8 d6, [r6], r1 + pld [r6] + vld1.u8 d7, [r6], r1 + + veor.u8 q9, q9 + veor.u8 q10, q10 + + \filterv + + mov r12,#8192 + vdup.32 q8, r12 + vsub.s32 q9, q8 + vqmovn.s32 d0, q9 + vsub.s32 q10, q8 + vqmovn.s32 d1, q10 + vst1.u16 {q0}, [r7]! + + add r8, #8 + cmp r8, #\a + blt .loop_ps_w8_\filterv\()_\a\()x\b + + add r0, r1 + add r2, r3 + subs r4, #1 + bne .loop_ps_\filterv\()_\a\()x\b + +.endm + +.macro LUMA_VPS w h +function x265_interp_8tap_vert_ps_\w\()x\h\()_neon + + push {r4, r5, r6, r7, r8} + ldr r5, [sp, #4 * 5] + lsl r3, #1 + mov r4, r1, lsl #2 + sub r4, r1 + sub r0, r4 + mov r4, #\h + + cmp r5, #0 + beq 0f + cmp r5, #1 + beq 1f + cmp r5, #2 + beq 2f + cmp r5, #3 + beq 3f +0: + FILTER_VPS \w \h qpel_filter_0_32b + b 5f +1: + FILTER_VPS \w \h qpel_filter_1_32b + b 5f +2: + FILTER_VPS \w \h qpel_filter_2_32b + b 5f +3: + FILTER_VPS \w \h qpel_filter_3_32b + b 5f +5: + pop {r4, r5, r6, r7, r8} + bx lr +endfunc +.endm + +LUMA_VPS 8 4 +LUMA_VPS 8 8 +LUMA_VPS 8 16 +LUMA_VPS 8 32 +LUMA_VPS 16 4 +LUMA_VPS 16 8 +LUMA_VPS 16 16 +LUMA_VPS 16 32 +LUMA_VPS 16 64 +LUMA_VPS 16 12 +LUMA_VPS 32 8 +LUMA_VPS 32 16 +LUMA_VPS 32 32 +LUMA_VPS 32 64 +LUMA_VPS 32 24 +LUMA_VPS 64 16 +LUMA_VPS 64 32 +LUMA_VPS 64 64 +LUMA_VPS 64 48 +LUMA_VPS 24 32 +LUMA_VPS 48 64 + +function x265_interp_8tap_vert_ps_12x16_neon + push {r4, r5, r6, r7} + lsl r3, #1 + ldr r5, [sp, #4 * 4] + mov r4, r1, lsl #2 + sub r4, r1 + sub r0, r4 + + mov r4, #16 +.loop_vps_12x16: + + mov r6, r0 + mov r7, r2 + + pld [r6] + vld1.u8 d0, [r6], r1 + pld [r6] + vld1.u8 d1, [r6], r1 + pld [r6] + vld1.u8 d2, [r6], r1 + pld [r6] + vld1.u8 d3, [r6], r1 + pld [r6] + vld1.u8 d4, [r6], r1 + pld [r6] + vld1.u8 d5, [r6], r1 + pld [r6] + vld1.u8 d6, [r6], r1 + pld [r6] + vld1.u8 d7, [r6], r1 + + veor.u8 q9, q9 + veor.u8 q10, q10 + + cmp r5,#0 + beq 0f + cmp r5,#1 + beq 1f + cmp r5,#2 + beq 2f + cmp r5,#3 + beq 3f +0: + qpel_filter_0_32b + b 5f +1: + qpel_filter_1_32b + b 5f +2: + qpel_filter_2_32b + b 5f +3: + qpel_filter_3_32b + b 5f +5: + mov r12,#8192 + vdup.32 q8, r12 + vsub.s32 q9, q8 + vqmovn.s32 d0, q9 + vsub.s32 q10, q8 + vqmovn.s32 d1, q10 + vst1.u8 {q0}, [r7]! + + add r6, r0, #8 + + pld [r6] + vld1.u8 d0, [r6], r1 + pld [r6] + vld1.u8 d1, [r6], r1 + pld [r6] + vld1.u8 d2, [r6], r1 + pld [r6] + vld1.u8 d3, [r6], r1 + pld [r6] + vld1.u8 d4, [r6], r1 + pld [r6] + vld1.u8 d5, [r6], r1 + pld [r6] + vld1.u8 d6, [r6], r1 + pld [r6] + vld1.u8 d7, [r6], r1 + + veor.u8 q9, q9 + veor.u8 q10, q10 + + cmp r5,#0 + beq 0f + cmp r5,#1 + beq 1f + cmp r5,#2 + beq 2f + cmp r5,#3 + beq 3f +0: + qpel_filter_0_32b + b 5f +1: + qpel_filter_1_32b + b 5f +2: + qpel_filter_2_32b + b 5f +3: + qpel_filter_3_32b + b 5f +5: + mov r12,#8192 + vdup.32 q8, r12 + vsub.s32 q9, q8 + vqmovn.s32 d0, q9 + vst1.u8 d0, [r7]! + + add r0, r1 + add r2, r3 + subs r4, #1 + bne .loop_vps_12x16 + + pop {r4, r5, r6, r7} + bx lr +endfunc + +//************chroma_vpp************ + +.macro qpel_filter_chroma_0_32b + vmov.i16 d16, #64 + vmull.s16 q6, d6, d16 // 64*b0 + vmull.s16 q7, d7, d16 // 64*b1 +.endm + +.macro qpel_filter_chroma_1_32b + vmov.i16 d16, #58 + vmov.i16 d17, #10 + vmull.s16 q9, d6, d16 // 58*b0 + vmull.s16 q10, d7, d16 // 58*b1 + vmull.s16 q11, d8, d17 // 10*c0 + vmull.s16 q12, d9, d17 // 10*c1 + vadd.s16 q2, q5 //a +d + vshll.s16 q13, d4, #1 // 2 * (a0+d0) + vshll.s16 q14, d5, #1 // 2 * (a1+d1) + vsub.s32 q9, q13 // 58*b0 - 2 * (a0+d0) + vsub.s32 q10, q14 // 58*b1 - 2 * (a1+d1) + vadd.s32 q6, q9, q11 // 58*b0 - 2 * (a0+d0) +10*c0 + vadd.s32 q7, q10, q12 // 58*b1 - 2 * (a1+d1) +10*c1 +.endm + +.macro qpel_filter_chroma_2_32b + vmov.i16 d16, #54 + vmull.s16 q9, d6, d16 // 54*b0 + vmull.s16 q10, d7, d16 // 54*b1 + vshll.s16 q11, d4, #2 // 4 * a0 + vshll.s16 q12, d5, #2 // 4 * a1 + vshll.s16 q13, d8, #4 // 16 * c0 + vshll.s16 q14, d9, #4 // 16 * c1 + vshll.s16 q15, d10, #1 // 2 * d0 + vshll.s16 q8, d11, #1 // 2 * d1 + + vadd.s32 q9, q13 // 54*b0 + 16 * c0 + vadd.s32 q10, q14 // 54*b1 + 16 * c1 + vadd.s32 q11, q15 // 4 * a0 +2 * d0 + vadd.s32 q12, q8 // 4 * a1 +2 * d1 + vsub.s32 q6, q9, q11 // 54*b0 + 16 * c0 - ( 4 * a0 +2 * d0) + vsub.s32 q7, q10, q12 // 54*b0 + 16 * c0 - ( 4 * a0 +2 * d0) +.endm + +.macro qpel_filter_chroma_3_32b + vmov.i16 d16, #46 + vmov.i16 d17, #28 + vmull.s16 q9, d6, d16 // 46*b0 + vmull.s16 q10, d7, d16 // 46*b1 + vmull.s16 q11, d8, d17 // 28*c0 + vmull.s16 q12, d9, d17 // 28*c1 + vmov.i16 d17, #6 + vshll.s16 q13, d10, #2 // 4 * d0 + vshll.s16 q14, d11, #2 // 4 * d1 + vmull.s16 q15, d4, d17 // 6*a0 + vmull.s16 q8, d5, d17 // 6*a1 + vadd.s32 q9, q11 // 46*b0 + 28*c0 + vadd.s32 q10, q12 // 46*b1 + 28*c1 + vadd.s32 q13, q15 // 4 * d0 + 6*a0 + vadd.s32 q14, q8 // 4 * d1 + 6*a1 + vsub.s32 q6, q9, q13 // 46*b0 + 28*c0 -(4 * d0 + 6*a0) + vsub.s32 q7, q10, q14 // 46*b1 + 28*c1 -(4 * d1 + 6*a1) +.endm + +.macro qpel_filter_chroma_4_32b + vmov.i16 d16, #36 + vadd.s16 q2, q5 // a +d + vadd.s16 q3, q4 // b+c + vmull.s16 q9, d6, d16 // 36*(b0 + c0) + vmull.s16 q10, d7, d16 // 36*(b1 + c1) + vshll.s16 q11, d4, #2 // 4 * (a0+d0) + vshll.s16 q12, d5, #2 // 4 * (a1+d1) + vsub.s32 q6, q9, q11 // 36*(b0 + c0) - ( 4 * (a0+d0)) + vsub.s32 q7, q10, q12 // 36*(b1 + c1) - ( 4 * (a1+d1)) +.endm + +.macro qpel_filter_chroma_5_32b + vmov.i16 d16, #46 + vmov.i16 d17, #28 + vmull.s16 q9, d6, d17 // 28*b0 + vmull.s16 q10, d7, d17 // 28*b1 + vmull.s16 q11, d8, d16 // 46*c0 + vmull.s16 q12, d9, d16 // 46*c1 + vmov.i16 d17, #6 + vshll.s16 q13, d4, #2 // 4 * a0 + vshll.s16 q14, d5, #2 // 4 * a1 + vmull.s16 q15, d10, d17 // 6*d0 + vmull.s16 q8, d11, d17 // 6*d1 + vadd.s32 q9, q11 // 28*b0 + 46*c0 + vadd.s32 q10, q12 // 28*b1 + 46*c1 + vadd.s32 q13, q15 // 4 * a0 + 6*d0 + vadd.s32 q14, q8 // 4 * a1 + 6*d1 + vsub.s32 q6, q9, q13 // 28*b0 + 46*c0- (4 * a0 + 6*d0) + vsub.s32 q7, q10, q14 // 28*b1 + 46*c1- (4 * a1 + 6*d1) +.endm + +.macro qpel_filter_chroma_6_32b + vmov.i16 d16, #54 + vmull.s16 q9, d8, d16 // 54*c0 + vmull.s16 q10, d9, d16 // 54*c1 + vshll.s16 q11, d4, #1 // 2 * a0 + vshll.s16 q12, d5, #1 // 2 * a1 + vshll.s16 q13, d6, #4 // 16 * b0 + vshll.s16 q14, d7, #4 // 16 * b1 + vshll.s16 q15, d10, #2 // 4 * d0 + vshll.s16 q8, d11, #2 // 4 * d1 + vadd.s32 q9, q13 // 54*c0 + 16 * b0 + vadd.s32 q10, q14 // 54*c1 + 16 * b1 + vadd.s32 q11, q15 // 2 * a0 + 4 * d0 + vadd.s32 q12, q8 // 2 * a1 + 4 * d1 + vsub.s32 q6, q9, q11 // 54*c0 + 16 * b0 - ( 2 * a0 + 4 * d0) + vsub.s32 q7, q10, q12 // 54*c1 + 16 * b1 - ( 2 * a1 + 4 * d1) +.endm + +.macro qpel_filter_chroma_7_32b + vmov.i16 d16, #10 + vmov.i16 d17, #58 + vmull.s16 q9, d6, d16 // 10*b0 + vmull.s16 q10, d7, d16 // 10*b1 + vmull.s16 q11, d8, d17 // 58*c0 + vmull.s16 q12, d9, d17 // 58*c1 + vadd.s16 q2, q5 //a +d + vshll.s16 q13, d4, #1 // 2 * (a0+d0) + vshll.s16 q14, d5, #1 // 2 * (a1+d1) + vsub.s32 q9, q13 // 58*c0 - 2 * (a0+d0) + vsub.s32 q10, q14 // 58*c1 - 2 * (a1+d1) + vadd.s32 q6, q9, q11 // 58*c0 - 2 * (a0+d0) +10*b0 + vadd.s32 q7, q10, q12 // 58*c1 - 2 * (a1+d1) +10*b1 +.endm + +.macro FILTER_CHROMA_VPP a b filterv + + vpush {q4-q7} + +.loop_\filterv\()_\a\()x\b: + + mov r7, r2 + mov r6, r0 + eor r8, r8 + +.loop_w8_\filterv\()_\a\()x\b: + + add r6, r0, r8 + + pld [r6] + vld1.u8 d0, [r6], r1 + pld [r6] + vld1.u8 d1, [r6], r1 + pld [r6] + vld1.u8 d2, [r6], r1 + pld [r6] + vld1.u8 d3, [r6], r1 + + vmovl.u8 q2, d0 + vmovl.u8 q3, d1 + vmovl.u8 q4, d2 + vmovl.u8 q5, d3 + + veor.u8 q6, q6 + veor.u8 q7, q7 + + \filterv + + mov r12,#32 + vdup.32 q8, r12 + vadd.s32 q6, q8 + vqshrun.s32 d0, q6, #6 + vadd.s32 q7, q8 + vqshrun.s32 d1, q7, #6 + vqmovn.u16 d0, q0 + vst1.u8 d0, [r7]! + + add r8, #8 + cmp r8, #\a + blt .loop_w8_\filterv\()_\a\()x\b + + add r0, r1 + add r2, r3 + subs r4, #1 + bne .loop_\filterv\()_\a\()x\b + vpop {q4-q7} +.endm + +.macro CHROMA_VPP w h +function x265_interp_4tap_vert_pp_\w\()x\h\()_neon + + push {r4, r5, r6, r7, r8} + ldr r5, [sp, #4 * 5] + sub r0, r1 + mov r4, #\h + + cmp r5, #0 + beq 0f + cmp r5, #1 + beq 1f + cmp r5, #2 + beq 2f + cmp r5, #3 + beq 3f + cmp r5, #4 + beq 4f + cmp r5, #5 + beq 5f + cmp r5, #6 + beq 6f + cmp r5, #7 + beq 7f +0: + FILTER_CHROMA_VPP \w \h qpel_filter_chroma_0_32b + b 8f +1: + FILTER_CHROMA_VPP \w \h qpel_filter_chroma_1_32b + b 8f +2: + FILTER_CHROMA_VPP \w \h qpel_filter_chroma_2_32b + b 8f +3: + FILTER_CHROMA_VPP \w \h qpel_filter_chroma_3_32b + b 8f +4: + FILTER_CHROMA_VPP \w \h qpel_filter_chroma_4_32b + b 8f +5: + FILTER_CHROMA_VPP \w \h qpel_filter_chroma_5_32b + b 8f +6: + FILTER_CHROMA_VPP \w \h qpel_filter_chroma_6_32b + b 8f +7: + FILTER_CHROMA_VPP \w \h qpel_filter_chroma_7_32b + b 8f +8: + pop {r4, r5, r6, r7, r8} + bx lr +endfunc +.endm + +CHROMA_VPP 8 2 +CHROMA_VPP 8 4 +CHROMA_VPP 8 6 +CHROMA_VPP 8 8 +CHROMA_VPP 8 16 +CHROMA_VPP 8 32 +CHROMA_VPP 8 12 +CHROMA_VPP 8 64 +CHROMA_VPP 16 4 +CHROMA_VPP 16 8 +CHROMA_VPP 16 12 +CHROMA_VPP 16 16 +CHROMA_VPP 16 32 +CHROMA_VPP 16 64 +CHROMA_VPP 16 24 +CHROMA_VPP 32 8 +CHROMA_VPP 32 16 +CHROMA_VPP 32 24 +CHROMA_VPP 32 32 +CHROMA_VPP 32 64 +CHROMA_VPP 32 48 +CHROMA_VPP 24 32 +CHROMA_VPP 24 64 +CHROMA_VPP 64 16 +CHROMA_VPP 64 32 +CHROMA_VPP 64 48 +CHROMA_VPP 64 64 +CHROMA_VPP 48 64 + +.macro FILTER_CHROMA_VPS a b filterv + + vpush {q4-q7} + +.loop_vps_\filterv\()_\a\()x\b: + + mov r7, r2 + mov r6, r0 + eor r8, r8 + +.loop_vps_w8_\filterv\()_\a\()x\b: + + add r6, r0, r8 + + pld [r6] + vld1.u8 d0, [r6], r1 + pld [r6] + vld1.u8 d1, [r6], r1 + pld [r6] + vld1.u8 d2, [r6], r1 + pld [r6] + vld1.u8 d3, [r6], r1 + + vmovl.u8 q2, d0 + vmovl.u8 q3, d1 + vmovl.u8 q4, d2 + vmovl.u8 q5, d3 + + veor.u8 q6, q6 + veor.u8 q7, q7 + + \filterv + + mov r12,#8192 + vdup.32 q8, r12 + vsub.s32 q6, q8 + vqmovn.s32 d0, q6 + vsub.s32 q7, q8 + vqmovn.s32 d1, q7 + vst1.u16 {q0}, [r7]! + + add r8, #8 + cmp r8, #\a + blt .loop_vps_w8_\filterv\()_\a\()x\b + + add r0, r1 + add r2, r3 + subs r4, #1 + bne .loop_vps_\filterv\()_\a\()x\b + vpop {q4-q7} +.endm + +.macro CHROMA_VPS w h +function x265_interp_4tap_vert_ps_\w\()x\h\()_neon + + push {r4, r5, r6, r7, r8} + ldr r5, [sp, #4 * 5] + lsl r3, #1 + sub r0, r1 + mov r4, #\h + + cmp r5, #0 + beq 0f + cmp r5, #1 + beq 1f + cmp r5, #2 + beq 2f + cmp r5, #3 + beq 3f + cmp r5, #4 + beq 4f + cmp r5, #5 + beq 5f + cmp r5, #6 + beq 6f + cmp r5, #7 + beq 7f +0: + FILTER_CHROMA_VPS \w \h qpel_filter_chroma_0_32b + b 8f +1: + FILTER_CHROMA_VPS \w \h qpel_filter_chroma_1_32b + b 8f +2: + FILTER_CHROMA_VPS \w \h qpel_filter_chroma_2_32b + b 8f +3: + FILTER_CHROMA_VPS \w \h qpel_filter_chroma_3_32b + b 8f +4: + FILTER_CHROMA_VPS \w \h qpel_filter_chroma_4_32b + b 8f +5: + FILTER_CHROMA_VPS \w \h qpel_filter_chroma_5_32b + b 8f +6: + FILTER_CHROMA_VPS \w \h qpel_filter_chroma_6_32b + b 8f +7: + FILTER_CHROMA_VPS \w \h qpel_filter_chroma_7_32b + b 8f +8: + pop {r4, r5, r6, r7, r8} + bx lr +endfunc +.endm + +CHROMA_VPS 8 2 +CHROMA_VPS 8 4 +CHROMA_VPS 8 6 +CHROMA_VPS 8 8 +CHROMA_VPS 8 16 +CHROMA_VPS 8 32 +CHROMA_VPS 8 12 +CHROMA_VPS 8 64 +CHROMA_VPS 16 4 +CHROMA_VPS 16 8 +CHROMA_VPS 16 12 +CHROMA_VPS 16 16 +CHROMA_VPS 16 32 +CHROMA_VPS 16 64 +CHROMA_VPS 16 24 +CHROMA_VPS 32 8 +CHROMA_VPS 32 16 +CHROMA_VPS 32 24 +CHROMA_VPS 32 32 +CHROMA_VPS 32 64 +CHROMA_VPS 32 48 +CHROMA_VPS 24 32 +CHROMA_VPS 24 64 +CHROMA_VPS 64 16 +CHROMA_VPS 64 32 +CHROMA_VPS 64 48 +CHROMA_VPS 64 64 +CHROMA_VPS 48 64 + +.macro FILTER_CHROMA_VSP a b filterv + + vpush {q4-q7} + +.loop_vsp_\filterv\()_\a\()x\b: + + mov r7, r2 + mov r6, r0 + eor r8, r8 + +.loop_vsp_w8_\filterv\()_\a\()x\b: + + add r6, r0, r8 + + pld [r6] + vld1.u16 {q2}, [r6], r1 + pld [r6] + vld1.u16 {q3}, [r6], r1 + pld [r6] + vld1.u16 {q4}, [r6], r1 + pld [r6] + vld1.u16 {q5}, [r6], r1 + + veor.u8 q6, q6 + veor.u8 q7, q7 + + \filterv + + mov r12,#1 + lsl r12, #19 + add r12, #2048 + vdup.32 q8, r12 + vadd.s32 q6, q8 + vqshrun.s32 d0, q6, #12 + vadd.s32 q7, q8 + vqshrun.s32 d1, q7, #12 + vqmovn.u16 d0, q0 + vst1.u8 d0, [r7]! + + add r8, #16 + mov r12, #\a + lsl r12, #1 + cmp r8, r12 + blt .loop_vsp_w8_\filterv\()_\a\()x\b + + add r0, r1 + add r2, r3 + subs r4, #1 + bne .loop_vsp_\filterv\()_\a\()x\b + vpop {q4-q7} +.endm + +.macro CHROMA_VSP w h +function x265_interp_4tap_vert_sp_\w\()x\h\()_neon + + push {r4, r5, r6, r7, r8} + ldr r5, [sp, #4 * 5] + lsl r1, #1 + sub r0, r1 + mov r4, #\h + + cmp r5, #0 + beq 0f + cmp r5, #1 + beq 1f + cmp r5, #2 + beq 2f + cmp r5, #3 + beq 3f + cmp r5, #4 + beq 4f + cmp r5, #5 + beq 5f + cmp r5, #6 + beq 6f + cmp r5, #7 + beq 7f +0: + FILTER_CHROMA_VSP \w \h qpel_filter_chroma_0_32b + b 8f +1: + FILTER_CHROMA_VSP \w \h qpel_filter_chroma_1_32b + b 8f +2: + FILTER_CHROMA_VSP \w \h qpel_filter_chroma_2_32b + b 8f +3: + FILTER_CHROMA_VSP \w \h qpel_filter_chroma_3_32b + b 8f +4: + FILTER_CHROMA_VSP \w \h qpel_filter_chroma_4_32b + b 8f +5: + FILTER_CHROMA_VSP \w \h qpel_filter_chroma_5_32b + b 8f +6: + FILTER_CHROMA_VSP \w \h qpel_filter_chroma_6_32b + b 8f +7: + FILTER_CHROMA_VSP \w \h qpel_filter_chroma_7_32b + b 8f +8: + pop {r4, r5, r6, r7, r8} + bx lr +endfunc +.endm + +CHROMA_VSP 8 2 +CHROMA_VSP 8 4 +CHROMA_VSP 8 6 +CHROMA_VSP 8 8 +CHROMA_VSP 8 16 +CHROMA_VSP 8 32 +CHROMA_VSP 8 12 +CHROMA_VSP 8 64 +CHROMA_VSP 16 4 +CHROMA_VSP 16 8 +CHROMA_VSP 16 12 +CHROMA_VSP 16 16 +CHROMA_VSP 16 32 +CHROMA_VSP 16 64 +CHROMA_VSP 16 24 +CHROMA_VSP 32 8 +CHROMA_VSP 32 16 +CHROMA_VSP 32 24 +CHROMA_VSP 32 32 +CHROMA_VSP 32 64 +CHROMA_VSP 32 48 +CHROMA_VSP 24 32 +CHROMA_VSP 24 64 +CHROMA_VSP 64 16 +CHROMA_VSP 64 32 +CHROMA_VSP 64 48 +CHROMA_VSP 64 64 +CHROMA_VSP 48 64 + + // void interp_horiz_pp_c(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx) +.macro vextin8 + pld [r5] + vld1.8 {q3}, [r5]! + vext.8 d0, d6, d7, #1 + vext.8 d1, d6, d7, #2 + vext.8 d2, d6, d7, #3 + vext.8 d3, d6, d7, #4 + vext.8 d4, d6, d7, #5 + vext.8 d5, d6, d7, #6 + vext.8 d6, d6, d7, #7 +.endm + +.macro HPP_FILTER a b filterhpp + mov r12,#32 + mov r6, #\b + sub r3, #\a + mov r8, #\a + cmp r8, #4 + beq 4f + cmp r8, #12 + beq 12f + b 6f +4: + HPP_FILTER_4 \a \b \filterhpp + b 5f +12: + HPP_FILTER_12 \a \b \filterhpp + b 5f +6: +loop2_hpp_\filterhpp\()_\a\()x\b: + mov r7, #\a + lsr r7, #3 + mov r5, r0 + sub r5, #4 +loop3_hpp_\filterhpp\()_\a\()x\b: + vextin8 + \filterhpp + vdup.32 q8, r12 + vadd.s32 q9, q8 + vqshrun.s32 d0, q9, #6 + vadd.s32 q10, q8 + vqshrun.s32 d1, q10, #6 + vqmovn.u16 d0, q0 + vst1.u8 d0, [r2]! + subs r7, #1 + sub r5, #8 + bne loop3_hpp_\filterhpp\()_\a\()x\b + subs r6, #1 + add r0, r1 + add r2, r3 + bne loop2_hpp_\filterhpp\()_\a\()x\b +5: +.endm + +.macro HPP_FILTER_4 w h filterhpp +loop4_hpp_\filterhpp\()_\w\()x\h: + mov r5, r0 + sub r5, #4 + vextin8 + \filterhpp + vdup.32 q8, r12 + vadd.s32 q9, q8 + vqshrun.s32 d0, q9, #6 + vadd.s32 q10, q8 + vqshrun.s32 d1, q10, #6 + vqmovn.u16 d0, q0 + vst1.u32 {d0[0]}, [r2]! + sub r5, #8 + subs r6, #1 + add r0, r1 + add r2, r3 + bne loop4_hpp_\filterhpp\()_\w\()x\h +.endm + +.macro HPP_FILTER_12 w h filterhpp +loop12_hpp_\filterhpp\()_\w\()x\h: + mov r5, r0 + sub r5, #4 + vextin8 + \filterhpp + vdup.32 q8, r12 + vadd.s32 q9, q8 + vqshrun.s32 d0, q9, #6 + vadd.s32 q10, q8 + vqshrun.s32 d1, q10, #6 + vqmovn.u16 d0, q0 + vst1.u8 {d0}, [r2]! + sub r5, #8 + + vextin8 + \filterhpp + vdup.32 q8, r12 + vadd.s32 q9, q8 + vqshrun.s32 d0, q9, #6 + vadd.s32 q10, q8 + vqshrun.s32 d1, q10, #6 + vqmovn.u16 d0, q0 + vst1.u32 {d0[0]}, [r2]! + add r2, r3 + subs r6, #1 + add r0, r1 + bne loop12_hpp_\filterhpp\()_\w\()x\h +.endm + +.macro LUMA_HPP w h +function x265_interp_horiz_pp_\w\()x\h\()_neon + push {r4, r5, r6, r7, r8} + ldr r4, [sp, #20] + cmp r4, #0 + beq 0f + cmp r4, #1 + beq 1f + cmp r4, #2 + beq 2f + cmp r4, #3 + beq 3f +0: + HPP_FILTER \w \h qpel_filter_0_32b + b 5f +1: + HPP_FILTER \w \h qpel_filter_1_32b + b 5f +2: + HPP_FILTER \w \h qpel_filter_2_32b + b 5f +3: + HPP_FILTER \w \h qpel_filter_3_32b + b 5f +5: + pop {r4, r5, r6, r7, r8} + bx lr +endfunc +.endm + +LUMA_HPP 4 4 +LUMA_HPP 4 8 +LUMA_HPP 4 16 +LUMA_HPP 8 4 +LUMA_HPP 8 8 +LUMA_HPP 8 16 +LUMA_HPP 8 32 +LUMA_HPP 12 16 +LUMA_HPP 16 4 +LUMA_HPP 16 8 +LUMA_HPP 16 12 +LUMA_HPP 16 16 +LUMA_HPP 16 32 +LUMA_HPP 16 64 +LUMA_HPP 24 32 +LUMA_HPP 32 8 +LUMA_HPP 32 16 +LUMA_HPP 32 24 +LUMA_HPP 32 32 +LUMA_HPP 32 64 +LUMA_HPP 48 64 +LUMA_HPP 64 16 +LUMA_HPP 64 32 +LUMA_HPP 64 48 +LUMA_HPP 64 64 + +// void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt) +.macro HPS_FILTER a b filterhps + mov r12, #8192 + mov r6, r10 + sub r3, #\a + lsl r3, #1 + + mov r8, #\a + cmp r8, #4 + beq 14f + cmp r8, #12 + beq 15f + b 7f +14: + HPS_FILTER_4 \a \b \filterhps + b 10f +15: + HPS_FILTER_12 \a \b \filterhps + b 10f +7: + cmp r9, #0 + beq 8f + cmp r9, #1 + beq 9f +8: +loop1_hps_\filterhps\()_\a\()x\b\()_rowext0: + mov r7, #\a + lsr r7, #3 + mov r5, r0 + sub r5, #4 +loop2_hps_\filterhps\()_\a\()x\b\()_rowext0: + vextin8 + \filterhps + vdup.32 q8, r12 + vsub.s32 q9, q8 + vsub.s32 q10, q8 + vmovn.u32 d0, q9 + vmovn.u32 d1, q10 + vst1.s16 {q0}, [r2]! + subs r7, #1 + sub r5, #8 + bne loop2_hps_\filterhps\()_\a\()x\b\()_rowext0 + subs r6, #1 + add r0, r1 + add r2, r3 + bne loop1_hps_\filterhps\()_\a\()x\b\()_rowext0 + b 10f +9: +loop3_hps_\filterhps\()_\a\()x\b\()_rowext1: + mov r7, #\a + lsr r7, #3 + mov r5, r0 + sub r5, #4 +loop4_hps_\filterhps\()_\a\()x\b\()_rowext1: + vextin8 + \filterhps + vdup.32 q8, r12 + vsub.s32 q9, q8 + vsub.s32 q10, q8 + vmovn.u32 d0, q9 + vmovn.u32 d1, q10 + vst1.s16 {q0}, [r2]! + subs r7, #1 + sub r5, #8 + bne loop4_hps_\filterhps\()_\a\()x\b\()_rowext1 + subs r6, #1 + add r0, r1 + add r2, r3 + bne loop3_hps_\filterhps\()_\a\()x\b\()_rowext1 +10: +.endm + +.macro HPS_FILTER_4 w h filterhps + cmp r9, #0 + beq 11f + cmp r9, #1 + beq 12f +11: +loop4_hps_\filterhps\()_\w\()x\h\()_rowext0: + mov r5, r0 + sub r5, #4 + vextin8 + \filterhps + vdup.32 q8, r12 + vsub.s32 q9, q8 + vmovn.u32 d0, q9 + vst1.s16 {d0}, [r2]! + sub r5, #8 + subs r6, #1 + add r0, r1 + add r2, r3 + bne loop4_hps_\filterhps\()_\w\()x\h\()_rowext0 + b 13f +12: +loop5_hps_\filterhps\()_\w\()x\h\()_rowext1: + mov r5, r0 + sub r5, #4 + vextin8 + \filterhps + vdup.32 q8, r12 + vsub.s32 q9, q8 + vmovn.u32 d0, q9 + vst1.s16 {d0}, [r2]! + sub r5, #8 + subs r6, #1 + add r0, r1 + add r2, r3 + bne loop5_hps_\filterhps\()_\w\()x\h\()_rowext1 +13: +.endm + +.macro HPS_FILTER_12 w h filterhps + cmp r9, #0 + beq 14f + cmp r9, #1 + beq 15f +14: +loop12_hps_\filterhps\()_\w\()x\h\()_rowext0: + mov r5, r0 + sub r5, #4 + vextin8 + \filterhps + vdup.32 q8, r12 + vsub.s32 q9, q8 + vsub.s32 q10, q8 + vmovn.u32 d0, q9 + vmovn.u32 d1, q10 + vst1.s16 {q0}, [r2]! + sub r5, #8 + + vextin8 + \filterhps + vdup.32 q8, r12 + vsub.s32 q9, q8 + vmovn.u32 d0, q9 + vst1.s16 {d0}, [r2]! + add r2, r3 + subs r6, #1 + add r0, r1 + bne loop12_hps_\filterhps\()_\w\()x\h\()_rowext0 + b 16f +15: +loop12_hps_\filterhps\()_\w\()x\h\()_rowext1: + mov r5, r0 + sub r5, #4 + vextin8 + \filterhps + vdup.32 q8, r12 + vsub.s32 q9, q8 + vsub.s32 q10, q8 + vmovn.u32 d0, q9 + vmovn.u32 d1, q10 + vst1.s16 {q0}, [r2]! + sub r5, #8 + + vextin8 + \filterhps + vdup.32 q8, r12 + vsub.s32 q9, q8 + vmovn.u32 d0, q9 + vst1.s16 {d0}, [r2]! + add r2, r3 + subs r6, #1 + add r0, r1 + bne loop12_hps_\filterhps\()_\w\()x\h\()_rowext1 +16: +.endm + +.macro LUMA_HPS w h +function x265_interp_horiz_ps_\w\()x\h\()_neon + push {r4, r5, r6, r7, r8, r9, r10} + ldr r4, [sp, #28] + ldr r9, [sp, #32] + mov r10, #\h + cmp r9, #0 + beq 6f + sub r0, r0, r1, lsl #2 + add r0, r1 + add r10, #7 +6: + cmp r4, #0 + beq 0f + cmp r4, #1 + beq 1f + cmp r4, #2 + beq 2f + cmp r4, #3 + beq 3f +0: + HPS_FILTER \w \h qpel_filter_0_32b + b 5f +1: + HPS_FILTER \w \h qpel_filter_1_32b + b 5f +2: + HPS_FILTER \w \h qpel_filter_2_32b + b 5f +3: + HPS_FILTER \w \h qpel_filter_3_32b + b 5f +5: + pop {r4, r5, r6, r7, r8, r9, r10} + bx lr +endfunc +.endm + +LUMA_HPS 4 4 +LUMA_HPS 4 8 +LUMA_HPS 4 16 +LUMA_HPS 8 4 +LUMA_HPS 8 8 +LUMA_HPS 8 16 +LUMA_HPS 8 32 +LUMA_HPS 12 16 +LUMA_HPS 16 4 +LUMA_HPS 16 8 +LUMA_HPS 16 12 +LUMA_HPS 16 16 +LUMA_HPS 16 32 +LUMA_HPS 16 64 +LUMA_HPS 24 32 +LUMA_HPS 32 8 +LUMA_HPS 32 16 +LUMA_HPS 32 24 +LUMA_HPS 32 32 +LUMA_HPS 32 64 +LUMA_HPS 48 64 +LUMA_HPS 64 16 +LUMA_HPS 64 32 +LUMA_HPS 64 48 +LUMA_HPS 64 64 + +// ******* Chroma_hpp ******* +.macro vextin8_chroma + pld [r5] + vld1.8 {q3}, [r5]! + vext.8 d0, d6, d7, #1 + vext.8 d1, d6, d7, #2 + vext.8 d2, d6, d7, #3 + vext.8 d3, d6, d7, #4 + + vmovl.u8 q2, d0 + vmovl.u8 q3, d1 + vmovl.u8 q4, d2 + vmovl.u8 q5, d3 +.endm + +.macro FILTER_CHROMA_HPP a b filterhpp + vpush {q4-q7} + mov r12,#32 + mov r6, #\b + sub r3, #\a + mov r8, #\a + cmp r8, #4 + beq 11f + cmp r8, #12 + beq 12f + b 13f +11: + FILTER_CHROMA_HPP_4 \a \b \filterhpp + b 14f +12: + FILTER_CHROMA_HPP_12 \a \b \filterhpp + b 14f +13: + veor q6, q6 + veor q7, q7 + +loop2_hpp_\filterhpp\()_\a\()x\b: + mov r7, #\a + lsr r7, #3 + mov r5, r0 + sub r5, #2 +loop3_hpp_\filterhpp\()_\a\()x\b: + vextin8_chroma + \filterhpp + vdup.32 q8, r12 + vadd.s32 q6, q8 + vqshrun.s32 d0, q6, #6 + vadd.s32 q7, q8 + vqshrun.s32 d1, q7, #6 + vqmovn.u16 d0, q0 + vst1.u8 d0, [r2]! + subs r7, #1 + sub r5, #8 + bne loop3_hpp_\filterhpp\()_\a\()x\b + subs r6, #1 + add r0, r1 + add r2, r3 + bne loop2_hpp_\filterhpp\()_\a\()x\b +14: + vpop {q4-q7} +.endm + +.macro FILTER_CHROMA_HPP_4 w h filterhpp +loop4_hpp_\filterhpp\()_\w\()x\h: + mov r5, r0 + sub r5, #2 + vextin8_chroma + \filterhpp + vdup.32 q8, r12 + vadd.s32 q6, q8 + vqshrun.s32 d0, q6, #6 + vadd.s32 q7, q8 + vqshrun.s32 d1, q7, #6 + vqmovn.u16 d0, q0 + vst1.u32 {d0[0]}, [r2]! + sub r5, #8 + subs r6, #1 + add r0, r1 + add r2, r3 + bne loop4_hpp_\filterhpp\()_\w\()x\h +.endm + +.macro FILTER_CHROMA_HPP_12 w h filterhpp +loop12_hpp_\filterhpp\()_\w\()x\h: + mov r5, r0 + sub r5, #2 + vextin8_chroma + \filterhpp + vdup.32 q8, r12 + vadd.s32 q6, q8 + vqshrun.s32 d0, q6, #6 + vadd.s32 q7, q8 + vqshrun.s32 d1, q7, #6 + vqmovn.u16 d0, q0 + vst1.u8 {d0}, [r2]! + sub r5, #8 + + vextin8_chroma + \filterhpp + vdup.32 q8, r12 + vadd.s32 q6, q8 + vqshrun.s32 d0, q6, #6 + vadd.s32 q7, q8 + vqshrun.s32 d1, q7, #6 + vqmovn.u16 d0, q0 + vst1.u32 {d0[0]}, [r2]! + add r2, r3 + subs r6, #1 + add r0, r1 + bne loop12_hpp_\filterhpp\()_\w\()x\h +.endm + +.macro CHROMA_HPP w h +function x265_interp_4tap_horiz_pp_\w\()x\h\()_neon + + push {r4, r5, r6, r7, r8} + ldr r4, [sp, #4 * 5] + + cmp r4, #0 + beq 0f + cmp r4, #1 + beq 1f + cmp r4, #2 + beq 2f + cmp r4, #3 + beq 3f + cmp r4, #4 + beq 4f + cmp r4, #5 + beq 5f + cmp r4, #6 + beq 6f + cmp r4, #7 + beq 7f +0: + FILTER_CHROMA_HPP \w \h qpel_filter_chroma_0_32b + b 8f +1: + FILTER_CHROMA_HPP \w \h qpel_filter_chroma_1_32b + b 8f +2: + FILTER_CHROMA_HPP \w \h qpel_filter_chroma_2_32b + b 8f +3: + FILTER_CHROMA_HPP \w \h qpel_filter_chroma_3_32b + b 8f +4: + FILTER_CHROMA_HPP \w \h qpel_filter_chroma_4_32b + b 8f +5: + FILTER_CHROMA_HPP \w \h qpel_filter_chroma_5_32b + b 8f +6: + FILTER_CHROMA_HPP \w \h qpel_filter_chroma_6_32b + b 8f +7: + FILTER_CHROMA_HPP \w \h qpel_filter_chroma_7_32b + +8: + pop {r4, r5, r6, r7, r8} + bx lr +endfunc +.endm + +CHROMA_HPP 4 2 +CHROMA_HPP 4 4 +CHROMA_HPP 4 8 +CHROMA_HPP 4 16 +CHROMA_HPP 4 32 +CHROMA_HPP 8 2 +CHROMA_HPP 8 4 +CHROMA_HPP 8 6 +CHROMA_HPP 8 8 +CHROMA_HPP 8 12 +CHROMA_HPP 8 16 +CHROMA_HPP 8 32 +CHROMA_HPP 8 64 +CHROMA_HPP 12 16 +CHROMA_HPP 12 32 +CHROMA_HPP 16 4 +CHROMA_HPP 16 8 +CHROMA_HPP 16 12 +CHROMA_HPP 16 16 +CHROMA_HPP 16 24 +CHROMA_HPP 16 32 +CHROMA_HPP 16 64 +CHROMA_HPP 24 32 +CHROMA_HPP 24 64 +CHROMA_HPP 32 8 +CHROMA_HPP 32 16 +CHROMA_HPP 32 24 +CHROMA_HPP 32 32 +CHROMA_HPP 32 48 +CHROMA_HPP 32 64 +CHROMA_HPP 48 64 +CHROMA_HPP 64 16 +CHROMA_HPP 64 32 +CHROMA_HPP 64 48 +CHROMA_HPP 64 64 +// ***** Chroma_hps ***** +.macro FILTER_CHROMA_HPS a b filterhps + vpush {q4-q7} + mov r12, #8192 + mov r6, r10 + sub r3, #\a + lsl r3, #1 + + mov r8, #\a + cmp r8, #4 + beq 14f + cmp r8, #12 + beq 15f + b 16f +14: + FILTER_CHROMA_HPS_4 \a \b \filterhps + b 10f +15: + FILTER_CHROMA_HPS_12 \a \b \filterhps + b 10f +16: + cmp r9, #0 + beq 17f + cmp r9, #1 + beq 18f +17: +loop1_hps_\filterhps\()_\a\()x\b\()_rowext0: + mov r7, #\a + lsr r7, #3 + mov r5, r0 + sub r5, #2 +loop2_hps_\filterhps\()_\a\()x\b\()_rowext0: + vextin8_chroma + \filterhps + vdup.32 q8, r12 + vsub.s32 q6, q8 + vsub.s32 q7, q8 + vmovn.u32 d0, q6 + vmovn.u32 d1, q7 + vst1.s16 {q0}, [r2]! + subs r7, #1 + sub r5, #8 + bne loop2_hps_\filterhps\()_\a\()x\b\()_rowext0 + subs r6, #1 + add r0, r1 + add r2, r3 + bne loop1_hps_\filterhps\()_\a\()x\b\()_rowext0 + b 10f +18: +loop3_hps_\filterhps\()_\a\()x\b\()_rowext1: + mov r7, #\a + lsr r7, #3 + mov r5, r0 + sub r5, #2 +loop4_hps_\filterhps\()_\a\()x\b\()_rowext1: + vextin8_chroma + \filterhps + vdup.32 q8, r12 + vsub.s32 q6, q8 + vsub.s32 q7, q8 + vmovn.u32 d0, q6 + vmovn.u32 d1, q7 + vst1.s16 {q0}, [r2]! + subs r7, #1 + sub r5, #8 + bne loop4_hps_\filterhps\()_\a\()x\b\()_rowext1 + subs r6, #1 + add r0, r1 + add r2, r3 + bne loop3_hps_\filterhps\()_\a\()x\b\()_rowext1 +10: + vpop {q4-q7} +.endm + +.macro FILTER_CHROMA_HPS_4 w h filterhps + cmp r9, #0 + beq 19f + cmp r9, #1 + beq 20f +19: +loop4_hps_\filterhps\()_\w\()x\h\()_rowext0: + mov r5, r0 + sub r5, #2 + vextin8_chroma + \filterhps + vdup.32 q8, r12 + vsub.s32 q6, q8 + vmovn.u32 d0, q6 + vst1.s16 {d0}, [r2]! + sub r5, #8 + subs r6, #1 + add r0, r1 + add r2, r3 + bne loop4_hps_\filterhps\()_\w\()x\h\()_rowext0 + b 21f +20: +loop5_hps_\filterhps\()_\w\()x\h\()_rowext1: + mov r5, r0 + sub r5, #2 + vextin8_chroma + \filterhps + vdup.32 q8, r12 + vsub.s32 q6, q8 + vmovn.u32 d0, q6 + vst1.s16 {d0}, [r2]! + sub r5, #8 + subs r6, #1 + add r0, r1 + add r2, r3 + bne loop5_hps_\filterhps\()_\w\()x\h\()_rowext1 +21: +.endm + +.macro FILTER_CHROMA_HPS_12 w h filterhpp + cmp r9, #0 + beq 22f + cmp r9, #1 + beq 23f +22: +loop12_hps_\filterhpp\()_\w\()x\h\()_rowext0: + mov r5, r0 + sub r5, #2 + vextin8_chroma + \filterhpp + vdup.32 q8, r12 + vsub.s32 q6, q8 + vsub.s32 q7, q8 + vmovn.u32 d0, q6 + vmovn.u32 d1, q7 + vst1.s16 {q0}, [r2]! + sub r5, #8 + + vextin8_chroma + \filterhpp + vdup.32 q8, r12 + vsub.s32 q6, q8 + vmovn.u32 d0, q6 + vst1.s16 {d0}, [r2]! + add r2, r3 + subs r6, #1 + add r0, r1 + bne loop12_hps_\filterhpp\()_\w\()x\h\()_rowext0 + b 24f +23: +loop12_hps_\filterhpp\()_\w\()x\h\()_rowext1: + mov r5, r0 + sub r5, #2 + vextin8_chroma + \filterhpp + vdup.32 q8, r12 + vsub.s32 q6, q8 + vsub.s32 q7, q8 + vmovn.u32 d0, q6 + vmovn.u32 d1, q7 + vst1.s16 {q0}, [r2]! + sub r5, #8 + + vextin8_chroma + \filterhpp + vdup.32 q8, r12 + vsub.s32 q6, q8 + vmovn.u32 d0, q6 + vst1.s16 {d0}, [r2]! + add r2, r3 + subs r6, #1 + add r0, r1 + bne loop12_hps_\filterhpp\()_\w\()x\h\()_rowext1 +24: +.endm + +.macro CHROMA_HPS w h +function x265_interp_4tap_horiz_ps_\w\()x\h\()_neon + push {r4, r5, r6, r7, r8, r9, r10} + ldr r4, [sp, #28] + ldr r9, [sp, #32] + mov r10, #\h + cmp r9, #0 + beq 9f + sub r0, r1 + add r10, #3 +9: + cmp r4, #0 + beq 0f + cmp r4, #1 + beq 1f + cmp r4, #2 + beq 2f + cmp r4, #3 + beq 3f + cmp r4, #4 + beq 4f + cmp r4, #5 + beq 5f + cmp r4, #6 + beq 6f + cmp r4, #7 + beq 7f +0: + FILTER_CHROMA_HPS \w \h qpel_filter_chroma_0_32b + b 8f +1: + FILTER_CHROMA_HPS \w \h qpel_filter_chroma_1_32b + b 8f +2: + FILTER_CHROMA_HPS \w \h qpel_filter_chroma_2_32b + b 8f +3: + FILTER_CHROMA_HPS \w \h qpel_filter_chroma_3_32b + b 8f +4: + FILTER_CHROMA_HPS \w \h qpel_filter_chroma_4_32b + b 8f +5: + FILTER_CHROMA_HPS \w \h qpel_filter_chroma_5_32b + b 8f +6: + FILTER_CHROMA_HPS \w \h qpel_filter_chroma_6_32b + b 8f +7: + FILTER_CHROMA_HPS \w \h qpel_filter_chroma_7_32b + +8: + pop {r4, r5, r6, r7, r8, r9, r10} + bx lr +endfunc +.endm + +CHROMA_HPS 4 2 +CHROMA_HPS 4 4 +CHROMA_HPS 4 8 +CHROMA_HPS 4 16 +CHROMA_HPS 4 32 +CHROMA_HPS 8 2 +CHROMA_HPS 8 4 +CHROMA_HPS 8 6 +CHROMA_HPS 8 8 +CHROMA_HPS 8 12 +CHROMA_HPS 8 16 +CHROMA_HPS 8 32 +CHROMA_HPS 8 64 +CHROMA_HPS 12 16 +CHROMA_HPS 12 32 +CHROMA_HPS 16 4 +CHROMA_HPS 16 8 +CHROMA_HPS 16 12 +CHROMA_HPS 16 16 +CHROMA_HPS 16 24 +CHROMA_HPS 16 32 +CHROMA_HPS 16 64 +CHROMA_HPS 24 32 +CHROMA_HPS 24 64 +CHROMA_HPS 32 8 +CHROMA_HPS 32 16 +CHROMA_HPS 32 24 +CHROMA_HPS 32 32 +CHROMA_HPS 32 48 +CHROMA_HPS 32 64 +CHROMA_HPS 48 64 +CHROMA_HPS 64 16 +CHROMA_HPS 64 32 +CHROMA_HPS 64 48 +CHROMA_HPS 64 64
View file
x265_2.0.tar.gz/source/common/arm/ipfilter8.h
Added
@@ -0,0 +1,342 @@ +/***************************************************************************** + * Copyright (C) 2016 x265 project + * + * Authors: Steve Borho <steve@borho.org> + * Dnyaneshwar Gorade <dnyaneshwar@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#ifndef X265_IPFILTER8_ARM_H +#define X265_IPFILTER8_ARM_H + +void x265_filterPixelToShort_4x4_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_4x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_4x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_8x4_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_8x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_8x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_8x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_12x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x4_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x12_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_16x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_24x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x24_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_32x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_48x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_64x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_64x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_64x48_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); +void x265_filterPixelToShort_64x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); + +void x265_interp_8tap_vert_pp_4x4_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_4x8_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_4x16_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_8x4_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_8x8_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_8x16_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_8x32_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_16x4_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_16x8_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_16x16_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_16x32_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_16x64_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_16x12_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_32x8_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_32x16_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_32x32_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_32x64_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_32x24_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_64x16_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_64x32_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_64x64_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_64x48_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_24x32_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_48x64_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_pp_12x16_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); + +void x265_interp_8tap_vert_sp_4x4_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_4x8_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_4x16_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_8x4_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_8x8_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_8x16_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_8x32_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_16x4_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_16x8_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_16x16_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_16x32_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_16x64_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_16x12_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_32x8_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_32x16_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_32x32_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_32x64_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_32x24_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_64x16_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_64x32_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_64x64_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_64x48_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_24x32_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_48x64_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_sp_12x16_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); + +void x265_interp_8tap_vert_ps_4x4_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_4x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_4x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_8x4_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_8x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_8x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_8x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_16x4_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_16x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_16x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_16x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_16x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_16x12_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_32x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_32x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_32x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_32x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_32x24_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_64x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_64x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_64x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_64x48_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_24x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_48x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_8tap_vert_ps_12x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); + +void x265_interp_4tap_vert_pp_8x2_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_8x4_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_8x6_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_8x8_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_8x16_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_8x32_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_8x64_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_8x12_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_16x4_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_16x8_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_16x12_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_16x16_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_16x32_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_16x64_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_16x24_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_32x8_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_32x16_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_32x24_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_32x32_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_32x64_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_32x48_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_24x32_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_24x64_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_48x64_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_64x16_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_64x32_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_64x64_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_pp_64x48_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); + +void x265_interp_4tap_vert_ps_8x2_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_8x4_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_8x6_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_8x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_8x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_8x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_8x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_8x12_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_16x4_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_16x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_16x12_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_16x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_16x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_16x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_16x24_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_32x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_32x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_32x24_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_32x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_32x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_32x48_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_24x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_24x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_48x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_64x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_64x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_64x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_ps_64x48_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); + +void x265_interp_4tap_vert_sp_8x2_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_8x4_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_8x6_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_8x8_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_8x16_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_8x32_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_8x64_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_8x12_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_16x4_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_16x8_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_16x12_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_16x16_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_16x32_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_16x64_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_16x24_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_32x8_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_32x16_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_32x24_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_32x32_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_32x64_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_32x48_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_24x32_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_24x64_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_48x64_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_64x16_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_64x32_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_64x64_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_vert_sp_64x48_neon(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); + +void x265_interp_horiz_pp_4x4_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_4x8_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_4x16_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_8x4_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_8x8_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_8x16_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_8x32_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_12x16_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_16x4_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_16x8_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_16x12_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_16x16_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_16x32_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_16x64_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_24x32_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_32x8_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_32x16_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_32x24_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_32x32_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_32x64_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_48x64_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_64x16_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_64x32_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_64x48_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_horiz_pp_64x64_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); + +void x265_interp_horiz_ps_4x4_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_4x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_4x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_8x4_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_8x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_8x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_8x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_12x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_16x4_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_16x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_16x12_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_16x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_16x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_16x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_24x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_32x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_32x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_32x24_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_32x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_32x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_48x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_64x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_64x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_64x48_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_horiz_ps_64x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); + +void x265_interp_4tap_horiz_pp_4x2_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_4x4_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_4x8_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_4x16_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_4x32_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_8x2_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_8x4_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_8x6_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_8x8_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_8x12_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_8x16_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_8x32_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_8x64_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_12x16_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_12x32_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_16x4_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_16x8_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_16x12_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_16x16_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_16x24_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_16x32_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_16x64_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_24x32_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_24x64_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_32x8_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_32x16_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_32x24_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_32x32_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_32x48_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_32x64_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_48x64_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_64x16_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_64x32_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_64x48_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); +void x265_interp_4tap_horiz_pp_64x64_neon(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); + +void x265_interp_4tap_horiz_ps_4x2_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_4x4_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_4x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_4x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_4x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_8x2_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_8x4_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_8x6_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_8x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_8x12_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_8x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_8x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_8x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_12x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_12x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_16x4_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_16x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_16x12_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_16x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_16x24_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_16x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_16x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_24x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_24x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_32x8_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_32x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_32x24_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_32x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_32x48_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_32x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_48x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_64x16_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_64x32_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_64x48_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +void x265_interp_4tap_horiz_ps_64x64_neon(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); +#endif // ifndef X265_IPFILTER8_ARM_H
View file
x265_2.0.tar.gz/source/common/arm/loopfilter.h
Added
@@ -0,0 +1,29 @@ +/***************************************************************************** + * Copyright (C) 2016 x265 project + * + * Authors: Dnyaneshwar Gorade <dnyaneshwar@multicorewareinc.com> + * Praveen Kumar Tiwari <praveen@multicorewareinc.com> +;* Min Chen <chenm003@163.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#ifndef X265_LOOPFILTER_ARM_H +#define X265_LOOPFILTER_ARM_H + +#endif // ifndef X265_LOOPFILTER_ARM_H
View file
x265_2.0.tar.gz/source/common/arm/mc-a.S
Added
@@ -0,0 +1,1172 @@ +/***************************************************************************** + * Copyright (C) 2016 x265 project + * + * Authors: Dnyaneshwar Gorade <dnyaneshwar@multicorewareinc.com> + * Radhakrishnan <radhakrishnan@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "asm.S" + +.section .rodata + +.align 4 + +.text + +/* blockcopy_pp_16x16(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride) + * + * r0 - dst + * r1 - dstStride + * r2 - src + * r3 - srcStride */ +function x265_blockcopy_pp_16x16_neon +.rept 16 + vld1.8 {q0}, [r2] + vst1.8 {q0}, [r0] + add r2, r2, r3 + add r0, r0, r1 +.endr + bx lr +endfunc + +.macro blockcopy_pp_4xN_neon h +function x265_blockcopy_pp_4x\h\()_neon +.rept \h + ldr r12, [r2], r3 + str r12, [r0], r1 +.endr + bx lr +endfunc +.endm + +blockcopy_pp_4xN_neon 4 +blockcopy_pp_4xN_neon 8 +blockcopy_pp_4xN_neon 16 +blockcopy_pp_4xN_neon 2 +blockcopy_pp_4xN_neon 32 + +.macro blockcopy_pp_16xN_neon h +function x265_blockcopy_pp_16x\h\()_neon +.rept \h + vld1.8 {q0}, [r2], r3 + vst1.8 {q0}, [r0], r1 +.endr + bx lr +endfunc +.endm + +blockcopy_pp_16xN_neon 4 +blockcopy_pp_16xN_neon 8 +blockcopy_pp_16xN_neon 12 +blockcopy_pp_16xN_neon 24 + +.macro blockcopy_pp_16xN1_neon h i +function x265_blockcopy_pp_16x\h\()_neon + mov r12, #\i +loop_16x\h\(): +.rept 8 + vld1.8 {q0}, [r2], r3 + vst1.8 {q0}, [r0], r1 +.endr + subs r12, r12, #1 + bne loop_16x\h + bx lr +endfunc +.endm + +blockcopy_pp_16xN1_neon 32 4 +blockcopy_pp_16xN1_neon 64 8 + +.macro blockcopy_pp_8xN_neon h +function x265_blockcopy_pp_8x\h\()_neon +.rept \h + vld1.8 {d0}, [r2], r3 + vst1.8 {d0}, [r0], r1 +.endr + bx lr +endfunc +.endm + +blockcopy_pp_8xN_neon 4 +blockcopy_pp_8xN_neon 8 +blockcopy_pp_8xN_neon 16 +blockcopy_pp_8xN_neon 32 +blockcopy_pp_8xN_neon 2 +blockcopy_pp_8xN_neon 6 +blockcopy_pp_8xN_neon 12 + +function x265_blockcopy_pp_12x16_neon + sub r3, #8 + sub r1, #8 +.rept 16 + vld1.8 {d0}, [r2]! + ldr r12, [r2], r3 + vst1.8 {d0}, [r0]! + str r12, [r0], r1 +.endr + bx lr +endfunc + +function x265_blockcopy_pp_24x32_neon + mov r12, #4 +loop_24x32: +.rept 8 + vld1.8 {d0, d1, d2}, [r2], r3 + vst1.8 {d0, d1, d2}, [r0], r1 +.endr + subs r12, r12, #1 + bne loop_24x32 + bx lr +endfunc + +function x265_blockcopy_pp_32x8_neon +.rept 8 + vld1.8 {q0, q1}, [r2], r3 + vst1.8 {q0, q1}, [r0], r1 +.endr + bx lr +endfunc + +.macro blockcopy_pp_32xN_neon h i +function x265_blockcopy_pp_32x\h\()_neon + mov r12, #\i +loop_32x\h\(): +.rept 8 + vld1.8 {q0, q1}, [r2], r3 + vst1.8 {q0, q1}, [r0], r1 +.endr + subs r12, r12, #1 + bne loop_32x\h + bx lr +endfunc +.endm + +blockcopy_pp_32xN_neon 16 2 +blockcopy_pp_32xN_neon 24 3 +blockcopy_pp_32xN_neon 32 4 +blockcopy_pp_32xN_neon 64 8 +blockcopy_pp_32xN_neon 48 6 + +function x265_blockcopy_pp_48x64_neon + mov r12, #8 + sub r3, #32 + sub r1, #32 +loop_48x64: +.rept 8 + vld1.8 {q0, q1}, [r2]! + vld1.8 {q2}, [r2], r3 + vst1.8 {q0, q1}, [r0]! + vst1.8 {q2}, [r0], r1 +.endr + subs r12, r12, #1 + bne loop_48x64 + bx lr +endfunc + +.macro blockcopy_pp_64xN_neon h i +function x265_blockcopy_pp_64x\h\()_neon + mov r12, #\i + sub r3, #32 + sub r1, #32 +loop_64x\h\(): +.rept 4 + vld1.8 {q0, q1}, [r2]! + vld1.8 {q2, q3}, [r2], r3 + vst1.8 {q0, q1}, [r0]! + vst1.8 {q2, q3}, [r0], r1 +.endr + subs r12, r12, #1 + bne loop_64x\h + bx lr +endfunc +.endm + +blockcopy_pp_64xN_neon 16 4 +blockcopy_pp_64xN_neon 32 8 +blockcopy_pp_64xN_neon 48 12 +blockcopy_pp_64xN_neon 64 16 + +.macro blockcopy_pp_2xN_neon h +function x265_blockcopy_pp_2x\h\()_neon +.rept \h + ldrh r12, [r2], r3 + strh r12, [r0], r1 +.endr + bx lr +endfunc +.endm + +blockcopy_pp_2xN_neon 4 +blockcopy_pp_2xN_neon 8 +blockcopy_pp_2xN_neon 16 + +.macro blockcopy_pp_6xN_neon h i +function x265_blockcopy_pp_6x\h\()_neon + sub r1, #4 +.rept \i + vld1.8 {d0}, [r2], r3 + vld1.8 {d1}, [r2], r3 + vst1.32 {d0[0]}, [r0]! + vst1.16 {d0[2]}, [r0], r1 + vst1.32 {d1[0]}, [r0]! + vst1.16 {d1[2]}, [r0], r1 +.endr + bx lr +endfunc +.endm +blockcopy_pp_6xN_neon 8 4 +blockcopy_pp_6xN_neon 16 8 + +function x265_blockcopy_pp_8x64_neon + mov r12, #4 +loop_pp_8x64: + subs r12, #1 +.rept 16 + vld1.8 {d0}, [r2], r3 + vst1.8 {d0}, [r0], r1 +.endr + bne loop_pp_8x64 + bx lr +endfunc + +function x265_blockcopy_pp_12x32_neon + push {r4} + sub r3, #8 + sub r1, #8 + mov r12, #4 +loop_pp_12x32: + subs r12, #1 +.rept 8 + vld1.8 {d0}, [r2]! + ldr r4, [r2], r3 + vst1.8 {d0}, [r0]! + str r4, [r0], r1 +.endr + bne loop_pp_12x32 + pop {r4} + bx lr +endfunc + +function x265_blockcopy_pp_24x64_neon + mov r12, #4 +loop_24x64: +.rept 16 + vld1.8 {d0, d1, d2}, [r2], r3 + vst1.8 {d0, d1, d2}, [r0], r1 +.endr + subs r12, r12, #1 + bne loop_24x64 + bx lr +endfunc + +// void pixelavg_pp(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int) +.macro pixel_avg_pp_4xN_neon h +function x265_pixel_avg_pp_4x\h\()_neon + push {r4} + ldr r4, [sp, #4] + ldr r12, [sp, #8] +.rept \h + vld1.32 {d0[]}, [r2], r3 + vld1.32 {d1[]}, [r4], r12 + vrhadd.u8 d2, d0, d1 + vst1.32 {d2[0]}, [r0], r1 +.endr + pop {r4} + bx lr +endfunc +.endm + +pixel_avg_pp_4xN_neon 4 +pixel_avg_pp_4xN_neon 8 +pixel_avg_pp_4xN_neon 16 + +.macro pixel_avg_pp_8xN_neon h +function x265_pixel_avg_pp_8x\h\()_neon + push {r4} + ldr r4, [sp, #4] + ldr r12, [sp, #8] +.rept \h + vld1.8 {d0}, [r2], r3 + vld1.8 {d1}, [r4], r12 + vrhadd.u8 d2, d0, d1 + vst1.8 {d2}, [r0], r1 +.endr + pop {r4} + bx lr +endfunc +.endm + +pixel_avg_pp_8xN_neon 4 +pixel_avg_pp_8xN_neon 8 +pixel_avg_pp_8xN_neon 16 +pixel_avg_pp_8xN_neon 32 + +function x265_pixel_avg_pp_12x16_neon + push {r4, r6} + mov r6, #8 + ldr r4, [sp, #8] + ldr r12, [sp, #12] + sub r1, r6 + sub r3, r6 + sub r12, r6 +.rept 16 + vld1.32 {d0}, [r2]! + vld1.32 {d1[0]}, [r2], r3 + vld1.32 {d2}, [r4]! + vld1.32 {d3[0]}, [r4], r12 + vrhadd.u8 d0, d0, d2 + vrhadd.u8 d1, d1, d3 + vst1.8 {d0}, [r0]! + vst1.32 {d1[0]}, [r0], r1 +.endr + pop {r4, r6} + bx lr +endfunc + +.macro pixel_avg_pp_16xN_neon h +function x265_pixel_avg_pp_16x\h\()_neon + push {r4} + ldr r4, [sp, #4] + ldr r12, [sp, #8] +.rept \h + vld1.8 {q0}, [r2], r3 + vld1.8 {q1}, [r4], r12 + vrhadd.u8 q2, q0, q1 + vst1.8 {q2}, [r0], r1 +.endr + pop {r4} + bx lr +endfunc +.endm + +pixel_avg_pp_16xN_neon 4 +pixel_avg_pp_16xN_neon 8 +pixel_avg_pp_16xN_neon 12 +pixel_avg_pp_16xN_neon 16 +pixel_avg_pp_16xN_neon 32 + +function x265_pixel_avg_pp_16x64_neon + push {r4, r6} + ldr r4, [sp, #8] + ldr r12, [sp, #12] + mov r6, #8 +lpavg_16x64: +.rept 8 + vld1.8 {q0}, [r2], r3 + vld1.8 {q1}, [r4], r12 + vrhadd.u8 q2, q0, q1 + vst1.8 {q2}, [r0], r1 +.endr + subs r6, r6, #1 + bne lpavg_16x64 + pop {r4 , r6} + bx lr +endfunc + +function x265_pixel_avg_pp_24x32_neon + push {r4, r6} + ldr r4, [sp, #8] + ldr r12, [sp, #12] + mov r6, #4 +lpavg_24x32: +.rept 8 + vld1.8 {d0, d1, d2}, [r2], r3 + vld1.8 {d3, d4, d5}, [r4], r12 + vrhadd.u8 d0, d0, d3 + vrhadd.u8 d1, d1, d4 + vrhadd.u8 d2, d2, d5 + vst1.8 {d0, d1, d2}, [r0], r1 +.endr + subs r6, r6, #1 + bne lpavg_24x32 + pop {r4, r6} + bx lr +endfunc + +.macro pixel_avg_pp_32xN_neon h +function x265_pixel_avg_pp_32x\h\()_neon + push {r4} + ldr r4, [sp, #4] + ldr r12, [sp, #8] +.rept \h + vld1.8 {q0, q1}, [r2], r3 + vld1.8 {q2, q3}, [r4], r12 + vrhadd.u8 q0, q0, q2 + vrhadd.u8 q1, q1, q3 + vst1.8 {q0, q1}, [r0], r1 +.endr + pop {r4} + bx lr +endfunc +.endm + +pixel_avg_pp_32xN_neon 8 +pixel_avg_pp_32xN_neon 16 +pixel_avg_pp_32xN_neon 24 + +.macro pixel_avg_pp_32xN1_neon h i +function x265_pixel_avg_pp_32x\h\()_neon + push {r4, r6} + ldr r4, [sp, #8] + ldr r12, [sp, #12] + mov r6, #\i +lpavg_32x\h\(): +.rept 8 + vld1.8 {q0, q1}, [r2], r3 + vld1.8 {q2, q3}, [r4], r12 + vrhadd.u8 q0, q0, q2 + vrhadd.u8 q1, q1, q3 + vst1.8 {q0, q1}, [r0], r1 +.endr + subs r6, r6, #1 + bne lpavg_32x\h + pop {r4, r6} + bx lr +endfunc +.endm + +pixel_avg_pp_32xN1_neon 32 4 +pixel_avg_pp_32xN1_neon 64 8 + +function x265_pixel_avg_pp_48x64_neon + push {r4, r6, r7} + ldr r4, [sp, #12] + ldr r12, [sp, #16] + mov r6, #8 + mov r7, #32 + sub r1, r7 + sub r3, r7 + sub r12, r7 +lpavg_48x64: +.rept 8 + vld1.8 {q0, q1}, [r2]! + vld1.8 {q2}, [r2], r3 + vld1.8 {q8, q9}, [r4]! + vld1.8 {q10}, [r4], r12 + vrhadd.u8 q0, q0, q8 + vrhadd.u8 q1, q1, q9 + vrhadd.u8 q2, q2, q10 + vst1.8 {q0, q1}, [r0]! + vst1.8 {q2}, [r0], r1 +.endr + subs r6, r6, #1 + bne lpavg_48x64 + pop {r4, r6, r7} + bx lr +endfunc + +.macro pixel_avg_pp_64xN_neon h i +function x265_pixel_avg_pp_64x\h\()_neon + push {r4, r6, r7} + ldr r4, [sp, #12] + ldr r12, [sp, #16] + mov r7, #32 + mov r6, #\i + sub r3, r7 + sub r12, r7 + sub r1, r7 +lpavg_64x\h\(): +.rept 4 + vld1.8 {q0, q1}, [r2]! + vld1.8 {q2, q3}, [r2], r3 + vld1.8 {q8, q9}, [r4]! + vld1.8 {q10, q11}, [r4], r12 + vrhadd.u8 q0, q0, q8 + vrhadd.u8 q1, q1, q9 + vrhadd.u8 q2, q2, q10 + vrhadd.u8 q3, q3, q11 + vst1.8 {q0, q1}, [r0]! + vst1.8 {q2, q3}, [r0], r1 +.endr + subs r6, r6, #1 + bne lpavg_64x\h + pop {r4, r6, r7} + bx lr +endfunc +.endm + +pixel_avg_pp_64xN_neon 16 4 +pixel_avg_pp_64xN_neon 32 8 +pixel_avg_pp_64xN_neon 48 12 +pixel_avg_pp_64xN_neon 64 16 + +// void x265_cpy2Dto1D_shr_4x4_neon(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift) +function x265_cpy2Dto1D_shr_4x4_neon + add r2, r2 + vdup.16 q0, r3 + vceq.s16 q1, q1 + vshl.s16 q1, q0 + vsri.s16 q1, #1 + vneg.s16 q0, q0 + vld1.s16 {d4}, [r1], r2 + vld1.s16 {d5}, [r1], r2 + vld1.s16 {d6}, [r1], r2 + vld1.s16 {d7}, [r1], r2 + vsub.s16 q2, q1 + vsub.s16 q3, q1 + vshl.s16 q2, q0 + vshl.s16 q3, q0 + vst1.16 {q2-q3}, [r0] + bx lr +endfunc + +function x265_cpy2Dto1D_shr_8x8_neon + add r2, r2 + vdup.16 q0, r3 + vceq.s16 q1, q1 + vshl.s16 q1, q0 + vsri.s16 q1, #1 + vneg.s16 q0, q0 +.rept 4 + vld1.s16 {q2}, [r1], r2 + vld1.s16 {q3}, [r1], r2 + vsub.s16 q2, q1 + vsub.s16 q3, q1 + vshl.s16 q2, q0 + vshl.s16 q3, q0 + vst1.16 {q2-q3}, [r0]! +.endr + bx lr +endfunc + +function x265_cpy2Dto1D_shr_16x16_neon + add r2, r2 + vdup.16 q0, r3 + vceq.s16 q1, q1 + vshl.s16 q1, q0 + vsri.s16 q1, #1 + vneg.s16 q0, q0 + mov r3, #4 +.loop_cpy2Dto1D_shr_16: + subs r3, #1 +.rept 4 + vld1.s16 {q2-q3}, [r1], r2 + vsub.s16 q2, q1 + vsub.s16 q3, q1 + vshl.s16 q2, q0 + vshl.s16 q3, q0 + vst1.16 {q2-q3}, [r0]! +.endr + bgt .loop_cpy2Dto1D_shr_16 + bx lr +endfunc + +function x265_cpy2Dto1D_shr_32x32_neon + add r2, r2 + sub r2, #32 + vdup.16 q0, r3 + vceq.s16 q1, q1 + vshl.s16 q1, q0 + vsri.s16 q1, #1 + vneg.s16 q0, q0 + mov r3, 16 +.loop_cpy2Dto1D_shr_32: + subs r3, #1 +.rept 2 + vld1.s16 {q2-q3}, [r1]! + vld1.s16 {q8-q9}, [r1], r2 + vsub.s16 q2, q1 + vsub.s16 q3, q1 + vsub.s16 q8, q1 + vsub.s16 q9, q1 + vshl.s16 q2, q0 + vshl.s16 q3, q0 + vshl.s16 q8, q0 + vshl.s16 q9, q0 + vst1.16 {q2-q3}, [r0]! + vst1.16 {q8-q9}, [r0]! +.endr + bgt .loop_cpy2Dto1D_shr_32 + bx lr +endfunc + +// void addAvg(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride) +.macro addAvg_8xN h i +function x265_addAvg_8x\h\()_neon + push {r4, r5, r6} + ldr r4, [sp, #12] + ldr r5, [sp, #16] + lsl r3, #1 + lsl r4, #1 + mov r12, #\i + vmov.i16 d0, #16448 + +loop_addavg_8x\h: + subs r12, #1 + vld1.16 {q1}, [r0], r3 // src1 + vld1.16 {q2}, [r1], r4 // src2 + vld1.16 {q10}, [r0], r3 // src1 + vld1.16 {q11}, [r1], r4 // src2 + + vadd.s16 q1, q2 + vaddl.s16 q8, d2, d0 + vaddl.s16 q9, d3, d0 + vadd.s16 q10, q11 + vaddl.s16 q1, d20, d0 + vaddl.s16 q2, d21, d0 + + vshrn.s32 d20, q8, #7 + vshrn.s32 d21, q9, #7 + vshrn.s32 d22, q1, #7 + vshrn.s32 d23, q2, #7 + + vqmovun.s16 d2, q10 + vqmovun.s16 d3, q11 + vst1.8 {d2}, [r2], r5 + vst1.8 {d3}, [r2], r5 + + bne loop_addavg_8x\h + pop {r4, r5, r6} + bx lr +endfunc +.endm + +addAvg_8xN 4 2 +addAvg_8xN 8 4 +addAvg_8xN 16 8 +addAvg_8xN 32 16 +addAvg_8xN 2 1 +addAvg_8xN 6 3 +addAvg_8xN 12 6 +addAvg_8xN 64 32 + +function x265_addAvg_4x4_neon + push {r4, r5, r6} + ldr r4, [sp, #12] + ldr r5, [sp, #16] + lsl r3, #1 + lsl r4, #1 + vmov.i16 d0, #16448 + +.rept 2 + vld1.16 {d2}, [r0], r3 // src1 + vld1.16 {d4}, [r0], r3 + vld1.16 {d3}, [r1], r4 // src2 + vld1.16 {d5}, [r1], r4 + + vadd.s16 d2, d3 + vadd.s16 d4, d5 + vaddl.s16 q8, d2, d0 + vaddl.s16 q9, d4, d0 + vshrn.s32 d20, q8, #7 + vshrn.s32 d21, q9, #7 + vqmovun.s16 d2, q10 + + vst1.32 {d2[0]}, [r2], r5 + vst1.32 {d2[1]}, [r2], r5 +.endr + pop {r4, r5, r6} + bx lr +endfunc + +.macro addAvg_4xN h i +function x265_addAvg_4x\h\()_neon + push {r4, r5, r6} + ldr r4, [sp, #12] + ldr r5, [sp, #16] + lsl r3, #1 + lsl r4, #1 + mov r12, #\i + vmov.i16 d0, #16448 + +loop_addavg_4x\h\(): + subs r12, #1 + vld1.16 {d2}, [r0], r3 // src1 + vld1.16 {d4}, [r0], r3 + vld1.16 {d3}, [r1], r4 // src2 + vld1.16 {d5}, [r1], r4 + + vadd.s16 d2, d3 + vadd.s16 d4, d5 + vaddl.s16 q8, d2, d0 + vaddl.s16 q9, d4, d0 + vshrn.s32 d20, q8, #7 + vshrn.s32 d21, q9, #7 + vqmovun.s16 d2, q10 + + vst1.32 {d2[0]}, [r2], r5 + vst1.32 {d2[1]}, [r2], r5 + bne loop_addavg_4x\h + pop {r4, r5, r6} + bx lr +endfunc +.endm + +addAvg_4xN 8 4 +addAvg_4xN 16 8 +addAvg_4xN 2 1 +addAvg_4xN 32 16 + +.macro addAvg_6xN h i +function x265_addAvg_6x\h\()_neon + push {r4, r5, r6} + ldr r4, [sp, #12] + ldr r5, [sp, #16] + lsl r3, #1 + lsl r4, #1 + sub r5, #4 + mov r12, #\i + vmov.i16 d0, #16448 + +loop_addavg_6x\h: + subs r12, #1 + vld1.16 {q1}, [r0], r3 // src1 + vld1.16 {q2}, [r1], r4 // src2 + vld1.16 {q10}, [r0], r3 // src1 + vld1.16 {q11}, [r1], r4 // src2 + + vadd.s16 q1, q2 + vaddl.s16 q8, d2, d0 + vaddl.s16 q9, d3, d0 + vadd.s16 q10, q11 + vaddl.s16 q1, d20, d0 + vaddl.s16 q2, d21, d0 + + vshrn.s32 d20, q8, #7 + vshrn.s32 d21, q9, #7 + vshrn.s32 d22, q1, #7 + vshrn.s32 d23, q2, #7 + + vqmovun.s16 d2, q10 + vqmovun.s16 d3, q11 + vst1.32 {d2[0]}, [r2]! + vst1.16 {d2[2]}, [r2], r5 + vst1.32 {d3[0]}, [r2]! + vst1.16 {d3[2]}, [r2], r5 + + bne loop_addavg_6x\h + pop {r4, r5, r6} + bx lr +endfunc +.endm + +addAvg_6xN 8 4 +addAvg_6xN 16 8 + +function x265_addAvg_12x16_neon + push {r4, r5, r6} + ldr r4, [sp, #12] + ldr r5, [sp, #16] + lsl r3, #1 + lsl r4, #1 + sub r5, #8 + mov r12, #16 + vmov.i16 d0, #16448 + +loop_addAvg_12X16: + subs r12, #1 + vld1.16 {d2, d3, d4}, [r0], r3 + vld1.16 {d16, d17, d18}, [r1], r4 + + vadd.s16 q1, q8 + vaddl.s16 q11, d2, d0 + vaddl.s16 q10, d3, d0 + vadd.s16 d4, d18 + vaddl.s16 q9, d0, d4 + + vshrn.s32 d2, q11, #7 + vshrn.s32 d3, q10, #7 + vshrn.s32 d4, q9, #7 + veor d5, d5 + + vqmovun.s16 d6, q1 + vqmovun.s16 d7, q2 + vst1.8 {d6}, [r2]! + vst1.32 {d7[0]}, [r2], r5 + + bne loop_addAvg_12X16 + pop {r4, r5, r6} + bx lr +endfunc + +function x265_addAvg_12x32_neon + push {r4, r5, r6} + ldr r4, [sp, #12] + ldr r5, [sp, #16] + lsl r3, #1 + lsl r4, #1 + sub r5, #8 + mov r12, #32 + vmov.i16 d0, #16448 + +loop_addAvg_12X32: + subs r12, #1 + vld1.16 {d2, d3, d4}, [r0], r3 + vld1.16 {d16, d17, d18}, [r1], r4 + + vadd.s16 q1, q8 + vaddl.s16 q11, d2, d0 + vaddl.s16 q10, d3, d0 + vadd.s16 d4, d18 + vaddl.s16 q9, d0, d4 + + vshrn.s32 d2, q11, #7 + vshrn.s32 d3, q10, #7 + vshrn.s32 d4, q9, #7 + veor d5, d5 + + vqmovun.s16 d6, q1 + vqmovun.s16 d7, q2 + vst1.8 {d6}, [r2]! + vst1.32 {d7[0]}, [r2], r5 + + bne loop_addAvg_12X32 + pop {r4, r5, r6} + bx lr +endfunc + +.macro addAvg_16xN h +function x265_addAvg_16x\h\()_neon + push {r4, r5, r6} + ldr r4, [sp, #12] + ldr r5, [sp, #16] + lsl r3, #1 + lsl r4, #1 + mov r12, #\h + vmov.i16 d0, #16448 + +loop_addavg_16x\h: + subs r12, #1 + vld1.16 {q1, q2}, [r0], r3 // src1 + vld1.16 {q8, q9}, [r1], r4 // src2 + + vadd.s16 q1, q8 + vaddl.s16 q10, d2, d0 + vaddl.s16 q11, d3, d0 + vadd.s16 q2, q9 + vaddl.s16 q8, d4, d0 + vaddl.s16 q9, d5, d0 + + vshrn.s32 d2, q10, #7 + vshrn.s32 d3, q11, #7 + vshrn.s32 d4, q8, #7 + vshrn.s32 d5, q9, #7 + + vqmovun.s16 d6, q1 + vqmovun.s16 d7, q2 + vst1.8 {q3}, [r2], r5 + + bne loop_addavg_16x\h + pop {r4, r5, r6} + bx lr +endfunc +.endm + +addAvg_16xN 4 +addAvg_16xN 8 +addAvg_16xN 12 +addAvg_16xN 16 +addAvg_16xN 32 +addAvg_16xN 64 +addAvg_16xN 24 + +function x265_addAvg_24x32_neon + push {r4, r5, r6} + ldr r4, [sp, #12] + ldr r5, [sp, #16] + lsl r3, #1 + lsl r4, #1 + sub r3, #32 + sub r4, #32 + mov r12, #32 + vmov.i16 d0, #16448 + +loop_addavg_24x32: + subs r12, #1 + vld1.16 {q1, q2}, [r0]! // src1 + vld1.16 {q3}, [r0], r3 + vld1.16 {q8, q9}, [r1]! // src2 + vld1.16 {q10}, [r1], r4 + + vadd.s16 q1, q8 + vaddl.s16 q12, d2, d0 + vaddl.s16 q13, d3, d0 + vadd.s16 q2, q9 + vaddl.s16 q8, d4, d0 + vaddl.s16 q9, d5, d0 + vadd.s16 q3, q10 + vaddl.s16 q10, d6, d0 + vaddl.s16 q11, d7, d0 + + vshrn.s32 d2, q12, #7 + vshrn.s32 d3, q13, #7 + vshrn.s32 d4, q8, #7 + vshrn.s32 d5, q9, #7 + vshrn.s32 d6, q10, #7 + vshrn.s32 d7, q11, #7 + + vqmovun.s16 d16, q1 + vqmovun.s16 d17, q2 + vqmovun.s16 d18, q3 + vst1.8 {d16, d17, d18}, [r2], r5 + bne loop_addavg_24x32 + + pop {r4, r5, r6} + bx lr +endfunc + +function x265_addAvg_24x64_neon + push {r4, r5, r6} + ldr r4, [sp, #12] + ldr r5, [sp, #16] + lsl r3, #1 + lsl r4, #1 + sub r3, #32 + sub r4, #32 + mov r12, #64 + vmov.i16 d0, #16448 + +loop_addavg_24x64: + subs r12, #1 + vld1.16 {q1, q2}, [r0]! // src1 + vld1.16 {q3}, [r0], r3 + vld1.16 {q8, q9}, [r1]! // src2 + vld1.16 {q10}, [r1], r4 + + vadd.s16 q1, q8 + vaddl.s16 q12, d2, d0 + vaddl.s16 q13, d3, d0 + vadd.s16 q2, q9 + vaddl.s16 q8, d4, d0 + vaddl.s16 q9, d5, d0 + vadd.s16 q3, q10 + vaddl.s16 q10, d6, d0 + vaddl.s16 q11, d7, d0 + + vshrn.s32 d2, q12, #7 + vshrn.s32 d3, q13, #7 + vshrn.s32 d4, q8, #7 + vshrn.s32 d5, q9, #7 + vshrn.s32 d6, q10, #7 + vshrn.s32 d7, q11, #7 + + vqmovun.s16 d16, q1 + vqmovun.s16 d17, q2 + vqmovun.s16 d18, q3 + vst1.8 {d16, d17, d18}, [r2], r5 + bne loop_addavg_24x64 + + pop {r4, r5, r6} + bx lr +endfunc + +.macro addAvg32 x y z + mov r12, #\y +loop_addavg_\x\()x\y\()_\z: + subs r12, #1 + vld1.16 {q8, q9}, [r0]! // src1 + vld1.16 {q10, q11}, [r0], r3 + vld1.16 {q12, q13}, [r1]! // src2 + vld1.16 {q14, q15}, [r1], r4 + + vadd.s16 q8, q12 + vaddl.s16 q1, d16, d0 + vaddl.s16 q2, d17, d0 + vadd.s16 q9, q13 + vaddl.s16 q12, d18, d0 + vaddl.s16 q13, d19, d0 + + vshrn.s32 d6, q1, #7 + vshrn.s32 d7, q2, #7 + vshrn.s32 d2, q12, #7 + vshrn.s32 d3, q13, #7 + vqmovun.s16 d16, q3 + vqmovun.s16 d17, q1 + + vadd.s16 q10, q14 + vaddl.s16 q1, d20, d0 + vaddl.s16 q2, d21, d0 + vadd.s16 q11, q15 + vaddl.s16 q12, d22, d0 + vaddl.s16 q13, d23, d0 + + vshrn.s32 d6, q1, #7 + vshrn.s32 d7, q2, #7 + vshrn.s32 d2, q12, #7 + vshrn.s32 d3, q13, #7 + vqmovun.s16 d18, q3 + vqmovun.s16 d19, q1 + vst1.8 {q8, q9}, [r2], r5 + bne loop_addavg_\x\()x\y\()_\z +.endm + +.macro addAvg_32xN h +function x265_addAvg_32x\h\()_neon + push {r4, r5, r6} + ldr r4, [sp, #12] + ldr r5, [sp, #16] + lsl r3, #1 + lsl r4, #1 + sub r3, #32 + sub r4, #32 + vmov.i16 d0, #16448 + + addAvg32 32 \h 1 + pop {r4, r5, r6} + bx lr +endfunc +.endm + +addAvg_32xN 8 +addAvg_32xN 16 +addAvg_32xN 24 +addAvg_32xN 32 +addAvg_32xN 64 +addAvg_32xN 48 + +function x265_addAvg_48x64_neon + push {r4, r5, r6, r7, r8} + ldr r4, [sp, #20] + ldr r5, [sp, #24] + lsl r3, #1 + lsl r4, #1 + sub r3, #32 + sub r4, #32 + vmov.i16 d0, #16448 + mov r7, r0 + mov r8, r1 + + addAvg32 48 64 1 // 32x64 + add r0, r7, #64 + add r1, r8, #64 + sub r2, r2, r5, lsl #6 + add r2, #32 + add r3, #32 + add r4, #32 + + mov r12, #64 +loop_addavg_16x64_2: // 16x64 + subs r12, #1 + vld1.16 {q1, q2}, [r0], r3 // src1 + vld1.16 {q8, q9}, [r1], r4 // src2 + + vadd.s16 q1, q8 + vaddl.s16 q10, d2, d0 + vaddl.s16 q11, d3, d0 + vadd.s16 q2, q9 + vaddl.s16 q8, d4, d0 + vaddl.s16 q9, d5, d0 + + vshrn.s32 d2, q10, #7 + vshrn.s32 d3, q11, #7 + vshrn.s32 d4, q8, #7 + vshrn.s32 d5, q9, #7 + + vqmovun.s16 d6, q1 + vqmovun.s16 d7, q2 + vst1.8 {q3}, [r2], r5 + bne loop_addavg_16x64_2 + + pop {r4, r5, r6, r7, r8} + bx lr +endfunc + +function x265_addAvg_64x16_neon + push {r4, r5, r6, r7, r8} + ldr r4, [sp, #20] + ldr r5, [sp, #24] + lsl r3, #1 + lsl r4, #1 + sub r3, #32 + sub r4, #32 + vmov.i16 d0, #16448 + mov r7, r0 + mov r8, r1 + + addAvg32 64 16 1 + add r0, r7, #64 + add r1, r8, #64 + sub r2, r2, r5, lsl #4 + add r2, #32 + addAvg32 64 16 2 + + pop {r4, r5, r6, r7, r8} + bx lr +endfunc + +function x265_addAvg_64x32_neon + push {r4, r5, r6, r7, r8} + ldr r4, [sp, #20] + ldr r5, [sp, #24] + lsl r3, #1 + lsl r4, #1 + sub r3, #32 + sub r4, #32 + vmov.i16 d0, #16448 + mov r7, r0 + mov r8, r1 + + addAvg32 64 32 1 + add r0, r7, #64 + add r1, r8, #64 + sub r2, r2, r5, lsl #5 + add r2, #32 + addAvg32 64 32 2 + + pop {r4, r5, r6, r7, r8} + bx lr +endfunc + +function x265_addAvg_64x48_neon + push {r4, r5, r6, r7, r8} + ldr r4, [sp, #20] + ldr r5, [sp, #24] + lsl r3, #1 + lsl r4, #1 + sub r3, #32 + sub r4, #32 + vmov.i16 d0, #16448 + mov r7, r0 + mov r8, r1 + + addAvg32 64 48 1 + add r0, r7, #64 + add r1, r8, #64 + sub r2, r2, r5, lsl #5 + sub r2, r2, r5, lsl #4 + add r2, #32 + addAvg32 64 48 2 + + pop {r4, r5, r6, r7, r8} + bx lr +endfunc + +function x265_addAvg_64x64_neon + push {r4, r5, r6, r7, r8} + ldr r4, [sp, #20] + ldr r5, [sp, #24] + lsl r3, #1 + lsl r4, #1 + sub r3, #32 + sub r4, #32 + vmov.i16 d0, #16448 + mov r7, r0 + mov r8, r1 + + addAvg32 64 64 1 + add r0, r7, #64 + add r1, r8, #64 + sub r2, r2, r5, lsl #6 + add r2, #32 + addAvg32 64 64 2 + + pop {r4, r5, r6, r7, r8} + bx lr +endfunc
View file
x265_2.0.tar.gz/source/common/arm/mc.h
Added
@@ -0,0 +1,27 @@ +/***************************************************************************** + * Copyright (C) 2016 x265 project + * + * Authors: Steve Borho <steve@borho.org> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#ifndef X265_MC_ARM_H +#define X265_MC_ARM_H + +#endif // ifndef X265_MC_ARM_H
View file
x265_2.0.tar.gz/source/common/arm/pixel-util.S
Added
@@ -0,0 +1,2451 @@ +/***************************************************************************** + * Copyright (C) 2016 x265 project + * + * Authors: Dnyaneshwar G <dnyaneshwar@multicorewareinc.com> + * Radhakrishnan VR <radhakrishnan@multicorewareinc.com> + * Min Chen <min.chen@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "asm.S" + +.section .rodata + +.align 4 + + +.text + +.macro VAR_SQR_SUM qsqr_sum, qsqr_last, qsqr_temp, dsrc, num=0, vpadal=vpadal.u16 + vmull.u8 \qsqr_temp, \dsrc, \dsrc + vaddw.u8 q\num, q\num, \dsrc + \vpadal \qsqr_sum, \qsqr_last +.endm + +function x265_pixel_var_8x8_neon + vld1.u8 {d16}, [r0], r1 + vmull.u8 q1, d16, d16 + vmovl.u8 q0, d16 + vld1.u8 {d18}, [r0], r1 + vmull.u8 q2, d18, d18 + vaddw.u8 q0, q0, d18 + + vld1.u8 {d20}, [r0], r1 + VAR_SQR_SUM q1, q1, q3, d20, 0, vpaddl.u16 + vld1.u8 {d22}, [r0], r1 + VAR_SQR_SUM q2, q2, q8, d22, 0, vpaddl.u16 + + vld1.u8 {d24}, [r0], r1 + VAR_SQR_SUM q1, q3, q9, d24 + vld1.u8 {d26}, [r0], r1 + VAR_SQR_SUM q2, q8, q10, d26 + vld1.u8 {d24}, [r0], r1 + VAR_SQR_SUM q1, q9, q14, d24 + vld1.u8 {d26}, [r0], r1 + VAR_SQR_SUM q2, q10, q15, d26 + + vpaddl.u16 q8, q14 + vpaddl.u16 q9, q15 + vadd.u32 q1, q1, q8 + vadd.u16 d0, d0, d1 + vadd.u32 q1, q1, q9 + vadd.u32 q1, q1, q2 + vpaddl.u16 d0, d0 + vadd.u32 d2, d2, d3 + vpadd.u32 d0, d0, d2 + + vmov r0, r1, d0 + bx lr +endfunc + +function x265_pixel_var_16x16_neon + veor.u8 q0, q0 + veor.u8 q1, q1 + veor.u8 q2, q2 + veor.u8 q14, q14 + veor.u8 q15, q15 + mov ip, #4 + +.var16_loop: + subs ip, ip, #1 + vld1.u8 {q8}, [r0], r1 + VAR_SQR_SUM q1, q14, q12, d16 + VAR_SQR_SUM q2, q15, q13, d17 + + vld1.u8 {q9}, [r0], r1 + VAR_SQR_SUM q1, q12, q14, d18 + VAR_SQR_SUM q2, q13, q15, d19 + + vld1.u8 {q8}, [r0], r1 + VAR_SQR_SUM q1, q14, q12, d16 + VAR_SQR_SUM q2, q15, q13, d17 + + vld1.u8 {q9}, [r0], r1 + VAR_SQR_SUM q1, q12, q14, d18 + VAR_SQR_SUM q2, q13, q15, d19 + bgt .var16_loop + + vpaddl.u16 q8, q14 + vpaddl.u16 q9, q15 + vadd.u32 q1, q1, q8 + vadd.u16 d0, d0, d1 + vadd.u32 q1, q1, q9 + vadd.u32 q1, q1, q2 + vpaddl.u16 d0, d0 + vadd.u32 d2, d2, d3 + vpadd.u32 d0, d0, d2 + + vmov r0, r1, d0 + bx lr +endfunc + +function x265_pixel_var_32x32_neon + veor.u8 q0, q0 + veor.u8 q1, q1 + veor.u8 q2, q2 + veor.u8 q14, q14 + veor.u8 q15, q15 + mov ip, #8 + +.var32_loop: + subs ip, ip, #1 + vld1.u8 {q8-q9}, [r0], r1 + VAR_SQR_SUM q1, q14, q12, d16 + VAR_SQR_SUM q2, q15, q13, d17 + VAR_SQR_SUM q1, q12, q14, d18 + VAR_SQR_SUM q2, q13, q15, d19 + + vld1.u8 {q8-q9}, [r0], r1 + VAR_SQR_SUM q1, q14, q12, d16 + VAR_SQR_SUM q2, q15, q13, d17 + VAR_SQR_SUM q1, q12, q14, d18 + VAR_SQR_SUM q2, q13, q15, d19 + + vld1.u8 {q8-q9}, [r0], r1 + VAR_SQR_SUM q1, q14, q12, d16 + VAR_SQR_SUM q2, q15, q13, d17 + VAR_SQR_SUM q1, q12, q14, d18 + VAR_SQR_SUM q2, q13, q15, d19 + + vld1.u8 {q8-q9}, [r0], r1 + VAR_SQR_SUM q1, q14, q12, d16 + VAR_SQR_SUM q2, q15, q13, d17 + VAR_SQR_SUM q1, q12, q14, d18 + VAR_SQR_SUM q2, q13, q15, d19 + bgt .var32_loop + + vpaddl.u16 q8, q14 + vpaddl.u16 q9, q15 + vadd.u32 q1, q1, q8 + vadd.u16 d0, d0, d1 + vadd.u32 q1, q1, q9 + vadd.u32 q1, q1, q2 + vpaddl.u16 d0, d0 + vadd.u32 d2, d2, d3 + vpadd.u32 d0, d0, d2 + + vmov r0, r1, d0 + bx lr +endfunc + +function x265_pixel_var_64x64_neon + sub r1, #32 + veor.u8 q0, q0 + veor.u8 q1, q1 + veor.u8 q2, q2 + veor.u8 q3, q3 + veor.u8 q14, q14 + veor.u8 q15, q15 + mov ip, #16 + +.var64_loop: + subs ip, ip, #1 + vld1.u8 {q8-q9}, [r0]! + VAR_SQR_SUM q1, q14, q12, d16 + VAR_SQR_SUM q2, q15, q13, d17 + VAR_SQR_SUM q1, q12, q14, d18 + VAR_SQR_SUM q2, q13, q15, d19 + + vld1.u8 {q8-q9}, [r0], r1 + VAR_SQR_SUM q1, q14, q12, d16, 3 + VAR_SQR_SUM q2, q15, q13, d17, 3 + VAR_SQR_SUM q1, q12, q14, d18, 3 + VAR_SQR_SUM q2, q13, q15, d19, 3 + + vld1.u8 {q8-q9}, [r0]! + VAR_SQR_SUM q1, q14, q12, d16 + VAR_SQR_SUM q2, q15, q13, d17 + VAR_SQR_SUM q1, q12, q14, d18 + VAR_SQR_SUM q2, q13, q15, d19 + + vld1.u8 {q8-q9}, [r0], r1 + VAR_SQR_SUM q1, q14, q12, d16, 3 + VAR_SQR_SUM q2, q15, q13, d17, 3 + VAR_SQR_SUM q1, q12, q14, d18, 3 + VAR_SQR_SUM q2, q13, q15, d19, 3 + + vld1.u8 {q8-q9}, [r0]! + VAR_SQR_SUM q1, q14, q12, d16 + VAR_SQR_SUM q2, q15, q13, d17 + VAR_SQR_SUM q1, q12, q14, d18 + VAR_SQR_SUM q2, q13, q15, d19 + + vld1.u8 {q8-q9}, [r0], r1 + VAR_SQR_SUM q1, q14, q12, d16, 3 + VAR_SQR_SUM q2, q15, q13, d17, 3 + VAR_SQR_SUM q1, q12, q14, d18, 3 + VAR_SQR_SUM q2, q13, q15, d19, 3 + + vld1.u8 {q8-q9}, [r0]! + VAR_SQR_SUM q1, q14, q12, d16 + VAR_SQR_SUM q2, q15, q13, d17 + VAR_SQR_SUM q1, q12, q14, d18 + VAR_SQR_SUM q2, q13, q15, d19 + + vld1.u8 {q8-q9}, [r0], r1 + VAR_SQR_SUM q1, q14, q12, d16, 3 + VAR_SQR_SUM q2, q15, q13, d17, 3 + VAR_SQR_SUM q1, q12, q14, d18, 3 + VAR_SQR_SUM q2, q13, q15, d19, 3 + bgt .var64_loop + + vpaddl.u16 q8, q14 + vpaddl.u16 q9, q15 + vadd.u32 q1, q1, q8 + vadd.u32 q1, q1, q9 + vadd.u32 q1, q1, q2 + vpaddl.u16 d0, d0 + vpaddl.u16 d1, d1 + vpaddl.u16 d6, d6 + vpaddl.u16 d7, d7 + vadd.u32 d0, d1 + vadd.u32 d6, d7 + vadd.u32 d0, d6 + vadd.u32 d2, d2, d3 + vpadd.u32 d0, d0, d2 + + vmov r0, r1, d0 + bx lr +endfunc + +/* void getResidual4_neon(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); + * r0 - fenc + * r1 - pred + * r2 - residual + * r3 - Stride */ +function x265_getResidual4_neon + lsl r12, r3, #1 +.rept 2 + vld1.u8 {d0}, [r0], r3 + vld1.u8 {d1}, [r1], r3 + vld1.u8 {d2}, [r0], r3 + vld1.u8 {d3}, [r1], r3 + vsubl.u8 q2, d0, d1 + vsubl.u8 q3, d2, d3 + vst1.s16 {d4}, [r2], r12 + vst1.s16 {d6}, [r2], r12 +.endr + bx lr +endfunc + +function x265_getResidual8_neon + lsl r12, r3, #1 +.rept 4 + vld1.u8 {d0}, [r0], r3 + vld1.u8 {d1}, [r1], r3 + vld1.u8 {d2}, [r0], r3 + vld1.u8 {d3}, [r1], r3 + vsubl.u8 q2, d0, d1 + vsubl.u8 q3, d2, d3 + vst1.s16 {q2}, [r2], r12 + vst1.s16 {q3}, [r2], r12 +.endr + bx lr +endfunc + +function x265_getResidual16_neon + lsl r12, r3, #1 +.rept 8 + vld1.u8 {d0, d1}, [r0], r3 + vld1.u8 {d2, d3}, [r1], r3 + vld1.u8 {d4, d5}, [r0], r3 + vld1.u8 {d6, d7}, [r1], r3 + vsubl.u8 q8, d0, d2 + vsubl.u8 q9, d1, d3 + vsubl.u8 q10, d4, d6 + vsubl.u8 q11, d5, d7 + vst1.s16 {q8, q9}, [r2], r12 + vst1.s16 {q10, q11}, [r2], r12 +.endr + bx lr +endfunc + +function x265_getResidual32_neon + push {r4} + lsl r12, r3, #1 + sub r12, #32 + mov r4, #4 +loop_res32: + subs r4, r4, #1 +.rept 8 + vld1.u8 {q0, q1}, [r0], r3 + vld1.u8 {q2, q3}, [r1], r3 + vsubl.u8 q8, d0, d4 + vsubl.u8 q9, d1, d5 + vsubl.u8 q10, d2, d6 + vsubl.u8 q11, d3, d7 + vst1.s16 {q8, q9}, [r2]! + vst1.s16 {q10, q11}, [r2], r12 +.endr + bne loop_res32 + pop {r4} + bx lr +endfunc + +// void pixel_sub_ps_neon(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1) +function x265_pixel_sub_ps_4x4_neon + push {r4} + lsl r1, r1, #1 + ldr r4, [sp, #4] + ldr r12, [sp, #8] +.rept 2 + vld1.u8 {d0}, [r2], r4 + vld1.u8 {d1}, [r3], r12 + vld1.u8 {d2}, [r2], r4 + vld1.u8 {d3}, [r3], r12 + vsubl.u8 q2, d0, d1 + vsubl.u8 q3, d2, d3 + vst1.s16 {d4}, [r0], r1 + vst1.s16 {d6}, [r0], r1 +.endr + pop {r4} + bx lr +endfunc + +function x265_pixel_sub_ps_8x8_neon + push {r4} + lsl r1, r1, #1 + ldr r4, [sp, #4] + ldr r12, [sp, #8] +.rept 4 + vld1.u8 {d0}, [r2], r4 + vld1.u8 {d1}, [r3], r12 + vld1.u8 {d2}, [r2], r4 + vld1.u8 {d3}, [r3], r12 + vsubl.u8 q2, d0, d1 + vsubl.u8 q3, d2, d3 + vst1.s16 {q2}, [r0], r1 + vst1.s16 {q3}, [r0], r1 +.endr + pop {r4} + bx lr +endfunc + +function x265_pixel_sub_ps_16x16_neon + push {r4, r5} + lsl r1, r1, #1 + ldr r4, [sp, #8] + ldr r12, [sp, #12] + mov r5, #2 +loop_sub16: + subs r5, r5, #1 +.rept 4 + vld1.u8 {q0}, [r2], r4 + vld1.u8 {q1}, [r3], r12 + vld1.u8 {q2}, [r2], r4 + vld1.u8 {q3}, [r3], r12 + vsubl.u8 q8, d0, d2 + vsubl.u8 q9, d1, d3 + vsubl.u8 q10, d4, d6 + vsubl.u8 q11, d5, d7 + vst1.s16 {q8, q9}, [r0], r1 + vst1.s16 {q10, q11}, [r0], r1 +.endr + bne loop_sub16 + pop {r4, r5} + bx lr +endfunc + +function x265_pixel_sub_ps_32x32_neon + push {r4, r5} + lsl r1, r1, #1 + ldr r4, [sp, #8] + ldr r12, [sp, #12] + sub r1, #32 + mov r5, #8 +loop_sub32: + subs r5, r5, #1 +.rept 4 + vld1.u8 {q0, q1}, [r2], r4 + vld1.u8 {q2, q3}, [r3], r12 + vsubl.u8 q8, d0, d4 + vsubl.u8 q9, d1, d5 + vsubl.u8 q10, d2, d6 + vsubl.u8 q11, d3, d7 + vst1.s16 {q8, q9}, [r0]! + vst1.s16 {q10, q11}, [r0], r1 +.endr + bne loop_sub32 + pop {r4, r5} + bx lr +endfunc + +function x265_pixel_sub_ps_64x64_neon + push {r4, r5} + lsl r1, r1, #1 + ldr r4, [sp, #8] + ldr r12, [sp, #12] + sub r1, #96 + sub r4, #32 + sub r12, #32 + mov r5, #32 +loop_sub64: + subs r5, r5, #1 +.rept 2 + vld1.u8 {q0, q1}, [r2]! + vld1.u8 {q2, q3}, [r2], r4 + vld1.u8 {q8, q9}, [r3]! + vld1.u8 {q10, q11}, [r3], r12 + vsubl.u8 q12, d0, d16 + vsubl.u8 q13, d1, d17 + vsubl.u8 q14, d2, d18 + vsubl.u8 q15, d3, d19 + vsubl.u8 q0, d4, d20 + vsubl.u8 q1, d5, d21 + vsubl.u8 q2, d6, d22 + vsubl.u8 q3, d7, d23 + vst1.s16 {q12, q13}, [r0]! + vst1.s16 {q14, q15}, [r0]! + vst1.s16 {q0, q1}, [r0]! + vst1.s16 {q2, q3}, [r0], r1 +.endr + bne loop_sub64 + pop {r4, r5} + bx lr +endfunc + +// chroma sub_ps +function x265_pixel_sub_ps_4x8_neon + push {r4} + lsl r1, r1, #1 + ldr r4, [sp, #4] + ldr r12, [sp, #8] +.rept 4 + vld1.u8 {d0}, [r2], r4 + vld1.u8 {d1}, [r3], r12 + vld1.u8 {d2}, [r2], r4 + vld1.u8 {d3}, [r3], r12 + vsubl.u8 q2, d0, d1 + vsubl.u8 q3, d2, d3 + vst1.s16 {d4}, [r0], r1 + vst1.s16 {d6}, [r0], r1 +.endr + pop {r4} + bx lr +endfunc + +function x265_pixel_sub_ps_8x16_neon + push {r4} + lsl r1, r1, #1 + ldr r4, [sp, #4] + ldr r12, [sp, #8] +.rept 8 + vld1.u8 {d0}, [r2], r4 + vld1.u8 {d1}, [r3], r12 + vld1.u8 {d2}, [r2], r4 + vld1.u8 {d3}, [r3], r12 + vsubl.u8 q2, d0, d1 + vsubl.u8 q3, d2, d3 + vst1.s16 {q2}, [r0], r1 + vst1.s16 {q3}, [r0], r1 +.endr + pop {r4} + bx lr +endfunc + +function x265_pixel_sub_ps_16x32_neon + push {r4, r5} + lsl r1, r1, #1 + ldr r4, [sp, #8] + ldr r12, [sp, #12] + mov r5, #4 +loop_sub_16x32: + subs r5, r5, #1 +.rept 4 + vld1.u8 {q0}, [r2], r4 + vld1.u8 {q1}, [r3], r12 + vld1.u8 {q2}, [r2], r4 + vld1.u8 {q3}, [r3], r12 + vsubl.u8 q8, d0, d2 + vsubl.u8 q9, d1, d3 + vsubl.u8 q10, d4, d6 + vsubl.u8 q11, d5, d7 + vst1.s16 {q8, q9}, [r0], r1 + vst1.s16 {q10, q11}, [r0], r1 +.endr + bne loop_sub_16x32 + pop {r4, r5} + bx lr +endfunc + +function x265_pixel_sub_ps_32x64_neon + push {r4, r5} + lsl r1, r1, #1 + ldr r4, [sp, #8] + ldr r12, [sp, #12] + sub r1, #32 + mov r5, #16 +loop_sub_32x64: + subs r5, r5, #1 +.rept 4 + vld1.u8 {q0, q1}, [r2], r4 + vld1.u8 {q2, q3}, [r3], r12 + vsubl.u8 q8, d0, d4 + vsubl.u8 q9, d1, d5 + vsubl.u8 q10, d2, d6 + vsubl.u8 q11, d3, d7 + vst1.s16 {q8, q9}, [r0]! + vst1.s16 {q10, q11}, [r0], r1 +.endr + bne loop_sub_32x64 + pop {r4, r5} + bx lr +endfunc + +// void x265_pixel_add_ps_neon(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); +function x265_pixel_add_ps_4x4_neon + push {r4} + ldr r4, [sp, #4] + ldr r12, [sp, #8] + lsl r12, #1 + vmov.u16 q10, #255 + veor.u16 q11, q11 + veor.u16 d3, d3 + veor.u16 d5, d5 +.rept 2 + vld1.u8 {d0}, [r2], r4 + vld1.u8 {d1}, [r2], r4 + vld1.s16 {d2}, [r3], r12 + vld1.s16 {d4}, [r3], r12 + vmovl.u8 q8, d0 + vmovl.u8 q9, d1 + vadd.s16 q1, q1, q8 + vadd.s16 q2, q2, q9 + vqmovun.s16 d0, q1 + vqmovun.s16 d1, q2 + vst1.32 {d0[0]}, [r0], r1 + vst1.32 {d1[0]}, [r0], r1 +.endr + pop {r4} + bx lr +endfunc + +function x265_pixel_add_ps_8x8_neon + push {r4} + ldr r4, [sp, #4] + ldr r12, [sp, #8] + lsl r12, #1 + vmov.u16 q10, #255 + veor.u16 q11, q11 +.rept 4 + vld1.u8 {d0}, [r2], r4 + vld1.u8 {d1}, [r2], r4 + vld1.s16 {q8}, [r3], r12 + vld1.s16 {q9}, [r3], r12 + vmovl.u8 q1, d0 + vmovl.u8 q2, d1 + vadd.s16 q1, q1, q8 + vadd.s16 q2, q2, q9 + vqmovun.s16 d0, q1 + vqmovun.s16 d1, q2 + vst1.8 {d0}, [r0], r1 + vst1.8 {d1}, [r0], r1 +.endr + pop {r4} + bx lr +endfunc + +.macro pixel_add_ps_16xN_neon h i +function x265_pixel_add_ps_16x\h\()_neon + push {r4, r5} + ldr r4, [sp, #8] + ldr r12, [sp, #12] + lsl r12, #1 + vmov.u16 q10, #255 + veor.u16 q11, q11 + mov r5, #\i +loop_addps_16x\h\(): + subs r5, #1 +.rept 4 + vld1.u8 {q0}, [r2], r4 + vld1.u8 {q1}, [r2], r4 + vld1.s16 {q8, q9}, [r3], r12 + vld1.s16 {q12, q13}, [r3], r12 + + vmovl.u8 q2, d0 + vmovl.u8 q3, d1 + vmovl.u8 q0, d2 + vmovl.u8 q1, d3 + + vadd.s16 q2, q2, q8 + vadd.s16 q3, q3, q9 + vadd.s16 q0, q0, q12 + vadd.s16 q1, q1, q13 + + vqmovun.s16 d4, q2 + vqmovun.s16 d5, q3 + vqmovun.s16 d0, q0 + vqmovun.s16 d1, q1 + vst1.8 {d4, d5}, [r0], r1 + vst1.8 {d0, d1}, [r0], r1 +.endr + bne loop_addps_16x\h + pop {r4, r5} + bx lr +endfunc +.endm + +pixel_add_ps_16xN_neon 16 2 +pixel_add_ps_16xN_neon 32 4 + +.macro pixel_add_ps_32xN_neon h i + function x265_pixel_add_ps_32x\h\()_neon + push {r4, r5} + ldr r4, [sp, #8] + ldr r12, [sp, #12] + lsl r12, #1 + vmov.u16 q10, #255 + veor.u16 q11, q11 + mov r5, #\i + sub r12, #32 +loop_addps_32x\h\(): + subs r5, #1 +.rept 4 + vld1.u8 {q0, q1}, [r2], r4 + vld1.s16 {q8, q9}, [r3]! + vld1.s16 {q12, q13}, [r3], r12 + + vmovl.u8 q2, d0 + vmovl.u8 q3, d1 + vmovl.u8 q14, d2 + vmovl.u8 q15, d3 + + vadd.s16 q2, q2, q8 + vadd.s16 q3, q3, q9 + vadd.s16 q14, q14, q12 + vadd.s16 q15, q15, q13 + + vqmovun.s16 d0, q2 + vqmovun.s16 d1, q3 + vqmovun.s16 d2, q14 + vqmovun.s16 d3, q15 + vst1.8 {q0, q1}, [r0], r1 +.endr + bne loop_addps_32x\h + pop {r4, r5} + bx lr +endfunc +.endm + +pixel_add_ps_32xN_neon 32 8 +pixel_add_ps_32xN_neon 64 16 + +function x265_pixel_add_ps_64x64_neon + push {r4, r5} + vpush {q4, q5, q6, q7} + ldr r4, [sp, #72] + ldr r12, [sp, #76] + lsl r12, #1 + vmov.u16 q2, #255 + veor.u16 q3, q3 + mov r5, #32 + sub r1, #32 + sub r4, #32 + sub r12, #96 +loop_addps64: + subs r5, #1 +.rept 2 + vld1.u8 {q0, q1}, [r2]! + vld1.s16 {q8, q9}, [r3]! + vld1.s16 {q10, q11}, [r3]! + vld1.s16 {q12, q13}, [r3]! + vld1.s16 {q14, q15}, [r3], r12 + + vmovl.u8 q4, d0 + vmovl.u8 q5, d1 + vmovl.u8 q6, d2 + vmovl.u8 q7, d3 + + vadd.s16 q4, q4, q8 + vadd.s16 q5, q5, q9 + vadd.s16 q6, q6, q10 + vadd.s16 q7, q7, q11 + + vqmovun.s16 d0, q4 + vqmovun.s16 d1, q5 + vqmovun.s16 d2, q6 + vqmovun.s16 d3, q7 + + vst1.u8 {q0, q1}, [r0]! + vld1.u8 {q0, q1}, [r2], r4 + vmovl.u8 q4, d0 + vmovl.u8 q5, d1 + vmovl.u8 q6, d2 + vmovl.u8 q7, d3 + + vadd.s16 q4, q4, q12 + vadd.s16 q5, q5, q13 + vadd.s16 q6, q6, q14 + vadd.s16 q7, q7, q15 + + vqmovun.s16 d0, q4 + vqmovun.s16 d1, q5 + vqmovun.s16 d2, q6 + vqmovun.s16 d3, q7 + vst1.u8 {q0, q1}, [r0], r1 +.endr + bne loop_addps64 + vpop {q4, q5, q6, q7} + pop {r4, r5} + bx lr +endfunc + +// Chroma add_ps +function x265_pixel_add_ps_4x8_neon + push {r4} + ldr r4, [sp, #4] + ldr r12, [sp, #8] + lsl r12, #1 + vmov.u16 q10, #255 + veor.u16 q11, q11 + veor.u16 d3, d3 + veor.u16 d5, d5 +.rept 4 + vld1.u8 {d0}, [r2], r4 + vld1.u8 {d1}, [r2], r4 + vld1.s16 {d2}, [r3], r12 + vld1.s16 {d4}, [r3], r12 + vmovl.u8 q8, d0 + vmovl.u8 q9, d1 + vadd.s16 q1, q1, q8 + vadd.s16 q2, q2, q9 + vqmovun.s16 d0, q1 + vqmovun.s16 d1, q2 + vst1.32 {d0[0]}, [r0], r1 + vst1.32 {d1[0]}, [r0], r1 +.endr + pop {r4} + bx lr +endfunc + +function x265_pixel_add_ps_8x16_neon + push {r4, r5} + ldr r4, [sp, #8] + ldr r12, [sp, #12] + lsl r12, #1 + vmov.u16 q10, #255 + veor.u16 q11, q11 + mov r5, #2 +loop_add_8x16: + subs r5, #1 +.rept 4 + vld1.u8 {d0}, [r2], r4 + vld1.u8 {d1}, [r2], r4 + vld1.s16 {q8}, [r3], r12 + vld1.s16 {q9}, [r3], r12 + vmovl.u8 q1, d0 + vmovl.u8 q2, d1 + vadd.s16 q1, q1, q8 + vadd.s16 q2, q2, q9 + vqmovun.s16 d0, q1 + vqmovun.s16 d1, q2 + vst1.8 {d0}, [r0], r1 + vst1.8 {d1}, [r0], r1 +.endr + bne loop_add_8x16 + pop {r4, r5} + bx lr +endfunc + +// void scale1D_128to64(pixel *dst, const pixel *src) +function x265_scale1D_128to64_neon + mov r12, #32 +.rept 2 + vld2.u8 {q8, q9}, [r1]! + vld2.u8 {q10, q11}, [r1]! + vld2.u8 {q12, q13}, [r1]! + vld2.u8 {q14, q15}, [r1], r12 + + vrhadd.u8 q0, q8, q9 + vrhadd.u8 q1, q10, q11 + vrhadd.u8 q2, q12, q13 + vrhadd.u8 q3, q14, q15 + + vst1.u8 {q0, q1}, [r0]! + vst1.u8 {q2, q3}, [r0], r12 +.endr + bx lr +endfunc + +// void scale2D_64to32(pixel* dst, const pixel* src, intptr_t stride) +function x265_scale2D_64to32_neon + sub r2, #32 + mov r3, #16 +loop_scale2D: + subs r3, #1 +.rept 2 + vld2.8 {q8, q9}, [r1]! + vld2.8 {q10, q11}, [r1], r2 + vld2.8 {q12, q13}, [r1]! + vld2.8 {q14, q15}, [r1], r2 + + vaddl.u8 q0, d16, d18 + vaddl.u8 q1, d17, d19 + vaddl.u8 q2, d20, d22 + vaddl.u8 q3, d21, d23 + + vaddl.u8 q8, d24, d26 + vaddl.u8 q9, d25, d27 + vaddl.u8 q10, d28, d30 + vaddl.u8 q11, d29, d31 + + vadd.u16 q0, q8 + vadd.u16 q1, q9 + vadd.u16 q2, q10 + vadd.u16 q3, q11 + + vrshrn.u16 d16, q0, #2 + vrshrn.u16 d17, q1, #2 + vrshrn.u16 d18, q2, #2 + vrshrn.u16 d19, q3, #2 + vst1.8 {q8, q9}, [r0]! +.endr + bne loop_scale2D + bx lr +endfunc + +function x265_pixel_planecopy_cp_neon + push {r4, r5, r6, r7} + ldr r4, [sp, #4 * 4] + ldr r5, [sp, #4 * 4 + 4] + ldr r12, [sp, #4 * 4 + 8] + vdup.8 q2, r12 + sub r5, #1 + +.loop_h: + mov r6, r0 + mov r12, r2 + eor r7, r7 +.loop_w: + vld1.u8 {q0}, [r6]! + vshl.u8 q0, q0, q2 + vst1.u8 {q0}, [r12]! + + add r7, #16 + cmp r7, r4 + blt .loop_w + + add r0, r1 + add r2, r3 + + subs r5, #1 + bgt .loop_h + +// handle last row + mov r5, r4 + lsr r5, #3 + +.loopW8: + vld1.u8 d0, [r0]! + vshl.u8 d0, d0, d4 + vst1.u8 d0, [r2]! + subs r4, r4, #8 + subs r5, #1 + bgt .loopW8 + + mov r5,#8 + sub r5, r4 + sub r0, r5 + sub r2, r5 + vld1.u8 d0, [r0] + vshl.u8 d0, d0, d4 + vst1.u8 d0, [r2] + + pop {r4, r5, r6, r7} + bx lr +endfunc + +//******* satd ******* +.macro satd_4x4_neon + vld1.32 {d1[]}, [r2], r3 + vld1.32 {d0[]}, [r0,:32], r1 + vld1.32 {d3[]}, [r2], r3 + vld1.32 {d2[]}, [r0,:32], r1 + vld1.32 {d1[1]}, [r2], r3 + vld1.32 {d0[1]}, [r0,:32], r1 + vld1.32 {d3[1]}, [r2], r3 + vld1.32 {d2[1]}, [r0,:32], r1 + vsubl.u8 q0, d0, d1 + vsubl.u8 q1, d2, d3 + SUMSUB_AB q2, q3, q0, q1 + SUMSUB_ABCD d0, d2, d1, d3, d4, d5, d6, d7 + HADAMARD 1, sumsub, q2, q3, q0, q1 + HADAMARD 2, amax, q0,, q2, q3 + HORIZ_ADD d0, d0, d1 +.endm + +function x265_pixel_satd_4x4_neon + satd_4x4_neon + vmov.32 r0, d0[0] + bx lr +endfunc + +.macro LOAD_DIFF_8x4_1 q0 q1 q2 q3 + vld1.32 {d1}, [r2], r3 + vld1.32 {d0}, [r0,:64], r1 + vsubl.u8 \q0, d0, d1 + vld1.32 {d3}, [r2], r3 + vld1.32 {d2}, [r0,:64], r1 + vsubl.u8 \q1, d2, d3 + vld1.32 {d5}, [r2], r3 + vld1.32 {d4}, [r0,:64], r1 + vsubl.u8 \q2, d4, d5 + vld1.32 {d7}, [r2], r3 + vld1.32 {d6}, [r0,:64], r1 + vsubl.u8 \q3, d6, d7 +.endm + +.macro x265_satd_4x8_8x4_end_neon + vadd.s16 q0, q8, q10 + vadd.s16 q1, q9, q11 + vsub.s16 q2, q8, q10 + vsub.s16 q3, q9, q11 + + vtrn.16 q0, q1 + vadd.s16 q8, q0, q1 + vtrn.16 q2, q3 + vsub.s16 q9, q0, q1 + vadd.s16 q10, q2, q3 + vsub.s16 q11, q2, q3 + vtrn.32 q8, q10 + vabs.s16 q8, q8 + vtrn.32 q9, q11 + vabs.s16 q10, q10 + vabs.s16 q9, q9 + vabs.s16 q11, q11 + vmax.u16 q0, q8, q10 + vmax.u16 q1, q9, q11 + vadd.u16 q0, q0, q1 + HORIZ_ADD d0, d0, d1 +.endm + +.macro pixel_satd_4x8_neon + vld1.32 {d1[]}, [r2], r3 + vld1.32 {d0[]}, [r0,:32], r1 + vld1.32 {d3[]}, [r2], r3 + vld1.32 {d2[]}, [r0,:32], r1 + vld1.32 {d5[]}, [r2], r3 + vld1.32 {d4[]}, [r0,:32], r1 + vld1.32 {d7[]}, [r2], r3 + vld1.32 {d6[]}, [r0,:32], r1 + + vld1.32 {d1[1]}, [r2], r3 + vld1.32 {d0[1]}, [r0,:32], r1 + vsubl.u8 q0, d0, d1 + vld1.32 {d3[1]}, [r2], r3 + vld1.32 {d2[1]}, [r0,:32], r1 + vsubl.u8 q1, d2, d3 + vld1.32 {d5[1]}, [r2], r3 + vld1.32 {d4[1]}, [r0,:32], r1 + vsubl.u8 q2, d4, d5 + vld1.32 {d7[1]}, [r2], r3 + SUMSUB_AB q8, q9, q0, q1 + vld1.32 {d6[1]}, [r0,:32], r1 + vsubl.u8 q3, d6, d7 + SUMSUB_AB q10, q11, q2, q3 + x265_satd_4x8_8x4_end_neon +.endm + +function x265_pixel_satd_4x8_neon + pixel_satd_4x8_neon + vmov.32 r0, d0[0] + bx lr +endfunc + +function x265_pixel_satd_4x16_neon + push {r4, r5} + eor r4, r4 + pixel_satd_4x8_neon + vmov.32 r5, d0[0] + add r4, r5 + pixel_satd_4x8_neon + vmov.32 r5, d0[0] + add r0, r5, r4 + pop {r4, r5} + bx lr +endfunc + +function x265_pixel_satd_4x32_neon + push {r4, r5} + eor r4, r4 +.rept 4 + pixel_satd_4x8_neon + vmov.32 r5, d0[0] + add r4, r5 +.endr + mov r0, r4 + pop {r4, r5} + bx lr +endfunc + +function x265_pixel_satd_12x16_neon + push {r4, r5, r6, r7} + vpush {d8-d11} + mov ip, lr + mov r4, r0 + mov r5, r2 + eor r7, r7 + pixel_satd_4x8_neon + vmov.32 r6, d0[0] + add r7, r6 + pixel_satd_4x8_neon + vmov.32 r6, d0[0] + add r7, r6 + + add r0, r4, #4 + add r2, r5, #4 + pixel_satd_4x8_neon + vmov.32 r6, d0[0] + add r7, r6 + pixel_satd_4x8_neon + vmov.32 r6, d0[0] + add r7, r6 + + add r0, r4, #8 + add r2, r5, #8 + pixel_satd_4x8_neon + vmov.32 r6, d0[0] + add r7, r6 + pixel_satd_4x8_neon + vmov.32 r6, d0[0] + add r0, r7, r6 + vpop {d8-d11} + pop {r4, r5, r6, r7} + mov lr, ip + bx lr +endfunc + +function x265_pixel_satd_12x32_neon + push {r4, r5, r6, r7} + vpush {d8-d11} + mov ip, lr + mov r4, r0 + mov r5, r2 + eor r7, r7 +.rept 4 + pixel_satd_4x8_neon + vmov.32 r6, d0[0] + add r7, r6 +.endr + + add r0, r4, #4 + add r2, r5, #4 +.rept 4 + pixel_satd_4x8_neon + vmov.32 r6, d0[0] + add r7, r6 +.endr + + add r0, r4, #8 + add r2, r5, #8 +.rept 4 + pixel_satd_4x8_neon + vmov.32 r6, d0[0] + add r7, r6 +.endr + + mov r0, r7 + vpop {d8-d11} + pop {r4, r5, r6, r7} + mov lr, ip + bx lr +endfunc + +function x265_pixel_satd_8x4_neon + push {r4, r5, r6} + mov r4, r0 + mov r5, r2 + satd_4x4_neon + add r0, r4, #4 + add r2, r5, #4 + vmov.32 r6, d0[0] + satd_4x4_neon + vmov.32 r0, d0[0] + add r0, r0, r6 + pop {r4, r5, r6} + bx lr +endfunc + +function x265_pixel_satd_8x8_neon + mov ip, lr + push {r4, r5, r6, r7} + eor r4, r4 + mov r6, r0 + mov r7, r2 + pixel_satd_4x8_neon + vmov.32 r5, d0[0] + add r4, r5 + add r0, r6, #4 + add r2, r7, #4 + pixel_satd_4x8_neon + vmov.32 r5, d0[0] + add r0, r4, r5 + pop {r4, r5, r6, r7} + mov lr, ip + bx lr +endfunc + +function x265_pixel_satd_8x12_neon + push {r4, r5, r6, r7} + mov r4, r0 + mov r5, r2 + eor r7, r7 + satd_4x4_neon + vmov.32 r6, d0[0] + add r7, r6 + add r0, r4, #4 + add r2, r5, #4 + satd_4x4_neon + vmov.32 r6, d0[0] + add r7, r6 +.rept 2 + sub r0, #4 + sub r2, #4 + mov r4, r0 + mov r5, r2 + satd_4x4_neon + vmov.32 r6, d0[0] + add r7, r6 + add r0, r4, #4 + add r2, r5, #4 + satd_4x4_neon + vmov.32 r6, d0[0] + add r7, r6 +.endr + mov r0, r7 + pop {r4, r5, r6, r7} + bx lr +endfunc + +function x265_pixel_satd_8x16_neon + vpush {d8-d11} + mov ip, lr + bl x265_satd_8x8_neon + vadd.u16 q4, q12, q13 + vadd.u16 q5, q14, q15 + + bl x265_satd_8x8_neon + vadd.u16 q4, q4, q12 + vadd.u16 q5, q5, q13 + vadd.u16 q4, q4, q14 + vadd.u16 q5, q5, q15 + + vadd.u16 q0, q4, q5 + HORIZ_ADD d0, d0, d1 + vpop {d8-d11} + mov lr, ip + vmov.32 r0, d0[0] + bx lr +endfunc + +function x265_pixel_satd_8x32_neon + vpush {d8-d11} + mov ip, lr + bl x265_satd_8x8_neon + vadd.u16 q4, q12, q13 + vadd.u16 q5, q14, q15 +.rept 3 + bl x265_satd_8x8_neon + vadd.u16 q4, q4, q12 + vadd.u16 q5, q5, q13 + vadd.u16 q4, q4, q14 + vadd.u16 q5, q5, q15 +.endr + vadd.u16 q0, q4, q5 + HORIZ_ADD d0, d0, d1 + vpop {d8-d11} + mov lr, ip + vmov.32 r0, d0[0] + bx lr +endfunc + +function x265_pixel_satd_8x64_neon + vpush {d8-d11} + mov ip, lr + bl x265_satd_8x8_neon + vadd.u16 q4, q12, q13 + vadd.u16 q5, q14, q15 +.rept 7 + bl x265_satd_8x8_neon + vadd.u16 q4, q4, q12 + vadd.u16 q5, q5, q13 + vadd.u16 q4, q4, q14 + vadd.u16 q5, q5, q15 +.endr + vadd.u16 q0, q4, q5 + HORIZ_ADD d0, d0, d1 + vpop {d8-d11} + mov lr, ip + vmov.32 r0, d0[0] + bx lr +endfunc + +function x265_satd_8x8_neon + LOAD_DIFF_8x4_1 q8, q9, q10, q11 + vld1.64 {d7}, [r2], r3 + vld1.64 {d6}, [r0,:64], r1 + vsubl.u8 q12, d6, d7 + SUMSUB_AB q0, q1, q8, q9 + + vld1.64 {d17}, [r2], r3 + vld1.64 {d16}, [r0,:64], r1 + vsubl.u8 q13, d16, d17 + SUMSUB_AB q2, q3, q10, q11 + + vld1.64 {d19}, [r2], r3 + vld1.64 {d18}, [r0,:64], r1 + vsubl.u8 q14, d18, d19 + SUMSUB_AB q8, q10, q0, q2 + + vld1.64 {d1}, [r2], r3 + vld1.64 {d0}, [r0,:64], r1 + vsubl.u8 q15, d0, d1 + SUMSUB_AB q9, q11, q1, q3 +endfunc + +// one vertical hadamard pass and two horizontal +function x265_satd_8x4v_8x8h_neon, export=0 + SUMSUB_ABCD q0, q1, q2, q3, q12, q13, q14, q15 + SUMSUB_AB q12, q14, q0, q2 + SUMSUB_AB q13, q15, q1, q3 + vtrn.16 q8, q9 + vtrn.16 q10, q11 + + SUMSUB_AB q0, q1, q8, q9 + SUMSUB_AB q2, q3, q10, q11 + vtrn.16 q12, q13 + vtrn.16 q14, q15 + + SUMSUB_AB q8, q9, q12, q13 + SUMSUB_AB q10, q11, q14, q15 + vtrn.32 q0, q2 + vtrn.32 q1, q3 + ABS2 q0, q2 + ABS2 q1, q3 + + vtrn.32 q8, q10 + vtrn.32 q9, q11 + ABS2 q8, q10 + ABS2 q9, q11 + + vmax.s16 q12, q0, q2 + vmax.s16 q13, q1, q3 + vmax.s16 q14, q8, q10 + vmax.s16 q15, q9, q11 + bx lr +endfunc + +function x265_satd_16x4_neon, export=0 + vld1.64 {d2-d3}, [r2], r3 + vld1.64 {d0-d1}, [r0,:128], r1 + vsubl.u8 q8, d0, d2 + vsubl.u8 q12, d1, d3 + + vld1.64 {d6-d7}, [r2], r3 + vld1.64 {d4-d5}, [r0,:128], r1 + vsubl.u8 q9, d4, d6 + vsubl.u8 q13, d5, d7 + + vld1.64 {d2-d3}, [r2], r3 + vld1.64 {d0-d1}, [r0,:128], r1 + vsubl.u8 q10, d0, d2 + vsubl.u8 q14, d1, d3 + + vld1.64 {d6-d7}, [r2], r3 + vld1.64 {d4-d5}, [r0,:128], r1 + vsubl.u8 q11, d4, d6 + vsubl.u8 q15, d5, d7 + + vadd.s16 q0, q8, q9 + vsub.s16 q1, q8, q9 + SUMSUB_AB q2, q3, q10, q11 + SUMSUB_ABCD q8, q10, q9, q11, q0, q2, q1, q3 + b x265_satd_8x4v_8x8h_neon +endfunc + +function x265_pixel_satd_16x4_neon + vpush {d8-d11} + mov ip, lr + bl x265_satd_16x4_neon + vadd.u16 q4, q12, q13 + vadd.u16 q5, q14, q15 + vadd.u16 q0, q4, q5 + HORIZ_ADD d0, d0, d1 + vpop {d8-d11} + mov lr, ip + vmov.32 r0, d0[0] + bx lr +endfunc + +function x265_pixel_satd_16x8_neon + vpush {d8-d11} + mov ip, lr + bl x265_satd_16x4_neon + vadd.u16 q4, q12, q13 + vadd.u16 q5, q14, q15 + + bl x265_satd_16x4_neon + vadd.u16 q4, q4, q12 + vadd.u16 q5, q5, q13 + vadd.u16 q4, q4, q14 + vadd.u16 q5, q5, q15 + + vadd.u16 q0, q4, q5 + HORIZ_ADD d0, d0, d1 + vpop {d8-d11} + mov lr, ip + vmov.32 r0, d0[0] + bx lr +endfunc + +function x265_pixel_satd_16x12_neon + vpush {d8-d11} + mov ip, lr + bl x265_satd_16x4_neon + vadd.u16 q4, q12, q13 + vadd.u16 q5, q14, q15 +.rept 2 + bl x265_satd_16x4_neon + vadd.u16 q4, q4, q12 + vadd.u16 q5, q5, q13 + vadd.u16 q4, q4, q14 + vadd.u16 q5, q5, q15 +.endr + vadd.u16 q0, q4, q5 + HORIZ_ADD d0, d0, d1 + vpop {d8-d11} + mov lr, ip + vmov.32 r0, d0[0] + bx lr +endfunc + +function x265_pixel_satd_16x16_neon + vpush {d8-d11} + mov ip, lr + bl x265_satd_16x4_neon + vadd.u16 q4, q12, q13 + vadd.u16 q5, q14, q15 +.rept 3 + bl x265_satd_16x4_neon + vadd.u16 q4, q4, q12 + vadd.u16 q5, q5, q13 + vadd.u16 q4, q4, q14 + vadd.u16 q5, q5, q15 +.endr + vadd.u16 q0, q4, q5 + HORIZ_ADD d0, d0, d1 + vpop {d8-d11} + mov lr, ip + vmov.32 r0, d0[0] + bx lr +endfunc + +function x265_pixel_satd_16x24_neon + vpush {d8-d11} + mov ip, lr + bl x265_satd_16x4_neon + vadd.u16 q4, q12, q13 + vadd.u16 q5, q14, q15 +.rept 5 + bl x265_satd_16x4_neon + vadd.u16 q4, q4, q12 + vadd.u16 q5, q5, q13 + vadd.u16 q4, q4, q14 + vadd.u16 q5, q5, q15 +.endr + vadd.u16 q0, q4, q5 + HORIZ_ADD d0, d0, d1 + vpop {d8-d11} + mov lr, ip + vmov.32 r0, d0[0] + bx lr +endfunc + +.macro pixel_satd_16x32_neon + bl x265_satd_16x4_neon + vadd.u16 q4, q12, q13 + vadd.u16 q5, q14, q15 +.rept 7 + bl x265_satd_16x4_neon + vadd.u16 q4, q4, q12 + vadd.u16 q5, q5, q13 + vadd.u16 q4, q4, q14 + vadd.u16 q5, q5, q15 +.endr +.endm + +function x265_pixel_satd_16x32_neon + vpush {d8-d11} + mov ip, lr + pixel_satd_16x32_neon + vadd.u16 q0, q4, q5 + HORIZ_ADD d0, d0, d1 + vpop {d8-d11} + mov lr, ip + vmov.32 r0, d0[0] + bx lr +endfunc + +function x265_pixel_satd_16x64_neon + push {r6, r7} + vpush {d8-d11} + mov ip, lr + eor r7, r7 + pixel_satd_16x32_neon + vadd.u16 q0, q4, q5 + HORIZ_ADD d0, d0, d1 + vmov.32 r6, d0[0] + add r7, r6 + + veor q4, q5 + veor q5, q5 + pixel_satd_16x32_neon + vadd.u16 q0, q4, q5 + HORIZ_ADD d0, d0, d1 + vmov.32 r6, d0[0] + add r0, r7, r6 + vpop {d8-d11} + pop {r6, r7} + mov lr, ip + bx lr +endfunc + +function x265_pixel_satd_24x32_neon + push {r4, r5, r6, r7} + vpush {d8-d11} + mov ip, lr + eor r7, r7 + mov r4, r0 + mov r5, r2 +.rept 3 + veor q4, q4 + veor q5, q5 +.rept 4 + bl x265_satd_8x8_neon + vadd.u16 q4, q4, q12 + vadd.u16 q5, q5, q13 + vadd.u16 q4, q4, q14 + vadd.u16 q5, q5, q15 +.endr + vadd.u16 q0, q4, q5 + HORIZ_ADD d0, d0, d1 + vmov.32 r6, d0[0] + add r7, r6 + add r4, #8 + add r5, #8 + mov r0, r4 + mov r2, r5 +.endr + mov r0, r7 + vpop {d8-d11} + pop {r4, r5, r6, r7} + mov lr, ip + bx lr +endfunc + +function x265_pixel_satd_24x64_neon + push {r4, r5, r6, r7} + vpush {d8-d11} + mov ip, lr + eor r7, r7 + mov r4, r0 + mov r5, r2 +.rept 3 + veor q4, q4 + veor q5, q5 +.rept 4 + bl x265_satd_8x8_neon + vadd.u16 q4, q4, q12 + vadd.u16 q5, q5, q13 + vadd.u16 q4, q4, q14 + vadd.u16 q5, q5, q15 +.endr + vadd.u16 q0, q4, q5 + HORIZ_ADD d0, d0, d1 + vmov.32 r6, d0[0] + add r7, r6 + add r4, #8 + add r5, #8 + mov r0, r4 + mov r2, r5 +.endr + + sub r4, #24 + sub r5, #24 + add r0, r4, r1, lsl #5 + add r2, r5, r3, lsl #5 + mov r4, r0 + mov r5, r2 +.rept 3 + veor q4, q4 + veor q5, q5 +.rept 4 + bl x265_satd_8x8_neon + vadd.u16 q4, q4, q12 + vadd.u16 q5, q5, q13 + vadd.u16 q4, q4, q14 + vadd.u16 q5, q5, q15 +.endr + vadd.u16 q0, q4, q5 + HORIZ_ADD d0, d0, d1 + vmov.32 r6, d0[0] + add r7, r6 + add r4, #8 + add r5, #8 + mov r0, r4 + mov r2, r5 +.endr + mov r0, r7 + vpop {d8-d11} + pop {r4, r5, r6, r7} + mov lr, ip + bx lr +endfunc + +.macro pixel_satd_32x8 + mov r4, r0 + mov r5, r2 + bl x265_satd_16x4_neon + vadd.u16 q4, q4, q12 + vadd.u16 q5, q5, q13 + vadd.u16 q4, q4, q14 + vadd.u16 q5, q5, q15 + + bl x265_satd_16x4_neon + vadd.u16 q4, q4, q12 + vadd.u16 q5, q5, q13 + vadd.u16 q4, q4, q14 + vadd.u16 q5, q5, q15 + + add r0, r4, #16 + add r2, r5, #16 + bl x265_satd_16x4_neon + vadd.u16 q4, q4, q12 + vadd.u16 q5, q5, q13 + vadd.u16 q4, q4, q14 + vadd.u16 q5, q5, q15 + + bl x265_satd_16x4_neon + vadd.u16 q4, q4, q12 + vadd.u16 q5, q5, q13 + vadd.u16 q4, q4, q14 + vadd.u16 q5, q5, q15 +.endm + +function x265_pixel_satd_32x8_neon + push {r4, r5} + vpush {d8-d11} + mov ip, lr + veor q4, q4 + veor q5, q5 + pixel_satd_32x8 + vadd.u16 q0, q4, q5 + HORIZ_ADD d0, d0, d1 + vmov.32 r0, d0[0] + vpop {d8-d11} + pop {r4, r5} + mov lr, ip + bx lr +endfunc + +.macro satd_32x16_neon + veor q4, q4 + veor q5, q5 + pixel_satd_32x8 + sub r0, #16 + sub r2, #16 + pixel_satd_32x8 + vadd.u16 q0, q4, q5 + HORIZ_ADD d0, d0, d1 + vmov.32 r6, d0[0] +.endm + +function x265_pixel_satd_32x16_neon + push {r4, r5, r6} + vpush {d8-d11} + mov ip, lr + satd_32x16_neon + mov r0, r6 + vpop {d8-d11} + pop {r4, r5, r6} + mov lr, ip + bx lr +endfunc + +function x265_pixel_satd_32x24_neon + push {r4, r5, r6} + vpush {d8-d11} + mov ip, lr + satd_32x16_neon + veor q4, q4 + veor q5, q5 + sub r0, #16 + sub r2, #16 + pixel_satd_32x8 + vadd.u16 q0, q4, q5 + HORIZ_ADD d0, d0, d1 + vmov.32 r0, d0[0] + add r0, r6 + vpop {d8-d11} + pop {r4, r5, r6} + mov lr, ip + bx lr +endfunc + +function x265_pixel_satd_32x32_neon + push {r4, r5, r6, r7} + vpush {d8-d11} + mov ip, lr + eor r7, r7 + satd_32x16_neon + sub r0, #16 + sub r2, #16 + add r7, r6 + satd_32x16_neon + add r0, r7, r6 + vpop {d8-d11} + pop {r4, r5, r6, r7} + mov lr, ip + bx lr +endfunc + +function x265_pixel_satd_32x48_neon + push {r4, r5, r6, r7} + vpush {d8-d11} + mov ip, lr + eor r7, r7 +.rept 2 + satd_32x16_neon + sub r0, #16 + sub r2, #16 + add r7, r6 +.endr + satd_32x16_neon + add r0, r7, r6 + vpop {d8-d11} + pop {r4, r5, r6, r7} + mov lr, ip + bx lr +endfunc + +function x265_pixel_satd_32x64_neon + push {r4, r5, r6, r7} + vpush {d8-d11} + mov ip, lr + eor r7, r7 +.rept 3 + satd_32x16_neon + sub r0, #16 + sub r2, #16 + add r7, r6 +.endr + satd_32x16_neon + add r0, r7, r6 + vpop {d8-d11} + pop {r4, r5, r6, r7} + mov lr, ip + bx lr +endfunc + +.macro satd_64x16_neon + mov r8, r0 + mov r9, r2 + satd_32x16_neon + add r7, r6 + add r0, r8, #32 + add r2, r9, #32 + satd_32x16_neon + add r7, r6 +.endm + +function x265_pixel_satd_64x16_neon + push {r4, r5, r6, r7, r8, r9} + vpush {d8-d11} + mov ip, lr + eor r7, r7 + satd_64x16_neon + mov r0, r7 + vpop {d8-d11} + pop {r4, r5, r6, r7, r8, r9} + mov lr, ip + bx lr +endfunc + +function x265_pixel_satd_64x32_neon + push {r4, r5, r6, r7, r8, r9} + vpush {d8-d11} + mov ip, lr + eor r7, r7 + satd_64x16_neon + sub r0, #48 + sub r2, #48 + satd_64x16_neon + mov r0, r7 + vpop {d8-d11} + pop {r4, r5, r6, r7, r8, r9} + mov lr, ip + bx lr +endfunc + +function x265_pixel_satd_64x48_neon + push {r4, r5, r6, r7, r8, r9} + vpush {d8-d11} + mov ip, lr + eor r7, r7 + satd_64x16_neon + sub r0, #48 + sub r2, #48 + satd_64x16_neon + sub r0, #48 + sub r2, #48 + satd_64x16_neon + mov r0, r7 + vpop {d8-d11} + pop {r4, r5, r6, r7, r8, r9} + mov lr, ip + bx lr +endfunc + +function x265_pixel_satd_64x64_neon + push {r4, r5, r6, r7, r8, r9} + vpush {d8-d11} + mov ip, lr + eor r7, r7 + satd_64x16_neon + sub r0, #48 + sub r2, #48 + satd_64x16_neon + sub r0, #48 + sub r2, #48 + satd_64x16_neon + sub r0, #48 + sub r2, #48 + satd_64x16_neon + mov r0, r7 + vpop {d8-d11} + pop {r4, r5, r6, r7, r8, r9} + mov lr, ip + bx lr +endfunc + +function x265_pixel_satd_48x64_neon + push {r4, r5, r6, r7, r8, r9} + vpush {d8-d11} + mov ip, lr + eor r7, r7 + mov r8, r0 + mov r9, r2 +.rept 3 + satd_32x16_neon + sub r0, #16 + sub r2, #16 + add r7, r6 +.endr + satd_32x16_neon + add r7, r6 + + add r0, r8, #32 + add r2, r9, #32 + pixel_satd_16x32_neon + vadd.u16 q0, q4, q5 + HORIZ_ADD d0, d0, d1 + vmov.32 r6, d0[0] + add r7, r6 + + veor q4, q5 + veor q5, q5 + pixel_satd_16x32_neon + vadd.u16 q0, q4, q5 + HORIZ_ADD d0, d0, d1 + vmov.32 r6, d0[0] + add r0, r7, r6 + + vpop {d8-d11} + pop {r4, r5, r6, r7, r8, r9} + mov lr, ip + bx lr +endfunc + +.macro LOAD_DIFF_8x4 q0 q1 q2 q3 + vld1.32 {d1}, [r2], r3 + vld1.32 {d0}, [r0,:64], r1 + vsubl.u8 \q0, d0, d1 + vld1.32 {d3}, [r2], r3 + vld1.32 {d2}, [r0,:64], r1 + vsubl.u8 \q1, d2, d3 + vld1.32 {d5}, [r2], r3 + vld1.32 {d4}, [r0,:64], r1 + vsubl.u8 \q2, d4, d5 + vld1.32 {d7}, [r2], r3 + vld1.32 {d6}, [r0,:64], r1 + vsubl.u8 \q3, d6, d7 +.endm + +.macro HADAMARD4_V r1, r2, r3, r4, t1, t2, t3, t4 + SUMSUB_ABCD \t1, \t2, \t3, \t4, \r1, \r2, \r3, \r4 + SUMSUB_ABCD \r1, \r3, \r2, \r4, \t1, \t3, \t2, \t4 +.endm + +.macro sa8d_satd_8x8 satd= +function x265_sa8d_\satd\()8x8_neon, export=0 + LOAD_DIFF_8x4 q8, q9, q10, q11 + vld1.64 {d7}, [r2], r3 + SUMSUB_AB q0, q1, q8, q9 + vld1.64 {d6}, [r0,:64], r1 + vsubl.u8 q12, d6, d7 + vld1.64 {d17}, [r2], r3 + SUMSUB_AB q2, q3, q10, q11 + vld1.64 {d16}, [r0,:64], r1 + vsubl.u8 q13, d16, d17 + vld1.64 {d19}, [r2], r3 + SUMSUB_AB q8, q10, q0, q2 + vld1.64 {d18}, [r0,:64], r1 + vsubl.u8 q14, d18, d19 + vld1.64 {d1}, [r2], r3 + SUMSUB_AB q9, q11, q1, q3 + vld1.64 {d0}, [r0,:64], r1 + vsubl.u8 q15, d0, d1 + + HADAMARD4_V q12, q13, q14, q15, q0, q1, q2, q3 + + SUMSUB_ABCD q0, q8, q1, q9, q8, q12, q9, q13 + SUMSUB_AB q2, q10, q10, q14 + vtrn.16 q8, q9 + SUMSUB_AB q3, q11, q11, q15 + vtrn.16 q0, q1 + SUMSUB_AB q12, q13, q8, q9 + vtrn.16 q10, q11 + SUMSUB_AB q8, q9, q0, q1 + vtrn.16 q2, q3 + SUMSUB_AB q14, q15, q10, q11 + vadd.i16 q10, q2, q3 + vtrn.32 q12, q14 + vsub.i16 q11, q2, q3 + vtrn.32 q13, q15 + SUMSUB_AB q0, q2, q12, q14 + vtrn.32 q8, q10 + SUMSUB_AB q1, q3, q13, q15 + vtrn.32 q9, q11 + SUMSUB_AB q12, q14, q8, q10 + SUMSUB_AB q13, q15, q9, q11 + + vswp d1, d24 + ABS2 q0, q12 + vswp d3, d26 + ABS2 q1, q13 + vswp d5, d28 + ABS2 q2, q14 + vswp d7, d30 + ABS2 q3, q15 + vmax.s16 q8, q0, q12 + vmax.s16 q9, q1, q13 + vmax.s16 q10, q2, q14 + vmax.s16 q11, q3, q15 + vadd.i16 q8, q8, q9 + vadd.i16 q9, q10, q11 + + bx lr +endfunc +.endm + +sa8d_satd_8x8 + +function x265_pixel_sa8d_8x8_neon + mov ip, lr + bl x265_sa8d_8x8_neon + vadd.u16 q0, q8, q9 + HORIZ_ADD d0, d0, d1 + mov lr, ip + vmov.32 r0, d0[0] + add r0, r0, #1 + lsr r0, r0, #1 + bx lr +endfunc + +function x265_pixel_sa8d_8x16_neon + push {r4, r5} + mov ip, lr + bl x265_sa8d_8x8_neon + vadd.u16 q0, q8, q9 + HORIZ_ADD d0, d0, d1 + vmov.32 r5, d0[0] + add r5, r5, #1 + lsr r5, r5, #1 + bl x265_sa8d_8x8_neon + vadd.u16 q0, q8, q9 + HORIZ_ADD d0, d0, d1 + vmov.32 r4, d0[0] + add r4, r4, #1 + lsr r4, r4, #1 + add r0, r4, r5 + mov lr, ip + pop {r4, r5} + bx lr +endfunc + +function x265_pixel_sa8d_16x16_neon + vpush {d8 - d11} + mov ip, lr + bl x265_sa8d_8x8_neon + vpaddl.u16 q4, q8 + vpaddl.u16 q5, q9 + bl x265_sa8d_8x8_neon + vpadal.u16 q4, q8 + vpadal.u16 q5, q9 + sub r0, r0, r1, lsl #4 + sub r2, r2, r3, lsl #4 + add r0, r0, #8 + add r2, r2, #8 + bl x265_sa8d_8x8_neon + vpadal.u16 q4, q8 + vpadal.u16 q5, q9 + bl x265_sa8d_8x8_neon + vpaddl.u16 q8, q8 + vpaddl.u16 q9, q9 + vadd.u32 q0, q4, q8 + vadd.u32 q1, q5, q9 + vadd.u32 q0, q0, q1 + vadd.u32 d0, d0, d1 + vpadd.u32 d0, d0, d0 + vpop {d8-d11} + mov lr, ip + vmov.32 r0, d0[0] + add r0, r0, #1 + lsr r0, r0, #1 + bx lr +endfunc + +function x265_quant_neon + push {r4-r6} + ldr r4, [sp, #3* 4] + mov r12, #1 + lsl r12, r4 + vdup.s32 d0, r12 // q0 = 2^qbits + neg r12, r4 + vdup.s32 q1, r12 // q1= -qbits + add r12, #8 + vdup.s32 q2, r12 // q2= -qbits+8 + ldr r4, [sp, #3* 4 + 4] + vdup.s32 q3, r4 // q3= add + ldr r4, [sp, #3* 4 + 8] // r4= numcoeff + + lsr r4, r4 ,#2 + veor.s32 q4, q4 // q4= accumulate numsig + eor r5, r5 + veor.s32 q12, q12 + +.loop_quant: + + vld1.s16 d16, [r0]! + vmovl.s16 q9, d16 // q9= coef[blockpos] + + vclt.s32 q8, q9, #0 // q8= sign + + vabs.s32 q9, q9 // q9= level=abs(coef[blockpos]) + vld1.s32 {q10}, [r1]! // q10= quantCoeff[blockpos] + vmul.i32 q9, q9, q10 // q9 = tmplevel = abs(level) * quantCoeff[blockpos]; + + vadd.s32 q10, q9, q3 // q10= tmplevel+add + vshl.s32 q10, q10, q1 // q10= level =(tmplevel+add) >> qbits + + vmls.s32 q9, q10, d0[0] // q10= tmplevel - (level << qBits) + vshl.s32 q11, q9, q2 // q11= ((tmplevel - (level << qBits)) >> qBits8) + vst1.s32 {q11}, [r2]! // store deltaU + + // numsig + vceq.s32 q11, q10, q12 + vadd.s32 q4, q11 + add r5, #4 + + veor.s32 q11, q10, q8 + vsub.s32 q11, q11, q8 + vqmovn.s32 d16, q11 + vst1.s16 d16, [r3]! + + subs r4, #1 + bne .loop_quant + + vadd.u32 d8, d9 + vpadd.u32 d8, d8 + vmov.32 r12, d8[0] + add r0, r5, r12 + + pop {r4-r6} + bx lr +endfunc + +function x265_nquant_neon + push {r4} + neg r12, r3 + vdup.s32 q0, r12 // q0= -qbits + ldr r3, [sp, #1* 4] + vdup.s32 q1, r3 // add + ldr r3, [sp, #1* 4 + 4] // numcoeff + + lsr r3, r3 ,#2 + veor.s32 q4, q4 // q4= accumulate numsig + eor r4, r4 + veor.s32 q12, q12 + +.loop_nquant: + + vld1.s16 d16, [r0]! + vmovl.s16 q9, d16 // q9= coef[blockpos] + + vclt.s32 q8, q9, #0 // q8= sign + + vabs.s32 q9, q9 // q9= level=abs(coef[blockpos]) + vld1.s32 {q10}, [r1]! // q10= quantCoeff[blockpos] + vmul.i32 q9, q9, q10 // q9 = tmplevel = abs(level) * quantCoeff[blockpos]; + + vadd.s32 q10, q9, q1 // q10= tmplevel+add + vshl.s32 q10, q10, q0 // q10= level =(tmplevel+add) >> qbits + + // numsig + vceq.s32 q11, q10, q12 + vadd.s32 q4, q11 + add r4, #4 + + veor.s32 q11, q10, q8 + vsub.s32 q11, q11, q8 + vqmovn.s32 d16, q11 + vabs.s16 d17, d16 + vst1.s16 d17, [r2]! + + subs r3, #1 + bne .loop_nquant + + vadd.u32 d8, d9 + vpadd.u32 d8, d8 + vmov.32 r12, d8[0] + add r0, r4, r12 + + pop {r4} + bx lr +endfunc +.macro sa8d_16x16 reg + bl x265_sa8d_8x8_neon + vpaddl.u16 q4, q8 + vpaddl.u16 q5, q9 + bl x265_sa8d_8x8_neon + vpadal.u16 q4, q8 + vpadal.u16 q5, q9 + sub r0, r0, r1, lsl #4 + sub r2, r2, r3, lsl #4 + add r0, r0, #8 + add r2, r2, #8 + bl x265_sa8d_8x8_neon + vpadal.u16 q4, q8 + vpadal.u16 q5, q9 + bl x265_sa8d_8x8_neon + vpaddl.u16 q8, q8 + vpaddl.u16 q9, q9 + vadd.u32 q0, q4, q8 + vadd.u32 q1, q5, q9 + vadd.u32 q0, q0, q1 + vadd.u32 d0, d0, d1 + vpadd.u32 d0, d0, d0 + vmov.32 \reg, d0[0] + add \reg, \reg, #1 + lsr \reg, \reg, #1 +.endm + +function x265_pixel_sa8d_16x32_neon + push {r4, r5} + vpush {d8 - d11} + mov ip, lr + + sa8d_16x16 r4 + + sub r0, r0, #8 + sub r2, r2, #8 + + sa8d_16x16 r5 + + add r0, r4, r5 + vpop {d8 - d11} + pop {r4, r5} + mov lr, ip + bx lr +endfunc + +function x265_pixel_sa8d_32x32_neon + push {r4 - r7} + vpush {d8 - d11} + mov ip, lr + + sa8d_16x16 r4 + + sub r0, r0, r1, lsl #4 + sub r2, r2, r3, lsl #4 + add r0, r0, #8 + add r2, r2, #8 + + sa8d_16x16 r5 + + sub r0, r0, #24 + sub r2, r2, #24 + + sa8d_16x16 r6 + + sub r0, r0, r1, lsl #4 + sub r2, r2, r3, lsl #4 + add r0, r0, #8 + add r2, r2, #8 + + sa8d_16x16 r7 + + add r4, r4, r5 + add r6, r6, r7 + add r0, r4, r6 + vpop {d8 - d11} + pop {r4 - r7} + mov lr, ip + bx lr +endfunc + +function x265_pixel_sa8d_32x64_neon + push {r4 - r10} + vpush {d8 - d11} + mov ip, lr + + mov r10, #4 + eor r9, r9 + +.loop_32: + + sa8d_16x16 r4 + + sub r0, r0, r1, lsl #4 + sub r2, r2, r3, lsl #4 + add r0, r0, #8 + add r2, r2, #8 + + sa8d_16x16 r5 + + add r4, r4, r5 + add r9, r9, r4 + + sub r0, r0, #24 + sub r2, r2, #24 + + subs r10, #1 + bgt .loop_32 + + mov r0, r9 + vpop {d8-d11} + pop {r4-r10} + mov lr, ip + bx lr +endfunc + +function x265_pixel_sa8d_64x64_neon + push {r4-r10} + vpush {d8-d11} + mov ip, lr + + mov r10, #4 + eor r9, r9 + +.loop_1: + + sa8d_16x16 r4 + + sub r0, r0, r1, lsl #4 + sub r2, r2, r3, lsl #4 + add r0, r0, #8 + add r2, r2, #8 + + sa8d_16x16 r5 + + sub r0, r0, r1, lsl #4 + sub r2, r2, r3, lsl #4 + add r0, r0, #8 + add r2, r2, #8 + + sa8d_16x16 r6 + + sub r0, r0, r1, lsl #4 + sub r2, r2, r3, lsl #4 + add r0, r0, #8 + add r2, r2, #8 + + sa8d_16x16 r7 + + add r4, r4, r5 + add r6, r6, r7 + add r8, r4, r6 + add r9, r9, r8 + + sub r0, r0, #56 + sub r2, r2, #56 + + subs r10, #1 + bgt .loop_1 + + mov r0, r9 + vpop {d8-d11} + pop {r4-r10} + mov lr, ip + bx lr +endfunc + +/***** dequant_scaling*****/ +// void dequant_scaling_c(const int16_t* quantCoef, const int32_t* deQuantCoef, int16_t* coef, int num, int per, int shift) +function x265_dequant_scaling_neon + push {r4, r5, r6, r7} + ldr r4, [sp, #16] // per + ldr r5, [sp, #20] //.shift + add r5, #4 // shift + 4 + lsr r3, #3 // num / 8 + cmp r5, r4 + blt skip + + mov r12, #1 + sub r6, r5, r4 // shift - per + sub r6, #1 // shift - per - 1 + lsl r6, r12, r6 // 1 << shift - per - 1 (add) + vdup.32 q0, r6 + sub r7, r4, r5 // per - shift + vdup.32 q3, r7 + +dequant_loop1: + vld1.16 {q9}, [r0]! // quantCoef + vld1.32 {q2}, [r1]! // deQuantCoef + vld1.32 {q10}, [r1]! + vmovl.s16 q1, d18 + vmovl.s16 q9, d19 + + vmul.s32 q1, q2 // quantCoef * deQuantCoef + vmul.s32 q9, q10 + vadd.s32 q1, q0 // quantCoef * deQuantCoef + add + vadd.s32 q9, q0 + + vshl.s32 q1, q3 + vshl.s32 q9, q3 + vqmovn.s32 d16, q1 // x265_clip3 + vqmovn.s32 d17, q9 + subs r3, #1 + vst1.16 {q8}, [r2]! + bne dequant_loop1 + b 1f + +skip: + sub r6, r4, r5 // per - shift + vdup.16 q0, r6 + +dequant_loop2: + vld1.16 {q9}, [r0]! // quantCoef + vld1.32 {q2}, [r1]! // deQuantCoef + vld1.32 {q10}, [r1]! + vmovl.s16 q1, d18 + vmovl.s16 q9, d19 + + vmul.s32 q1, q2 // quantCoef * deQuantCoef + vmul.s32 q9, q10 + vqmovn.s32 d16, q1 // x265_clip3 + vqmovn.s32 d17, q9 + + vqshl.s16 q8, q0 // coefQ << per - shift + subs r3, #1 + vst1.16 {q8}, [r2]! + bne dequant_loop2 +1: + pop {r4, r5, r6, r7} + bx lr +endfunc + +// void dequant_normal_c(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift) +function x265_dequant_normal_neon + ldr r12, [sp] // shift +#if HIGH_BIT_DEPTH // NEVER TEST path + cmp r3, #32768 + lsrlt r3, #(BIT_DEPTH - 8) + sublt r12, #(BIT_DEPTH - 8) +#endif + lsr r2, #4 // num / 16 + + neg r12, r12 + vdup.16 q0, r3 + vdup.32 q1, r12 + +.dqn_loop1: + vld1.16 {d4-d7}, [r0]! + + vmull.s16 q8, d4, d0 + vmull.s16 q9, d5, d0 + vmull.s16 q10, d6, d0 + vmull.s16 q11, d7, d0 + + vrshl.s32 q8, q1 + vrshl.s32 q9, q1 + vrshl.s32 q10, q1 + vrshl.s32 q11, q1 + vqmovn.s32 d16, q8 + vqmovn.s32 d17, q9 + vqmovn.s32 d18, q10 + vqmovn.s32 d19, q11 + + subs r2, #1 + vst1.16 {d16-d19}, [r1]! + bgt .dqn_loop1 + bx lr +endfunc + +/********* ssim ***********/ +// void x265_ssim_4x4x2_core_neon(const pixel* pix1, intptr_t stride1, const pixel* pix2, intptr_t stride2, int sums[2][4]); +function x265_ssim_4x4x2_core_neon + ldr r12, [sp] + + vld1.64 {d0}, [r0], r1 + vld1.64 {d1}, [r0], r1 + vld1.64 {d2}, [r0], r1 + vld1.64 {d3}, [r0], r1 + + vld1.64 {d4}, [r2], r3 + vld1.64 {d5}, [r2], r3 + vld1.64 {d6}, [r2], r3 + vld1.64 {d7}, [r2], r3 + + vpaddl.u8 q8, q0 + vpadal.u8 q8, q1 + vpaddl.u8 q9, q2 + vpadal.u8 q9, q3 + vadd.u16 d16, d17 + vpaddl.u16 d16, d16 + vadd.u16 d18, d19 + vpaddl.u16 d17, d18 + + vmull.u8 q10, d0, d0 + vmull.u8 q11, d1, d1 + vmull.u8 q12, d2, d2 + vmull.u8 q13, d3, d3 + vpaddl.u16 q10, q10 + vpadal.u16 q10, q11 + vpadal.u16 q10, q12 + vpadal.u16 q10, q13 + + vmull.u8 q9, d4, d4 + vmull.u8 q11, d5, d5 + vmull.u8 q12, d6, d6 + vmull.u8 q13, d7, d7 + vpadal.u16 q10, q9 + vpadal.u16 q10, q11 + vpadal.u16 q10, q12 + vpadal.u16 q10, q13 + vpadd.u32 d18, d20, d21 + + vmull.u8 q10, d0, d4 + vmull.u8 q11, d1, d5 + vmull.u8 q12, d2, d6 + vmull.u8 q13, d3, d7 + vpaddl.u16 q10, q10 + vpadal.u16 q10, q11 + vpadal.u16 q10, q12 + vpadal.u16 q10, q13 + vpadd.u32 d19, d20, d21 + + vst4.32 {d16-d19}, [r12] + bx lr +endfunc + +// int psyCost_pp(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride) +function x265_psyCost_4x4_neon + vld1.32 {d16[]}, [r0,:32], r1 // d16 = [A03 A02 A01 A00 A03 A02 A01 A00] + vld1.32 {d17[]}, [r0,:32], r1 // d17 = [A13 A12 A11 A10 A13 A12 A11 A10] + vld1.32 {d16[1]}, [r0,:32], r1 // d16 = [A23 A22 A21 A20 A03 A02 A01 A00] + vld1.32 {d17[1]}, [r0,:32], r1 // d17 = [A33 A32 A31 A30 A13 A12 A11 A10] + + vld1.32 {d18[]}, [r2,:32], r3 // d18 = [B03 B02 B01 B00 B03 B02 B01 B00] + vld1.32 {d19[]}, [r2,:32], r3 // d19 = [B13 B12 B11 B10 B13 B12 B11 B10] + vld1.32 {d18[1]}, [r2,:32], r3 // d18 = [B23 B22 B21 B20 B03 B02 B01 B00] + vld1.32 {d19[1]}, [r2,:32], r3 // d19 = [B33 B32 B31 B30 B13 B12 B11 B10] + + vaddl.u8 q2, d16, d17 // q2 = [2+3 0+1] + vsubl.u8 q3, d16, d17 // q3 = [2-3 0-1] + vaddl.u8 q12, d18, d19 + vsubl.u8 q13, d18, d19 + + SUMSUB_ABCD d0, d2, d1, d3, d4, d5, d6, d7 // q0 = [(0-1)+(2-3) (0+1)+(2+3)], q1 = [(0-1)-(2-3) (0+1)-(2+3)] + SUMSUB_ABCD d20, d22, d21, d23, d24, d25, d26, d27 + + // Hadamard-1D + vtrn.16 q0, q1 + vtrn.16 q10, q11 + SUMSUB_AB q2, q3, q0, q1 // q2 = [((0-1)-(2-3))+((0-1)+(2-3)) ((0+1)-(2+3))+((0+1)+(2+3))], q3 = [((0-1)-(2-3))-((0-1)+(2-3)) ((0+1)-(2+3))-((0+1)+(2+3))] + SUMSUB_AB q12, q13, q10, q11 + + // SAD Stage-0 + vaddl.u8 q14, d16, d17 // q14 = [S23x4 S01x4] + vaddl.u8 q15, d18, d19 + + // Hadamard-2D + vtrn.32 q2, q3 + vtrn.32 q12, q13 + vabs.s16 q2, q2 + vabs.s16 q12, q12 + vabs.s16 q3, q3 + vabs.s16 q13, q13 + + // SAD Stage-1 + vadd.u16 d28, d29 // SAD: reduce to 4 elements + vadd.u16 d30, d31 + + vmax.s16 q0, q2, q3 + vmax.s16 q10, q12, q13 + + // SAD Stage-2 + vpadd.u16 d28, d30 // SAD: reduce to 2 elements + + // SAD & SATD Final Stage + vswp d1, d20 + vadd.u16 q0, q10 + vpaddl.u16 d28, d28 // d28 = SAD_DWORD[B A] + vpadd.u16 d0, d1 + vshr.u32 d28, #2 // d28 = SAD_DWORD[B A] >> 2 + vpaddl.u16 d0, d0 // d0 = SATD_DWORD[B A] + vsub.s32 d0, d28 // d0 = SATD - SAD + vmov.32 r0, d0[0] + vmov.32 r1, d0[1] + subs r0, r1 + rsbmi r0, r0, #0 + + bx lr +endfunc +
View file
x265_2.0.tar.gz/source/common/arm/pixel-util.h
Added
@@ -0,0 +1,92 @@ +/***************************************************************************** + * Copyright (C) 2016 x265 project + * + * Authors: Steve Borho <steve@borho.org> +;* Min Chen <chenm003@163.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#ifndef X265_PIXEL_UTIL_ARM_H +#define X265_PIXEL_UTIL_ARM_H + +uint64_t x265_pixel_var_8x8_neon(const pixel* pix, intptr_t stride); +uint64_t x265_pixel_var_16x16_neon(const pixel* pix, intptr_t stride); +uint64_t x265_pixel_var_32x32_neon(const pixel* pix, intptr_t stride); +uint64_t x265_pixel_var_64x64_neon(const pixel* pix, intptr_t stride); + +void x265_getResidual4_neon(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); +void x265_getResidual8_neon(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); +void x265_getResidual16_neon(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); +void x265_getResidual32_neon(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); + +void x265_scale1D_128to64_neon(pixel *dst, const pixel *src); +void x265_scale2D_64to32_neon(pixel* dst, const pixel* src, intptr_t stride); + +int x265_pixel_satd_4x4_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_4x8_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_4x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_4x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_8x4_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_8x8_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_8x12_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_8x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_8x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_8x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_12x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_12x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_16x4_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_16x8_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_16x12_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_16x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_16x24_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_16x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_16x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_24x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_24x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_32x8_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_32x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_32x24_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_32x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_32x48_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_32x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_48x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_64x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_64x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_64x48_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +int x265_pixel_satd_64x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); + +int x265_pixel_sa8d_8x8_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2); +int x265_pixel_sa8d_8x16_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2); +int x265_pixel_sa8d_16x16_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2); +int x265_pixel_sa8d_16x32_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2); +int x265_pixel_sa8d_32x32_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2); +int x265_pixel_sa8d_32x64_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2); +int x265_pixel_sa8d_64x64_neon(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2); + +uint32_t x265_quant_neon(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff); +uint32_t x265_nquant_neon(const int16_t* coef, const int32_t* quantCoeff, int16_t* qCoef, int qBits, int add, int numCoeff); + +void x265_dequant_scaling_neon(const int16_t* quantCoef, const int32_t* deQuantCoef, int16_t* coef, int num, int per, int shift); +void x265_dequant_normal_neon(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift); + +void x265_ssim_4x4x2_core_neon(const pixel* pix1, intptr_t stride1, const pixel* pix2, intptr_t stride2, int sums[2][4]); + +int PFX(psyCost_4x4_neon)(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); + +#endif // ifndef X265_PIXEL_UTIL_ARM_H
View file
x265_2.0.tar.gz/source/common/arm/pixel.h
Added
@@ -0,0 +1,215 @@ +/***************************************************************************** + * pixel.h: x86 pixel metrics + ***************************************************************************** + * Copyright (C) 2003-2013 x264 project + * Copyright (C) 2013-2016 x265 project + * + * Authors: Laurent Aimar <fenrir@via.ecp.fr> + * Loren Merritt <lorenm@u.washington.edu> + * Fiona Glaser <fiona@x264.com> + * Min Chen <chenm003@163.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#ifndef X265_I386_PIXEL_ARM_H +#define X265_I386_PIXEL_ARM_H + +int x265_pixel_sad_4x4_armv6(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_4x8_armv6(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_4x16_armv6(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_8x4_neon(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_8x8_neon(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_8x16_neon(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_8x32_neon(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_16x4_neon(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_16x8_neon(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_16x16_neon(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_16x12_neon(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_16x32_neon(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_16x64_neon(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_32x8_neon(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_32x16_neon(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_32x32_neon(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_32x64_neon(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_32x24_neon(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_64x16_neon(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_64x32_neon(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_64x64_neon(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_64x48_neon(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_12x16_neon(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_24x32_neon(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +int x265_pixel_sad_48x64_neon(const pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); + +void x265_pixel_avg_pp_4x4_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_4x8_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_4x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_8x4_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_8x8_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_8x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_8x32_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_12x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_16x4_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_16x8_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_16x12_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_16x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_16x32_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_16x64_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_24x32_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_32x8_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_32x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_32x24_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_32x32_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_32x64_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_48x64_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_64x16_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_64x32_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_64x48_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); +void x265_pixel_avg_pp_64x64_neon (pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); + +void x265_sad_x3_4x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_4x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_4x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_8x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_8x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_8x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_8x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_12x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_16x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_16x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_16x12_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_16x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_16x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_16x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_24x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_32x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_32x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_32x24_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_32x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_32x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_48x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_64x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_64x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_64x48_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); +void x265_sad_x3_64x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); + +void x265_sad_x4_4x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_4x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_4x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_8x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_8x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_8x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_8x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_12x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_16x4_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_16x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_16x12_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_16x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_16x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_16x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_24x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_32x8_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_32x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_32x24_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_32x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_32x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_48x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_64x16_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_64x32_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_64x48_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); +void x265_sad_x4_64x64_neon(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); + +sse_t x265_pixel_sse_pp_4x4_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +sse_t x265_pixel_sse_pp_8x8_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +sse_t x265_pixel_sse_pp_16x16_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +sse_t x265_pixel_sse_pp_32x32_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); +sse_t x265_pixel_sse_pp_64x64_neon(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); + +sse_t x265_pixel_sse_ss_4x4_neon(const int16_t* pix1, intptr_t stride_pix1, const int16_t* pix2, intptr_t stride_pix2); +sse_t x265_pixel_sse_ss_8x8_neon(const int16_t* pix1, intptr_t stride_pix1, const int16_t* pix2, intptr_t stride_pix2); +sse_t x265_pixel_sse_ss_16x16_neon(const int16_t* pix1, intptr_t stride_pix1, const int16_t* pix2, intptr_t stride_pix2); +sse_t x265_pixel_sse_ss_32x32_neon(const int16_t* pix1, intptr_t stride_pix1, const int16_t* pix2, intptr_t stride_pix2); +sse_t x265_pixel_sse_ss_64x64_neon(const int16_t* pix1, intptr_t stride_pix1, const int16_t* pix2, intptr_t stride_pix2); + +sse_t x265_pixel_ssd_s_4x4_neon(const int16_t* a, intptr_t dstride); +sse_t x265_pixel_ssd_s_8x8_neon(const int16_t* a, intptr_t dstride); +sse_t x265_pixel_ssd_s_16x16_neon(const int16_t* a, intptr_t dstride); +sse_t x265_pixel_ssd_s_32x32_neon(const int16_t* a, intptr_t dstride); +sse_t x265_pixel_ssd_s_64x64_neon(const int16_t* a, intptr_t dstride); + +void x265_pixel_sub_ps_4x4_neon(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_sub_ps_8x8_neon(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_sub_ps_16x16_neon(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_sub_ps_32x32_neon(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_sub_ps_64x64_neon(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_sub_ps_4x8_neon(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_sub_ps_8x16_neon(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_sub_ps_16x32_neon(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_sub_ps_32x64_neon(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); + +void x265_pixel_add_ps_4x4_neon(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_add_ps_8x8_neon(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_add_ps_16x16_neon(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_add_ps_32x32_neon(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_add_ps_64x64_neon(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_add_ps_4x8_neon(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_add_ps_8x16_neon(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_add_ps_16x32_neon(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); +void x265_pixel_add_ps_32x64_neon(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); + +void x265_pixel_planecopy_cp_neon(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift); + +void x265_addAvg_4x4_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_4x8_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_4x16_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_8x4_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_8x8_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_8x16_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_8x32_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_12x16_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_16x4_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_16x8_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_16x12_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_16x16_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_16x32_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_16x64_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_24x32_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_32x8_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_32x16_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_32x24_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_32x32_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_32x64_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_48x64_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_64x16_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_64x32_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_64x48_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_64x64_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); + +void x265_addAvg_4x2_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_4x32_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_6x8_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_6x16_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_8x2_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_8x6_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_8x12_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_8x64_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_12x32_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_16x24_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_24x64_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +void x265_addAvg_32x48_neon(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride); +#endif // ifndef X265_I386_PIXEL_ARM_H
View file
x265_2.0.tar.gz/source/common/arm/sad-a.S
Added
@@ -0,0 +1,1356 @@ +/***************************************************************************** + * Copyright (C) 2016 x265 project + * + * Authors: David Conrad <lessen42@gmail.com> + * Janne Grunau <janne-x264@jannau.net> + * Dnyaneshwar G <dnyaneshwar@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "asm.S" + +.section .rodata + +.align 4 +sad12_mask: +.byte 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 0, 0, 0, 0 + +.text + +/* sad4x4(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride) + * + * r0 - dst + * r1 - dstStride + * r2 - src + * r3 - srcStride */ + +.macro SAD4_ARMV6 h +function x265_pixel_sad_4x\h\()_armv6 + push {r4-r6,lr} + ldr r4, [r2], r3 + ldr r5, [r0], r1 + ldr r6, [r2], r3 + ldr lr, [r0], r1 + usad8 ip, r4, r5 +.rept (\h - 2)/2 + ldr r4, [r2], r3 + ldr r5, [r0], r1 + usada8 ip, r6, lr, ip + ldr r6, [r2], r3 + ldr lr, [r0], r1 + usada8 ip, r4, r5, ip +.endr + usada8 r0, r6, lr, ip + pop {r4-r6,pc} +endfunc +.endm + +SAD4_ARMV6 4 +SAD4_ARMV6 8 +SAD4_ARMV6 16 + +.macro SAD8_NEON h +function x265_pixel_sad_8x\h\()_neon + vld1.8 d0, [r0], r1 // row 0 + vld1.8 d1, [r2], r3 // row 1 + vabdl.u8 q1, d0, d1 + +.rept \h-1 + vld1.8 d0, [r0], r1 // row 2,4,6 + vld1.8 d1, [r2], r3 // row 3,5,7 + vabal.u8 q1, d0, d1 +.endr + + vadd.u16 d2, d2, d3 + vpadd.u16 d0, d2, d2 + vpaddl.u16 d0, d0 + vmov.u32 r0, d0[0] + bx lr +endfunc +.endm + +SAD8_NEON 4 +SAD8_NEON 8 +SAD8_NEON 16 +SAD8_NEON 32 + +.macro SAD16_NEON h +function x265_pixel_sad_16x\h\()_neon + vld1.8 {q0}, [r0], r1 // row 0 + vld1.8 {q1}, [r2], r3 + vld1.8 {q2}, [r0], r1 // row 1 + vld1.8 {q3}, [r2], r3 + + vabdl.u8 q8, d0, d2 + vabdl.u8 q9, d1, d3 + vabal.u8 q8, d4, d6 + vabal.u8 q9, d5, d7 + mov r12, #(\h-2)/2 + +.loop_16x\h: + + subs r12, #1 + vld1.8 {q0}, [r0], r1 + vld1.8 {q1}, [r2], r3 + vld1.8 {q2}, [r0], r1 + vld1.8 {q3}, [r2], r3 + + vabal.u8 q8, d0, d2 + vabal.u8 q9, d1, d3 + vabal.u8 q8, d4, d6 + vabal.u8 q9, d5, d7 + bne .loop_16x\h + + vadd.u16 q8, q8, q9 +.if \h == 64 + vaddl.u16 q0, d16, d17 + vpadd.u32 d0, d0, d1 + vpadd.u32 d0, d0 +.else + vadd.u16 d16, d16, d17 + vpadd.u16 d0, d16, d16 + vpaddl.u16 d0, d0 +.endif + vmov.u32 r0, d0[0] + bx lr +endfunc +.endm + +SAD16_NEON 4 +SAD16_NEON 8 +SAD16_NEON 16 +SAD16_NEON 12 +SAD16_NEON 32 +SAD16_NEON 64 + +.macro SAD32_NEON h +function x265_pixel_sad_32x\h\()_neon + veor.u8 q8, q8 + veor.u8 q9, q9 + veor.u8 q10, q10 + veor.u8 q11, q11 + mov r12, #\h/8 + +.loop_32x\h: + + subs r12, #1 +.rept 4 + vld1.8 {q0, q1}, [r0], r1 // row 0 + vld1.8 {q2, q3}, [r2], r3 // row 0 + vld1.8 {q12, q13}, [r0], r1 // row 1 + vld1.8 {q14, q15}, [r2], r3 // row 1 + + vabal.u8 q8, d0, d4 + vabal.u8 q9, d1, d5 + vabal.u8 q10, d2, d6 + vabal.u8 q11, d3, d7 + + vabal.u8 q8, d24, d28 + vabal.u8 q9, d25, d29 + vabal.u8 q10, d26, d30 + vabal.u8 q11, d27, d31 +.endr + bne .loop_32x\h + + vadd.u16 q8, q8, q9 + vadd.u16 q10, q10, q11 +.if \h == 64 + vaddl.u16 q0, d16, d17 + vpadd.u32 d0, d0, d1 + vpaddl.u32 d0, d0 + + vaddl.u16 q1, d20, d21 + vpadd.u32 d2, d2, d3 + vpaddl.u32 d2, d2 + + vadd.u32 d0,d0,d2 +.else + vadd.u16 d16, d16, d17 + vpadd.u16 d0, d16, d16 + vpaddl.u16 d0, d0 + + vadd.u16 d20, d20, d21 + vpadd.u16 d1, d20, d20 + vpaddl.u16 d1, d1 + + vadd.u32 d0,d0,d1 +.endif + vmov.u32 r0, d0[0] + bx lr +endfunc +.endm + +SAD32_NEON 8 +SAD32_NEON 16 +SAD32_NEON 24 +SAD32_NEON 32 +SAD32_NEON 64 + +.macro SAD64_NEON h +function x265_pixel_sad_64x\h\()_neon + veor.u8 q8, q8 + veor.u8 q9, q9 + veor.u8 q10, q10 + veor.u8 q11, q11 + mov r12, #32 + sub r1, r12 + sub r3, r12 + mov r12, #\h/8 + +.loop_64x\h: + + subs r12, #1 +.rept 4 + // Columns 0-32 + vld1.8 {q0, q1}, [r0]! + vld1.8 {q2, q3}, [r2]! + vabal.u8 q8, d0, d4 + vabal.u8 q9, d1, d5 + vabal.u8 q10, d2, d6 + vabal.u8 q11, d3, d7 + // Columns 32-64 + vld1.8 {q0, q1}, [r0],r1 + vld1.8 {q2, q3}, [r2],r3 + vabal.u8 q8, d0, d4 + vabal.u8 q9, d1, d5 + vabal.u8 q10, d2, d6 + vabal.u8 q11, d3, d7 + // Columns 0-32 + vld1.8 {q12, q13}, [r0]! + vld1.8 {q14, q15}, [r2]! + vabal.u8 q8, d24, d28 + vabal.u8 q9, d25, d29 + vabal.u8 q10, d26, d30 + vabal.u8 q11, d27, d31 + // Columns 32-64 + vld1.8 {q12, q13}, [r0],r1 + vld1.8 {q14, q15}, [r2],r3 + vabal.u8 q8, d24, d28 + vabal.u8 q9, d25, d29 + vabal.u8 q10, d26, d30 + vabal.u8 q11, d27, d31 +.endr + bne .loop_64x\h + + vadd.u16 q8, q8, q9 + vadd.u16 q10, q10, q11 + + vaddl.u16 q0, d16, d17 + vpadd.u32 d0, d0, d1 + vpaddl.u32 d0, d0 + + vaddl.u16 q1, d20, d21 + vpadd.u32 d2, d2, d3 + vpaddl.u32 d2, d2 + + vadd.u32 d0,d0,d2 + + vmov.u32 r0, d0[0] + bx lr +endfunc +.endm + +SAD64_NEON 16 +SAD64_NEON 32 +SAD64_NEON 48 +SAD64_NEON 64 + +function x265_pixel_sad_24x32_neon + veor.u8 q0, q0 + veor.u8 q1, q1 + veor.u8 q2, q2 + veor.u8 q8, q8 + veor.u8 q9, q9 + veor.u8 q10, q10 + mov r12, #16 + sub r1, #16 + sub r3, #16 + mov r12, #8 + +.loop_24x32: + + subs r12, #1 +.rept 4 + vld1.8 {q0}, [r0]! + vld1.8 {q1}, [r2]! + vabal.u8 q8, d0, d2 + vabal.u8 q9, d1, d3 + + vld1.8 {d0}, [r0], r1 + vld1.8 {d1}, [r2], r3 + vabal.u8 q10, d0, d1 +.endr + bne .loop_24x32 + + vadd.u16 q8, q8, q9 + vadd.u16 d16, d16, d17 + vpadd.u16 d0, d16, d16 + vpaddl.u16 d0, d0 + vadd.u16 d20, d20, d21 + vpadd.u16 d1, d20, d20 + vpaddl.u16 d1, d1 + vadd.u32 d0,d0,d1 + vmov.u32 r0, d0[0] + bx lr +endfunc + +function x265_pixel_sad_48x64_neon + veor.u8 q3, q3 + veor.u8 q11, q11 + veor.u8 q12, q12 + veor.u8 q13, q13 + veor.u8 q14, q14 + veor.u8 q15, q15 + mov r12, #32 + sub r1, #32 + sub r3, #32 + mov r12, #16 + +.loop_48x64: + + subs r12, #1 +.rept 4 + vld1.8 {q0,q1}, [r0]! + vld1.8 {q2}, [r0], r1 + vld1.8 {q8,q9}, [r2]! + vld1.8 {q10}, [r2], r3 + vabal.u8 q3, d0, d16 + vabal.u8 q11, d1, d17 + vabal.u8 q12, d2, d18 + vabal.u8 q13, d3, d19 + vabal.u8 q14, d4, d20 + vabal.u8 q15, d5, d21 +.endr + bne .loop_48x64 + + vadd.u16 q3, q3, q11 + vadd.u16 d6, d6, d7 + vpaddl.u16 d0, d6 + vpadd.u32 d0, d0 + + vadd.u16 q12, q12, q13 + vadd.u16 d24, d24, d25 + vpaddl.u16 d1, d24 + vpadd.u32 d1, d1 + + vadd.u16 q14,q14,q15 + vadd.u16 d28, d28, d29 + vpaddl.u16 d2, d28 + vpadd.u32 d2, d2 + + vadd.u32 d0, d0, d1 + vadd.u32 d0, d0, d2 + vmov.u32 r0, d0[0] + bx lr +endfunc + +// SAD_X3 and SAD_X4 code start + +.macro SAD_X_START_4 x + vld1.32 {d0[]}, [r0], r12 + vld1.32 {d1[]}, [r1], r4 + vld1.32 {d2[]}, [r2], r4 + vld1.32 {d3[]}, [r3], r4 +.if \x == 4 + vld1.32 {d4[]}, [lr], r4 +.endif + vabdl.u8 q8, d0, d1 + vabdl.u8 q9, d0, d2 + vabdl.u8 q10, d0, d3 +.if \x == 4 + vabdl.u8 q11, d0, d4 +.endif +.endm + +.macro SAD_X_4 x + vld1.32 {d0[]}, [r0], r12 + vld1.32 {d1[]}, [r1], r4 + vld1.32 {d2[]}, [r2], r4 + vld1.32 {d3[]}, [r3], r4 +.if \x == 4 + vld1.32 {d4[]}, [lr], r4 +.endif + vabal.u8 q8, d0, d1 + vabal.u8 q9, d0, d2 + vabal.u8 q10, d0, d3 +.if \x == 4 + vabal.u8 q11, d0, d4 +.endif +.endm + +.macro SAD_X_4xN x, h +function x265_sad_x\x\()_4x\h\()_neon + push {r4, r5, lr} +.if \x == 3 + ldrd r4, r5, [sp, #12] +.else + ldr lr, [sp, #12] + ldrd r4, r5, [sp, #16] +.endif + mov r12, #FENC_STRIDE + + SAD_X_START_4 \x +.rept \h - 1 + SAD_X_4 \x +.endr + vpadd.u16 d0, d16, d18 + vpadd.u16 d1, d20, d22 + vpaddl.u16 q0, q0 +.if \x == 3 + vst1.32 {d0}, [r5]! + vst1.32 {d1[0]}, [r5, :32] +.else + vst1.32 {d0-d1}, [r5] +.endif + pop {r4, r5, lr} + bx lr +endfunc +.endm + +SAD_X_4xN 3 4 +SAD_X_4xN 3 8 +SAD_X_4xN 3 16 + +SAD_X_4xN 4 4 +SAD_X_4xN 4 8 +SAD_X_4xN 4 16 + +.macro SAD_X_START_8 x + vld1.8 {d0}, [r0], r12 + vld1.8 {d1}, [r1], r4 + vld1.8 {d2}, [r2], r4 + vld1.8 {d3}, [r3], r4 +.if \x == 4 + vld1.8 {d4}, [lr], r4 +.endif + vabdl.u8 q8, d0, d1 + vabdl.u8 q9, d0, d2 + vabdl.u8 q10, d0, d3 +.if \x == 4 + vabdl.u8 q11, d0, d4 +.endif +.endm + +.macro SAD_X_8 x + vld1.8 {d0}, [r0], r12 + vld1.8 {d1}, [r1], r4 + vld1.8 {d2}, [r2], r4 + vld1.8 {d3}, [r3], r4 +.if \x == 4 + vld1.8 {d4}, [lr], r4 +.endif + vabal.u8 q8, d0, d1 + vabal.u8 q9, d0, d2 + vabal.u8 q10, d0, d3 +.if \x == 4 + vabal.u8 q11, d0, d4 +.endif +.endm + +.macro SAD_X_8xN x, h +function x265_sad_x\x\()_8x\h\()_neon + push {r4, r5, lr} +.if \x == 3 + ldrd r4, r5, [sp, #12] +.else + ldr lr, [sp, #12] + ldrd r4, r5, [sp, #16] +.endif + mov r12, #FENC_STRIDE + SAD_X_START_8 \x +.rept \h - 1 + SAD_X_8 \x +.endr + vadd.u16 d16, d16, d17 + vadd.u16 d18, d18, d19 + vadd.u16 d20, d20, d21 + vadd.u16 d22, d22, d23 + + vpadd.u16 d0, d16, d18 + vpadd.u16 d1, d20, d22 + vpaddl.u16 q0, q0 +.if \x == 3 + vst1.32 {d0}, [r5]! + vst1.32 {d1[0]}, [r5, :32] +.else + vst1.32 {d0-d1}, [r5] +.endif + pop {r4, r5, lr} + bx lr +endfunc +.endm + +SAD_X_8xN 3 4 +SAD_X_8xN 3 8 +SAD_X_8xN 3 16 +SAD_X_8xN 3 32 + +SAD_X_8xN 4 4 +SAD_X_8xN 4 8 +SAD_X_8xN 4 16 +SAD_X_8xN 4 32 + +.macro SAD_X_START_16 x + vld1.8 {q0}, [r0], r12 + vld1.8 {q1}, [r1], r4 + vld1.8 {q2}, [r2], r4 + vld1.8 {q3}, [r3], r4 + vabdl.u8 q8, d0, d2 + vabdl.u8 q9, d1, d3 + vabdl.u8 q10, d0, d4 + vabdl.u8 q11, d1, d5 + vabdl.u8 q12, d0, d6 + vabdl.u8 q13, d1, d7 +.if \x == 4 + vld1.8 {q3}, [lr], r4 + vabdl.u8 q14, d0, d6 + vabdl.u8 q15, d1, d7 +.endif +.endm + +.macro SAD_X_16 x + vld1.8 {q0}, [r0], r12 + vld1.8 {q1}, [r1], r4 + vld1.8 {q2}, [r2], r4 + vld1.8 {q3}, [r3], r4 + vabal.u8 q8, d0, d2 + vabal.u8 q9, d1, d3 + vabal.u8 q10, d0, d4 + vabal.u8 q11, d1, d5 + vabal.u8 q12, d0, d6 + vabal.u8 q13, d1, d7 +.if \x == 4 + vld1.8 {q3}, [lr], r4 + vabal.u8 q14, d0, d6 + vabal.u8 q15, d1, d7 +.endif +.endm + +.macro SAD_X_16xN x, h +function x265_sad_x\x\()_16x\h\()_neon + push {r4, r5, lr} +.if \x == 3 + ldrd r4, r5, [sp, #12] +.else + ldr lr, [sp, #12] + ldrd r4, r5, [sp, #16] +.endif + mov r12, #FENC_STRIDE + SAD_X_START_16 \x +.rept \h - 1 + SAD_X_16 \x +.endr + vadd.u16 q8, q8, q9 + vadd.u16 q10, q10, q11 + vadd.u16 q12, q12, q13 +.if \x == 4 + vadd.u16 q14, q14, q15 +.endif + vadd.u16 d16, d16, d17 + vadd.u16 d20, d20, d21 + vadd.u16 d24, d24, d25 +.if \x == 4 + vadd.u16 d28, d28, d29 +.endif + +.if \h <= 32 + vpadd.u16 d0, d16, d20 + vpadd.u16 d1, d24, d28 + vpaddl.u16 q0, q0 + .if \x == 3 + vst1.32 {d0}, [r5]! + vst1.32 {d1[0]}, [r5, :32] + .else + vst1.32 {d0-d1}, [r5] + .endif +.else + vpaddl.u16 d16, d16 + vpaddl.u16 d20, d20 + vpaddl.u16 d24, d24 + .if \x == 4 + vpaddl.u16 d28, d28 + .endif + vpaddl.u32 d16, d16 + vpaddl.u32 d20, d20 + vpaddl.u32 d24, d24 + .if \x == 4 + vpaddl.u32 d28, d28 + .endif + vst1.32 {d16[0]}, [r5]! + vst1.32 {d20[0]}, [r5]! + .if \x == 3 + vst1.32 {d24[0]}, [r5] + .endif + .if \x == 4 + vst1.32 {d24[0]}, [r5]! + vst1.32 {d28[0]}, [r5] + .endif +.endif + pop {r4, r5, lr} + bx lr +endfunc +.endm + +SAD_X_16xN 3 4 +SAD_X_16xN 3 12 + +SAD_X_16xN 4 4 +SAD_X_16xN 4 12 + +.macro SAD_X_16xN_LOOP x, h +function x265_sad_x\x\()_16x\h\()_neon + push {r4-r6, lr} +.if \x == 3 + ldrd r4, r5, [sp, #16] +.else + ldr lr, [sp, #16] + ldrd r4, r5, [sp, #20] +.endif + mov r12, #FENC_STRIDE + mov r6, #\h/8 + veor.u8 q8, q8 + veor.u8 q9, q9 + veor.u8 q10, q10 + veor.u8 q11, q11 + veor.u8 q12, q12 + veor.u8 q13, q13 +.if \x == 4 + veor.u8 q14, q14 + veor.u8 q15, q15 +.endif + +.loop_sad_x\x\()_16x\h: +.rept 8 + SAD_X_16 \x +.endr + subs r6, #1 + bne .loop_sad_x\x\()_16x\h + + vadd.u16 q8, q8, q9 + vadd.u16 q10, q10, q11 + vadd.u16 q12, q12, q13 +.if \x == 4 + vadd.u16 q14, q14, q15 +.endif + vadd.u16 d16, d16, d17 + vadd.u16 d20, d20, d21 + vadd.u16 d24, d24, d25 +.if \x == 4 + vadd.u16 d28, d28, d29 +.endif + +.if \h <= 32 + vpadd.u16 d0, d16, d20 + vpadd.u16 d1, d24, d28 + vpaddl.u16 q0, q0 + .if \x == 3 + vst1.32 {d0}, [r5]! + vst1.32 {d1[0]}, [r5, :32] + .else + vst1.32 {d0-d1}, [r5] + .endif +.else + vpaddl.u16 d16, d16 + vpaddl.u16 d20, d20 + vpaddl.u16 d24, d24 + .if \x == 4 + vpaddl.u16 d28, d28 + .endif + vpaddl.u32 d16, d16 + vpaddl.u32 d20, d20 + vpaddl.u32 d24, d24 + .if \x == 4 + vpaddl.u32 d28, d28 + .endif + vst1.32 {d16[0]}, [r5]! + vst1.32 {d20[0]}, [r5]! + .if \x == 3 + vst1.32 {d24[0]}, [r5] + .endif + .if \x == 4 + vst1.32 {d24[0]}, [r5]! + vst1.32 {d28[0]}, [r5] + .endif +.endif + pop {r4-r6, lr} + bx lr +endfunc +.endm + +SAD_X_16xN_LOOP 3 8 +SAD_X_16xN_LOOP 3 16 +SAD_X_16xN_LOOP 3 32 +SAD_X_16xN_LOOP 3 64 + +SAD_X_16xN_LOOP 4 8 +SAD_X_16xN_LOOP 4 16 +SAD_X_16xN_LOOP 4 32 +SAD_X_16xN_LOOP 4 64 + +.macro SAD_X_32 x + vld1.8 {q0}, [r0]! + vld1.8 {q1}, [r1]! + vld1.8 {q2}, [r2]! + vld1.8 {q3}, [r3]! + vabal.u8 q8, d0, d2 + vabal.u8 q9, d1, d3 + vabal.u8 q10, d0, d4 + vabal.u8 q11, d1, d5 + vabal.u8 q12, d0, d6 + vabal.u8 q13, d1, d7 +.if \x == 4 + vld1.8 {q3}, [lr]! + vabal.u8 q14, d0, d6 + vabal.u8 q15, d1, d7 +.endif + vld1.8 {q0}, [r0], r12 + vld1.8 {q1}, [r1], r4 + vld1.8 {q2}, [r2], r4 + vld1.8 {q3}, [r3], r4 + vabal.u8 q8, d0, d2 + vabal.u8 q9, d1, d3 + vabal.u8 q10, d0, d4 + vabal.u8 q11, d1, d5 + vabal.u8 q12, d0, d6 + vabal.u8 q13, d1, d7 +.if \x == 4 + vld1.8 {q3}, [lr], r4 + vabal.u8 q14, d0, d6 + vabal.u8 q15, d1, d7 +.endif +.endm + +.macro SAD_X_32xN x, h +function x265_sad_x\x\()_32x\h\()_neon + push {r4-r6, lr} +.if \x == 3 + ldrd r4, r5, [sp, #16] +.else + ldr lr, [sp, #16] + ldrd r4, r5, [sp, #20] +.endif + mov r12, #FENC_STRIDE + sub r12, #16 + sub r4, #16 + mov r6, #\h/8 + veor.u8 q8, q8 + veor.u8 q9, q9 + veor.u8 q10, q10 + veor.u8 q11, q11 + veor.u8 q12, q12 + veor.u8 q13, q13 +.if \x == 4 + veor.u8 q14, q14 + veor.u8 q15, q15 +.endif + +loop_sad_x\x\()_32x\h: +.rept 8 + SAD_X_32 \x +.endr + subs r6, #1 + bgt loop_sad_x\x\()_32x\h + +.if \h <= 32 + vadd.u16 q8, q8, q9 + vadd.u16 q10, q10, q11 + vadd.u16 q12, q12, q13 + .if \x == 4 + vadd.u16 q14, q14, q15 + .endif + vadd.u16 d16, d16, d17 + vadd.u16 d20, d20, d21 + vadd.u16 d24, d24, d25 + .if \x == 4 + vadd.u16 d28, d28, d29 + .endif +.else + vpaddl.u16 q8, q8 + vpaddl.u16 q9, q9 + vpaddl.u16 q10, q10 + vpaddl.u16 q11, q11 + vpaddl.u16 q12, q12 + vpaddl.u16 q13, q13 + .if \x == 4 + vpaddl.u16 q14, q14 + vpaddl.u16 q15, q15 + .endif + vadd.u32 q8, q8, q9 + vadd.u32 q10, q10, q11 + vadd.u32 q12, q12, q13 + .if \x == 4 + vadd.u32 q14, q14, q15 + .endif + vadd.u32 d16, d16, d17 + vadd.u32 d20, d20, d21 + vadd.u32 d24, d24, d25 + .if \x == 4 + vadd.u32 d28, d28, d29 + .endif +.endif + +.if \h <= 16 + vpadd.u16 d0, d16, d20 + vpadd.u16 d1, d24, d28 + vpaddl.u16 q0, q0 + .if \x == 3 + vst1.32 {d0}, [r5]! + vst1.32 {d1[0]}, [r5, :32] + .else + vst1.32 {d0-d1}, [r5] + .endif +.elseif \h <= 32 + vpaddl.u16 d16, d16 + vpaddl.u16 d20, d20 + vpaddl.u16 d24, d24 + .if \x == 4 + vpaddl.u16 d28, d28 + .endif + vpaddl.u32 d16, d16 + vpaddl.u32 d20, d20 + vpaddl.u32 d24, d24 + .if \x == 4 + vpaddl.u32 d28, d28 + .endif + vst1.32 {d16[0]}, [r5]! + vst1.32 {d20[0]}, [r5]! + .if \x == 3 + vst1.32 {d24[0]}, [r5] + .endif + .if \x == 4 + vst1.32 {d24[0]}, [r5]! + vst1.32 {d28[0]}, [r5] + .endif +.elseif \h <= 64 + vpaddl.u32 d16, d16 + vpaddl.u32 d20, d20 + vpaddl.u32 d24, d24 + .if \x == 4 + vpaddl.u32 d28, d28 + .endif + vst1.32 {d16[0]}, [r5]! + vst1.32 {d20[0]}, [r5]! + .if \x == 3 + vst1.32 {d24[0]}, [r5] + .endif + .if \x == 4 + vst1.32 {d24[0]}, [r5]! + vst1.32 {d28[0]}, [r5] + .endif +.endif + pop {r4-r6, lr} + bx lr +endfunc +.endm + +SAD_X_32xN 3 8 +SAD_X_32xN 3 16 +SAD_X_32xN 3 24 +SAD_X_32xN 3 32 +SAD_X_32xN 3 64 + +SAD_X_32xN 4 8 +SAD_X_32xN 4 16 +SAD_X_32xN 4 24 +SAD_X_32xN 4 32 +SAD_X_32xN 4 64 + +.macro SAD_X_64 x +.rept 3 + vld1.8 {q0}, [r0]! + vld1.8 {q1}, [r1]! + vld1.8 {q2}, [r2]! + vld1.8 {q3}, [r3]! + vabal.u8 q8, d0, d2 + vabal.u8 q9, d1, d3 + vabal.u8 q10, d0, d4 + vabal.u8 q11, d1, d5 + vabal.u8 q12, d0, d6 + vabal.u8 q13, d1, d7 +.if \x == 4 + vld1.8 {q3}, [lr]! + vabal.u8 q14, d0, d6 + vabal.u8 q15, d1, d7 +.endif +.endr + vld1.8 {q0}, [r0], r12 + vld1.8 {q1}, [r1], r4 + vld1.8 {q2}, [r2], r4 + vld1.8 {q3}, [r3], r4 + vabal.u8 q8, d0, d2 + vabal.u8 q9, d1, d3 + vabal.u8 q10, d0, d4 + vabal.u8 q11, d1, d5 + vabal.u8 q12, d0, d6 + vabal.u8 q13, d1, d7 +.if \x == 4 + vld1.8 {q3}, [lr], r4 + vabal.u8 q14, d0, d6 + vabal.u8 q15, d1, d7 +.endif +.endm + +.macro SAD_X_64xN x, h +function x265_sad_x\x\()_64x\h\()_neon + push {r4-r6, lr} +.if \x == 3 + ldrd r4, r5, [sp, #16] +.else + ldr lr, [sp, #16] + ldrd r4, r5, [sp, #20] +.endif + mov r12, #FENC_STRIDE + sub r12, #48 + sub r4, #48 + mov r6, #\h/8 + veor.u8 q8, q8 + veor.u8 q9, q9 + veor.u8 q10, q10 + veor.u8 q11, q11 + veor.u8 q12, q12 + veor.u8 q13, q13 +.if \x == 4 + veor.u8 q14, q14 + veor.u8 q15, q15 +.endif +.loop_sad_x\x\()_64x\h: +.rept 8 + SAD_X_64 \x +.endr + subs r6, #1 + bne .loop_sad_x\x\()_64x\h + +.if \h <= 16 + vadd.u16 q8, q8, q9 + vadd.u16 q10, q10, q11 + vadd.u16 q12, q12, q13 + .if \x == 4 + vadd.u16 q14, q14, q15 + .endif + vadd.u16 d16, d16, d17 + vadd.u16 d20, d20, d21 + vadd.u16 d24, d24, d25 + .if \x == 4 + vadd.u16 d28, d28, d29 + .endif +.else + vpaddl.u16 q8, q8 + vpaddl.u16 q9, q9 + vpaddl.u16 q10, q10 + vpaddl.u16 q11, q11 + vpaddl.u16 q12, q12 + vpaddl.u16 q13, q13 + .if \x == 4 + vpaddl.u16 q14, q14 + vpaddl.u16 q15, q15 + .endif + vadd.u32 q8, q8, q9 + vadd.u32 q10, q10, q11 + vadd.u32 q12, q12, q13 + .if \x == 4 + vadd.u32 q14, q14, q15 + .endif + vadd.u32 d16, d16, d17 + vadd.u32 d20, d20, d21 + vadd.u32 d24, d24, d25 + .if \x == 4 + vadd.u32 d28, d28, d29 + .endif +.endif + +.if \h <= 16 + vpaddl.u16 d16, d16 + vpaddl.u16 d20, d20 + vpaddl.u16 d24, d24 + .if \x == 4 + vpaddl.u16 d28, d28 + .endif +.endif + vpaddl.u32 d16, d16 + vpaddl.u32 d20, d20 + vpaddl.u32 d24, d24 +.if \x == 4 + vpaddl.u32 d28, d28 +.endif + vst1.32 {d16[0]}, [r5]! + vst1.32 {d20[0]}, [r5]! +.if \x == 3 + vst1.32 {d24[0]}, [r5] +.endif +.if \x == 4 + vst1.32 {d24[0]}, [r5]! + vst1.32 {d28[0]}, [r5] +.endif + pop {r4-r6, lr} + bx lr +endfunc +.endm + +SAD_X_64xN 3 16 +SAD_X_64xN 3 32 +SAD_X_64xN 3 48 +SAD_X_64xN 3 64 + +SAD_X_64xN 4 16 +SAD_X_64xN 4 32 +SAD_X_64xN 4 48 +SAD_X_64xN 4 64 + +.macro SAD_X_48 x +.rept 2 + vld1.8 {q0}, [r0]! + vld1.8 {q1}, [r1]! + vld1.8 {q2}, [r2]! + vld1.8 {q3}, [r3]! + vabal.u8 q8, d0, d2 + vabal.u8 q9, d1, d3 + vabal.u8 q10, d0, d4 + vabal.u8 q11, d1, d5 + vabal.u8 q12, d0, d6 + vabal.u8 q13, d1, d7 +.if \x == 4 + vld1.8 {q3}, [lr]! + vabal.u8 q14, d0, d6 + vabal.u8 q15, d1, d7 +.endif +.endr + vld1.8 {q0}, [r0], r12 + vld1.8 {q1}, [r1], r4 + vld1.8 {q2}, [r2], r4 + vld1.8 {q3}, [r3], r4 + vabal.u8 q8, d0, d2 + vabal.u8 q9, d1, d3 + vabal.u8 q10, d0, d4 + vabal.u8 q11, d1, d5 + vabal.u8 q12, d0, d6 + vabal.u8 q13, d1, d7 +.if \x == 4 + vld1.8 {q3}, [lr], r4 + vabal.u8 q14, d0, d6 + vabal.u8 q15, d1, d7 +.endif +.endm + +.macro SAD_X_48x64 x +function x265_sad_x\x\()_48x64_neon + push {r4-r6, lr} +.if \x == 3 + ldrd r4, r5, [sp, #16] +.else + ldr lr, [sp, #16] + ldrd r4, r5, [sp, #20] +.endif + mov r12, #FENC_STRIDE + sub r12, #32 + sub r4, #32 + mov r6, #8 + veor.u8 q8, q8 + veor.u8 q9, q9 + veor.u8 q10, q10 + veor.u8 q11, q11 + veor.u8 q12, q12 + veor.u8 q13, q13 +.if \x == 4 + veor.u8 q14, q14 + veor.u8 q15, q15 +.endif + +.loop_sad_x\x\()_48x64: +.rept 8 + SAD_X_48 \x +.endr + subs r6, #1 + bne .loop_sad_x\x\()_48x64 + + vpaddl.u16 q8, q8 + vpaddl.u16 q9, q9 + vpaddl.u16 q10, q10 + vpaddl.u16 q11, q11 + vpaddl.u16 q12, q12 + vpaddl.u16 q13, q13 +.if \x == 4 + vpaddl.u16 q14, q14 + vpaddl.u16 q15, q15 +.endif + vadd.u32 q8, q8, q9 + vadd.u32 q10, q10, q11 + vadd.u32 q12, q12, q13 +.if \x == 4 + vadd.u32 q14, q14, q15 +.endif + vadd.u32 d16, d16, d17 + vadd.u32 d20, d20, d21 + vadd.u32 d24, d24, d25 +.if \x == 4 + vadd.u32 d28, d28, d29 +.endif + vpaddl.u32 d16, d16 + vpaddl.u32 d20, d20 + vpaddl.u32 d24, d24 + vpaddl.u32 d28, d28 +.if \x == 4 + vpaddl.u32 d28, d28 +.endif + vst1.32 {d16[0]}, [r5]! + vst1.32 {d20[0]}, [r5]! +.if \x == 3 + vst1.32 {d24[0]}, [r5] +.endif +.if \x == 4 + vst1.32 {d24[0]}, [r5]! + vst1.32 {d28[0]}, [r5] +.endif + pop {r4-r6, lr} + bx lr +endfunc +.endm + +SAD_X_48x64 3 +SAD_X_48x64 4 + +.macro SAD_X_24 x + vld1.8 {q0}, [r0]! + vld1.8 {q1}, [r1]! + vld1.8 {q2}, [r2]! + vld1.8 {q3}, [r3]! + vabal.u8 q8, d0, d2 + vabal.u8 q9, d1, d3 + vabal.u8 q10, d0, d4 + vabal.u8 q11, d1, d5 + vabal.u8 q12, d0, d6 + vabal.u8 q13, d1, d7 +.if \x == 4 + vld1.8 {q3}, [lr]! + vabal.u8 q14, d0, d6 + vabal.u8 q15, d1, d7 +.endif + vld1.8 {d0}, [r0], r12 + vld1.8 {d1}, [r1], r4 + vld1.8 {d2}, [r2], r4 + vld1.8 {d3}, [r3], r4 +.if \x == 4 + vld1.8 {d8}, [lr], r4 +.endif + vabal.u8 q8, d0, d1 + vabal.u8 q10, d0, d2 + vabal.u8 q12, d0, d3 +.if \x == 4 + vabal.u8 q14, d0, d8 +.endif +.endm + +.macro SAD_X_24x32 x +function x265_sad_x\x\()_24x32_neon + push {r4-r6, lr} +.if \x == 3 + ldrd r4, r5, [sp, #16] +.else + ldr lr, [sp, #16] + ldrd r4, r5, [sp, #20] +.endif + mov r12, #FENC_STRIDE + sub r12, #16 + sub r4, #16 + mov r6, #4 + veor.u8 q8, q8 + veor.u8 q9, q9 + veor.u8 q10, q10 + veor.u8 q11, q11 + veor.u8 q12, q12 + veor.u8 q13, q13 +.if \x == 4 + veor.u8 q14, q14 + veor.u8 q15, q15 +.endif + +.loop_sad_x\x\()_24x32: +.rept 8 + SAD_X_24 \x +.endr + subs r6, #1 + bne .loop_sad_x\x\()_24x32 + + vadd.u16 q8, q8, q9 + vadd.u16 q10, q10, q11 + vadd.u16 q12, q12, q13 +.if \x == 4 + vadd.u16 q14, q14, q15 +.endif + vadd.u16 d16, d16, d17 + vadd.u16 d20, d20, d21 + vadd.u16 d24, d24, d25 +.if \x == 4 + vadd.u16 d28, d28, d29 +.endif + vpaddl.u16 d16, d16 + vpaddl.u16 d20, d20 + vpaddl.u16 d24, d24 +.if \x == 4 + vpaddl.u16 d28, d28 +.endif + vpaddl.u32 d16, d16 + vpaddl.u32 d20, d20 + vpaddl.u32 d24, d24 +.if \x == 4 + vpaddl.u32 d28, d28 +.endif +.if \x == 4 + vpaddl.u32 d28, d28 +.endif + vst1.32 {d16[0]}, [r5]! + vst1.32 {d20[0]}, [r5]! +.if \x == 3 + vst1.32 {d24[0]}, [r5] +.endif +.if \x == 4 + vst1.32 {d24[0]}, [r5]! + vst1.32 {d28[0]}, [r5] +.endif + pop {r4-r6, lr} + bx lr +endfunc +.endm + +SAD_X_24x32 3 +SAD_X_24x32 4 + +// SAD_X3 and SAD_X4 code end + +.macro SAD_X_START_12 x + vld1.8 {q0}, [r0], r12 + vld1.8 {q1}, [r1], r4 + vld1.8 {q2}, [r2], r4 + vld1.8 {q3}, [r3], r4 + vand.u8 q0, q15 + vand.u8 q1, q15 + vand.u8 q2, q15 + vand.u8 q3, q15 + vabdl.u8 q5, d0, d2 + vabdl.u8 q8, d1, d3 + vabdl.u8 q9, d0, d4 + vabdl.u8 q10, d1, d5 + vabdl.u8 q11, d0, d6 + vabdl.u8 q12, d1, d7 +.if \x == 4 + vld1.8 {q3}, [lr], r4 + vand.u8 q3, q15 + vabdl.u8 q13, d0, d6 + vabdl.u8 q14, d1, d7 +.endif +.endm + + +.macro SAD_X_12 x + vld1.8 {q0}, [r0], r12 + vld1.8 {q1}, [r1], r4 + vld1.8 {q2}, [r2], r4 + vld1.8 {q3}, [r3], r4 + vand.u8 q0, q15 + vand.u8 q1, q15 + vand.u8 q2, q15 + vand.u8 q3, q15 + vabal.u8 q5, d0, d2 + vabal.u8 q8, d1, d3 + vabal.u8 q9, d0, d4 + vabal.u8 q10, d1, d5 + vabal.u8 q11, d0, d6 + vabal.u8 q12, d1, d7 +.if \x == 4 + vld1.8 {q3}, [lr], r4 + vand.u8 q3, q15 + vabal.u8 q13, d0, d6 + vabal.u8 q14, d1, d7 +.endif +.endm + +.macro SAD_X_12x16 x +function x265_sad_x\x\()_12x16_neon + push {r4-r5, lr} + vpush {q5} +.if \x == 3 + ldrd r4, r5, [sp, #28] +.else + ldr lr, [sp, #28] + ldrd r4, r5, [sp, #32] +.endif + movrel r12, sad12_mask + vld1.8 {q15}, [r12] + mov r12, #FENC_STRIDE + + SAD_X_START_12 \x +.rept 15 + SAD_X_12 \x +.endr + vadd.u16 q5, q5, q8 + vadd.u16 q9, q9, q10 + vadd.u16 q11, q11, q12 +.if \x == 4 + vadd.u16 q13, q13, q14 +.endif + vadd.u16 d10, d10, d11 + vadd.u16 d18, d18, d19 + vadd.u16 d22, d22, d23 +.if \x == 4 + vadd.u16 d26, d26, d27 +.endif + vpadd.u16 d0, d10, d18 + vpadd.u16 d1, d22, d26 + vpaddl.u16 q0, q0 +.if \x == 3 + vst1.32 {d0}, [r5]! + vst1.32 {d1[0]}, [r5, :32] +.else + vst1.32 {d0-d1}, [r5] +.endif + vpop {q5} + pop {r4-r5, lr} + bx lr +endfunc +.endm + +SAD_X_12x16 3 +SAD_X_12x16 4 + +function x265_pixel_sad_12x16_neon + veor.u8 q8, q8 + veor.u8 q9, q9 + movrel r12, sad12_mask + vld1.8 {q15}, [r12] +.rept 8 + vld1.8 {q0}, [r0], r1 + vld1.8 {q1}, [r2], r3 + vand.u8 q0, q15 + vand.u8 q1, q15 + vld1.8 {q2}, [r0], r1 + vld1.8 {q3}, [r2], r3 + vand.u8 q2, q15 + vand.u8 q3, q15 + vabal.u8 q8, d0, d2 + vabal.u8 q9, d1, d3 + vabal.u8 q8, d4, d6 + vabal.u8 q9, d5, d7 +.endr + vadd.u16 q8, q8, q9 + vadd.u16 d16, d16, d17 + vpadd.u16 d0, d16, d16 + vpaddl.u16 d0, d0 + vmov.u32 r0, d0[0] + bx lr +endfunc +
View file
x265_2.0.tar.gz/source/common/arm/ssd-a.S
Added
@@ -0,0 +1,469 @@ +/***************************************************************************** + * Copyright (C) 2016 x265 project + * + * Authors: Dnyaneshwar G <dnyaneshwar@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "asm.S" + +.section .rodata + +.align 4 + + +.text + + +function x265_pixel_sse_pp_4x4_neon + vld1.32 {d16[]}, [r0], r1 + vld1.32 {d17[]}, [r2], r3 + vsubl.u8 q2, d16, d17 + vld1.32 {d16[]}, [r0], r1 + vmull.s16 q0, d4, d4 + vld1.32 {d17[]}, [r2], r3 + + vsubl.u8 q2, d16, d17 + vld1.32 {d16[]}, [r0], r1 + vmlal.s16 q0, d4, d4 + vld1.32 {d17[]}, [r2], r3 + + vsubl.u8 q2, d16, d17 + vld1.32 {d16[]}, [r0], r1 + vmlal.s16 q0, d4, d4 + vld1.32 {d17[]}, [r2], r3 + + vsubl.u8 q2, d16, d17 + vmlal.s16 q0, d4, d4 + vadd.s32 d0, d0, d1 + vpadd.s32 d0, d0, d0 + vmov.32 r0, d0[0] + bx lr +endfunc + +function x265_pixel_sse_pp_8x8_neon + vld1.64 {d16}, [r0], r1 + vld1.64 {d17}, [r2], r3 + vsubl.u8 q2, d16, d17 + vld1.64 {d16}, [r0], r1 + vmull.s16 q0, d4, d4 + vmlal.s16 q0, d5, d5 + vld1.64 {d17}, [r2], r3 + +.rept 6 + vsubl.u8 q2, d16, d17 + vld1.64 {d16}, [r0], r1 + vmlal.s16 q0, d4, d4 + vmlal.s16 q0, d5, d5 + vld1.64 {d17}, [r2], r3 +.endr + vsubl.u8 q2, d16, d17 + vmlal.s16 q0, d4, d4 + vmlal.s16 q0, d5, d5 + vadd.s32 d0, d0, d1 + vpadd.s32 d0, d0, d0 + vmov.32 r0, d0[0] + bx lr +endfunc + +function x265_pixel_sse_pp_16x16_neon + vld1.64 {d16-d17}, [r0], r1 + vld1.64 {d18-d19}, [r2], r3 + vsubl.u8 q2, d16, d18 + vsubl.u8 q3, d17, d19 + vld1.64 {d16-d17}, [r0], r1 + vmull.s16 q0, d4, d4 + vmlal.s16 q0, d5, d5 + vld1.64 {d18-d19}, [r2], r3 + vmlal.s16 q0, d6, d6 + vmlal.s16 q0, d7, d7 + +.rept 14 + vsubl.u8 q2, d16, d18 + vsubl.u8 q3, d17, d19 + vld1.64 {d16-d17}, [r0], r1 + vmlal.s16 q0, d4, d4 + vmlal.s16 q0, d5, d5 + vld1.64 {d18-d19}, [r2], r3 + vmlal.s16 q0, d6, d6 + vmlal.s16 q0, d7, d7 +.endr + vsubl.u8 q2, d16, d18 + vsubl.u8 q3, d17, d19 + vmlal.s16 q0, d4, d4 + vmlal.s16 q0, d5, d5 + vmlal.s16 q0, d6, d6 + vmlal.s16 q0, d7, d7 + vadd.s32 d0, d0, d1 + vpadd.s32 d0, d0, d0 + vmov.32 r0, d0[0] + bx lr +endfunc + +function x265_pixel_sse_pp_32x32_neon + mov r12, #8 + veor.u8 q0, q0 + veor.u8 q1, q1 + +.loop_sse_pp_32: + subs r12, #1 +.rept 4 + vld1.64 {q8-q9}, [r0], r1 + vld1.64 {q10-q11}, [r2], r3 + vsubl.u8 q2, d16, d20 + vsubl.u8 q3, d17, d21 + vsubl.u8 q12, d18, d22 + vsubl.u8 q13, d19, d23 + vmlal.s16 q0, d4, d4 + vmlal.s16 q1, d5, d5 + vmlal.s16 q0, d6, d6 + vmlal.s16 q1, d7, d7 + vmlal.s16 q0, d24, d24 + vmlal.s16 q1, d25, d25 + vmlal.s16 q0, d26, d26 + vmlal.s16 q1, d27, d27 +.endr + bne .loop_sse_pp_32 + vadd.s32 q0, q1 + vadd.s32 d0, d0, d1 + vpadd.s32 d0, d0, d0 + vmov.32 r0, d0[0] + bx lr +endfunc + +function x265_pixel_sse_pp_64x64_neon + sub r1, #32 + sub r3, #32 + mov r12, #16 + veor.u8 q0, q0 + veor.u8 q1, q1 + +.loop_sse_pp_64: + subs r12, #1 +.rept 4 + vld1.64 {q8-q9}, [r0]! + vld1.64 {q10-q11}, [r2]! + vsubl.u8 q2, d16, d20 + vsubl.u8 q3, d17, d21 + vsubl.u8 q12, d18, d22 + vsubl.u8 q13, d19, d23 + vmlal.s16 q0, d4, d4 + vmlal.s16 q1, d5, d5 + vmlal.s16 q0, d6, d6 + vmlal.s16 q1, d7, d7 + vmlal.s16 q0, d24, d24 + vmlal.s16 q1, d25, d25 + vmlal.s16 q0, d26, d26 + vmlal.s16 q1, d27, d27 + + vld1.64 {q8-q9}, [r0], r1 + vld1.64 {q10-q11}, [r2], r3 + vsubl.u8 q2, d16, d20 + vsubl.u8 q3, d17, d21 + vsubl.u8 q12, d18, d22 + vsubl.u8 q13, d19, d23 + vmlal.s16 q0, d4, d4 + vmlal.s16 q1, d5, d5 + vmlal.s16 q0, d6, d6 + vmlal.s16 q1, d7, d7 + vmlal.s16 q0, d24, d24 + vmlal.s16 q1, d25, d25 + vmlal.s16 q0, d26, d26 + vmlal.s16 q1, d27, d27 +.endr + bne .loop_sse_pp_64 + vadd.s32 q0, q1 + vadd.s32 d0, d0, d1 + vpadd.s32 d0, d0, d0 + vmov.32 r0, d0[0] + bx lr +endfunc + +function x265_pixel_sse_ss_4x4_neon + add r1, r1 + add r3, r3 + + vld1.s16 {d16}, [r0], r1 + vld1.s16 {d18}, [r2], r3 + vsub.s16 q2, q8, q9 + vld1.s16 {d16}, [r0], r1 + vmull.s16 q0, d4, d4 + vld1.s16 {d18}, [r2], r3 + + vsub.s16 q2, q8, q9 + vld1.s16 {d16}, [r0], r1 + vmlal.s16 q0, d4, d4 + vld1.s16 {d18}, [r2], r3 + + vsub.s16 q2, q8, q9 + vld1.s16 {d16}, [r0], r1 + vmlal.s16 q0, d4, d4 + vld1.s16 {d18}, [r2], r3 + + vsub.s16 q2, q8, q9 + vmlal.s16 q0, d4, d4 + + vadd.s32 d0, d0, d1 + vpadd.s32 d0, d0, d0 + vmov.32 r0, d0[0] + bx lr +endfunc + +function x265_pixel_sse_ss_8x8_neon + add r1, r1 + add r3, r3 + + vld1.s16 {q8}, [r0], r1 + vld1.s16 {q9}, [r2], r3 + vsub.s16 q8, q9 + vmull.s16 q0, d16, d16 + vmull.s16 q1, d17, d17 + +.rept 7 + vld1.s16 {q8}, [r0], r1 + vld1.s16 {q9}, [r2], r3 + vsub.s16 q8, q9 + vmlal.s16 q0, d16, d16 + vmlal.s16 q1, d17, d17 +.endr + vadd.s32 q0, q1 + vadd.s32 d0, d0, d1 + vpadd.s32 d0, d0, d0 + vmov.32 r0, d0[0] + bx lr +endfunc + +function x265_pixel_sse_ss_16x16_neon + add r1, r1 + add r3, r3 + + mov r12, #4 + veor.u8 q0, q0 + veor.u8 q1, q1 + +.loop_sse_ss_16: + subs r12, #1 +.rept 4 + vld1.s16 {q8-q9}, [r0], r1 + vld1.s16 {q10-q11}, [r2], r3 + vsub.s16 q8, q10 + vsub.s16 q9, q11 + vmlal.s16 q0, d16, d16 + vmlal.s16 q1, d17, d17 + vmlal.s16 q0, d18, d18 + vmlal.s16 q1, d19, d19 +.endr + bne .loop_sse_ss_16 + vadd.s32 q0, q1 + vadd.s32 d0, d0, d1 + vpadd.s32 d0, d0, d0 + vmov.32 r0, d0[0] + bx lr +endfunc + +function x265_pixel_sse_ss_32x32_neon + add r1, r1 + add r3, r3 + sub r1, #32 + sub r3, #32 + mov r12, #8 + veor.u8 q0, q0 + veor.u8 q1, q1 + +.loop_sse_ss_32: + subs r12, #1 +.rept 4 + vld1.s16 {q8-q9}, [r0]! + vld1.s16 {q10-q11}, [r2]! + vsub.s16 q8, q10 + vsub.s16 q9, q11 + vmlal.s16 q0, d16, d16 + vmlal.s16 q1, d17, d17 + vmlal.s16 q0, d18, d18 + vmlal.s16 q1, d19, d19 + + vld1.s16 {q8-q9}, [r0], r1 + vld1.s16 {q10-q11}, [r2], r3 + vsub.s16 q8, q10 + vsub.s16 q9, q11 + vmlal.s16 q0, d16, d16 + vmlal.s16 q1, d17, d17 + vmlal.s16 q0, d18, d18 + vmlal.s16 q1, d19, d19 +.endr + bne .loop_sse_ss_32 + vadd.s32 q0, q1 + vadd.s32 d0, d0, d1 + vpadd.s32 d0, d0, d0 + vmov.32 r0, d0[0] + bx lr +endfunc + +function x265_pixel_sse_ss_64x64_neon + add r1, r1 + add r3, r3 + sub r1, #96 + sub r3, #96 + mov r12, #32 + veor.u8 q0, q0 + veor.u8 q1, q1 + +.loop_sse_ss_64: + subs r12, #1 +.rept 2 + vld1.s16 {q8-q9}, [r0]! + vld1.s16 {q10-q11}, [r2]! + vsub.s16 q8, q10 + vsub.s16 q9, q11 + vmlal.s16 q0, d16, d16 + vmlal.s16 q1, d17, d17 + vmlal.s16 q0, d18, d18 + vmlal.s16 q1, d19, d19 + + vld1.s16 {q8-q9}, [r0]! + vld1.s16 {q10-q11}, [r2]! + vsub.s16 q8, q10 + vsub.s16 q9, q11 + vmlal.s16 q0, d16, d16 + vmlal.s16 q1, d17, d17 + vmlal.s16 q0, d18, d18 + vmlal.s16 q1, d19, d19 + + vld1.s16 {q8-q9}, [r0]! + vld1.s16 {q10-q11}, [r2]! + vsub.s16 q8, q10 + vsub.s16 q9, q11 + vmlal.s16 q0, d16, d16 + vmlal.s16 q1, d17, d17 + vmlal.s16 q0, d18, d18 + vmlal.s16 q1, d19, d19 + + vld1.s16 {q8-q9}, [r0], r1 + vld1.s16 {q10-q11}, [r2], r3 + vsub.s16 q8, q10 + vsub.s16 q9, q11 + vmlal.s16 q0, d16, d16 + vmlal.s16 q1, d17, d17 + vmlal.s16 q0, d18, d18 + vmlal.s16 q1, d19, d19 +.endr + bne .loop_sse_ss_64 + vadd.s32 q0, q1 + vadd.s32 d0, d0, d1 + vpadd.s32 d0, d0, d0 + vmov.32 r0, d0[0] + bx lr +endfunc + +function x265_pixel_ssd_s_4x4_neon + add r1, r1 + vld1.s16 {d4}, [r0], r1 + vld1.s16 {d5}, [r0], r1 + vld1.s16 {d6}, [r0], r1 + vld1.s16 {d7}, [r0] + vmull.s16 q0, d4, d4 + vmull.s16 q1, d5, d5 + vmlal.s16 q0, d6, d6 + vmlal.s16 q1, d7, d7 + vadd.s32 q0, q1 + vadd.s32 d0, d0, d1 + vpadd.s32 d0, d0, d0 + vmov.32 r0, d0[0] + bx lr +endfunc + +function x265_pixel_ssd_s_8x8_neon + add r1, r1 + vld1.s16 {q8}, [r0], r1 + vld1.s16 {q9}, [r0], r1 + vmull.s16 q0, d16, d16 + vmull.s16 q1, d17, d17 + vmlal.s16 q0, d18, d18 + vmlal.s16 q1, d19, d19 +.rept 3 + vld1.s16 {q8}, [r0], r1 + vld1.s16 {q9}, [r0], r1 + vmlal.s16 q0, d16, d16 + vmlal.s16 q1, d17, d17 + vmlal.s16 q0, d18, d18 + vmlal.s16 q1, d19, d19 +.endr + vadd.s32 q0, q1 + vadd.s32 d0, d0, d1 + vpadd.s32 d0, d0, d0 + vmov.32 r0, d0[0] + bx lr +endfunc + +function x265_pixel_ssd_s_16x16_neon + add r1, r1 + mov r12, #4 + veor.u8 q0, q0 + veor.u8 q1, q1 + +.loop_ssd_s_16: + subs r12, #1 +.rept 2 + vld1.s16 {q8-q9}, [r0], r1 + vld1.s16 {q10-q11}, [r0], r1 + vmlal.s16 q0, d16, d16 + vmlal.s16 q1, d17, d17 + vmlal.s16 q0, d18, d18 + vmlal.s16 q1, d19, d19 + vmlal.s16 q0, d20, d20 + vmlal.s16 q1, d21, d21 + vmlal.s16 q0, d22, d22 + vmlal.s16 q1, d23, d23 +.endr + bne .loop_ssd_s_16 + vadd.s32 q0, q1 + vadd.s32 d0, d0, d1 + vpadd.s32 d0, d0, d0 + vmov.32 r0, d0[0] + bx lr +endfunc + +function x265_pixel_ssd_s_32x32_neon + add r1, r1 + sub r1, #32 + mov r12, #8 + veor.u8 q0, q0 + veor.u8 q1, q1 + +.loop_ssd_s_32: + subs r12, #1 +.rept 4 + vld1.s16 {q8-q9}, [r0]! + vld1.s16 {q10-q11}, [r0], r1 + vmlal.s16 q0, d16, d16 + vmlal.s16 q1, d17, d17 + vmlal.s16 q0, d18, d18 + vmlal.s16 q1, d19, d19 + vmlal.s16 q0, d20, d20 + vmlal.s16 q1, d21, d21 + vmlal.s16 q0, d22, d22 + vmlal.s16 q1, d23, d23 +.endr + bne .loop_ssd_s_32 + vadd.s32 q0, q1 + vadd.s32 d0, d0, d1 + vpadd.s32 d0, d0, d0 + vmov.32 r0, d0[0] + bx lr +endfunc
View file
x265_1.9.tar.gz/source/common/common.cpp -> x265_2.0.tar.gz/source/common/common.cpp
Changed
@@ -29,6 +29,8 @@ #if _WIN32 #include <sys/types.h> #include <sys/timeb.h> +#include <io.h> +#include <fcntl.h> #else #include <sys/time.h> #endif @@ -139,6 +141,94 @@ fputs(buffer, stderr); } +#if _WIN32 +/* For Unicode filenames in Windows we convert UTF-8 strings to UTF-16 and we use _w functions. + * For other OS we do not make any changes. */ +void general_log_file(const x265_param* param, const char* caller, int level, const char* fmt, ...) +{ + if (param && level > param->logLevel) + return; + const int bufferSize = 4096; + char buffer[bufferSize]; + int p = 0; + const char* log_level; + switch (level) + { + case X265_LOG_ERROR: + log_level = "error"; + break; + case X265_LOG_WARNING: + log_level = "warning"; + break; + case X265_LOG_INFO: + log_level = "info"; + break; + case X265_LOG_DEBUG: + log_level = "debug"; + break; + case X265_LOG_FULL: + log_level = "full"; + break; + default: + log_level = "unknown"; + break; + } + + if (caller) + p += sprintf(buffer, "%-4s [%s]: ", caller, log_level); + va_list arg; + va_start(arg, fmt); + vsnprintf(buffer + p, bufferSize - p, fmt, arg); + va_end(arg); + + HANDLE console = GetStdHandle(STD_ERROR_HANDLE); + DWORD mode; + if (GetConsoleMode(console, &mode)) + { + wchar_t buf_utf16[bufferSize]; + int length_utf16 = MultiByteToWideChar(CP_UTF8, 0, buffer, -1, buf_utf16, sizeof(buf_utf16)/sizeof(wchar_t)) - 1; + if (length_utf16 > 0) + WriteConsoleW(console, buf_utf16, length_utf16, &mode, NULL); + } + else + fputs(buffer, stderr); +} + +FILE* x265_fopen(const char* fileName, const char* mode) +{ + wchar_t buf_utf16[MAX_PATH * 2], mode_utf16[16]; + + if (MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, fileName, -1, buf_utf16, sizeof(buf_utf16)/sizeof(wchar_t)) && + MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, mode, -1, mode_utf16, sizeof(mode_utf16)/sizeof(wchar_t))) + { + return _wfopen(buf_utf16, mode_utf16); + } + return NULL; +} + +int x265_unlink(const char* fileName) +{ + wchar_t buf_utf16[MAX_PATH * 2]; + + if (MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, fileName, -1, buf_utf16, sizeof(buf_utf16)/sizeof(wchar_t))) + return _wunlink(buf_utf16); + + return -1; +} + +int x265_rename(const char* oldName, const char* newName) +{ + wchar_t old_utf16[MAX_PATH * 2], new_utf16[MAX_PATH * 2]; + + if (MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, oldName, -1, old_utf16, sizeof(old_utf16)/sizeof(wchar_t)) && + MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, newName, -1, new_utf16, sizeof(new_utf16)/sizeof(wchar_t))) + { + return _wrename(old_utf16, new_utf16); + } + return -1; +} +#endif + double x265_ssim2dB(double ssim) { double inv_ssim = 1 - ssim; @@ -177,10 +267,10 @@ size_t fSize; char *buf = NULL; - FILE *fh = fopen(filename, "rb"); + FILE *fh = x265_fopen(filename, "rb"); if (!fh) { - x265_log(NULL, X265_LOG_ERROR, "unable to open file %s\n", filename); + x265_log_file(NULL, X265_LOG_ERROR, "unable to open file %s\n", filename); return NULL; }
View file
x265_1.9.tar.gz/source/common/common.h -> x265_2.0.tar.gz/source/common/common.h
Changed
@@ -322,6 +322,8 @@ #define MAX_NUM_TR_COEFFS MAX_TR_SIZE * MAX_TR_SIZE // Maximum number of transform coefficients, for a 32x32 transform #define MAX_NUM_TR_CATEGORIES 16 // 32, 16, 8, 4 transform categories each for luma and chroma +#define PIXEL_MAX ((1 << X265_DEPTH) - 1) + namespace X265_NS { enum { SAO_NUM_OFFSET = 4 }; @@ -402,7 +404,19 @@ /* located in common.cpp */ int64_t x265_mdate(void); #define x265_log(param, ...) general_log(param, "x265", __VA_ARGS__) +#define x265_log_file(param, ...) general_log_file(param, "x265", __VA_ARGS__) void general_log(const x265_param* param, const char* caller, int level, const char* fmt, ...); +#if _WIN32 +void general_log_file(const x265_param* param, const char* caller, int level, const char* fmt, ...); +FILE* x265_fopen(const char* fileName, const char* mode); +int x265_unlink(const char* fileName); +int x265_rename(const char* oldName, const char* newName); +#else +#define general_log_file(param, caller, level, fmt, ...) general_log(param, caller, level, fmt, __VA_ARGS__) +#define x265_fopen(fileName, mode) fopen(fileName, mode) +#define x265_unlink(fileName) unlink(fileName) +#define x265_rename(oldName, newName) rename(oldName, newName) +#endif int x265_exp2fix8(double x); double x265_ssim2dB(double ssim);
View file
x265_1.9.tar.gz/source/common/constants.cpp -> x265_2.0.tar.gz/source/common/constants.cpp
Changed
@@ -555,18 +555,6 @@ 0x38, }; -/* Contains how much to increment shared depth buffer for different ctu sizes to get next best depth - * here, depth 0 = 64x64, depth 1 = 32x32, depth 2 = 16x16 and depth 3 = 8x8 - * if ctu = 64, depth buffer size is 256 combination of depth values 0, 1, 2, 3 - * if ctu = 32, depth buffer size is 64 combination of depth values 1, 2, 3 - * if ctu = 16, depth buffer size is 16 combination of depth values 2, 3 */ -const uint32_t g_depthInc[3][4] = -{ - { 16, 4, 0, 0}, - { 64, 16, 4, 1}, - {256, 64, 16, 4} -}; - /* g_depthScanIdx [y][x] */ const uint32_t g_depthScanIdx[8][8] = { @@ -580,4 +568,236 @@ { 42, 43, 46, 47, 58, 59, 62, 63, } }; +/* Rec.2020 YUV to RGB Non-constant luminance */ +const double g_YUVtoRGB_BT2020[3][3] = +{ + { 1.00, 0.00, 1.47460, }, + { 1.00, -0.16455, -0.57135, }, + { 1.00, 1.88140, 0.00, } +}; + +const double g_ST2084_PQTable[MAX_HDR_LEGAL_RANGE - MIN_HDR_LEGAL_RANGE + 1] = +{ + 0, + 5.25912035416561E-05, 0.000170826479250824, 0.000342874260206259, 0.000565730978088069, + 0.000838361593599196, 0.0011605708550711, 0.00153261170332205, 0.00195500928122658, + 0.00242846920816411, 0.00295382484798614, 0.00353200479131171, 0.00416401171798929, + 0.00485090808272845, 0.00559380610060962, 0.00639386055422149, 0.00725226351560689, + 0.0081702404049783, 0.00914904700558975, 0.010189967177051, 0.0112943110883226, + 0.0124634138437419, 0.0136986344106386, 0.0150013547814312, 0.0163729793201926, + 0.0178149342559234, 0.0193286672936668, 0.0209156473211494, 0.022577364193536, + 0.0243153285825585, 0.0261310718791221, 0.0280261461406398, 0.0300021240760516, + 0.0320605990628007, 0.0342031851910785, 0.036431517331512, 0.0387472512230819, + 0.0411520635786705, 0.0436476522060052, 0.046235736142162, 0.0489180558000865, + 0.0516963731258075, 0.0545724717652363, 0.0575481572396137, 0.0606252571287911, + 0.0638056212616694, 0.0670911219131892, 0.0704836540073949, 0.0739851353261047, + 0.0775975067228409, 0.0813227323416811, 0.0851627998407477, 0.0891197206201265, + 0.0931955300539647, 0.0973922877266004, 0.101712077672541, 0.106157008620188, + 0.110729214239187, 0.115430853391267, 0.120264110384523, 0.125231195231086, + 0.130334343908053, 0.135575818621706, 0.140957908074883, 0.146482927737596, + 0.152153220120717, 0.157971155052834, 0.163939129960184, 0.170059570149691, + 0.176334929095073, 0.182767688726043, 0.189360359720598, 0.196115481800328, + 0.203035624028883, 0.210123385113499, 0.21738139370961, 0.224812308728624, + 0.232418819648774, 0.240203646829142, 0.248169541826838, 0.256319287717358, + 0.264655699418179, 0.273181624015456, 0.281899941094164, 0.29081356307129, + 0.299925435532481, 0.309238537571936, 0.318755882135647, 0.32848051636804, + 0.338415521962, 0.34856401551231, 0.358929148872555, 0.369514109515577, + 0.380322120897342, 0.391356442824469, 0.402620371825233, 0.414117241524302, + 0.425850423021013, 0.437823325271459, 0.450039395474131, 0.4625021194595, + 0.475215022083238, 0.488181667623337, 0.501405660181076, 0.514890644085913, + 0.528640304304275, 0.542658366852319, 0.556948599212766, 0.571514810755682, + 0.58636085316357, 0.601490620860234, 0.616908051444177, 0.632617126126042, + 0.648621870170268, 0.664926353341107, 0.681534690353104, 0.6984510413256, + 0.715679612242097, 0.733224655413817, 0.751090469947712, 0.769281402219399, + 0.78780184635024, 0.806656244689427, 0.82584908830055, 0.84538491745295, + 0.865268322117971, 0.885503942469945, 0.906096469391926, 0.927050644986733, + 0.948371263092526, 0.970063169803824, 0.99213126399724, 1.01458049786256, + 1.03741587743901, 1.06064246315667, 1.08426537038311, 1.10828976997558, + 1.13272088883845, 1.1575640104859, 1.18282447561067, 1.20850768265765, + 1.23461908840365, 1.26116420854251, 1.28814861827608, 1.31557795291099, + 1.34345790846097, 1.37179424225547, 1.40059277355414, 1.42985938416685, + 1.45960001908056, 1.48982068709166, 1.52052746144494, 1.55172648047831, + 1.58342394827458, 1.61562613531883, 1.6483393791628, 1.68157008509547, + 1.71532472682031, 1.74960984713914, 1.78443205864284, 1.81979804440872, + 1.85571455870433, 1.8921884276992, 1.92922655018235, 1.9668358982877, + 2.0050235182263, 2.04379653102551, 2.0831621332761, 2.12312759788576, + 2.16370027484092, 2.20488759197549, 2.2466970557472, 2.28913625202187, + 2.33221284686502, 2.37593458734142, 2.42030930232274, 2.46534490330251, + 2.51104938521982, 2.55743082729067, 2.60449739384781, 2.65225733518805, + 2.70071898842928, 2.74989077837451, 2.79978121838576, 2.85039891126499, + 2.90175255014517, 2.95385091938954, 3.00670289549934, 3.06031744803115, + 3.11470364052283, 3.16987063142876, 3.22582767506471, 3.2825841225609, + 3.3401494228253, 3.39853312351689, 3.45774487202715, 3.51779441647257, + 3.57869160669604, 3.64044639527875, 3.7030688385618, 3.76656909767725, + 3.83095743959148, 3.89624423815599, 3.96243997517042, 4.02955524145598, + 4.09760073793895, 4.16658727674518, 4.2365257823051, 4.30742729247016, + 4.37930295964014, 4.45216405190141, 4.52602195417663, 4.60088816938553, + 4.67677431961831, 4.75369214731843, 4.83165351647993, 4.91067041385396, + 4.99075495016979, 5.07191936136577, 5.15417600983301, 5.23753738567282, + 5.32201610796449, 5.40762492604782, 5.49437672081637, 5.58228450602463, + 5.67136142960816, 5.76162077501684, 5.85307596256082, 5.94574055077076, + 6.03962823777015, 6.13475286266291, 6.2311284069342, 6.32876899586396, + 6.42768889995753, 6.5279025363866, 6.62942447044656, 6.73226941703026, + 6.83645224211186, 6.94198796425035, 7.04889175610325, 7.15717894596024, + 7.2668650192892, 7.37796562029657, 7.49049655350635, 7.60447378535363, + 7.71991344579293, 7.83683182992318, 7.95524539963073, 8.07517078524564, + 8.19662478721649, 8.31962437780235, 8.44418670277909, 8.57032908316786, + 8.69806901697162, 8.82742418094208, 8.95841243235119, 9.09105181078918, + 9.22536053997842, 9.36135702960081, 9.4990598771529, 9.63848786980913, + 9.77965998631185, 9.92259539887546, 10.0673134751131, 10.2138337799773, + 10.3621760777285, 10.5123603339148, 10.6644067173761, 10.8183356022682, + 10.9741675701064, 11.1319234118292, 11.2916241298841, 11.4532909403319, + 11.6169452749761, 11.782608783511, 11.9503033356888, 12.120051023515, + 12.2918741634627, 12.4657952987048, 12.6418372013776, 12.8200228748588, + 13.0003755560757, 13.1829187178276, 13.367676071144, 13.5546715676512, + 13.7439294019804, 13.9354740141834, 14.1293300921851, 14.3255225742508, + 14.5240766514895, 14.7250177703705, 14.9283716352778, 15.1341642110757, + 15.3424217257167, 15.5531706728631, 15.7664378145379, 15.9822501838117, + 16.2006350874992, 16.4216201089027, 16.6452331105667, 16.8715022370722, + 17.1004559178516, 17.3321228700381, 17.5665321013393, 17.8037129129401, + 18.0436949024415, 18.2865079668192, 18.5321823054235, 18.7807484229967, + 19.0322371327346, 19.2866795593684, 19.5441071422852, 19.8045516386728, + 20.068045126707, 20.3346200087623, 20.6043090146575, 20.8771452049349, + 21.1531619741772, 21.4323930543496, 21.7148725181833, 22.0006347825899, + 22.2897146121093, 22.5821471224015, 22.8779677837589, 23.1772124246723, + 23.4799172354157, 23.7861187716811, 24.0958539582449, 24.4091600926726, + 24.7260748490581, 25.0466362818137, 25.3708828294739, 25.6988533185695, + 26.0305869675189, 26.3661233905639, 26.7055026017538, 27.0487650189598, + 27.3959514679386, 27.7471031864343, 28.1022618283194, 28.4614694677879, + 28.8247686035749, 29.1922021632471, 29.5638135074984, 29.9396464345297, + 30.3197451844465, 30.7041544437129, 31.0929193496474, 31.4860854949729, + 31.8836989324014, 32.2858061792735, 32.6924542222466, 33.1036905220286, + 33.5195630181606, 33.9401201338504, 34.3654107808513, 34.7954843644001, + 35.2303907882032, 35.6701804594619, 36.1149042939698, 36.5646137212482, + 37.0193606897411, 37.4791976720634, 37.944177670299, 38.4143542213633, + 38.8897814024065, 39.3705138362898, 39.8566066971106, 40.3481157157767, + 40.8450971856484, 41.3476079682522, 41.8557054990105, 42.369447793091, + 42.8888934512647, 43.4141016658423, 43.9451322266965, 44.4820455273072, + 45.0249025708978, 45.57376497661, 46.128694985791, 46.6897554682848, + 47.257009928828, 47.8305225135037, 48.4103580162663, 48.9965818855272, + 49.589260230802, 50.1884598294566, 50.794248133489, 51.4066932764077, + 52.0258640801652, 52.6518300621766, 53.2846614424041, 53.9244291505136, + 54.5712048331156, 55.2250608610794, 55.8860703369173, 56.5543071022513, + 57.2298457453516, 57.9127616087739, 58.6031307970611, 59.3010301845114, + 60.0065374230609, 60.7197309502355, 61.4406899971675, 62.1694945967356, + 62.9062255917496, 63.6509646432403, 64.4037942388625, 65.1647977013236, + 65.9340591969731, 66.7116637444152, 67.4976972232724, 68.2922463830112, + 69.0953988518382, 69.9072431457598, 70.7278686776501, 71.5573657664994, + 72.3958256466906, 73.2433404774142, 74.1000033521872, 74.9659083084248, + 75.8411503371909, 76.7258253929696, 77.6200304036002, 78.5238632802992, + 79.4374229277768, 80.3608092544678, 81.2941231828966, 82.2374666600933, + 83.1909426682048, 84.154655235138, 85.1287094453491, 86.1132114507694, + 87.108268481825, 88.1139888585565, 89.1304820019001, 90.1578584450571, + 91.1962298449948, 92.2457089940652, 93.3064098317639, 94.3784474565997, + 95.4619381380949, 96.5569993289116, 97.6637496771184, 98.7823090385655, + 99.9127984894415, 101.055340338899, 102.210058141845, 103.377076711919, + 104.556522134513, 105.748521780005, 106.953204317117, 108.170699726403, + 109.401139313892, 110.644655724874, 111.901382957862, 113.171456378648, + 114.455012734562, 115.752190168864, 117.063128235285, 118.387967912751, + 119.726851620228, 121.079923231788, 122.447328091724, 123.829213029981, + 125.225726377642, 126.637017982633, 128.063239225529, 129.504543035659, + 130.961083907258, 132.43301791588, 133.920502734926, 135.423697652396, + 136.942763587828, 138.477863109372, 140.029160451099, 141.596821530472, + 143.181013966024, 144.781907095212, 146.399671992475, 148.034481487503, + 149.686510183665, 151.355934476676, 153.042932573466, 154.747684511235, + 156.470372176717, 158.211179325695, 159.970291602654, 161.747896560765, + 163.544183681914, 165.359344397174, 167.193572107279, 169.047062203492, + 170.920012088617, 172.812621198221, 174.725091022243, 176.657625126586, + 178.610429175187, 180.583710952171, 182.577680384379, 184.59254956399, + 186.628532771569, 188.685846499193, 190.764709473972, 192.865342681753, + 194.987969391112, 197.13281517763, 199.300107948348, 201.490077966701, + 203.702957877374, 205.938982731875, 208.198390014006, 210.481419665809, + 212.788314113849, 215.119318295558, 217.474679686168, 219.854648325694, + 222.259476846381, 224.689420500319, 227.144737187562, 229.625687484264, + 232.132534671514, 234.665544764103, 237.224986539876, 239.811131569336, + 242.424254245529, 245.064631814346, 247.73254440507, 250.428275061399, + 253.152109772633, 255.904337505438, 258.685250235678, 261.49514298094, + 264.334313833161, 267.203063991664, 270.101697796781, 273.03052276345, + 275.989849615675, 278.979992320954, 282.001268125309, 285.053997588697, + 288.138504620796, 291.255116517118, 294.404163995707, 297.585981234071, + 300.800905906628, 304.049279222569, 307.331445964095, 310.647754525259, + 313.998556950887, 317.384208976364, 320.805070067649, 324.26150346164, + 327.753876207298, 331.28255920701, 334.84792725845, 338.450359096983, + 342.090237438443, 345.767949022632, 349.483884657022, 353.238439261111, + 357.032011911288, 360.865005886229, 364.73782871259, 368.650892211681, + 372.604612546163, 376.59941026756, 380.635710364328, 384.713942310386, + 388.83454011424, 392.997942368521, 397.20459230049, 401.454937822634, + 405.749431584178, 410.088531023082, 414.47269841859, 418.902400944533, + 423.378110722949, 427.900304878816, 432.469465594816, 437.086080167171, + 441.750641062068, 446.463645972511, 451.225597876033, 456.037005092914, + 460.89838134554, 465.81024581748, 470.773123214509, 475.787543825096, + 480.854043582649, 485.973164127686, 491.14545287122, 496.371463058725, + 501.651753834779, 506.986890308486, 512.377443619739, 517.823991006384, + 523.32711587159, 528.887407852831, 534.505462890955, 540.181883300517, + 545.917277840779, 551.712261787277, 557.567457004939, 563.48349202123, + 569.461002100643, 575.500629320033, 581.603022644652, 587.76883800521, + 593.998738375827, 600.29339385279, 606.653481734616, 613.07968660232, + 619.572700401503, 626.133222524762, 632.761959895347, 639.459627051767, + 646.226946233466, 653.064647467273, 659.973468655012, 666.954155662449, + 674.007462408703, 681.134150957274, 688.334991607664, 695.610762988527, + 702.962252151562, 710.390254666907, 717.895574719168, 725.479025205175, + 733.141427832198, 740.883613218127, 748.706420992262, 756.610699897378, + 764.597307893424, 772.667112261926, 780.820989711908, 789.059826487117, + 797.384518474445, 805.79597131351, 814.295100508111, 822.882831538009, + 831.560099973222, 840.327851588798, 849.187042481472, 858.138639187298, + 867.183618801265, 876.322969097945, 885.557688653527, 894.88878696958, + 904.317284598324, 913.844213269149, 923.470616016881, 933.197547311661, + 943.02607318998, 952.957271387842, 962.99223147528, 973.13205499233, + 983.377855587028, 993.730759155025, 1004.19190398011, 1014.7624408779, + 1025.44353334027, 1036.23635768138, 1047.14210318612, 1058.16197226031, + 1069.29718058216, 1080.54895725615, 1091.91854496832, 1103.40720014439, + 1115.01619310819, 1126.74680824381, 1138.60034415848, 1150.57811384819, + 1162.68144486462, 1174.91167948465, 1187.27017488269, 1199.75830330268, + 1212.37745223534, 1225.12902459516, 1238.01443890053, 1251.03512945689, + 1264.19254654015, 1277.48815658428, 1290.92344237023, 1304.49990321753, + 1318.21905517769, 1332.0824312314, 1346.09158148618, 1360.24807337821, + 1374.55349187613, 1389.00943968636, 1403.61753746281, 1418.37942401772, + 1433.29675653564, 1448.37121079053, 1463.60448136459, 1478.99828187054, + 1494.55434517686, 1510.27442363459, 1526.16028930875, 1542.21373421151, + 1558.43657053802, 1574.8306309066, 1591.39776860023, 1608.13985781215, + 1625.05879389502, 1642.15649361107, 1659.43489538767, 1676.89595957601, + 1694.54166871017, 1712.37402777397, 1730.39506446684, 1748.60682947636, + 1767.01139675239, 1785.61086378491, 1804.40735188573, 1823.40300647457, + 1842.59999736598, 1862.00051906422, 1881.60679105712, 1901.42105811765, + 1921.44559060702, 1941.68268478254, 1962.13466310849, 1982.80387457295, + 2003.69269500608, 2024.80352740423, 2046.13880225813, 2067.70097788409, + 2089.4925407609, 2111.51600586931, 2133.77391703832, 2156.2688472933, + 2179.00339921048, 2201.98020527506, 2225.20192824396, 2248.67126151315, + 2272.39092949114, 2296.36368797505, 2320.59232453288, 2345.07965889086, + 2369.82854332463, 2394.84186305701, 2420.1225366596, 2445.67351646045, + 2471.4977889564, 2497.5983752314, 2523.97833137945, 2550.64074893434, + 2577.58875530317, 2604.8255142071, 2632.35422612708, 2660.17812875505, + 2688.30049745283, 2716.72464571406, 2745.45392563483, 2774.49172838938, + 2803.84148471127, 2833.50666538283, 2863.49078172885, 2893.79738611828, + 2924.43007247227, 2955.39247677789, 2986.68827760926, 3018.32119665627, + 3050.29499925996, 3082.61349495315, 3115.28053801072, 3148.30002800544, + 3181.67591037289, 3215.41217698172, 3249.51286671181, 3283.98206604386, + 3318.8239096497, 3354.04258099714, 3389.64231295962, 3425.62738843341, + 3462.00214096588, 3498.770955389, 3535.93826846362, 3573.50856952949, + 3611.48640116911, 3649.87635987397, 3688.68309672536, 3727.91131807909, + 3767.56578626554, 3807.6513202933, 3848.17279656462, 3889.13514960257, + 3930.54337278366, 3972.40251908377, 4014.71770183098, 4057.49409547529, + 4100.73693635754, 4144.45152349895, 4188.64321939905, 4233.31745083673, + 4278.47970969433, 4324.13555378427, 4370.2906076885, 4416.9505636112, + 4464.12118224336, 4511.80829363585, 4560.01779808583, 4608.75566703869, + 4658.02794399743, 4707.84074544526, 4758.20026178446, 4809.11275828399, + 4860.58457604072, 4912.6221329584, 4965.23192473005, 5018.42052584652, + 5072.19459060902, 5126.56085415876, 5181.52613352201, 5237.09732866887, + 5293.28142358609, 5350.08548736398, 5407.51667529896, 5465.58223001341, + 5524.28948258769, 5583.64585370912, 5643.65885483892, 5704.33608939131, + 5765.68525393099, 5827.71413938938, 5890.43063229428, 5953.84271601949, + 6017.95847204743, 6082.78608125617, 6148.33382521752, 6214.610087517, + 6281.62335509419, 6349.38221959681, 6417.89537875378, 6487.17163777577, + 6557.21991076552, 6628.04922215295, 6699.66870814791, 6772.08761821761, + 6845.31531658155, 6919.36128372573, 6994.23511794429, 7069.94653689413, + 7146.5053791833, 7223.92160596987, 7302.20530258909, 7381.36668020537, + 7461.41607748598, 7542.36396229371, 7624.22093341411, 7706.99772229679, + 7790.70519482415, 7875.35435311374, 7960.95633733285, 8047.52242755054, + 8135.06404560776, 8223.5927570193, 8313.12027290238, 8403.65845193137, + 8495.21930231871, 8587.81498382941, 8681.45780982398, 8776.16024932246, + 8871.93492910726, 8968.79463585546, 9066.75231829962, 9165.82108941207, + 9266.0142286397, 9367.34518415456, 9469.8275751412, 9573.47519411942, + 9678.30200930089, 9784.32216698275, 9891.54999396144, 10000 +}; + }
View file
x265_1.9.tar.gz/source/common/constants.h -> x265_2.0.tar.gz/source/common/constants.h
Changed
@@ -96,9 +96,15 @@ // Intra tables extern const uint8_t g_intraFilterFlags[NUM_INTRA_MODE]; -extern const uint32_t g_depthInc[3][4]; extern const uint32_t g_depthScanIdx[8][8]; +extern const double g_YUVtoRGB_BT2020[3][3]; + +#define MIN_HDR_LEGAL_RANGE 64 +#define MAX_HDR_LEGAL_RANGE 940 +#define CBCR_OFFSET 512 +extern const double g_ST2084_PQTable[MAX_HDR_LEGAL_RANGE - MIN_HDR_LEGAL_RANGE + 1]; + } #endif
View file
x265_1.9.tar.gz/source/common/contexts.h -> x265_2.0.tar.gz/source/common/contexts.h
Changed
@@ -117,196 +117,8 @@ #define sbacGetEntropyBits(S, V) (g_entropyBits[(S) ^ (V)]) #define sbacGetEntropyBitsTrm(V) (g_entropyBits[126 ^ (V)]) -#define MAX_NUM_CHANNEL_TYPE 2 - static const uint32_t ctxCbf[3][5] = { { 1, 0, 0, 0, 0 }, { 2, 3, 4, 5, 6 }, { 2, 3, 4, 5, 6 } }; -static const uint32_t significanceMapContextSetStart[MAX_NUM_CHANNEL_TYPE][3] = { { 0, 9, 21 }, { 0, 9, 12 } }; -static const uint32_t significanceMapContextSetSize[MAX_NUM_CHANNEL_TYPE][3] = { { 9, 12, 6 }, { 9, 3, 3 } }; -static const uint32_t nonDiagonalScan8x8ContextOffset[MAX_NUM_CHANNEL_TYPE] = { 6, 0 }; -static const uint32_t notFirstGroupNeighbourhoodContextOffset[MAX_NUM_CHANNEL_TYPE] = { 3, 0 }; - -// initial probability for cu_transquant_bypass flag -static const uint8_t INIT_CU_TRANSQUANT_BYPASS_FLAG[3][NUM_TQUANT_BYPASS_FLAG_CTX] = -{ - { 154 }, - { 154 }, - { 154 }, -}; - -// initial probability for split flag -static const uint8_t INIT_SPLIT_FLAG[3][NUM_SPLIT_FLAG_CTX] = -{ - { 107, 139, 126, }, - { 107, 139, 126, }, - { 139, 141, 157, }, -}; - -static const uint8_t INIT_SKIP_FLAG[3][NUM_SKIP_FLAG_CTX] = -{ - { 197, 185, 201, }, - { 197, 185, 201, }, - { CNU, CNU, CNU, }, -}; - -static const uint8_t INIT_MERGE_FLAG_EXT[3][NUM_MERGE_FLAG_EXT_CTX] = -{ - { 154, }, - { 110, }, - { CNU, }, -}; - -static const uint8_t INIT_MERGE_IDX_EXT[3][NUM_MERGE_IDX_EXT_CTX] = -{ - { 137, }, - { 122, }, - { CNU, }, -}; - -static const uint8_t INIT_PART_SIZE[3][NUM_PART_SIZE_CTX] = -{ - { 154, 139, 154, 154 }, - { 154, 139, 154, 154 }, - { 184, CNU, CNU, CNU }, -}; - -static const uint8_t INIT_PRED_MODE[3][NUM_PRED_MODE_CTX] = -{ - { 134, }, - { 149, }, - { CNU, }, -}; - -static const uint8_t INIT_INTRA_PRED_MODE[3][NUM_ADI_CTX] = -{ - { 183, }, - { 154, }, - { 184, }, -}; - -static const uint8_t INIT_CHROMA_PRED_MODE[3][NUM_CHROMA_PRED_CTX] = -{ - { 152, 139, }, - { 152, 139, }, - { 63, 139, }, -}; - -static const uint8_t INIT_INTER_DIR[3][NUM_INTER_DIR_CTX] = -{ - { 95, 79, 63, 31, 31, }, - { 95, 79, 63, 31, 31, }, - { CNU, CNU, CNU, CNU, CNU, }, -}; - -static const uint8_t INIT_MVD[3][NUM_MV_RES_CTX] = -{ - { 169, 198, }, - { 140, 198, }, - { CNU, CNU, }, -}; - -static const uint8_t INIT_REF_PIC[3][NUM_REF_NO_CTX] = -{ - { 153, 153 }, - { 153, 153 }, - { CNU, CNU }, -}; - -static const uint8_t INIT_DQP[3][NUM_DELTA_QP_CTX] = -{ - { 154, 154, 154, }, - { 154, 154, 154, }, - { 154, 154, 154, }, -}; - -static const uint8_t INIT_QT_CBF[3][NUM_QT_CBF_CTX] = -{ - { 153, 111, 149, 92, 167, 154, 154 }, - { 153, 111, 149, 107, 167, 154, 154 }, - { 111, 141, 94, 138, 182, 154, 154 }, -}; - -static const uint8_t INIT_QT_ROOT_CBF[3][NUM_QT_ROOT_CBF_CTX] = -{ - { 79, }, - { 79, }, - { CNU, }, -}; - -static const uint8_t INIT_LAST[3][NUM_CTX_LAST_FLAG_XY] = -{ - { 125, 110, 124, 110, 95, 94, 125, 111, 111, 79, 125, 126, 111, 111, 79, - 108, 123, 93 }, - { 125, 110, 94, 110, 95, 79, 125, 111, 110, 78, 110, 111, 111, 95, 94, - 108, 123, 108 }, - { 110, 110, 124, 125, 140, 153, 125, 127, 140, 109, 111, 143, 127, 111, 79, - 108, 123, 63 }, -}; - -static const uint8_t INIT_SIG_CG_FLAG[3][2 * NUM_SIG_CG_FLAG_CTX] = -{ - { 121, 140, - 61, 154, }, - { 121, 140, - 61, 154, }, - { 91, 171, - 134, 141, }, -}; - -static const uint8_t INIT_SIG_FLAG[3][NUM_SIG_FLAG_CTX] = -{ - { 170, 154, 139, 153, 139, 123, 123, 63, 124, 166, 183, 140, 136, 153, 154, 166, 183, 140, 136, 153, 154, 166, 183, 140, 136, 153, 154, 170, 153, 138, 138, 122, 121, 122, 121, 167, 151, 183, 140, 151, 183, 140, }, - { 155, 154, 139, 153, 139, 123, 123, 63, 153, 166, 183, 140, 136, 153, 154, 166, 183, 140, 136, 153, 154, 166, 183, 140, 136, 153, 154, 170, 153, 123, 123, 107, 121, 107, 121, 167, 151, 183, 140, 151, 183, 140, }, - { 111, 111, 125, 110, 110, 94, 124, 108, 124, 107, 125, 141, 179, 153, 125, 107, 125, 141, 179, 153, 125, 107, 125, 141, 179, 153, 125, 140, 139, 182, 182, 152, 136, 152, 136, 153, 136, 139, 111, 136, 139, 111, }, -}; - -static const uint8_t INIT_ONE_FLAG[3][NUM_ONE_FLAG_CTX] = -{ - { 154, 196, 167, 167, 154, 152, 167, 182, 182, 134, 149, 136, 153, 121, 136, 122, 169, 208, 166, 167, 154, 152, 167, 182, }, - { 154, 196, 196, 167, 154, 152, 167, 182, 182, 134, 149, 136, 153, 121, 136, 137, 169, 194, 166, 167, 154, 167, 137, 182, }, - { 140, 92, 137, 138, 140, 152, 138, 139, 153, 74, 149, 92, 139, 107, 122, 152, 140, 179, 166, 182, 140, 227, 122, 197, }, -}; - -static const uint8_t INIT_ABS_FLAG[3][NUM_ABS_FLAG_CTX] = -{ - { 107, 167, 91, 107, 107, 167, }, - { 107, 167, 91, 122, 107, 167, }, - { 138, 153, 136, 167, 152, 152, }, -}; - -static const uint8_t INIT_MVP_IDX[3][NUM_MVP_IDX_CTX] = -{ - { 168 }, - { 168 }, - { CNU }, -}; - -static const uint8_t INIT_SAO_MERGE_FLAG[3][NUM_SAO_MERGE_FLAG_CTX] = -{ - { 153, }, - { 153, }, - { 153, }, -}; - -static const uint8_t INIT_SAO_TYPE_IDX[3][NUM_SAO_TYPE_IDX_CTX] = -{ - { 160, }, - { 185, }, - { 200, }, -}; - -static const uint8_t INIT_TRANS_SUBDIV_FLAG[3][NUM_TRANS_SUBDIV_FLAG_CTX] = -{ - { 224, 167, 122, }, - { 124, 138, 94, }, - { 153, 138, 138, }, -}; -static const uint8_t INIT_TRANSFORMSKIP_FLAG[3][2 * NUM_TRANSFORMSKIP_FLAG_CTX] = -{ - { 139, 139 }, - { 139, 139 }, - { 139, 139 }, -}; } #endif // ifndef X265_CONTEXTS_H
View file
x265_1.9.tar.gz/source/common/cpu.cpp -> x265_2.0.tar.gz/source/common/cpu.cpp
Changed
@@ -274,9 +274,9 @@ if (!cache && max_basic_cap >= 2) { // Cache and TLB Information - static const char cache32_ids[] = { 0x0a, 0x0c, 0x41, 0x42, 0x43, 0x44, 0x45, 0x82, 0x83, 0x84, 0x85, 0 }; - static const char cache64_ids[] = { 0x22, 0x23, 0x25, 0x29, 0x2c, 0x46, 0x47, 0x49, 0x60, 0x66, 0x67, - 0x68, 0x78, 0x79, 0x7a, 0x7b, 0x7c, 0x7c, 0x7f, 0x86, 0x87, 0 }; + static const char cache32_ids[] = { '\x0a','\x0c','\x41','\x42','\x43','\x44','\x45','\x82','\x83','\x84','\x85','\0' }; + static const char cache64_ids[] = { '\x22','\x23','\x25','\x29','\x2c','\x46','\x47','\x49','\x60','\x66','\x67', + '\x68','\x78','\x79','\x7a','\x7b','\x7c','\x7c','\x7f','\x86','\x87','\0' }; uint32_t buf[4]; int max, i = 0; do
View file
x265_1.9.tar.gz/source/common/cudata.cpp -> x265_2.0.tar.gz/source/common/cudata.cpp
Changed
@@ -480,7 +480,7 @@ } /* The reverse of copyToPic, called only by encodeResidue */ -void CUData::copyFromPic(const CUData& ctu, const CUGeom& cuGeom, int csp) +void CUData::copyFromPic(const CUData& ctu, const CUGeom& cuGeom, int csp, bool copyQp) { m_encData = ctu.m_encData; m_slice = ctu.m_slice; @@ -491,7 +491,8 @@ m_numPartitions = cuGeom.numPartitions; /* copy out all prediction info for this part */ - m_partCopy((uint8_t*)m_qp, (uint8_t*)ctu.m_qp + m_absIdxInCTU); + if (copyQp) m_partCopy((uint8_t*)m_qp, (uint8_t*)ctu.m_qp + m_absIdxInCTU); + m_partCopy(m_log2CUSize, ctu.m_log2CUSize + m_absIdxInCTU); m_partCopy(m_lumaIntraDir, ctu.m_lumaIntraDir + m_absIdxInCTU); m_partCopy(m_tqBypass, ctu.m_tqBypass + m_absIdxInCTU); @@ -526,7 +527,7 @@ } /* Only called by encodeResidue, these fields can be modified during inter/intra coding */ -void CUData::updatePic(uint32_t depth) const +void CUData::updatePic(uint32_t depth, int picCsp) const { CUData& ctu = *m_encData->getPicCTU(m_cuAddr); @@ -540,7 +541,7 @@ uint32_t tmpY2 = m_absIdxInCTU << (LOG2_UNIT_SIZE * 2); memcpy(ctu.m_trCoeff[0] + tmpY2, m_trCoeff[0], sizeof(coeff_t)* tmpY); - if (ctu.m_chromaFormat != X265_CSP_I400) + if (ctu.m_chromaFormat != X265_CSP_I400 && picCsp != X265_CSP_I400) { m_partCopy(ctu.m_transformSkip[1] + m_absIdxInCTU, m_transformSkip[1]); m_partCopy(ctu.m_transformSkip[2] + m_absIdxInCTU, m_transformSkip[2]); @@ -2088,6 +2089,7 @@ cu->absPartIdx = g_depthScanIdx[yOffset][xOffset] * 4; cu->numPartitions = (NUM_4x4_PARTITIONS >> ((g_maxLog2CUSize - cu->log2CUSize) * 2)); cu->depth = g_log2Size[maxCUSize] - log2CUSize; + cu->geomRecurId = cuIdx; cu->flags = 0; CU_SET_FLAG(cu->flags, CUGeom::PRESENT, presentFlag);
View file
x265_1.9.tar.gz/source/common/cudata.h -> x265_2.0.tar.gz/source/common/cudata.h
Changed
@@ -87,6 +87,7 @@ uint32_t numPartitions; // Number of 4x4 blocks in the CU uint32_t flags; // CU flags. uint32_t depth; // depth of this CU relative from CTU + uint32_t geomRecurId; // Unique geom id from 0 to MAX_GEOMS - 1 for every depth }; struct MVField @@ -222,8 +223,8 @@ void copyToPic(uint32_t depth) const; /* RD-0 methods called only from encodeResidue */ - void copyFromPic(const CUData& ctu, const CUGeom& cuGeom, int csp); - void updatePic(uint32_t depth) const; + void copyFromPic(const CUData& ctu, const CUGeom& cuGeom, int csp, bool copyQp = true); + void updatePic(uint32_t depth, int picCsp) const; void setPartSizeSubParts(PartSize size) { m_partSet(m_partSize, (uint8_t)size); } void setPredModeSubParts(PredMode mode) { m_partSet(m_predMode, (uint8_t)mode); } @@ -246,7 +247,7 @@ void setPURefIdx(int list, int8_t refIdx, int absPartIdx, int puIdx); uint8_t getCbf(uint32_t absPartIdx, TextType ttype, uint32_t tuDepth) const { return (m_cbf[ttype][absPartIdx] >> tuDepth) & 0x1; } - uint8_t getQtRootCbf(uint32_t absPartIdx) const { if (m_chromaFormat == X265_CSP_I400) return m_cbf[0][absPartIdx] || false; else { return m_cbf[0][absPartIdx] || m_cbf[1][absPartIdx] || m_cbf[2][absPartIdx];} } + bool getQtRootCbf(uint32_t absPartIdx) const { return (m_cbf[0][absPartIdx] || ((m_chromaFormat != X265_CSP_I400) && (m_cbf[1][absPartIdx] || m_cbf[2][absPartIdx]))); } int8_t getRefQP(uint32_t currAbsIdxInCTU) const; uint32_t getInterMergeCandidates(uint32_t absPartIdx, uint32_t puIdx, MVField (*candMvField)[2], uint8_t* candDir) const; void clipMv(MV& outMV) const;
View file
x265_1.9.tar.gz/source/common/deblock.cpp -> x265_2.0.tar.gz/source/common/deblock.cpp
Changed
@@ -319,27 +319,6 @@ } } -/* Deblocking of one line/column for the chrominance component - * \param src pointer to picture data - * \param offset offset value for picture data - * \param tc tc value - * \param maskP indicator to disable filtering on partP - * \param maskQ indicator to disable filtering on partQ */ -static inline void pelFilterChroma(pixel* src, intptr_t srcStep, intptr_t offset, int32_t tc, int32_t maskP, int32_t maskQ) -{ - for (int32_t i = 0; i < UNIT_SIZE; i++, src += srcStep) - { - int16_t m4 = (int16_t)src[0]; - int16_t m3 = (int16_t)src[-offset]; - int16_t m5 = (int16_t)src[offset]; - int16_t m2 = (int16_t)src[-offset * 2]; - - int32_t delta = x265_clip3(-tc, tc, ((((m4 - m3) * 4) + m2 - m5 + 4) >> 3)); - src[-offset] = x265_clip(m3 + (delta & maskP)); - src[0] = x265_clip(m4 - (delta & maskQ)); - } -} - void Deblock::edgeFilterLuma(const CUData* cuQ, uint32_t absPartIdx, uint32_t depth, int32_t dir, int32_t edge, const uint8_t blockStrength[]) { PicYuv* reconPic = cuQ->m_encData->m_reconPic; @@ -517,7 +496,7 @@ int32_t tc = s_tcTable[indexTC] << bitdepthShift; pixel* srcC = srcChroma[chromaIdx]; - pelFilterChroma(srcC + unitOffset, srcStep, offset, tc, maskP, maskQ); + primitives.pelFilterChroma[dir](srcC + unitOffset, srcStep, offset, tc, maskP, maskQ); } } }
View file
x265_1.9.tar.gz/source/common/frame.cpp -> x265_2.0.tar.gz/source/common/frame.cpp
Changed
@@ -42,12 +42,14 @@ m_prev = NULL; m_param = NULL; memset(&m_lowres, 0, sizeof(m_lowres)); + m_rcData = NULL; } bool Frame::create(x265_param *param, float* quantOffsets) { m_fencPic = new PicYuv; m_param = param; + CHECKED_MALLOC_ZERO(m_rcData, RcStats, 1); if (m_fencPic->create(param->sourceWidth, param->sourceHeight, param->internalCsp) && m_lowres.create(m_fencPic, param->bframes, !!param->rc.aqMode)) @@ -64,14 +66,17 @@ return true; } return false; +fail: + return false; } bool Frame::allocEncodeData(x265_param *param, const SPS& sps) { m_encData = new FrameData; m_reconPic = new PicYuv; + m_param = param; m_encData->m_reconPic = m_reconPic; - bool ok = m_encData->create(*param, sps) && m_reconPic->create(param->sourceWidth, param->sourceHeight, param->internalCsp); + bool ok = m_encData->create(*param, sps, m_fencPic->m_picCsp) && m_reconPic->create(param->sourceWidth, param->sourceHeight, param->internalCsp); if (ok) { /* initialize right border of m_reconpicYuv as SAO may read beyond the @@ -139,4 +144,5 @@ } m_lowres.destroy(); + X265_FREE(m_rcData); }
View file
x265_1.9.tar.gz/source/common/frame.h -> x265_2.0.tar.gz/source/common/frame.h
Changed
@@ -37,6 +37,27 @@ #define IS_REFERENCED(frame) (frame->m_lowres.sliceType != X265_TYPE_B) +/* Ratecontrol statistics */ +struct RcStats +{ + double qpaRc; + double qpAq; + double qRceq; + double qpNoVbv; + double newQScale; + double iCuCount; + double pCuCount; + double skipCuCount; + double qScale; + int mvBits; + int miscBits; + int coeffBits; + int poc; + int encodeOrder; + int sliceType; + int keptAsRef; +}; + class Frame { public: @@ -49,6 +70,7 @@ /* Data associated with x265_picture */ PicYuv* m_fencPic; int m_poc; + int m_encodeOrder; int64_t m_pts; // user provided presentation time stamp int64_t m_reorderedPts; int64_t m_dts; @@ -71,6 +93,7 @@ Frame* m_prev; x265_param* m_param; // Points to the latest param set for the frame. x265_analysis_data m_analysisData; + RcStats* m_rcData; Frame(); bool create(x265_param *param, float* quantOffsets);
View file
x265_1.9.tar.gz/source/common/framedata.cpp -> x265_2.0.tar.gz/source/common/framedata.cpp
Changed
@@ -31,17 +31,18 @@ memset(this, 0, sizeof(*this)); } -bool FrameData::create(const x265_param& param, const SPS& sps) +bool FrameData::create(const x265_param& param, const SPS& sps, int csp) { m_param = ¶m; m_slice = new Slice; m_picCTU = new CUData[sps.numCUsInFrame]; + m_picCsp = csp; m_cuMemPool.create(0, param.internalCsp, sps.numCUsInFrame); for (uint32_t ctuAddr = 0; ctuAddr < sps.numCUsInFrame; ctuAddr++) m_picCTU[ctuAddr].initialize(m_cuMemPool, 0, param.internalCsp, ctuAddr); - CHECKED_MALLOC(m_cuStat, RCStatCU, sps.numCUsInFrame); + CHECKED_MALLOC_ZERO(m_cuStat, RCStatCU, sps.numCUsInFrame); CHECKED_MALLOC(m_rowStat, RCStatRow, sps.numCuInHeight); reinit(sps); return true;
View file
x265_1.9.tar.gz/source/common/framedata.h -> x265_2.0.tar.gz/source/common/framedata.h
Changed
@@ -146,10 +146,11 @@ double m_avgQpRc; /* avg QP as decided by rate-control */ double m_avgQpAq; /* avg QP as decided by AQ in addition to rate-control */ double m_rateFactor; /* calculated based on the Frame QP */ + int m_picCsp; FrameData(); - bool create(const x265_param& param, const SPS& sps); + bool create(const x265_param& param, const SPS& sps, int csp); void reinit(const SPS& sps); void destroy(); inline CUData* getPicCTU(uint32_t ctuAddr) { return &m_picCTU[ctuAddr]; } @@ -168,10 +169,12 @@ struct analysis_inter_data { MV* mv; + WeightParam* wt; int32_t* ref; uint8_t* depth; uint8_t* modes; - uint32_t* bestMergeCand; + uint8_t* partSize; + uint8_t* mergeFlag; }; } #endif // ifndef X265_FRAMEDATA_H
View file
x265_1.9.tar.gz/source/common/ipfilter.cpp -> x265_2.0.tar.gz/source/common/ipfilter.cpp
Changed
@@ -365,10 +365,10 @@ template<int N, int width, int height> void interp_hv_pp_c(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY) { - short immedVals[(64 + 8) * (64 + 8)]; + ALIGN_VAR_32(int16_t, immed[width * (height + N - 1)]); - interp_horiz_ps_c<N, width, height>(src, srcStride, immedVals, width, idxX, 1); - filterVertical_sp_c<N>(immedVals + 3 * width, width, dst, dstStride, width, height, idxY); + interp_horiz_ps_c<N, width, height>(src, srcStride, immed, width, idxX, 1); + filterVertical_sp_c<N>(immed + (N / 2 - 1) * width, width, dst, dstStride, width, height, idxY); } }
View file
x265_1.9.tar.gz/source/common/loopfilter.cpp -> x265_2.0.tar.gz/source/common/loopfilter.cpp
Changed
@@ -27,7 +27,6 @@ #include "primitives.h" #define PIXEL_MIN 0 -#define PIXEL_MAX ((1 << X265_DEPTH) - 1) namespace { @@ -158,6 +157,27 @@ src[offset * 2] = (pixel)(x265_clip3(-tcQ, tcQ, ((m3 + m4 + m5 + 3 * m6 + 2 * m7 + 4) >> 3) - m6) + m6); } } + +/* Deblocking of one line/column for the chrominance component +* \param src pointer to picture data +* \param offset offset value for picture data +* \param tc tc value +* \param maskP indicator to disable filtering on partP +* \param maskQ indicator to disable filtering on partQ */ +static void pelFilterChroma_c(pixel* src, intptr_t srcStep, intptr_t offset, int32_t tc, int32_t maskP, int32_t maskQ) +{ + for (int32_t i = 0; i < UNIT_SIZE; i++, src += srcStep) + { + int16_t m4 = (int16_t)src[0]; + int16_t m3 = (int16_t)src[-offset]; + int16_t m5 = (int16_t)src[offset]; + int16_t m2 = (int16_t)src[-offset * 2]; + + int32_t delta = x265_clip3(-tc, tc, ((((m4 - m3) * 4) + m2 - m5 + 4) >> 3)); + src[-offset] = x265_clip(m3 + (delta & maskP)); + src[0] = x265_clip(m4 - (delta & maskQ)); + } +} } namespace X265_NS { @@ -176,5 +196,7 @@ // C code is same for EDGE_VER and EDGE_HOR only asm code is different p.pelFilterLumaStrong[0] = pelFilterLumaStrong_c; p.pelFilterLumaStrong[1] = pelFilterLumaStrong_c; + p.pelFilterChroma[0] = pelFilterChroma_c; + p.pelFilterChroma[1] = pelFilterChroma_c; } }
View file
x265_1.9.tar.gz/source/common/param.cpp -> x265_2.0.tar.gz/source/common/param.cpp
Changed
@@ -121,9 +121,9 @@ /* Source specifications */ param->internalBitDepth = X265_DEPTH; param->internalCsp = X265_CSP_I420; - - param->levelIdc = 0; - param->bHighTier = 0; + param->levelIdc = 0; //Auto-detect level + param->uhdBluray = 0; + param->bHighTier = 1; //Allow high tier by default param->interlaceMode = 0; param->bAnnexB = 1; param->bRepeatHeaders = 0; @@ -164,6 +164,7 @@ param->bEnableWeightedPred = 1; param->bEnableWeightedBiPred = 0; param->bEnableEarlySkip = 0; + param->bEnableRecursionSkip = 1; param->bEnableAMP = 0; param->bEnableRectInter = 0; param->rdLevel = 3; @@ -193,6 +194,7 @@ param->bLossless = 0; param->bCULossless = 0; param->bEnableTemporalSubLayers = 0; + param->bEnableRdRefine = 0; /* Rate control options */ param->rc.vbvMaxBitrate = 0; @@ -219,8 +221,9 @@ param->rc.qblur = 0.5; param->rc.zoneCount = 0; param->rc.zones = NULL; - param->rc.bEnableSlowFirstPass = 0; + param->rc.bEnableSlowFirstPass = 1; param->rc.bStrictCbr = 0; + param->rc.bEnableGrain = 0; /* Video Usability Information (VUI) */ param->vui.aspectRatioIdc = 0; @@ -245,7 +248,7 @@ param->maxCLL = 0; param->maxFALL = 0; param->minLuma = 0; - param->maxLuma = (1 << X265_DEPTH) - 1; + param->maxLuma = PIXEL_MAX; } int x265_param_default_preset(x265_param* param, const char* preset, const char* tune) @@ -408,9 +411,9 @@ param->maxNumMergeCand = 5; param->searchMethod = X265_STAR_SEARCH; param->bEnableTransformSkip = 1; + param->bEnableRecursionSkip = 0; param->maxNumReferences = 5; param->limitReferences = 0; - param->rc.bEnableSlowFirstPass = 1; param->bIntraInBFrames = 1; param->lookaheadSlices = 0; // disabled for best quality // TODO: optimized esa @@ -453,16 +456,16 @@ } else if (!strcmp(tune, "grain")) { - param->deblockingFilterBetaOffset = -2; - param->deblockingFilterTCOffset = -2; - param->bIntraInBFrames = 0; - param->rdoqLevel = 2; - param->psyRdoq = 10.0; - param->psyRd = 0.5; param->rc.ipFactor = 1.1; - param->rc.pbFactor = 1.1; - param->rc.aqStrength = 0.3; - param->rc.qCompress = 0.8; + param->rc.pbFactor = 1.0; + param->rc.cuTree = 0; + param->rc.aqMode = 0; + param->rc.qpStep = 1; + param->rc.bEnableGrain = 1; + param->bEnableRecursionSkip = 0; + param->psyRd = 4.0; + param->psyRdoq = 10.0; + param->bEnableSAO = 0; } else return -1; @@ -616,6 +619,7 @@ OPT("max-merge") p->maxNumMergeCand = (uint32_t)atoi(value); OPT("temporal-mvp") p->bEnableTemporalMvp = atobool(value); OPT("early-skip") p->bEnableEarlySkip = atobool(value); + OPT("rskip") p->bEnableRecursionSkip = atobool(value); OPT("rdpenalty") p->rdPenalty = atoi(value); OPT("tskip") p->bEnableTransformSkip = atobool(value); OPT("no-tskip-fast") p->bEnableTSkipFast = atobool(value); @@ -702,6 +706,7 @@ else p->psyRdoq = 0.0; } + OPT("rd-refine") p->bEnableRdRefine = atobool(value); OPT("signhide") p->bEnableSignHiding = atobool(value); OPT("b-intra") p->bIntraInBFrames = atobool(value); OPT("lft") p->bEnableLoopFilter = atobool(value); /* DEPRECATED */ @@ -757,6 +762,7 @@ p->rc.qp = atoi(value); p->rc.rateControlMode = X265_RC_CQP; } + OPT("rc-grain") p->rc.bEnableGrain = atobool(value); OPT("zones") { p->rc.zoneCount = 1; @@ -877,6 +883,7 @@ OPT("max-cll") bError |= sscanf(value, "%hu,%hu", &p->maxCLL, &p->maxFALL) != 2; OPT("min-luma") p->minLuma = (uint16_t)atoi(value); OPT("max-luma") p->maxLuma = (uint16_t)atoi(value); + OPT("uhd-bd") p->uhdBluray = atobool(value); else return X265_PARAM_BAD_NAME; #undef OPT @@ -1023,7 +1030,8 @@ { #define CHECK(expr, msg) check_failed |= _confirm(param, expr, msg) int check_failed = 0; /* abort if there is a fatal configuration problem */ - + CHECK(param->uhdBluray == 1 && (X265_DEPTH != 10 || param->internalCsp != 1 || param->interlaceMode != 0), + "uhd-bd: bit depth, chroma subsample, source picture type must be 10, 4:2:0, progressive"); CHECK(param->maxCUSize != 64 && param->maxCUSize != 32 && param->maxCUSize != 16, "max cu size must be 16, 32, or 64"); if (check_failed == 1) @@ -1096,7 +1104,7 @@ CHECK(param->rc.rateControlMode > X265_RC_CRF || param->rc.rateControlMode < X265_RC_ABR, "Rate control mode is out of range"); - CHECK(param->rdLevel < 0 || param->rdLevel > 6, + CHECK(param->rdLevel < 1 || param->rdLevel > 6, "RD Level is out of range"); CHECK(param->rdoqLevel < 0 || param->rdoqLevel > 2, "RDOQ Level is out of range"); @@ -1194,12 +1202,12 @@ CHECK(0 > param->noiseReductionIntra || param->noiseReductionIntra > 2000, "Valid noise reduction range 0 - 2000"); if (param->noiseReductionInter) CHECK(0 > param->noiseReductionInter || param->noiseReductionInter > 2000, "Valid noise reduction range 0 - 2000"); - CHECK(param->rc.rateControlMode == X265_RC_CRF && param->rc.bStatRead && param->rc.vbvMaxBitrate == 0, - "Constant rate-factor is incompatible with 2pass"); CHECK(param->rc.rateControlMode == X265_RC_CQP && param->rc.bStatRead, "Constant QP is incompatible with 2pass"); CHECK(param->rc.bStrictCbr && (param->rc.bitrate <= 0 || param->rc.vbvBufferSize <=0), "Strict-cbr cannot be applied without specifying target bitrate or vbv bufsize"); + CHECK(param->analysisMode && (param->analysisMode < X265_ANALYSIS_OFF || param->analysisMode > X265_ANALYSIS_LOAD), + "Invalid analysis mode. Analysis mode 0: OFF 1: SAVE : 2 LOAD"); return check_failed; } @@ -1225,18 +1233,21 @@ uint32_t maxLog2CUSize = (uint32_t)g_log2Size[param->maxCUSize]; uint32_t minLog2CUSize = (uint32_t)g_log2Size[param->minCUSize]; - if (ATOMIC_INC(&g_ctuSizeConfigured) > 1) + Lock gLock; + ScopedLock sLock(gLock); + + if (++g_ctuSizeConfigured > 1) { if (g_maxCUSize != param->maxCUSize) { - x265_log(param, X265_LOG_ERROR, "maxCUSize must be the same for all encoders in a single process"); - return -1; + x265_log(param, X265_LOG_WARNING, "maxCUSize must be the same for all encoders in a single process"); } if (g_maxCUDepth != maxLog2CUSize - minLog2CUSize) { - x265_log(param, X265_LOG_ERROR, "maxCUDepth must be the same for all encoders in a single process"); - return -1; + x265_log(param, X265_LOG_WARNING, "maxCUDepth must be the same for all encoders in a single process"); } + param->maxCUSize = g_maxCUSize; + return x265_check_params(param); /* Check again, since param may have changed */ } else { @@ -1302,8 +1313,9 @@ x265_log(param, X265_LOG_INFO, "Lookahead / bframes / badapt : %d / %d / %d\n", param->lookaheadDepth, param->bframes, param->bFrameAdaptive); x265_log(param, X265_LOG_INFO, "b-pyramid / weightp / weightb : %d / %d / %d\n", param->bBPyramid, param->bEnableWeightedPred, param->bEnableWeightedBiPred); - x265_log(param, X265_LOG_INFO, "References / ref-limit cu / depth : %d / %d / %d\n", - param->maxNumReferences, !!(param->limitReferences & X265_REF_LIMIT_CU), !!(param->limitReferences & X265_REF_LIMIT_DEPTH)); + x265_log(param, X265_LOG_INFO, "References / ref-limit cu / depth : %d / %s / %s\n", + param->maxNumReferences, (param->limitReferences & X265_REF_LIMIT_CU) ? "on" : "off", + (param->limitReferences & X265_REF_LIMIT_DEPTH) ? "on" : "off"); if (param->rc.aqMode) x265_log(param, X265_LOG_INFO, "AQ: mode / str / qg-size / cu-tree : %d / %0.1f / %d / %d\n", param->rc.aqMode, @@ -1336,7 +1348,9 @@ TOOLVAL(param->psyRd, "psy-rd=%.2lf"); TOOLVAL(param->rdoqLevel, "rdoq=%d"); TOOLVAL(param->psyRdoq, "psy-rdoq=%.2lf"); + TOOLOPT(param->bEnableRdRefine, "rd-refine"); TOOLOPT(param->bEnableEarlySkip, "early-skip"); + TOOLOPT(param->bEnableRecursionSkip, "rskip"); TOOLVAL(param->noiseReductionIntra, "nr-intra=%d"); TOOLVAL(param->noiseReductionInter, "nr-inter=%d"); TOOLOPT(param->bEnableTSkipFast, "tskip-fast"); @@ -1367,43 +1381,6 @@ fflush(stderr); } -void x265_print_reconfigured_params(x265_param* param, x265_param* reconfiguredParam) -{ - if (!param || !reconfiguredParam) - return; - - x265_log(param,X265_LOG_INFO, "Reconfigured param options :\n"); - - char buf[80] = { 0 }; - char tmp[40]; -#define TOOLCMP(COND1, COND2, STR, VAL) if (COND1 != COND2) { sprintf(tmp, STR, VAL); appendtool(param, buf, sizeof(buf), tmp); } - TOOLCMP(param->maxNumReferences, reconfiguredParam->maxNumReferences, "ref=%d", reconfiguredParam->maxNumReferences); - TOOLCMP(param->maxTUSize, reconfiguredParam->maxTUSize, "max-tu-size=%d", reconfiguredParam->maxTUSize); - TOOLCMP(param->searchRange, reconfiguredParam->searchRange, "merange=%d", reconfiguredParam->searchRange); - TOOLCMP(param->subpelRefine, reconfiguredParam->subpelRefine, "subme= %d", reconfiguredParam->subpelRefine); - TOOLCMP(param->rdLevel, reconfiguredParam->rdLevel, "rd=%d", reconfiguredParam->rdLevel); - TOOLCMP(param->psyRd, reconfiguredParam->psyRd, "psy-rd=%.2lf", reconfiguredParam->psyRd); - TOOLCMP(param->rdoqLevel, reconfiguredParam->rdoqLevel, "rdoq=%d", reconfiguredParam->rdoqLevel); - TOOLCMP(param->psyRdoq, reconfiguredParam->psyRdoq, "psy-rdoq=%.2lf", reconfiguredParam->psyRdoq); - TOOLCMP(param->noiseReductionIntra, reconfiguredParam->noiseReductionIntra, "nr-intra=%d", reconfiguredParam->noiseReductionIntra); - TOOLCMP(param->noiseReductionInter, reconfiguredParam->noiseReductionInter, "nr-inter=%d", reconfiguredParam->noiseReductionInter); - TOOLCMP(param->bEnableTSkipFast, reconfiguredParam->bEnableTSkipFast, "tskip-fast=%d", reconfiguredParam->bEnableTSkipFast); - TOOLCMP(param->bEnableSignHiding, reconfiguredParam->bEnableSignHiding, "signhide=%d", reconfiguredParam->bEnableSignHiding); - TOOLCMP(param->bEnableFastIntra, reconfiguredParam->bEnableFastIntra, "fast-intra=%d", reconfiguredParam->bEnableFastIntra); - if (param->bEnableLoopFilter && (param->deblockingFilterBetaOffset != reconfiguredParam->deblockingFilterBetaOffset - || param->deblockingFilterTCOffset != reconfiguredParam->deblockingFilterTCOffset)) - { - sprintf(tmp, "deblock(tC=%d:B=%d)", param->deblockingFilterTCOffset, param->deblockingFilterBetaOffset); - appendtool(param, buf, sizeof(buf), tmp); - } - else - TOOLCMP(param->bEnableLoopFilter, reconfiguredParam->bEnableLoopFilter, "deblock=%d", reconfiguredParam->bEnableLoopFilter); - - TOOLCMP(param->bEnableTemporalMvp, reconfiguredParam->bEnableTemporalMvp, "tmvp=%d", reconfiguredParam->bEnableTemporalMvp); - TOOLCMP(param->bEnableEarlySkip, reconfiguredParam->bEnableEarlySkip, "early-skip=%d", reconfiguredParam->bEnableEarlySkip); - x265_log(param, X265_LOG_INFO, "tools:%s\n", buf); -} - char *x265_param2string(x265_param* p) { char *buf, *s; @@ -1413,7 +1390,7 @@ return NULL; #define BOOL(param, cliopt) \ - s += sprintf(s, " %s", (param) ? cliopt : "no-"cliopt); + s += sprintf(s, " %s", (param) ? cliopt : "no-" cliopt); s += sprintf(s, "%dx%d", p->sourceWidth,p->sourceHeight); s += sprintf(s, " fps=%u/%u", p->fpsNum, p->fpsDenom); @@ -1432,6 +1409,7 @@ s += sprintf(s, " max-merge=%d", p->maxNumMergeCand); BOOL(p->bEnableTemporalMvp, "temporal-mvp"); BOOL(p->bEnableEarlySkip, "early-skip"); + BOOL(p->bEnableRecursionSkip, "rskip"); s += sprintf(s, " rdpenalty=%d", p->rdPenalty); BOOL(p->bEnableTransformSkip, "tskip"); BOOL(p->bEnableTSkipFast, "tskip-fast"); @@ -1465,9 +1443,10 @@ s += sprintf(s, " psy-rd=%.2f", p->psyRd); s += sprintf(s, " rdoq-level=%d", p->rdoqLevel); s += sprintf(s, " psy-rdoq=%.2f", p->psyRdoq); + BOOL(p->bEnableRdRefine, "rd-refine"); BOOL(p->bEnableSignHiding, "signhide"); BOOL(p->bEnableLoopFilter, "deblock"); - if (p->bEnableLoopFilter && (p->deblockingFilterBetaOffset || p->deblockingFilterTCOffset)) + if (p->bEnableLoopFilter) s += sprintf(s, "=%d:%d", p->deblockingFilterTCOffset, p->deblockingFilterBetaOffset); BOOL(p->bEnableSAO, "sao"); BOOL(p->bSaoNonDeblocked, "sao-non-deblock");
View file
x265_1.9.tar.gz/source/common/param.h -> x265_2.0.tar.gz/source/common/param.h
Changed
@@ -30,7 +30,6 @@ int x265_check_params(x265_param *param); int x265_set_globals(x265_param *param); void x265_print_params(x265_param *param); -void x265_print_reconfigured_params(x265_param* param, x265_param* reconfiguredParam); void x265_param_apply_fastfirstpass(x265_param *p); char* x265_param2string(x265_param *param); int x265_atoi(const char *str, bool& bError);
View file
x265_1.9.tar.gz/source/common/picyuv.cpp -> x265_2.0.tar.gz/source/common/picyuv.cpp
Changed
@@ -46,6 +46,10 @@ m_maxLumaLevel = 0; m_avgLumaLevel = 0; + m_stride = 0; + m_strideC = 0; + m_hChromaShift = 0; + m_vChromaShift = 0; } bool PicYuv::create(uint32_t picWidth, uint32_t picHeight, uint32_t picCsp) @@ -176,6 +180,7 @@ * warnings from valgrind about using uninitialized pixels */ padx++; pady++; + m_picCsp = pic.colorSpace; X265_CHECK(pic.bitDepth >= 8, "pic.bitDepth check failure"); @@ -190,7 +195,7 @@ primitives.planecopy_cp(yChar, pic.stride[0] / sizeof(*yChar), yPixel, m_stride, width, height, shift); - if (pic.colorSpace != X265_CSP_I400) + if (param.internalCsp != X265_CSP_I400) { pixel *uPixel = m_picOrg[1]; pixel *vPixel = m_picOrg[2]; @@ -216,7 +221,7 @@ yChar += pic.stride[0] / sizeof(*yChar); } - if (pic.colorSpace != X265_CSP_I400) + if (param.internalCsp != X265_CSP_I400) { pixel *uPixel = m_picOrg[1]; pixel *vPixel = m_picOrg[2]; @@ -258,7 +263,7 @@ primitives.planecopy_sp_shl(yShort, pic.stride[0] / sizeof(*yShort), yPixel, m_stride, width, height, shift, mask); } - if (pic.colorSpace != X265_CSP_I400) + if (param.internalCsp != X265_CSP_I400) { pixel *uPixel = m_picOrg[1]; pixel *vPixel = m_picOrg[2]; @@ -279,12 +284,25 @@ } } - /* extend the right edge if width was not multiple of the minimum CU size */ - uint64_t sumLuma; pixel *Y = m_picOrg[0]; - m_maxLumaLevel = primitives.planeClipAndMax(Y, m_stride, width, height, &sumLuma, (pixel)param.minLuma, (pixel)param.maxLuma); - m_avgLumaLevel = (double)(sumLuma) / (m_picHeight * m_picWidth); + pixel *U = m_picOrg[1]; + pixel *V = m_picOrg[2]; +#if HIGH_BIT_DEPTH + bool calcHDRParams = !!param.minLuma || (param.maxLuma != PIXEL_MAX); + /* Apply min/max luma bounds for HDR pixel manipulations */ + if (calcHDRParams) + { + X265_CHECK(pic.bitDepth == 10, "HDR stats can be applied/calculated only for 10bpp content"); + uint64_t sumLuma; + m_maxLumaLevel = primitives.planeClipAndMax(Y, m_stride, width, height, &sumLuma, (pixel)param.minLuma, (pixel)param.maxLuma); + m_avgLumaLevel = (double) sumLuma / (m_picHeight * m_picWidth); + } +#else + (void) param; +#endif + + /* extend the right edge if width was not multiple of the minimum CU size */ for (int r = 0; r < height; r++) { for (int x = 0; x < padx; x++) @@ -297,11 +315,8 @@ for (int i = 1; i <= pady; i++) memcpy(Y + i * m_stride, Y, (width + padx) * sizeof(pixel)); - if (pic.colorSpace != X265_CSP_I400) + if (param.internalCsp != X265_CSP_I400) { - pixel *U = m_picOrg[1]; - pixel *V = m_picOrg[2]; - for (int r = 0; r < height >> m_vChromaShift; r++) { for (int x = 0; x < padx >> m_hChromaShift; x++)
View file
x265_1.9.tar.gz/source/common/picyuv.h -> x265_2.0.tar.gz/source/common/picyuv.h
Changed
@@ -60,7 +60,7 @@ uint32_t m_chromaMarginX; uint32_t m_chromaMarginY; - uint16_t m_maxLumaLevel; + pixel m_maxLumaLevel; double m_avgLumaLevel; PicYuv();
View file
x265_1.9.tar.gz/source/common/pixel.cpp -> x265_2.0.tar.gz/source/common/pixel.cpp
Changed
@@ -607,7 +607,6 @@ * s1*s1, s2*s2, and s1*s2 also obtain this value for edge cases: ((2^10-1)*16*4)^2 = 4286582784. * Maximum value for 9-bit is: ss*64 = (2^9-1)^2*16*4*64 = 1069551616, which will not overflow. */ -#define PIXEL_MAX ((1 << X265_DEPTH) - 1) #if HIGH_BIT_DEPTH X265_CHECK((X265_DEPTH == 10) || (X265_DEPTH == 12), "ssim invalid depth\n"); #define type float @@ -873,7 +872,25 @@ } } -static pixel planeClipAndMax_c(pixel *src, intptr_t stride, int width, int height, uint64_t *outsum, const pixel minPix, const pixel maxPix) +/* Conversion between double and Q8.8 fixed point (big-endian) for storage */ +static void cuTreeFix8Pack(uint16_t *dst, double *src, int count) +{ + for (int i = 0; i < count; i++) + dst[i] = (uint16_t)(src[i] * 256.0); +} + +static void cuTreeFix8Unpack(double *dst, uint16_t *src, int count) +{ + for (int i = 0; i < count; i++) + { + int16_t qpFix8 = src[i]; + dst[i] = (double)(qpFix8) / 256.0; + } +} + +#if HIGH_BIT_DEPTH +static pixel planeClipAndMax_c(pixel *src, intptr_t stride, int width, int height, uint64_t *outsum, + const pixel minPix, const pixel maxPix) { pixel maxLumaLevel = 0; uint64_t sumLuma = 0; @@ -882,21 +899,18 @@ { for (int c = 0; c < width; c++) { - /* Clip luma of source picture to max and min values before extending edges of picYuv */ + /* Clip luma of source picture to max and min*/ src[c] = x265_clip3((pixel)minPix, (pixel)maxPix, src[c]); - - /* Determine maximum and average luma level in a picture */ maxLumaLevel = X265_MAX(src[c], maxLumaLevel); sumLuma += src[c]; } - src += stride; } - *outsum = sumLuma; return maxLumaLevel; } +#endif } // end anonymous namespace namespace X265_NS { @@ -1181,7 +1195,11 @@ p.planecopy_cp = planecopy_cp_c; p.planecopy_sp = planecopy_sp_c; p.planecopy_sp_shl = planecopy_sp_shl_c; +#if HIGH_BIT_DEPTH p.planeClipAndMax = planeClipAndMax_c; +#endif p.propagateCost = estimateCUPropagateCost; + p.fix8Unpack = cuTreeFix8Unpack; + p.fix8Pack = cuTreeFix8Pack; } }
View file
x265_1.9.tar.gz/source/common/predict.cpp -> x265_2.0.tar.gz/source/common/predict.cpp
Changed
@@ -57,12 +57,10 @@ Predict::Predict() { - m_immedVals = NULL; } Predict::~Predict() { - X265_FREE(m_immedVals); m_predShortYuv[0].destroy(); m_predShortYuv[1].destroy(); } @@ -72,12 +70,8 @@ m_csp = csp; m_hChromaShift = CHROMA_H_SHIFT(csp); m_vChromaShift = CHROMA_V_SHIFT(csp); - CHECKED_MALLOC(m_immedVals, int16_t, 64 * (64 + NTAPS_LUMA - 1)); return m_predShortYuv[0].create(MAX_CU_SIZE, csp) && m_predShortYuv[1].create(MAX_CU_SIZE, csp); - -fail: - return false; } void Predict::motionCompensation(const CUData& cu, const PredictionUnit& pu, Yuv& predYuv, bool bLuma, bool bChroma) @@ -258,8 +252,8 @@ int partEnum = partitionFromSizes(pu.width, pu.height); const pixel* src = refPic.getLumaAddr(pu.ctuAddr, pu.cuAbsPartIdx + pu.puAbsPartIdx) + srcOffset; - int xFrac = mv.x & 0x3; - int yFrac = mv.y & 0x3; + int xFrac = mv.x & 3; + int yFrac = mv.y & 3; if (!(yFrac | xFrac)) primitives.pu[partEnum].copy_pp(dst, dstStride, src, srcStride); @@ -280,14 +274,14 @@ intptr_t srcOffset = (mv.x >> 2) + (mv.y >> 2) * srcStride; const pixel* src = refPic.getLumaAddr(pu.ctuAddr, pu.cuAbsPartIdx + pu.puAbsPartIdx) + srcOffset; - int xFrac = mv.x & 0x3; - int yFrac = mv.y & 0x3; - int partEnum = partitionFromSizes(pu.width, pu.height); X265_CHECK((pu.width % 4) + (pu.height % 4) == 0, "width or height not divisible by 4\n"); X265_CHECK(dstStride == MAX_CU_SIZE, "stride expected to be max cu size\n"); + int xFrac = mv.x & 3; + int yFrac = mv.y & 3; + if (!(yFrac | xFrac)) primitives.pu[partEnum].convert_p2s(src, srcStride, dst, dstStride); else if (!yFrac) @@ -296,11 +290,12 @@ primitives.pu[partEnum].luma_vps(src, srcStride, dst, dstStride, yFrac); else { - int tmpStride = pu.width; - int filterSize = NTAPS_LUMA; - int halfFilterSize = (filterSize >> 1); - primitives.pu[partEnum].luma_hps(src, srcStride, m_immedVals, tmpStride, xFrac, 1); - primitives.pu[partEnum].luma_vss(m_immedVals + (halfFilterSize - 1) * tmpStride, tmpStride, dst, dstStride, yFrac); + ALIGN_VAR_32(int16_t, immed[MAX_CU_SIZE * (MAX_CU_SIZE + NTAPS_LUMA - 1)]); + int immedStride = pu.width; + int halfFilterSize = NTAPS_LUMA >> 1; + + primitives.pu[partEnum].luma_hps(src, srcStride, immed, immedStride, xFrac, 1); + primitives.pu[partEnum].luma_vss(immed + (halfFilterSize - 1) * immedStride, immedStride, dst, dstStride, yFrac); } } @@ -309,10 +304,10 @@ intptr_t dstStride = dstYuv.m_csize; intptr_t refStride = refPic.m_strideC; - int shiftHor = (2 + m_hChromaShift); - int shiftVer = (2 + m_vChromaShift); + int mvx = mv.x << (1 - m_hChromaShift); + int mvy = mv.y << (1 - m_vChromaShift); - intptr_t refOffset = (mv.x >> shiftHor) + (mv.y >> shiftVer) * refStride; + intptr_t refOffset = (mvx >> 3) + (mvy >> 3) * refStride; const pixel* refCb = refPic.getCbAddr(pu.ctuAddr, pu.cuAbsPartIdx + pu.puAbsPartIdx) + refOffset; const pixel* refCr = refPic.getCrAddr(pu.ctuAddr, pu.cuAbsPartIdx + pu.puAbsPartIdx) + refOffset; @@ -320,11 +315,11 @@ pixel* dstCb = dstYuv.getCbAddr(pu.puAbsPartIdx); pixel* dstCr = dstYuv.getCrAddr(pu.puAbsPartIdx); - int xFrac = mv.x & ((1 << shiftHor) - 1); - int yFrac = mv.y & ((1 << shiftVer) - 1); - int partEnum = partitionFromSizes(pu.width, pu.height); - + + int xFrac = mvx & 7; + int yFrac = mvy & 7; + if (!(yFrac | xFrac)) { primitives.chroma[m_csp].pu[partEnum].copy_pp(dstCb, dstStride, refCb, refStride); @@ -332,37 +327,36 @@ } else if (!yFrac) { - primitives.chroma[m_csp].pu[partEnum].filter_hpp(refCb, refStride, dstCb, dstStride, xFrac << (1 - m_hChromaShift)); - primitives.chroma[m_csp].pu[partEnum].filter_hpp(refCr, refStride, dstCr, dstStride, xFrac << (1 - m_hChromaShift)); + primitives.chroma[m_csp].pu[partEnum].filter_hpp(refCb, refStride, dstCb, dstStride, xFrac); + primitives.chroma[m_csp].pu[partEnum].filter_hpp(refCr, refStride, dstCr, dstStride, xFrac); } else if (!xFrac) { - primitives.chroma[m_csp].pu[partEnum].filter_vpp(refCb, refStride, dstCb, dstStride, yFrac << (1 - m_vChromaShift)); - primitives.chroma[m_csp].pu[partEnum].filter_vpp(refCr, refStride, dstCr, dstStride, yFrac << (1 - m_vChromaShift)); + primitives.chroma[m_csp].pu[partEnum].filter_vpp(refCb, refStride, dstCb, dstStride, yFrac); + primitives.chroma[m_csp].pu[partEnum].filter_vpp(refCr, refStride, dstCr, dstStride, yFrac); } else { - int extStride = pu.width >> m_hChromaShift; - int filterSize = NTAPS_CHROMA; - int halfFilterSize = (filterSize >> 1); - - primitives.chroma[m_csp].pu[partEnum].filter_hps(refCb, refStride, m_immedVals, extStride, xFrac << (1 - m_hChromaShift), 1); - primitives.chroma[m_csp].pu[partEnum].filter_vsp(m_immedVals + (halfFilterSize - 1) * extStride, extStride, dstCb, dstStride, yFrac << (1 - m_vChromaShift)); - - primitives.chroma[m_csp].pu[partEnum].filter_hps(refCr, refStride, m_immedVals, extStride, xFrac << (1 - m_hChromaShift), 1); - primitives.chroma[m_csp].pu[partEnum].filter_vsp(m_immedVals + (halfFilterSize - 1) * extStride, extStride, dstCr, dstStride, yFrac << (1 - m_vChromaShift)); + ALIGN_VAR_32(int16_t, immed[MAX_CU_SIZE * (MAX_CU_SIZE + NTAPS_CHROMA - 1)]); + int immedStride = pu.width >> m_hChromaShift; + int halfFilterSize = NTAPS_CHROMA >> 1; + + primitives.chroma[m_csp].pu[partEnum].filter_hps(refCb, refStride, immed, immedStride, xFrac, 1); + primitives.chroma[m_csp].pu[partEnum].filter_vsp(immed + (halfFilterSize - 1) * immedStride, immedStride, dstCb, dstStride, yFrac); + primitives.chroma[m_csp].pu[partEnum].filter_hps(refCr, refStride, immed, immedStride, xFrac, 1); + primitives.chroma[m_csp].pu[partEnum].filter_vsp(immed + (halfFilterSize - 1) * immedStride, immedStride, dstCr, dstStride, yFrac); } } void Predict::predInterChromaShort(const PredictionUnit& pu, ShortYuv& dstSYuv, const PicYuv& refPic, const MV& mv) const { - intptr_t refStride = refPic.m_strideC; intptr_t dstStride = dstSYuv.m_csize; + intptr_t refStride = refPic.m_strideC; - int shiftHor = (2 + m_hChromaShift); - int shiftVer = (2 + m_vChromaShift); + int mvx = mv.x << (1 - m_hChromaShift); + int mvy = mv.y << (1 - m_vChromaShift); - intptr_t refOffset = (mv.x >> shiftHor) + (mv.y >> shiftVer) * refStride; + intptr_t refOffset = (mvx >> 3) + (mvy >> 3) * refStride; const pixel* refCb = refPic.getCbAddr(pu.ctuAddr, pu.cuAbsPartIdx + pu.puAbsPartIdx) + refOffset; const pixel* refCr = refPic.getCrAddr(pu.ctuAddr, pu.cuAbsPartIdx + pu.puAbsPartIdx) + refOffset; @@ -370,15 +364,15 @@ int16_t* dstCb = dstSYuv.getCbAddr(pu.puAbsPartIdx); int16_t* dstCr = dstSYuv.getCrAddr(pu.puAbsPartIdx); - int xFrac = mv.x & ((1 << shiftHor) - 1); - int yFrac = mv.y & ((1 << shiftVer) - 1); - int partEnum = partitionFromSizes(pu.width, pu.height); uint32_t cxWidth = pu.width >> m_hChromaShift; X265_CHECK(((cxWidth | (pu.height >> m_vChromaShift)) % 2) == 0, "chroma block size expected to be multiple of 2\n"); + int xFrac = mvx & 7; + int yFrac = mvy & 7; + if (!(yFrac | xFrac)) { primitives.chroma[m_csp].pu[partEnum].p2s(refCb, refStride, dstCb, dstStride); @@ -386,23 +380,24 @@ } else if (!yFrac) { - primitives.chroma[m_csp].pu[partEnum].filter_hps(refCb, refStride, dstCb, dstStride, xFrac << (1 - m_hChromaShift), 0); - primitives.chroma[m_csp].pu[partEnum].filter_hps(refCr, refStride, dstCr, dstStride, xFrac << (1 - m_hChromaShift), 0); + primitives.chroma[m_csp].pu[partEnum].filter_hps(refCb, refStride, dstCb, dstStride, xFrac, 0); + primitives.chroma[m_csp].pu[partEnum].filter_hps(refCr, refStride, dstCr, dstStride, xFrac, 0); } else if (!xFrac) { - primitives.chroma[m_csp].pu[partEnum].filter_vps(refCb, refStride, dstCb, dstStride, yFrac << (1 - m_vChromaShift)); - primitives.chroma[m_csp].pu[partEnum].filter_vps(refCr, refStride, dstCr, dstStride, yFrac << (1 - m_vChromaShift)); + primitives.chroma[m_csp].pu[partEnum].filter_vps(refCb, refStride, dstCb, dstStride, yFrac); + primitives.chroma[m_csp].pu[partEnum].filter_vps(refCr, refStride, dstCr, dstStride, yFrac); } else { - int extStride = cxWidth; - int filterSize = NTAPS_CHROMA; - int halfFilterSize = (filterSize >> 1); - primitives.chroma[m_csp].pu[partEnum].filter_hps(refCb, refStride, m_immedVals, extStride, xFrac << (1 - m_hChromaShift), 1); - primitives.chroma[m_csp].pu[partEnum].filter_vss(m_immedVals + (halfFilterSize - 1) * extStride, extStride, dstCb, dstStride, yFrac << (1 - m_vChromaShift)); - primitives.chroma[m_csp].pu[partEnum].filter_hps(refCr, refStride, m_immedVals, extStride, xFrac << (1 - m_hChromaShift), 1); - primitives.chroma[m_csp].pu[partEnum].filter_vss(m_immedVals + (halfFilterSize - 1) * extStride, extStride, dstCr, dstStride, yFrac << (1 - m_vChromaShift)); + ALIGN_VAR_32(int16_t, immed[MAX_CU_SIZE * (MAX_CU_SIZE + NTAPS_CHROMA - 1)]); + int immedStride = cxWidth; + int halfFilterSize = NTAPS_CHROMA >> 1; + + primitives.chroma[m_csp].pu[partEnum].filter_hps(refCb, refStride, immed, immedStride, xFrac, 1); + primitives.chroma[m_csp].pu[partEnum].filter_vss(immed + (halfFilterSize - 1) * immedStride, immedStride, dstCb, dstStride, yFrac); + primitives.chroma[m_csp].pu[partEnum].filter_hps(refCr, refStride, immed, immedStride, xFrac, 1); + primitives.chroma[m_csp].pu[partEnum].filter_vss(immed + (halfFilterSize - 1) * immedStride, immedStride, dstCr, dstStride, yFrac); } }
View file
x265_1.9.tar.gz/source/common/predict.h -> x265_2.0.tar.gz/source/common/predict.h
Changed
@@ -73,7 +73,6 @@ }; ShortYuv m_predShortYuv[2]; /* temporary storage for weighted prediction */ - int16_t* m_immedVals; // Unfiltered/filtered neighbours of the current partition. pixel intraNeighbourBuf[2][258];
View file
x265_1.9.tar.gz/source/common/primitives.cpp -> x265_2.0.tar.gz/source/common/primitives.cpp
Changed
@@ -238,7 +238,9 @@ primitives.cu[i].intra_pred_allangs = NULL; #if ENABLE_ASSEMBLY +#if X265_ARCH_X86 setupInstrinsicPrimitives(primitives, param->cpuid); +#endif setupAssemblyPrimitives(primitives, param->cpuid); #endif @@ -249,7 +251,7 @@ } } -#if ENABLE_ASSEMBLY +#if ENABLE_ASSEMBLY && X265_ARCH_X86 /* these functions are implemented in assembly. When assembly is not being * compiled, they are unnecessary and can be NOPs */ #else @@ -258,7 +260,10 @@ void PFX(cpu_emms)(void) {} void PFX(cpu_cpuid)(uint32_t, uint32_t *eax, uint32_t *, uint32_t *, uint32_t *) { *eax = 0; } void PFX(cpu_xgetbv)(uint32_t, uint32_t *, uint32_t *) {} + +#if X265_ARCH_ARM == 0 void PFX(cpu_neon_test)(void) {} int PFX(cpu_fast_neon_mrc_test)(void) { return 0; } +#endif // X265_ARCH_ARM } #endif
View file
x265_1.9.tar.gz/source/common/primitives.h -> x265_2.0.tar.gz/source/common/primitives.h
Changed
@@ -189,6 +189,9 @@ typedef void (*cutree_propagate_cost) (int* dst, const uint16_t* propagateIn, const int32_t* intraCosts, const uint16_t* interCosts, const int32_t* invQscales, const double* fpsFactor, int len); +typedef void (*cutree_fix8_unpack)(double *dst, uint16_t *src, int count); +typedef void (*cutree_fix8_pack)(uint16_t *dst, double *src, int count); + typedef int (*scanPosLast_t)(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize); typedef uint32_t (*findPosFirstLast_t)(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16]); @@ -197,6 +200,7 @@ typedef uint32_t (*costC1C2Flag_t)(uint16_t *absCoeff, intptr_t numC1Flag, uint8_t *baseCtxMod, intptr_t ctxOffset); typedef void (*pelFilterLumaStrong_t)(pixel* src, intptr_t srcStep, intptr_t offset, int32_t tcP, int32_t tcQ); +typedef void (*pelFilterChroma_t)(pixel* src, intptr_t srcStep, intptr_t offset, int32_t tc, int32_t maskP, int32_t maskQ); /* Function pointers to optimized encoder primitives. Each pointer can reference * either an assembly routine, a SIMD intrinsic primitive, or a C function */ @@ -313,6 +317,8 @@ downscale_t frameInitLowres; cutree_propagate_cost propagateCost; + cutree_fix8_unpack fix8Unpack; + cutree_fix8_pack fix8Pack; extendCURowBorder_t extendRowBorder; planecopy_cp_t planecopy_cp; @@ -332,6 +338,7 @@ costC1C2Flag_t costC1C2Flag; pelFilterLumaStrong_t pelFilterLumaStrong[2]; // EDGE_VER = 0, EDGE_HOR = 1 + pelFilterChroma_t pelFilterChroma[2]; // EDGE_VER = 0, EDGE_HOR = 1 /* There is one set of chroma primitives per color space. An encoder will * have just a single color space and thus it will only ever use one entry
View file
x265_1.9.tar.gz/source/common/quant.cpp -> x265_2.0.tar.gz/source/common/quant.cpp
Changed
@@ -188,10 +188,9 @@ m_nr = NULL; } -bool Quant::init(int rdoqLevel, double psyScale, const ScalingList& scalingList, Entropy& entropy) +bool Quant::init(double psyScale, const ScalingList& scalingList, Entropy& entropy) { m_entropyCoder = &entropy; - m_rdoqLevel = rdoqLevel; m_psyRdoqScale = (int32_t)(psyScale * 256.0); X265_CHECK((psyScale * 256.0) < (double)MAX_INT, "psyScale value too large\n"); m_scalingList = &scalingList; @@ -223,6 +222,7 @@ { m_nr = m_frameNr ? &m_frameNr[ctu.m_encData->m_frameEncoderID] : NULL; m_qpParam[TEXT_LUMA].setQpParam(qp + QP_BD_OFFSET); + m_rdoqLevel = ctu.m_encData->m_param->rdoqLevel; if (ctu.m_chromaFormat != X265_CSP_I400) { setChromaQP(qp + ctu.m_slice->m_pps->chromaQpOffset[0], TEXT_CHROMA_U, ctu.m_chromaFormat);
View file
x265_1.9.tar.gz/source/common/quant.h -> x265_2.0.tar.gz/source/common/quant.h
Changed
@@ -100,7 +100,7 @@ ~Quant(); /* one-time setup */ - bool init(int rdoqLevel, double psyScale, const ScalingList& scalingList, Entropy& entropy); + bool init(double psyScale, const ScalingList& scalingList, Entropy& entropy); bool allocNoiseReduction(const x265_param& param); /* CU setup */
View file
x265_1.9.tar.gz/source/common/scalinglist.cpp -> x265_2.0.tar.gz/source/common/scalinglist.cpp
Changed
@@ -57,7 +57,11 @@ }, { "INTRA32X32_LUMA", + "", + "", "INTER32X32_LUMA", + "", + "", }, }; const char MatrixType_DC[4][12][22] = @@ -76,7 +80,11 @@ }, { "INTRA32X32_LUMA_DC", + "", + "", "INTER32X32_LUMA_DC", + "", + "", }, }; @@ -246,15 +254,15 @@ char line[1024]; int32_t *src = NULL; + fseek(fp, 0, 0); for (int sizeIdc = 0; sizeIdc < NUM_SIZES; sizeIdc++) { int size = X265_MIN(MAX_MATRIX_COEF_NUM, s_numCoefPerSize[sizeIdc]); - for (int listIdc = 0; listIdc < NUM_LISTS; listIdc++) + for (int listIdc = 0; listIdc < NUM_LISTS; listIdc += (sizeIdc == 3) ? 3 : 1) { src = m_scalingListCoef[sizeIdc][listIdc]; - fseek(fp, 0, 0); do { char *ret = fgets(line, 1024, fp); @@ -282,7 +290,6 @@ if (sizeIdc > BLOCK_8x8) { - fseek(fp, 0, 0); do { char *ret = fgets(line, 1024, fp); @@ -310,7 +317,7 @@ fclose(fp); m_bEnabled = true; - m_bDataPresent = !checkDefaultScalingList(); + m_bDataPresent = true; return false; }
View file
x265_1.9.tar.gz/source/common/shortyuv.cpp -> x265_2.0.tar.gz/source/common/shortyuv.cpp
Changed
@@ -78,11 +78,11 @@ memset(m_buf[2], 0, (m_csize * m_csize) * sizeof(int16_t)); } -void ShortYuv::subtract(const Yuv& srcYuv0, const Yuv& srcYuv1, uint32_t log2Size) +void ShortYuv::subtract(const Yuv& srcYuv0, const Yuv& srcYuv1, uint32_t log2Size, int picCsp) { const int sizeIdx = log2Size - 2; primitives.cu[sizeIdx].sub_ps(m_buf[0], m_size, srcYuv0.m_buf[0], srcYuv1.m_buf[0], srcYuv0.m_size, srcYuv1.m_size); - if (m_csp != X265_CSP_I400) + if (m_csp != X265_CSP_I400 && picCsp != X265_CSP_I400) { primitives.chroma[m_csp].cu[sizeIdx].sub_ps(m_buf[1], m_csize, srcYuv0.m_buf[1], srcYuv1.m_buf[1], srcYuv0.m_csize, srcYuv1.m_csize); primitives.chroma[m_csp].cu[sizeIdx].sub_ps(m_buf[2], m_csize, srcYuv0.m_buf[2], srcYuv1.m_buf[2], srcYuv0.m_csize, srcYuv1.m_csize);
View file
x265_1.9.tar.gz/source/common/shortyuv.h -> x265_2.0.tar.gz/source/common/shortyuv.h
Changed
@@ -64,7 +64,7 @@ const int16_t* getCrAddr(uint32_t absPartIdx) const { return m_buf[2] + getChromaAddrOffset(absPartIdx); } const int16_t* getChromaAddr(uint32_t chromaId, uint32_t partUnitIdx) const { return m_buf[chromaId] + getChromaAddrOffset(partUnitIdx); } - void subtract(const Yuv& srcYuv0, const Yuv& srcYuv1, uint32_t log2Size); + void subtract(const Yuv& srcYuv0, const Yuv& srcYuv1, uint32_t log2Size, int picCsp); void copyPartToPartLuma(ShortYuv& dstYuv, uint32_t absPartIdx, uint32_t log2Size) const; void copyPartToPartChroma(ShortYuv& dstYuv, uint32_t absPartIdx, uint32_t log2SizeL) const;
View file
x265_1.9.tar.gz/source/common/threadpool.cpp -> x265_2.0.tar.gz/source/common/threadpool.cpp
Changed
@@ -28,6 +28,10 @@ #include <new> +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 +#include <winnt.h> +#endif + #if X86_64 #ifdef __GNUC__ @@ -64,6 +68,21 @@ # define strcasecmp _stricmp #endif +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 +const uint64_t m1 = 0x5555555555555555; //binary: 0101... +const uint64_t m2 = 0x3333333333333333; //binary: 00110011.. +const uint64_t m3 = 0x0f0f0f0f0f0f0f0f; //binary: 4 zeros, 4 ones ... +const uint64_t h01 = 0x0101010101010101; //the sum of 256 to the power of 0,1,2,3... + +static int popCount(uint64_t x) +{ + x -= (x >> 1) & m1; + x = (x & m2) + ((x >> 2) & m2); + x = (x + (x >> 4)) & m3; + return (x * h01) >> 56; +} +#endif + namespace X265_NS { // x265 private namespace @@ -238,7 +257,6 @@ memset(nodeMaskPerPool, 0, sizeof(nodeMaskPerPool)); int numNumaNodes = X265_MIN(getNumaNodeCount(), MAX_NODE_NUM); - int cpuCount = getCpuCount(); bool bNumaSupport = false; #if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 @@ -248,26 +266,54 @@ #endif - for (int i = 0; i < cpuCount; i++) +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 + PGROUP_AFFINITY groupAffinityPointer = new GROUP_AFFINITY; + for (int i = 0; i < numNumaNodes; i++) { -#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 - UCHAR node; - if (GetNumaProcessorNode((UCHAR)i, &node)) - cpusPerNode[X265_MIN(node, (UCHAR)MAX_NODE_NUM)]++; - else + GetNumaNodeProcessorMaskEx((UCHAR)i, groupAffinityPointer); + cpusPerNode[i] = popCount(groupAffinityPointer->Mask); + } + delete groupAffinityPointer; #elif HAVE_LIBNUMA - if (bNumaSupport >= 0) - cpusPerNode[X265_MIN(numa_node_of_cpu(i), MAX_NODE_NUM)]++; - else -#endif - cpusPerNode[0]++; + if (bNumaSupport) + { + struct bitmask* bitMask = numa_allocate_cpumask(); + for (int i = 0; i < numNumaNodes; i++) + { + int ret = numa_node_to_cpus(i, bitMask); + if (!ret) + cpusPerNode[i] = numa_bitmask_weight(bitMask); + else + x265_log(p, X265_LOG_ERROR, "Failed to genrate CPU mask\n"); + } + numa_free_cpumask(bitMask); } +#else // NUMA not supported + cpusPerNode[0] = getCpuCount(); +#endif if (bNumaSupport && p->logLevel >= X265_LOG_DEBUG) - for (int i = 0; i < numNumaNodes; i++) - x265_log(p, X265_LOG_DEBUG, "detected NUMA node %d with %d logical cores\n", i, cpusPerNode[i]); - - /* limit threads based on param->numaPools */ + for (int i = 0; i < numNumaNodes; i++) + x265_log(p, X265_LOG_DEBUG, "detected NUMA node %d with %d logical cores\n", i, cpusPerNode[i]); + /* limit threads based on param->numaPools + * For windows because threads can't be allocated to live across sockets + * changing the default behavior to be per-socket pools -- FIXME */ +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 + if (!p->numaPools) + { + char poolString[50] = ""; + for (int i = 0; i < numNumaNodes; i++) + { + char nextCount[10] = ""; + if (i) + sprintf(nextCount, ",%d", cpusPerNode[i]); + else + sprintf(nextCount, "%d", cpusPerNode[i]); + strcat(poolString, nextCount); + } + x265_param_parse(p, "pools", poolString); + } +#endif if (p->numaPools && *p->numaPools) { const char *nodeStr = p->numaPools; @@ -280,7 +326,7 @@ } else if (*nodeStr == '-') threadsPerPool[i] = 0; - else if (*nodeStr == '*' || !strcasecmp(nodeStr, "NULL")) + else if (*nodeStr == '*' || !strcasecmp(nodeStr, "NULL")) { for (int j = i; j < numNumaNodes; j++) { @@ -297,8 +343,16 @@ else { int count = atoi(nodeStr); - threadsPerPool[i] = X265_MIN(count, cpusPerNode[i]); - nodeMaskPerPool[i] = ((uint64_t)1 << i); + if (i > 0 || strchr(nodeStr, ',')) // it is comma -> old logic + { + threadsPerPool[i] = X265_MIN(count, cpusPerNode[i]); + nodeMaskPerPool[i] = ((uint64_t)1 << i); + } + else // new logic: exactly 'count' threads on all NUMAs + { + threadsPerPool[numNumaNodes] = X265_MIN(count, numNumaNodes * MAX_POOL_THREADS); + nodeMaskPerPool[numNumaNodes] = ((uint64_t)-1 >> (64 - numNumaNodes)); + } } /* consume current node string, comma, and white-space */ @@ -389,16 +443,15 @@ X265_CHECK(numThreads <= MAX_POOL_THREADS, "a single thread pool cannot have more than MAX_POOL_THREADS threads\n"); #if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 - m_winCpuMask = 0x0; - GROUP_AFFINITY groupAffinity; + memset(&m_groupAffinity, 0, sizeof(GROUP_AFFINITY)); for (int i = 0; i < getNumaNodeCount(); i++) { int numaNode = ((nodeMask >> i) & 0x1U) ? i : -1; if (numaNode != -1) - if (GetNumaNodeProcessorMaskEx((USHORT)numaNode, &groupAffinity)) - m_winCpuMask |= groupAffinity.Mask; + if (GetNumaNodeProcessorMaskEx((USHORT)numaNode, &m_groupAffinity)) + break; } - m_numaMask = &m_winCpuMask; + m_numaMask = &m_groupAffinity.Mask; #elif HAVE_LIBNUMA if (numa_available() >= 0) { @@ -480,11 +533,16 @@ setThreadNodeAffinity(m_numaMask); } -/* static */ void ThreadPool::setThreadNodeAffinity(void *numaMask) { #if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 - if (SetThreadAffinityMask(GetCurrentThread(), *((DWORD_PTR*)numaMask))) + UNREFERENCED_PARAMETER(numaMask); + GROUP_AFFINITY groupAffinity; + memset(&groupAffinity, 0, sizeof(GROUP_AFFINITY)); + groupAffinity.Group = m_groupAffinity.Group; + groupAffinity.Mask = m_groupAffinity.Mask; + const PGROUP_AFFINITY affinityPointer = &groupAffinity; + if (SetThreadGroupAffinity(GetCurrentThread(), affinityPointer, NULL)) return; else x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity for NUMA node mask\n"); @@ -524,10 +582,25 @@ /* static */ int ThreadPool::getCpuCount() { -#if _WIN32 +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 + enum { MAX_NODE_NUM = 127 }; + int cpus = 0; + int numNumaNodes = X265_MIN(getNumaNodeCount(), MAX_NODE_NUM); + GROUP_AFFINITY groupAffinity; + for (int i = 0; i < numNumaNodes; i++) + { + GetNumaNodeProcessorMaskEx((UCHAR)i, &groupAffinity); + cpus += popCount(groupAffinity.Mask); + } + return cpus; +#elif _WIN32 SYSTEM_INFO sysinfo; GetSystemInfo(&sysinfo); return sysinfo.dwNumberOfProcessors; +#elif __unix__ && X265_ARCH_ARM + /* Return the number of processors configured by OS. Because, most embedded linux distributions + * uses only one processor as the scheduler doesn't have enough work to utilize all processors */ + return sysconf(_SC_NPROCESSORS_CONF); #elif __unix__ return sysconf(_SC_NPROCESSORS_ONLN); #elif MACOS
View file
x265_1.9.tar.gz/source/common/threadpool.h -> x265_2.0.tar.gz/source/common/threadpool.h
Changed
@@ -85,7 +85,7 @@ int m_numWorkers; void* m_numaMask; // node mask in linux, cpu mask in windows #if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 - DWORD_PTR m_winCpuMask; + GROUP_AFFINITY m_groupAffinity; #endif bool m_isActive; @@ -99,6 +99,7 @@ bool start(); void stopWorkers(); void setCurrentThreadAffinity(); + void setThreadNodeAffinity(void *numaMask); int tryAcquireSleepingThread(sleepbitmap_t firstTryBitmap, sleepbitmap_t secondTryBitmap); int tryBondPeers(int maxPeers, sleepbitmap_t peerBitmap, BondedTaskGroup& master); @@ -106,7 +107,6 @@ static int getCpuCount(); static int getNumaNodeCount(); - static void setThreadNodeAffinity(void *numaMask); }; /* Any worker thread may enlist the help of idle worker threads from the same
View file
x265_1.9.tar.gz/source/common/x86/asm-primitives.cpp -> x265_2.0.tar.gz/source/common/x86/asm-primitives.cpp
Changed
@@ -861,12 +861,12 @@ template<int size> void interp_8tap_hv_pp_cpu(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY) { - ALIGN_VAR_32(int16_t, immed[MAX_CU_SIZE * (MAX_CU_SIZE + NTAPS_LUMA)]); - const int filterSize = NTAPS_LUMA; - const int halfFilterSize = filterSize >> 1; + ALIGN_VAR_32(int16_t, immed[MAX_CU_SIZE * (MAX_CU_SIZE + NTAPS_LUMA - 1)]); + const int halfFilterSize = NTAPS_LUMA >> 1; + const int immedStride = MAX_CU_SIZE; - primitives.pu[size].luma_hps(src, srcStride, immed, MAX_CU_SIZE, idxX, 1); - primitives.pu[size].luma_vsp(immed + (halfFilterSize - 1) * MAX_CU_SIZE, MAX_CU_SIZE, dst, dstStride, idxY); + primitives.pu[size].luma_hps(src, srcStride, immed, immedStride, idxX, 1); + primitives.pu[size].luma_vsp(immed + (halfFilterSize - 1) * immedStride, immedStride, dst, dstStride, idxY); } #if HIGH_BIT_DEPTH @@ -1098,9 +1098,16 @@ p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].p2s = PFX(filterPixelToShort_8x2_ssse3); p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].p2s = PFX(filterPixelToShort_8x6_ssse3); p.findPosFirstLast = PFX(findPosFirstLast_ssse3); + p.fix8Unpack = PFX(cutree_fix8_unpack_ssse3); + p.fix8Pack = PFX(cutree_fix8_pack_ssse3); } if (cpuMask & X265_CPU_SSE4) { + p.pelFilterLumaStrong[0] = PFX(pelFilterLumaStrong_V_sse4); + p.pelFilterLumaStrong[1] = PFX(pelFilterLumaStrong_H_sse4); + p.pelFilterChroma[0] = PFX(pelFilterChroma_V_sse4); + p.pelFilterChroma[1] = PFX(pelFilterChroma_H_sse4); + p.saoCuOrgE0 = PFX(saoCuOrgE0_sse4); p.saoCuOrgE1 = PFX(saoCuOrgE1_sse4); p.saoCuOrgE1_2Rows = PFX(saoCuOrgE1_2Rows_sse4); @@ -1166,6 +1173,12 @@ p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].p2s = PFX(filterPixelToShort_2x16_sse4); p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].p2s = PFX(filterPixelToShort_6x16_sse4); p.costCoeffRemain = PFX(costCoeffRemain_sse4); +#if X86_64 + p.saoCuStatsE0 = PFX(saoCuStatsE0_sse4); + p.saoCuStatsE1 = PFX(saoCuStatsE1_sse4); + p.saoCuStatsE2 = PFX(saoCuStatsE2_sse4); + p.saoCuStatsE3 = PFX(saoCuStatsE3_sse4); +#endif } if (cpuMask & X265_CPU_AVX) { @@ -2141,11 +2154,23 @@ p.frameInitLowres = PFX(frame_init_lowres_core_avx2); p.propagateCost = PFX(mbtree_propagate_cost_avx2); + p.fix8Unpack = PFX(cutree_fix8_unpack_avx2); + p.fix8Pack = PFX(cutree_fix8_pack_avx2); + + /* TODO: This kernel needs to be modified to work with HIGH_BIT_DEPTH only + p.planeClipAndMax = PFX(planeClipAndMax_avx2); */ // TODO: depends on hps and vsp ALL_LUMA_PU_T(luma_hvpp, interp_8tap_hv_pp_cpu); // calling luma_hvpp for all sizes p.pu[LUMA_4x4].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_4x4>; // ALL_LUMA_PU_T has declared all sizes except 4x4, hence calling luma_hvpp[4x4] +#if X265_DEPTH == 10 + p.pu[LUMA_8x8].satd = PFX(pixel_satd_8x8_avx2); + p.cu[LUMA_8x8].sa8d = PFX(pixel_sa8d_8x8_avx2); + p.cu[LUMA_16x16].sa8d = PFX(pixel_sa8d_16x16_avx2); + p.cu[LUMA_32x32].sa8d = PFX(pixel_sa8d_32x32_avx2); +#endif + if (cpuMask & X265_CPU_BMI2) { p.scanPosLast = PFX(scanPosLast_avx2_bmi2); @@ -2434,6 +2459,8 @@ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = PFX(filterPixelToShort_32x48_ssse3); p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = PFX(filterPixelToShort_32x64_ssse3); p.findPosFirstLast = PFX(findPosFirstLast_ssse3); + p.fix8Unpack = PFX(cutree_fix8_unpack_ssse3); + p.fix8Pack = PFX(cutree_fix8_pack_ssse3); } if (cpuMask & X265_CPU_SSE4) { @@ -2529,8 +2556,10 @@ #if X86_64 p.pelFilterLumaStrong[0] = PFX(pelFilterLumaStrong_V_sse4); p.pelFilterLumaStrong[1] = PFX(pelFilterLumaStrong_H_sse4); + p.pelFilterChroma[0] = PFX(pelFilterChroma_V_sse4); + p.pelFilterChroma[1] = PFX(pelFilterChroma_H_sse4); - p.saoCuStatsBO = PFX(saoCuStatsBO_sse4); +// p.saoCuStatsBO = PFX(saoCuStatsBO_sse4); p.saoCuStatsE0 = PFX(saoCuStatsE0_sse4); p.saoCuStatsE1 = PFX(saoCuStatsE1_sse4); p.saoCuStatsE2 = PFX(saoCuStatsE2_sse4); @@ -2932,6 +2961,7 @@ p.cu[BLOCK_8x8].intra_pred[14] = PFX(intra_pred_ang8_14_avx2); p.cu[BLOCK_8x8].intra_pred[15] = PFX(intra_pred_ang8_15_avx2); p.cu[BLOCK_8x8].intra_pred[16] = PFX(intra_pred_ang8_16_avx2); + p.cu[BLOCK_8x8].intra_pred[17] = PFX(intra_pred_ang8_17_avx2); p.cu[BLOCK_8x8].intra_pred[20] = PFX(intra_pred_ang8_20_avx2); p.cu[BLOCK_8x8].intra_pred[21] = PFX(intra_pred_ang8_21_avx2); p.cu[BLOCK_8x8].intra_pred[22] = PFX(intra_pred_ang8_22_avx2); @@ -3651,7 +3681,6 @@ p.chroma[X265_CSP_I420].cu[CHROMA_420_32x32].copy_ps = PFX(blockcopy_ps_32x32_avx2); p.chroma[X265_CSP_I422].cu[CHROMA_422_32x64].copy_ps = PFX(blockcopy_ps_32x64_avx2); p.cu[BLOCK_64x64].copy_ps = PFX(blockcopy_ps_64x64_avx2); - p.planeClipAndMax = PFX(planeClipAndMax_avx2); p.pu[LUMA_32x8].sad_x3 = PFX(pixel_sad_x3_32x8_avx2); p.pu[LUMA_32x16].sad_x3 = PFX(pixel_sad_x3_32x16_avx2); @@ -3663,6 +3692,8 @@ p.pu[LUMA_64x48].sad_x3 = PFX(pixel_sad_x3_64x48_avx2); p.pu[LUMA_64x64].sad_x3 = PFX(pixel_sad_x3_64x64_avx2); p.pu[LUMA_48x64].sad_x3 = PFX(pixel_sad_x3_48x64_avx2); + p.fix8Unpack = PFX(cutree_fix8_unpack_avx2); + p.fix8Pack = PFX(cutree_fix8_pack_avx2); } #endif
View file
x265_1.9.tar.gz/source/common/x86/blockcopy8.asm -> x265_2.0.tar.gz/source/common/x86/blockcopy8.asm
Changed
@@ -28,8 +28,6 @@ SECTION_RODATA 32 -tab_Vm: db 0, 2, 4, 6, 8, 10, 12, 14, 0, 0, 0, 0, 0, 0, 0, 0 - cextern pb_4 cextern pb_1 cextern pb_16
View file
x265_1.9.tar.gz/source/common/x86/const-a.asm -> x265_2.0.tar.gz/source/common/x86/const-a.asm
Changed
@@ -40,12 +40,16 @@ const pb_8, times 32 db 8 const pb_15, times 32 db 15 const pb_16, times 32 db 16 +const pb_31, times 32 db 31 const pb_32, times 32 db 32 const pb_64, times 32 db 64 +const pb_124, times 32 db 124 const pb_128, times 32 db 128 const pb_a1, times 16 db 0xa1 const pb_01, times 8 db 0, 1 +const pb_0123, times 4 db 0, 1 + times 4 db 2, 3 const hsub_mul, times 16 db 1, -1 const pw_swap, times 2 db 6, 7, 4, 5, 2, 3, 0, 1 const pb_unpackbd1, times 2 db 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3 @@ -64,6 +68,8 @@ times 12 db 0x00 const pb_000000000000000F, db 0xff times 15 db 0x00 +const pb_shuf_off4, times 2 db 0, 4, 1, 5, 2, 6, 3, 7 +const pw_shuf_off4, times 1 db 0, 1, 8, 9, 2, 3, 10, 11, 4, 5, 12, 13, 6, 7, 14, 15 ;; 16-bit constants @@ -115,6 +121,8 @@ const hmul_16p, times 16 db 1 times 8 db 1, -1 const pw_exp2_0_15, dw 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768 +const pw_1_ffff, times 4 dw 1 + times 4 dw 0xFFFF ;; 32-bit constants @@ -146,10 +154,6 @@ const pd_planar16_mul2, times 1 dd 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 const trans8_shuf, times 1 dd 0, 4, 1, 5, 2, 6, 3, 7 -const popcnt_table -%assign x 0 -%rep 256 -; population count -db ((x>>0)&1)+((x>>1)&1)+((x>>2)&1)+((x>>3)&1)+((x>>4)&1)+((x>>5)&1)+((x>>6)&1)+((x>>7)&1) -%assign x x+1 -%endrep +;; 64-bit constants + +const pq_1, times 1 dq 1
View file
x265_1.9.tar.gz/source/common/x86/intrapred8.asm -> x265_2.0.tar.gz/source/common/x86/intrapred8.asm
Changed
@@ -355,55 +355,55 @@ times 8 db (32-22), 22 times 8 db (32-11), 11 -const ang16_shuf_mode9, times 8 db 0, 1 - times 8 db 1, 2 +const ang16_shuf_mode9, times 8 db 0, 1 + times 8 db 1, 2 -const angHor_tab_9, db (32-2), 2, (32-4), 4, (32-6), 6, (32-8), 8, (32-10), 10, (32-12), 12, (32-14), 14, (32-16), 16 - db (32-18), 18, (32-20), 20, (32-22), 22, (32-24), 24, (32-26), 26, (32-28), 28, (32-30), 30, (32-32), 32 +const angHor_tab_9, db (32-2), 2, (32-4), 4, (32-6), 6, (32-8), 8, (32-10), 10, (32-12), 12, (32-14), 14, (32-16), 16 + db (32-18), 18, (32-20), 20, (32-22), 22, (32-24), 24, (32-26), 26, (32-28), 28, (32-30), 30, (32-32), 32 -const angHor_tab_11, db (32-30), 30, (32-28), 28, (32-26), 26, (32-24), 24, (32-22), 22, (32-20), 20, (32-18), 18, (32-16), 16 - db (32-14), 14, (32-12), 12, (32-10), 10, (32- 8), 8, (32- 6), 6, (32- 4), 4, (32- 2), 2, (32- 0), 0 +const angHor_tab_11, db (32-30), 30, (32-28), 28, (32-26), 26, (32-24), 24, (32-22), 22, (32-20), 20, (32-18), 18, (32-16), 16 + db (32-14), 14, (32-12), 12, (32-10), 10, (32- 8), 8, (32- 6), 6, (32- 4), 4, (32- 2), 2, (32- 0), 0 -const ang16_shuf_mode12, db 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 1, 2, 1, 2, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 2, 3, 2, 3 - db 1, 2, 1, 2, 1, 2, 1, 2, 0, 1, 0, 1, 0, 1, 0, 1, 2, 3, 2, 3, 2, 3, 2, 3, 1, 2, 1, 2, 1, 2, 1, 2 +const ang16_shuf_mode12, db 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 1, 2, 1, 2, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 2, 3, 2, 3 + db 1, 2, 1, 2, 1, 2, 1, 2, 0, 1, 0, 1, 0, 1, 0, 1, 2, 3, 2, 3, 2, 3, 2, 3, 1, 2, 1, 2, 1, 2, 1, 2 -const angHor_tab_12, db (32-27), 27, (32-22), 22, (32-17), 17, (32-12), 12, (32-7), 7, (32-2), 2, (32-29), 29, (32-24), 24 - db (32-19), 19, (32-14), 14, (32-9), 9, (32-4), 4, (32-31), 31, (32-26), 26, (32-21), 21, (32-16), 16 +const angHor_tab_12, db (32-27), 27, (32-22), 22, (32-17), 17, (32-12), 12, (32-7), 7, (32-2), 2, (32-29), 29, (32-24), 24 + db (32-19), 19, (32-14), 14, (32-9), 9, (32-4), 4, (32-31), 31, (32-26), 26, (32-21), 21, (32-16), 16 -const ang16_shuf_mode13, db 4, 5, 4, 5, 4, 5, 3, 4, 3, 4, 3, 4, 3, 4, 2, 3, 5, 6, 5, 6, 5, 6, 4, 5, 4, 5, 4, 5, 4, 5, 3, 4 - db 2, 3, 2, 3, 1, 2, 1, 2, 1, 2, 1, 2, 0, 1, 0, 1, 3, 4, 3, 4, 2, 3, 2, 3, 2, 3, 2, 3, 1, 2, 1, 2 - db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14, 11, 7, 4, 0, 0 ,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14, 11, 7, 4, 0 +const ang16_shuf_mode13, db 4, 5, 4, 5, 4, 5, 3, 4, 3, 4, 3, 4, 3, 4, 2, 3, 5, 6, 5, 6, 5, 6, 4, 5, 4, 5, 4, 5, 4, 5, 3, 4 + db 2, 3, 2, 3, 1, 2, 1, 2, 1, 2, 1, 2, 0, 1, 0, 1, 3, 4, 3, 4, 2, 3, 2, 3, 2, 3, 2, 3, 1, 2, 1, 2 + db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14, 11, 7, 4, 0, 0 ,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14, 11, 7, 4, 0 -const angHor_tab_13, db (32-23), 23, (32-14), 14, (32-5), 5, (32-28), 28, (32-19), 19, (32-10), 10, (32-1), 1, (32-24), 24 - db (32-15), 15, (32-6), 6, (32-29), 29, (32-20), 20, (32-11), 11, (32-2), 2, (32-25), 25, (32-16), 16 +const angHor_tab_13, db (32-23), 23, (32-14), 14, (32-5), 5, (32-28), 28, (32-19), 19, (32-10), 10, (32-1), 1, (32-24), 24 + db (32-15), 15, (32-6), 6, (32-29), 29, (32-20), 20, (32-11), 11, (32-2), 2, (32-25), 25, (32-16), 16 -const ang16_shuf_mode14, db 6, 7, 6, 7, 5, 6, 5, 6, 4, 5, 4, 5, 4, 5, 3, 4, 7, 8, 7, 8, 6, 7, 6, 7, 5, 6, 5, 6, 5, 6, 4, 5 - db 3, 4, 2, 3, 2, 3, 2, 3, 1, 2, 1, 2, 0, 1, 0, 1, 4, 5, 3, 4, 3, 4, 3, 4, 2, 3, 2, 3, 1, 2, 1, 2 - db 0, 0, 0, 0, 0, 0, 0, 0, 0, 15, 12, 10, 7, 5, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 15, 12, 10, 7, 5, 2, 0 +const ang16_shuf_mode14, db 6, 7, 6, 7, 5, 6, 5, 6, 4, 5, 4, 5, 4, 5, 3, 4, 7, 8, 7, 8, 6, 7, 6, 7, 5, 6, 5, 6, 5, 6, 4, 5 + db 3, 4, 2, 3, 2, 3, 2, 3, 1, 2, 1, 2, 0, 1, 0, 1, 4, 5, 3, 4, 3, 4, 3, 4, 2, 3, 2, 3, 1, 2, 1, 2 + db 0, 0, 0, 0, 0, 0, 0, 0, 0, 15, 12, 10, 7, 5, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 15, 12, 10, 7, 5, 2, 0 -const angHor_tab_14, db (32-19), 19, (32-6), 6, (32-25), 25, (32-12), 12, (32-31), 31, (32-18), 18, (32-5), 5, (32-24), 24 - db (32-11), 11, (32-30), 30, (32-17), 17, (32-4), 4, (32-23), 23, (32-10), 10, (32-29), 29, (32-16), 16 +const angHor_tab_14, db (32-19), 19, (32-6), 6, (32-25), 25, (32-12), 12, (32-31), 31, (32-18), 18, (32-5), 5, (32-24), 24 + db (32-11), 11, (32-30), 30, (32-17), 17, (32-4), 4, (32-23), 23, (32-10), 10, (32-29), 29, (32-16), 16 -const ang16_shuf_mode15, db 8, 9, 7, 8, 7, 8, 6, 7, 6, 7, 5, 6, 5, 6, 4, 5, 9, 10, 8, 9, 8, 9, 7, 8, 7, 8, 6, 7, 6, 7, 5, 6 - db 4, 5, 3, 4, 3, 4, 2, 3, 2, 3, 1, 2, 1, 2, 0, 1, 5, 6, 4, 5, 4, 5, 3, 4, 3, 4, 2, 3, 2, 3, 1, 2 - db 0, 0, 0, 0, 0, 0, 0, 15, 13, 11, 9, 8, 6, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 15, 13, 11, 9, 8, 6, 4, 2, 0 +const ang16_shuf_mode15, db 8, 9, 7, 8, 7, 8, 6, 7, 6, 7, 5, 6, 5, 6, 4, 5, 9, 10, 8, 9, 8, 9, 7, 8, 7, 8, 6, 7, 6, 7, 5, 6 + db 4, 5, 3, 4, 3, 4, 2, 3, 2, 3, 1, 2, 1, 2, 0, 1, 5, 6, 4, 5, 4, 5, 3, 4, 3, 4, 2, 3, 2, 3, 1, 2 + db 0, 0, 0, 0, 0, 0, 0, 15, 13, 11, 9, 8, 6, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 15, 13, 11, 9, 8, 6, 4, 2, 0 -const angHor_tab_15, db (32-15), 15, (32-30), 30, (32-13), 13, (32-28), 28, (32-11), 11, (32-26), 26, (32-9), 9, (32-24), 24 - db (32-7), 7, (32-22), 22, (32-5), 5, (32-20), 20, (32-3), 3, (32-18), 18, (32-1), 1, (32- 16), 16 +const angHor_tab_15, db (32-15), 15, (32-30), 30, (32-13), 13, (32-28), 28, (32-11), 11, (32-26), 26, (32-9), 9, (32-24), 24 + db (32-7), 7, (32-22), 22, (32-5), 5, (32-20), 20, (32-3), 3, (32-18), 18, (32-1), 1, (32- 16), 16 -const ang16_shuf_mode16, db 10, 11, 9, 10, 9, 10, 8, 9, 7, 8, 7, 8, 6, 7, 5, 6, 11, 12, 10, 11, 10, 11, 9, 10, 8, 9, 8, 9, 7, 8, 6, 7 - db 5, 6, 4, 5, 3, 4, 3, 4, 2, 3, 1, 2, 1, 2, 0, 1, 6, 7, 5, 6, 4, 5, 4, 5, 3, 4, 2, 3, 2, 3, 1, 2 - db 0 ,0, 0, 0, 0, 15, 14, 12 , 11, 9, 8, 6, 5, 3, 2, 0, 0, 0, 0, 0, 0, 15, 14, 12, 11, 9, 8, 6, 5, 3, 2, 0 +const ang16_shuf_mode16, db 10, 11, 9, 10, 9, 10, 8, 9, 7, 8, 7, 8, 6, 7, 5, 6, 11, 12, 10, 11, 10, 11, 9, 10, 8, 9, 8, 9, 7, 8, 6, 7 + db 5, 6, 4, 5, 3, 4, 3, 4, 2, 3, 1, 2, 1, 2, 0, 1, 6, 7, 5, 6, 4, 5, 4, 5, 3, 4, 2, 3, 2, 3, 1, 2 + db 0 ,0, 0, 0, 0, 15, 14, 12 , 11, 9, 8, 6, 5, 3, 2, 0, 0, 0, 0, 0, 0, 15, 14, 12, 11, 9, 8, 6, 5, 3, 2, 0 -const angHor_tab_16, db (32-11), 11, (32-22), 22, (32-1), 1, (32-12), 12, (32-23), 23, (32-2), 2, (32-13), 13, (32-24), 24 - db (32-3), 3, (32-14), 14, (32-25), 25, (32-4), 4, (32-15), 15, (32-26), 26, (32-5), 5, (32-16), 16 +const angHor_tab_16, db (32-11), 11, (32-22), 22, (32-1), 1, (32-12), 12, (32-23), 23, (32-2), 2, (32-13), 13, (32-24), 24 + db (32-3), 3, (32-14), 14, (32-25), 25, (32-4), 4, (32-15), 15, (32-26), 26, (32-5), 5, (32-16), 16 -const ang16_shuf_mode17, db 12, 13, 11, 12, 10, 11, 9, 10, 8, 9, 8, 9, 7, 8, 6, 7, 13, 14, 12, 13, 11, 12, 10, 11, 9, 10, 9, 10, 8, 9, 7, 8 - db 5, 6, 4, 5, 4, 5, 3, 4, 2, 3, 1, 2, 0, 1, 0, 1, 6, 7, 5, 6, 5, 6, 4, 5, 3, 4, 2, 3, 1, 2, 1, 2 - db 0, 0, 0, 15, 14, 12, 11, 10, 9, 7, 6, 5, 4, 2, 1, 0, 0, 0, 0, 15, 14, 12, 11, 10, 9, 7, 6, 5, 4, 2, 1, 0 +const ang16_shuf_mode17, db 12, 13, 11, 12, 10, 11, 9, 10, 8, 9, 8, 9, 7, 8, 6, 7, 13, 14, 12, 13, 11, 12, 10, 11, 9, 10, 9, 10, 8, 9, 7, 8 + db 5, 6, 4, 5, 4, 5, 3, 4, 2, 3, 1, 2, 0, 1, 0, 1, 6, 7, 5, 6, 5, 6, 4, 5, 3, 4, 2, 3, 1, 2, 1, 2 + db 0, 0, 0, 15, 14, 12, 11, 10, 9, 7, 6, 5, 4, 2, 1, 0, 0, 0, 0, 15, 14, 12, 11, 10, 9, 7, 6, 5, 4, 2, 1, 0 -const angHor_tab_17, db (32- 6), 6, (32-12), 12, (32-18), 18, (32-24), 24, (32-30), 30, (32- 4), 4, (32-10), 10, (32-16), 16 - db (32-22), 22, (32-28), 28, (32- 2), 2, (32- 8), 8, (32-14), 14, (32-20), 20, (32-26), 26, (32- 0), 0 +const angHor_tab_17, db (32- 6), 6, (32-12), 12, (32-18), 18, (32-24), 24, (32-30), 30, (32- 4), 4, (32-10), 10, (32-16), 16 + db (32-22), 22, (32-28), 28, (32- 2), 2, (32- 8), 8, (32-14), 14, (32-20), 20, (32-26), 26, (32- 0), 0 ; Intrapred_angle32x32, modes 1 to 33 constants const ang32_shuf_mode9, times 8 db 0, 1 @@ -467,6 +467,39 @@ dd 0, 0, 2, 3, 0, 0, 7, 1 dd 0, 0, 5, 6, 0, 0, 0, 0 +; Intrapred_angle8x8, modes 1 to 33 constants +const ang8_shuf_mode3, db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 4, 5, 5, 6, 6, 7, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 5, 6, 6, 7, 7, 8 +const ang8_shuf_mode4, db 0, 1, 1, 2, 1, 2, 2, 3, 3, 4, 3, 4, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 4, 5, 4, 5, 5, 6, 6, 7 +const ang8_shuf_mode5, db 0, 1, 1, 2, 1, 2, 2, 3, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 3, 4, 4, 5, 4, 5, 5, 6 +const ang8_shuf_mode6, db 0, 1, 0, 1, 1, 2, 1, 2, 2, 3, 2, 3, 2, 3, 3, 4, 1, 2, 1, 2, 2, 3, 2, 3, 3, 4, 3, 4, 3, 4, 4, 5 +const ang8_shuf_mode7, db 0, 1, 0, 1, 0, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 3, 1, 2, 1, 2, 1, 2, 2, 3, 2, 3, 2, 3, 2, 3, 3, 4 +const ang8_shuf_mode8, db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 3, 2, 3 +const ang8_shuf_mode9, db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 +const ang8_shuf_mode12, db 7, 8, 7, 8, 7, 8, 7, 8, 7, 8, 7, 8, 6, 7, 6, 7, 8, 9, 8, 9, 8, 9, 8, 9, 8, 9, 8, 9, 7, 8, 7, 8 +const ang8_shuf_mode13, db 8, 9, 8, 9, 8, 9, 7, 8, 7, 8, 7, 8, 7, 8, 6, 7, 9, 10, 9, 10, 9, 10, 8, 9, 8, 9, 8, 9, 8, 9, 7, 8 +const ang8_shuf_mode14, db 9, 10, 9, 10, 8, 9, 8, 9, 7, 8, 7, 8, 7, 8, 6, 7, 10, 11, 10, 11, 9, 10, 9, 10, 8, 9, 8, 9, 8, 9, 7, 8 +const ang8_shuf_mode15, db 10, 11, 9, 10, 9, 10, 8, 9, 8, 9, 7, 8, 7, 8, 6, 7, 11, 12, 10, 11, 10, 11, 9, 10, 9, 10, 8, 9, 8, 9, 7, 8 + db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 6, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 6, 4, 2, 0 +const ang8_shuf_mode16, db 11, 12, 10, 11, 10, 11, 9, 10, 8, 9, 8, 9, 7, 8, 6, 7, 12, 13, 11, 12, 11, 12, 10, 11, 9, 10, 9, 10, 8, 9, 7, 8 + db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 6, 5, 3, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 6, 5, 3, 2, 0 +const ang8_shuf_mode17, db 12, 13, 11, 12, 10, 11, 9, 10, 8, 9, 8, 9, 7, 8, 6, 7, 13, 14, 12, 13, 11, 12, 10, 11, 9, 10, 9, 10, 8, 9, 7, 8 + db 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 6, 5, 4, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 6, 5, 4, 2, 1, 0 + +const ang8_fact_mode3, db (32-26), 26, (32-20), 20, (32-14), 14, (32- 8), 8, (32- 2), 2, (32-28), 28, (32-22), 22, (32-16), 16 +const ang8_fact_mode4, db (32-21), 21, (32-10), 10, (32-31), 31, (32-20), 20, (32- 9), 9, (32-30), 30, (32-19), 19, (32- 8), 8 +const ang8_fact_mode5, db (32-17), 17, (32- 2), 2, (32-19), 19, (32- 4), 4, (32-21), 21, (32- 6), 6, (32-23), 23, (32- 8), 8 +const ang8_fact_mode6, db (32-13), 13, (32-26), 26, (32- 7), 7, (32-20), 20, (32- 1), 1, (32-14), 14, (32-27), 27, (32- 8), 8 +const ang8_fact_mode7, db (32- 9), 9, (32-18), 18, (32-27), 27, (32- 4), 4, (32-13), 13, (32-22), 22, (32-31), 31, (32- 8), 8 +const ang8_fact_mode8, db (32- 5), 5, (32-10), 10, (32-15), 15, (32-20), 20, (32-25), 25, (32-30), 30, (32- 3), 3, (32- 8), 8 +const ang8_fact_mode9, db (32- 2), 2, (32- 4), 4, (32- 6), 6, (32- 8), 8, (32-10), 10, (32-12), 12, (32-14), 14, (32-16), 16 +const ang8_fact_mode11, db (32-30), 30, (32-28), 28, (32-26), 26, (32-24), 24, (32-22), 22, (32-20), 20, (32-18), 18, (32-16), 16 +const ang8_fact_mode12, db (32-27), 27, (32-22), 22, (32-17), 17, (32-12), 12, (32- 7), 7, (32- 2), 2, (32-29), 29, (32-24), 24 +const ang8_fact_mode13, db (32-23), 23, (32-14), 14, (32- 5), 5, (32-28), 28, (32-19), 19, (32-10), 10, (32- 1), 1, (32-24), 24 +const ang8_fact_mode14, db (32-19), 19, (32- 6), 6, (32-25), 25, (32-12), 12, (32-31), 31, (32-18), 18, (32- 5), 5, (32-24), 24 +const ang8_fact_mode15, db (32-15), 15, (32-30), 30, (32-13), 13, (32-28), 28, (32-11), 11, (32-26), 26, (32- 9), 9, (32-24), 24 +const ang8_fact_mode16, db (32-11), 11, (32-22), 22, (32- 1), 1, (32-12), 12, (32-23), 23, (32- 2), 2, (32-13), 13, (32-24), 24 +const ang8_fact_mode17, db (32- 6), 6, (32-12), 12, (32-18), 18, (32-24), 24, (32-30), 30, (32- 4), 4, (32-10), 10, (32-16), 16 + const ang_table %assign x 0 %rep 32 @@ -490,6 +523,7 @@ SECTION .text cextern pb_1 +cextern pb_2 cextern pw_2 cextern pw_3 cextern pw_4 @@ -18582,48 +18616,48 @@ ; void intraPredAng8(pixel* dst, intptr_t dstStride, pixel* src, int dirMode, int bFilter) ;----------------------------------------------------------------------------------------- INIT_YMM avx2 -cglobal intra_pred_ang8_3, 3,4,5 - mova m3, [pw_1024] +%macro ang8_store8x8 0 + lea r3, [3 * r1] + vextracti128 xm2, m1, 1 + vextracti128 xm5, m4, 1 + movq [r0], xm1 + movq [r0 + r1], xm2 + movhps [r0 + 2 * r1], xm1 + movhps [r0 + r3], xm2 + lea r0, [r0 + 4 * r1] + movq [r0], xm4 + movq [r0 + r1], xm5 + movhps [r0 + 2 * r1], xm4 + movhps [r0 + r3], xm5 +%endmacro + +cglobal intra_pred_ang8_3, 3,4,6 vbroadcasti128 m0, [r2 + 17] + mova m5, [ang8_shuf_mode3] + mova m3, [pb_2] - pshufb m1, m0, [c_ang8_src1_9_2_10] - pshufb m2, m0, [c_ang8_src3_11_4_12] - pshufb m4, m0, [c_ang8_src5_13_5_13] - pshufb m0, [c_ang8_src6_14_7_15] + pshufb m1, m0, m5 + paddb m5, m3 + pshufb m2, m0, m5 + paddb m5, m3 + pshufb m4, m0, m5 + paddb m5, m3 + pshufb m0, m5 - pmaddubsw m1, [c_ang8_26_20] + vbroadcasti128 m5, [ang8_fact_mode3] + mova m3, [pw_1024] + pmaddubsw m1, m5 + pmaddubsw m2, m5 + pmaddubsw m4, m5 + pmaddubsw m0, m5 pmulhrsw m1, m3 - pmaddubsw m2, [c_ang8_14_8] pmulhrsw m2, m3 - pmaddubsw m4, [c_ang8_2_28] pmulhrsw m4, m3 - pmaddubsw m0, [c_ang8_22_16] pmulhrsw m0, m3 packuswb m1, m2 packuswb m4, m0 - vperm2i128 m2, m1, m4, 00100000b - vperm2i128 m1, m1, m4, 00110001b - punpcklbw m4, m2, m1 - punpckhbw m2, m1 - punpcklwd m1, m4, m2 - punpckhwd m4, m2 - mova m0, [trans8_shuf] - vpermd m1, m0, m1 - vpermd m4, m0, m4 - - lea r3, [3 * r1] - movq [r0], xm1 - movhps [r0 + r1], xm1 - vextracti128 xm2, m1, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 - lea r0, [r0 + 4 * r1] - movq [r0], xm4 - movhps [r0 + r1], xm4 - vextracti128 xm2, m4, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 + ang8_store8x8 RET INIT_YMM avx2 @@ -18662,48 +18696,33 @@ RET INIT_YMM avx2 -cglobal intra_pred_ang8_4, 3,4,5 - mova m3, [pw_1024] +cglobal intra_pred_ang8_4, 3,4,6 vbroadcasti128 m0, [r2 + 17] + mova m5, [ang8_shuf_mode4] + mova m3, [pb_2] - pshufb m1, m0, [c_ang8_src1_9_2_10] - pshufb m2, m0, [c_ang8_src2_10_3_11] - pshufb m4, m0, [c_ang8_src4_12_4_12] - pshufb m0, [c_ang8_src5_13_6_14] + pshufb m1, m0, m5 + paddb m5, m3 + pshufb m2, m0, m5 + paddb m5, m3 + pshufb m4, m0, m5 + paddb m5, m3 + pshufb m0, m5 - pmaddubsw m1, [c_ang8_21_10] + vbroadcasti128 m5, [ang8_fact_mode4] + mova m3, [pw_1024] + pmaddubsw m1, m5 + pmaddubsw m2, m5 + pmaddubsw m4, m5 + pmaddubsw m0, m5 pmulhrsw m1, m3 - pmaddubsw m2, [c_ang8_31_20] pmulhrsw m2, m3 - pmaddubsw m4, [c_ang8_9_30] pmulhrsw m4, m3 - pmaddubsw m0, [c_ang8_19_8] pmulhrsw m0, m3 packuswb m1, m2 packuswb m4, m0 - vperm2i128 m2, m1, m4, 00100000b - vperm2i128 m1, m1, m4, 00110001b - punpcklbw m4, m2, m1 - punpckhbw m2, m1 - punpcklwd m1, m4, m2 - punpckhwd m4, m2 - mova m0, [trans8_shuf] - vpermd m1, m0, m1 - vpermd m4, m0, m4 - - lea r3, [3 * r1] - movq [r0], xm1 - movhps [r0 + r1], xm1 - vextracti128 xm2, m1, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 - lea r0, [r0 + 4 * r1] - movq [r0], xm4 - movhps [r0 + r1], xm4 - vextracti128 xm2, m4, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 + ang8_store8x8 RET INIT_YMM avx2 @@ -18743,48 +18762,33 @@ INIT_YMM avx2 -cglobal intra_pred_ang8_5, 3, 4, 5 - mova m3, [pw_1024] +cglobal intra_pred_ang8_5, 3, 4, 6 vbroadcasti128 m0, [r2 + 17] + mova m5, [ang8_shuf_mode5] + mova m3, [pb_2] - pshufb m1, m0, [c_ang8_src1_9_2_10] - pshufb m2, m0, [c_ang8_src2_10_3_11] - pshufb m4, m0, [c_ang8_src3_11_4_12] - pshufb m0, [c_ang8_src4_12_5_13] + pshufb m1, m0, m5 + paddb m5, m3 + pshufb m2, m0, m5 + paddb m5, m3 + pshufb m4, m0, m5 + paddb m5, m3 + pshufb m0, m5 - pmaddubsw m1, [c_ang8_17_2] + vbroadcasti128 m5, [ang8_fact_mode5] + mova m3, [pw_1024] + pmaddubsw m1, m5 + pmaddubsw m2, m5 + pmaddubsw m4, m5 + pmaddubsw m0, m5 pmulhrsw m1, m3 - pmaddubsw m2, [c_ang8_19_4] pmulhrsw m2, m3 - pmaddubsw m4, [c_ang8_21_6] pmulhrsw m4, m3 - pmaddubsw m0, [c_ang8_23_8] pmulhrsw m0, m3 packuswb m1, m2 packuswb m4, m0 - vperm2i128 m2, m1, m4, 00100000b - vperm2i128 m1, m1, m4, 00110001b - punpcklbw m4, m2, m1 - punpckhbw m2, m1 - punpcklwd m1, m4, m2 - punpckhwd m4, m2 - mova m0, [trans8_shuf] - vpermd m1, m0, m1 - vpermd m4, m0, m4 - - lea r3, [3 * r1] - movq [r0], xm1 - movhps [r0 + r1], xm1 - vextracti128 xm2, m1, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 - lea r0, [r0 + 4 * r1] - movq [r0], xm4 - movhps [r0 + r1], xm4 - vextracti128 xm2, m4, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 + ang8_store8x8 RET INIT_YMM avx2 @@ -18824,48 +18828,33 @@ INIT_YMM avx2 -cglobal intra_pred_ang8_6, 3, 4, 5 - mova m3, [pw_1024] +cglobal intra_pred_ang8_6, 3, 4, 6 vbroadcasti128 m0, [r2 + 17] + mova m5, [ang8_shuf_mode6] + mova m3, [pb_2] - pshufb m1, m0, [intra_pred_shuff_0_8] - pshufb m2, m0, [c_ang8_src2_10_2_10] - pshufb m4, m0, [c_ang8_src3_11_3_11] - pshufb m0, [c_ang8_src3_11_4_12] + pshufb m1, m0, m5 + paddb m5, m3 + pshufb m2, m0, m5 + paddb m5, m3 + pshufb m4, m0, m5 + paddb m5, m3 + pshufb m0, m5 - pmaddubsw m1, [c_ang8_13_26] + vbroadcasti128 m5, [ang8_fact_mode6] + mova m3, [pw_1024] + pmaddubsw m1, m5 + pmaddubsw m2, m5 + pmaddubsw m4, m5 + pmaddubsw m0, m5 pmulhrsw m1, m3 - pmaddubsw m2, [c_ang8_7_20] pmulhrsw m2, m3 - pmaddubsw m4, [c_ang8_1_14] pmulhrsw m4, m3 - pmaddubsw m0, [c_ang8_27_8] pmulhrsw m0, m3 packuswb m1, m2 packuswb m4, m0 - vperm2i128 m2, m1, m4, 00100000b - vperm2i128 m1, m1, m4, 00110001b - punpcklbw m4, m2, m1 - punpckhbw m2, m1 - punpcklwd m1, m4, m2 - punpckhwd m4, m2 - mova m0, [trans8_shuf] - vpermd m1, m0, m1 - vpermd m4, m0, m4 - - lea r3, [3 * r1] - movq [r0], xm1 - movhps [r0 + r1], xm1 - vextracti128 xm2, m1, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 - lea r0, [r0 + 4 * r1] - movq [r0], xm4 - movhps [r0 + r1], xm4 - vextracti128 xm2, m4, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 + ang8_store8x8 RET INIT_YMM avx2 @@ -18905,46 +18894,33 @@ INIT_YMM avx2 -cglobal intra_pred_ang8_9, 3, 5, 5 - mova m3, [pw_1024] +cglobal intra_pred_ang8_9, 3, 5, 6 vbroadcasti128 m0, [r2 + 17] + mova m5, [ang8_shuf_mode9] + mova m3, [pb_2] - pshufb m0, [intra_pred_shuff_0_8] + pshufb m1, m0, m5 + paddb m5, m3 + pshufb m2, m0, m5 + paddb m5, m3 + pshufb m4, m0, m5 + paddb m5, m3 + pshufb m0, m5 - lea r4, [c_ang8_mode_27] - pmaddubsw m1, m0, [r4] + vbroadcasti128 m5, [ang8_fact_mode9] + mova m3, [pw_1024] + pmaddubsw m1, m5 + pmaddubsw m2, m5 + pmaddubsw m4, m5 + pmaddubsw m0, m5 pmulhrsw m1, m3 - pmaddubsw m2, m0, [r4 + mmsize] pmulhrsw m2, m3 - pmaddubsw m4, m0, [r4 + 2 * mmsize] pmulhrsw m4, m3 - pmaddubsw m0, [r4 + 3 * mmsize] pmulhrsw m0, m3 packuswb m1, m2 packuswb m4, m0 - vperm2i128 m2, m1, m4, 00100000b - vperm2i128 m1, m1, m4, 00110001b - punpcklbw m4, m2, m1 - punpckhbw m2, m1 - punpcklwd m1, m4, m2 - punpckhwd m4, m2 - mova m0, [trans8_shuf] - vpermd m1, m0, m1 - vpermd m4, m0, m4 - - lea r3, [3 * r1] - movq [r0], xm1 - movhps [r0 + r1], xm1 - vextracti128 xm2, m1, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 - lea r0, [r0 + 4 * r1] - movq [r0], xm4 - movhps [r0 + r1], xm4 - vextracti128 xm2, m4, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 + ang8_store8x8 RET INIT_YMM avx2 @@ -19015,48 +18991,33 @@ INIT_YMM avx2 -cglobal intra_pred_ang8_7, 3, 4, 5 - mova m3, [pw_1024] +cglobal intra_pred_ang8_7, 3, 4, 6 vbroadcasti128 m0, [r2 + 17] + mova m5, [ang8_shuf_mode7] + mova m3, [pb_2] - pshufb m1, m0, [intra_pred_shuff_0_8] - pshufb m2, m0, [c_ang8_src1_9_2_10] - pshufb m4, m0, [c_ang8_src2_10_2_10] - pshufb m0, [c_ang8_src2_10_3_11] + pshufb m1, m0, m5 + paddb m5, m3 + pshufb m2, m0, m5 + paddb m5, m3 + pshufb m4, m0, m5 + paddb m5, m3 + pshufb m0, m5 - pmaddubsw m1, [c_ang8_9_18] + vbroadcasti128 m5, [ang8_fact_mode7] + mova m3, [pw_1024] + pmaddubsw m1, m5 + pmaddubsw m2, m5 + pmaddubsw m4, m5 + pmaddubsw m0, m5 pmulhrsw m1, m3 - pmaddubsw m2, [c_ang8_27_4] pmulhrsw m2, m3 - pmaddubsw m4, [c_ang8_13_22] pmulhrsw m4, m3 - pmaddubsw m0, [c_ang8_31_8] pmulhrsw m0, m3 packuswb m1, m2 packuswb m4, m0 - vperm2i128 m2, m1, m4, 00100000b - vperm2i128 m1, m1, m4, 00110001b - punpcklbw m4, m2, m1 - punpckhbw m2, m1 - punpcklwd m1, m4, m2 - punpckhwd m4, m2 - mova m0, [trans8_shuf] - vpermd m1, m0, m1 - vpermd m4, m0, m4 - - lea r3, [3 * r1] - movq [r0], xm1 - movhps [r0 + r1], xm1 - vextracti128 xm2, m1, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 - lea r0, [r0 + 4 * r1] - movq [r0], xm4 - movhps [r0 + r1], xm4 - vextracti128 xm2, m4, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 + ang8_store8x8 RET INIT_YMM avx2 @@ -19097,48 +19058,32 @@ INIT_YMM avx2 cglobal intra_pred_ang8_8, 3, 4, 6 - mova m3, [pw_1024] vbroadcasti128 m0, [r2 + 17] - mova m5, [intra_pred_shuff_0_8] + mova m5, [ang8_shuf_mode8] + mova m3, [pb_2] pshufb m1, m0, m5 + paddb m5, m3 pshufb m2, m0, m5 + paddb m5, m3 pshufb m4, m0, m5 - pshufb m0, [c_ang8_src2_10_2_10] + paddb m5, m3 + pshufb m0, m5 - pmaddubsw m1, [c_ang8_5_10] + vbroadcasti128 m5, [ang8_fact_mode8] + mova m3, [pw_1024] + pmaddubsw m1, m5 + pmaddubsw m2, m5 + pmaddubsw m4, m5 + pmaddubsw m0, m5 pmulhrsw m1, m3 - pmaddubsw m2, [c_ang8_15_20] pmulhrsw m2, m3 - pmaddubsw m4, [c_ang8_25_30] pmulhrsw m4, m3 - pmaddubsw m0, [c_ang8_3_8] pmulhrsw m0, m3 packuswb m1, m2 packuswb m4, m0 - vperm2i128 m2, m1, m4, 00100000b - vperm2i128 m1, m1, m4, 00110001b - punpcklbw m4, m2, m1 - punpckhbw m2, m1 - punpcklwd m1, m4, m2 - punpckhwd m4, m2 - mova m0, [trans8_shuf] - vpermd m1, m0, m1 - vpermd m4, m0, m4 - - lea r3, [3 * r1] - movq [r0], xm1 - movhps [r0 + r1], xm1 - vextracti128 xm2, m1, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 - lea r0, [r0 + 4 * r1] - movq [r0], xm4 - movhps [r0 + r1], xm4 - vextracti128 xm2, m4, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 + ang8_store8x8 RET INIT_YMM avx2 @@ -19179,163 +19124,139 @@ INIT_YMM avx2 -cglobal intra_pred_ang8_11, 3, 5, 5 - mova m3, [pw_1024] +cglobal intra_pred_ang8_11, 3, 5, 6 + mova m3, [pw_1024] movu xm1, [r2 + 16] pinsrb xm1, [r2], 0 - pshufb xm1, [intra_pred_shuff_0_8] - vinserti128 m0, m1, xm1, 1 + vinserti128 m0, m1, xm1, 1 - lea r4, [c_ang8_mode_25] - pmaddubsw m1, m0, [r4] + mova m5, [ang8_shuf_mode9] + mova m3, [pb_2] + + pshufb m1, m0, m5 + paddb m5, m3 + pshufb m2, m0, m5 + paddb m5, m3 + pshufb m4, m0, m5 + paddb m5, m3 + pshufb m0, m5 + + vbroadcasti128 m5, [ang8_fact_mode11] + mova m3, [pw_1024] + pmaddubsw m1, m5 + pmaddubsw m2, m5 + pmaddubsw m4, m5 + pmaddubsw m0, m5 pmulhrsw m1, m3 - pmaddubsw m2, m0, [r4 + mmsize] pmulhrsw m2, m3 - pmaddubsw m4, m0, [r4 + 2 * mmsize] pmulhrsw m4, m3 - pmaddubsw m0, [r4 + 3 * mmsize] pmulhrsw m0, m3 packuswb m1, m2 packuswb m4, m0 - vperm2i128 m2, m1, m4, 00100000b - vperm2i128 m1, m1, m4, 00110001b - punpcklbw m4, m2, m1 - punpckhbw m2, m1 - punpcklwd m1, m4, m2 - punpckhwd m4, m2 - mova m0, [trans8_shuf] - vpermd m1, m0, m1 - vpermd m4, m0, m4 - - lea r3, [3 * r1] - movq [r0], xm1 - movhps [r0 + r1], xm1 - vextracti128 xm2, m1, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 - lea r0, [r0 + 4 * r1] - movq [r0], xm4 - movhps [r0 + r1], xm4 - vextracti128 xm2, m4, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 + ang8_store8x8 RET INIT_YMM avx2 cglobal intra_pred_ang8_15, 3, 6, 6 - mova m3, [pw_1024] - movu xm5, [r2 + 16] - pinsrb xm5, [r2], 0 - lea r5, [intra_pred_shuff_0_8] - mova xm0, xm5 - pslldq xm5, 1 - pinsrb xm5, [r2 + 2], 0 - vinserti128 m0, m0, xm5, 1 - pshufb m0, [r5] + vbroadcasti128 m1, [r2 + 17] + vbroadcasti128 m2, [r2] + mova m3, [ang8_shuf_mode15 + mmsize] + pshufb m2, m3 + palignr m1, m2, 11 + + mova m5, [ang8_shuf_mode15] + mova m3, [pb_2] + pshufb m0, m1, m5 + psubb m5, m3 + pshufb m4, m1, m5 + psubb m5, m3 + pshufb m2, m1, m5 + psubb m5, m3 + pshufb m1, m5 - lea r4, [c_ang8_mode_15] - pmaddubsw m1, m0, [r4] + vbroadcasti128 m5, [ang8_fact_mode15] + mova m3, [pw_1024] + pmaddubsw m1, m5 + pmaddubsw m2, m5 + pmaddubsw m4, m5 + pmaddubsw m0, m5 pmulhrsw m1, m3 - mova xm0, xm5 - pslldq xm5, 1 - pinsrb xm5, [r2 + 4], 0 - vinserti128 m0, m0, xm5, 1 - pshufb m0, [r5] - pmaddubsw m2, m0, [r4 + mmsize] pmulhrsw m2, m3 - mova xm0, xm5 - pslldq xm5, 1 - pinsrb xm5, [r2 + 6], 0 - vinserti128 m0, m0, xm5, 1 - pshufb m0, [r5] - pmaddubsw m4, m0, [r4 + 2 * mmsize] pmulhrsw m4, m3 - mova xm0, xm5 - pslldq xm5, 1 - pinsrb xm5, [r2 + 8], 0 - vinserti128 m0, m0, xm5, 1 - pshufb m0, [r5] - pmaddubsw m0, [r4 + 3 * mmsize] pmulhrsw m0, m3 packuswb m1, m2 packuswb m4, m0 - vperm2i128 m2, m1, m4, 00100000b - vperm2i128 m1, m1, m4, 00110001b - punpcklbw m4, m2, m1 - punpckhbw m2, m1 - punpcklwd m1, m4, m2 - punpckhwd m4, m2 - mova m0, [trans8_shuf] - vpermd m1, m0, m1 - vpermd m4, m0, m4 - - lea r3, [3 * r1] - movq [r0], xm1 - movhps [r0 + r1], xm1 - vextracti128 xm2, m1, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 - lea r0, [r0 + 4 * r1] - movq [r0], xm4 - movhps [r0 + r1], xm4 - vextracti128 xm2, m4, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 + ang8_store8x8 RET INIT_YMM avx2 -cglobal intra_pred_ang8_16, 3,4,7 - lea r0, [r0 + r1 * 8] - sub r0, r1 - neg r1 - lea r3, [r1 * 3] - vbroadcasti128 m0, [angHor8_tab_16] ; m0 = factor - mova m1, [intra_pred8_shuff16] ; m1 = 4 of Row shuffle - movu m2, [intra_pred8_shuff16 + 8] ; m2 = 4 of Row shuffle +cglobal intra_pred_ang8_16, 3,4,6 + vbroadcasti128 m1, [r2 + 17] + vbroadcasti128 m2, [r2] + mova m3, [ang8_shuf_mode16 + mmsize] + pshufb m2, m3 + palignr m1, m2, 10 + + mova m5, [ang8_shuf_mode16] + mova m3, [pb_2] + pshufb m0, m1, m5 + psubb m5, m3 + pshufb m4, m1, m5 + psubb m5, m3 + pshufb m2, m1, m5 + psubb m5, m3 + pshufb m1, m5 - ; prepare reference pixel - movq xm3, [r2 + 16 + 1] ; m3 = [-1 -2 -3 -4 -5 -6 -7 -8 x x x x x x x x] - movhps xm3, [r2 + 2] ; m3 = [-1 -2 -3 -4 -5 -6 -7 -8 2 3 x 5 6 x 8 x] - pslldq xm3, 1 - pinsrb xm3, [r2], 0 ; m3 = [ 0 -1 -2 -3 -4 -5 -6 -7 -8 2 3 x 5 6 x 8] - pshufb xm3, [c_ang8_mode_16] - vinserti128 m3, m3, xm3, 1 ; m3 = [-8 -7 -6 -5 -4 -3 -2 -1 0 2 3 5 6 8] + vbroadcasti128 m5, [ang8_fact_mode16] + mova m3, [pw_1024] + pmaddubsw m1, m5 + pmaddubsw m2, m5 + pmaddubsw m4, m5 + pmaddubsw m0, m5 + pmulhrsw m1, m3 + pmulhrsw m2, m3 + pmulhrsw m4, m3 + pmulhrsw m0, m3 + packuswb m1, m2 + packuswb m4, m0 - ; process 4 rows - pshufb m4, m3, m1 - pshufb m5, m3, m2 - psrldq m3, 4 - punpcklbw m6, m5, m4 - punpckhbw m5, m4 - pmaddubsw m6, m0 - pmulhrsw m6, [pw_1024] - pmaddubsw m5, m0 - pmulhrsw m5, [pw_1024] - packuswb m6, m5 - vextracti128 xm5, m6, 1 - movq [r0], xm6 - movhps [r0 + r1], xm6 - movq [r0 + r1 * 2], xm5 - movhps [r0 + r3], xm5 + ang8_store8x8 + RET - ; process 4 rows - lea r0, [r0 + r1 * 4] - pshufb m4, m3, m1 - pshufb m5, m3, m2 - punpcklbw m6, m5, m4 - punpckhbw m5, m4 - pmaddubsw m6, m0 - pmulhrsw m6, [pw_1024] - pmaddubsw m5, m0 - pmulhrsw m5, [pw_1024] - packuswb m6, m5 - vextracti128 xm5, m6, 1 - movq [r0], xm6 - movhps [r0 + r1], xm6 - movq [r0 + r1 * 2], xm5 - movhps [r0 + r3], xm5 +INIT_YMM avx2 +cglobal intra_pred_ang8_17, 3,4,6 + vbroadcasti128 m1, [r2 + 17] + vbroadcasti128 m2, [r2] + mova m3, [ang8_shuf_mode17 + mmsize] + pshufb m2, m3 + palignr m1, m2, 9 + + mova m5, [ang8_shuf_mode17] + mova m3, [pb_2] + pshufb m0, m1, m5 + psubb m5, m3 + pshufb m4, m1, m5 + psubb m5, m3 + pshufb m2, m1, m5 + psubb m5, m3 + pshufb m1, m5 + + vbroadcasti128 m5, [ang8_fact_mode17] + mova m3, [pw_1024] + pmaddubsw m1, m5 + pmaddubsw m2, m5 + pmaddubsw m4, m5 + pmaddubsw m0, m5 + pmulhrsw m1, m3 + pmulhrsw m2, m3 + pmulhrsw m4, m3 + pmulhrsw m0, m3 + packuswb m1, m2 + packuswb m4, m0 + + ang8_store8x8 RET %if 1 @@ -19548,113 +19469,73 @@ INIT_YMM avx2 cglobal intra_pred_ang8_14, 3, 6, 6 - mova m3, [pw_1024] - movu xm5, [r2 + 16] - pinsrb xm5, [r2], 0 - lea r5, [intra_pred_shuff_0_8] - vinserti128 m0, m5, xm5, 1 - pshufb m0, [r5] + movu xm1, [r2 + 13] + vinserti128 m1, m1, xm1, 1 - lea r4, [c_ang8_mode_14] - pmaddubsw m1, m0, [r4] + pinsrb xm1, [r2 + 0], 3 + pinsrb xm1, [r2 + 2], 2 + pinsrb xm1, [r2 + 5], 1 + pinsrb xm1, [r2 + 7], 0 + vinserti128 m1, m1, xm1, 1 + + mova m5, [ang8_shuf_mode14] + mova m3, [pb_2] + pshufb m0, m1, m5 + psubb m5, m3 + pshufb m4, m1, m5 + psubb m5, m3 + pshufb m2, m1, m5 + psubb m5, m3 + pshufb m1, m5 + + vbroadcasti128 m5, [ang8_fact_mode14] + mova m3, [pw_1024] + pmaddubsw m1, m5 + pmaddubsw m2, m5 + pmaddubsw m4, m5 + pmaddubsw m0, m5 pmulhrsw m1, m3 - pslldq xm5, 1 - pinsrb xm5, [r2 + 2], 0 - vinserti128 m0, m5, xm5, 1 - pshufb m0, [r5] - pmaddubsw m2, m0, [r4 + mmsize] pmulhrsw m2, m3 - pslldq xm5, 1 - pinsrb xm5, [r2 + 5], 0 - vinserti128 m0, m5, xm5, 1 - pshufb m0, [r5] - pmaddubsw m4, m0, [r4 + 2 * mmsize] pmulhrsw m4, m3 - pslldq xm5, 1 - pinsrb xm5, [r2 + 7], 0 - pshufb xm5, [r5] - vinserti128 m0, m0, xm5, 1 - pmaddubsw m0, [r4 + 3 * mmsize] pmulhrsw m0, m3 packuswb m1, m2 packuswb m4, m0 - vperm2i128 m2, m1, m4, 00100000b - vperm2i128 m1, m1, m4, 00110001b - punpcklbw m4, m2, m1 - punpckhbw m2, m1 - punpcklwd m1, m4, m2 - punpckhwd m4, m2 - mova m0, [trans8_shuf] - vpermd m1, m0, m1 - vpermd m4, m0, m4 - - lea r3, [3 * r1] - movq [r0], xm1 - movhps [r0 + r1], xm1 - vextracti128 xm2, m1, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 - lea r0, [r0 + 4 * r1] - movq [r0], xm4 - movhps [r0 + r1], xm4 - vextracti128 xm2, m4, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 + ang8_store8x8 RET INIT_YMM avx2 cglobal intra_pred_ang8_13, 3, 6, 6 - mova m3, [pw_1024] - movu xm5, [r2 + 16] - pinsrb xm5, [r2], 0 - lea r5, [intra_pred_shuff_0_8] - vinserti128 m0, m5, xm5, 1 - pshufb m0, [r5] + movu xm1, [r2 + 14] + pinsrb xm1, [r2 + 0], 2 + pinsrb xm1, [r2 + 4], 1 + pinsrb xm1, [r2 + 7], 0 + vinserti128 m1, m1, xm1, 1 + + mova m5, [ang8_shuf_mode13] + mova m3, [pb_2] + pshufb m0, m1, m5 + psubb m5, m3 + pshufb m4, m1, m5 + psubb m5, m3 + pshufb m2, m1, m5 + psubb m5, m3 + pshufb m1, m5 - lea r4, [c_ang8_mode_13] - pmaddubsw m1, m0, [r4] + vbroadcasti128 m5, [ang8_fact_mode13] + mova m3, [pw_1024] + pmaddubsw m1, m5 + pmaddubsw m2, m5 + pmaddubsw m4, m5 + pmaddubsw m0, m5 pmulhrsw m1, m3 - pslldq xm5, 1 - pinsrb xm5, [r2 + 4], 0 - pshufb xm4, xm5, [r5] - vinserti128 m0, m0, xm4, 1 - pmaddubsw m2, m0, [r4 + mmsize] pmulhrsw m2, m3 - vinserti128 m0, m0, xm4, 0 - pmaddubsw m4, m0, [r4 + 2 * mmsize] pmulhrsw m4, m3 - pslldq xm5, 1 - pinsrb xm5, [r2 + 7], 0 - pshufb xm5, [r5] - vinserti128 m0, m0, xm5, 1 - pmaddubsw m0, [r4 + 3 * mmsize] pmulhrsw m0, m3 packuswb m1, m2 packuswb m4, m0 - vperm2i128 m2, m1, m4, 00100000b - vperm2i128 m1, m1, m4, 00110001b - punpcklbw m4, m2, m1 - punpckhbw m2, m1 - punpcklwd m1, m4, m2 - punpckhwd m4, m2 - mova m0, [trans8_shuf] - vpermd m1, m0, m1 - vpermd m4, m0, m4 - - lea r3, [3 * r1] - movq [r0], xm1 - movhps [r0 + r1], xm1 - vextracti128 xm2, m1, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 - lea r0, [r0 + 4 * r1] - movq [r0], xm4 - movhps [r0 + r1], xm4 - vextracti128 xm2, m4, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 + ang8_store8x8 RET @@ -19703,51 +19584,36 @@ RET INIT_YMM avx2 -cglobal intra_pred_ang8_12, 3, 5, 5 - mova m3, [pw_1024] - movu xm1, [r2 + 16] - pinsrb xm1, [r2], 0 - pshufb xm1, [intra_pred_shuff_0_8] - vinserti128 m0, m1, xm1, 1 +cglobal intra_pred_ang8_12, 3, 5, 6 + movu xm1, [r2 + 15] + pinsrb xm1, [r2 + 0], 1 + pinsrb xm1, [r2 + 6], 0 + vinserti128 m1, m1, xm1, 1 + + mova m5, [ang8_shuf_mode12] + mova m3, [pb_2] + pshufb m0, m1, m5 + psubb m5, m3 + pshufb m4, m1, m5 + psubb m5, m3 + pshufb m2, m1, m5 + psubb m5, m3 + pshufb m1, m5 - lea r4, [c_ang8_mode_24] - pmaddubsw m1, m0, [r4] + vbroadcasti128 m5, [ang8_fact_mode12] + mova m3, [pw_1024] + pmaddubsw m1, m5 + pmaddubsw m2, m5 + pmaddubsw m4, m5 + pmaddubsw m0, m5 pmulhrsw m1, m3 - pmaddubsw m2, m0, [r4 + mmsize] pmulhrsw m2, m3 - pmaddubsw m4, m0, [r4 + 2 * mmsize] pmulhrsw m4, m3 - pslldq xm0, 2 - pinsrb xm0, [r2 + 6], 0 - pinsrb xm0, [r2 + 0], 1 - vinserti128 m0, m0, xm0, 1 - pmaddubsw m0, [r4 + 3 * mmsize] pmulhrsw m0, m3 packuswb m1, m2 packuswb m4, m0 - vperm2i128 m2, m1, m4, 00100000b - vperm2i128 m1, m1, m4, 00110001b - punpcklbw m4, m2, m1 - punpckhbw m2, m1 - punpcklwd m1, m4, m2 - punpckhwd m4, m2 - mova m0, [trans8_shuf] - vpermd m1, m0, m1 - vpermd m4, m0, m4 - - lea r3, [3 * r1] - movq [r0], xm1 - movhps [r0 + r1], xm1 - vextracti128 xm2, m1, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 - lea r0, [r0 + 4 * r1] - movq [r0], xm4 - movhps [r0 + r1], xm4 - vextracti128 xm2, m4, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 + ang8_store8x8 RET INIT_YMM avx2
View file
x265_1.9.tar.gz/source/common/x86/ipfilter16.asm -> x265_2.0.tar.gz/source/common/x86/ipfilter16.asm
Changed
@@ -116,6 +116,7 @@ dw -1, 4, -11, 40, 40, -11, 4, -1 dw 0, 1, -5, 17, 58, -10, 4, -1 +ALIGN 32 tab_LumaCoeffV: times 4 dw 0, 0 times 4 dw 0, 64 times 4 dw 0, 0 @@ -161,9 +162,8 @@ const interp8_hpp_shuf, db 0, 1, 2, 3, 4, 5, 6, 7, 2, 3, 4, 5, 6, 7, 8, 9 db 4, 5, 6, 7, 8, 9, 10, 11, 6, 7, 8, 9, 10, 11, 12, 13 -const pb_shuf, db 0, 1, 2, 3, 4, 5, 6, 7, 2, 3, 4, 5, 6, 7, 8, 9 - db 4, 5, 6, 7, 8, 9, 10, 11, 6, 7, 8, 9, 10, 11, 12, 13 - +const interp8_hpp_shuf_new, db 0, 1, 2, 3, 2, 3, 4, 5, 4, 5, 6, 7, 6, 7, 8, 9 + db 4, 5, 6, 7, 6, 7, 8, 9, 8, 9, 10, 11, 10, 11, 12, 13 SECTION .text cextern pd_8 @@ -10407,7 +10407,7 @@ vpbroadcastq m0, [tab_LumaCoeff + r4] vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] %endif - mova m3, [pb_shuf] + mova m3, [interp8_hpp_shuf] vbroadcasti128 m2, [INTERP_OFFSET_PS] ; register map @@ -10475,7 +10475,7 @@ vpbroadcastq m0, [tab_LumaCoeff + r4] vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] %endif - mova m3, [pb_shuf] + mova m3, [interp8_hpp_shuf] vbroadcasti128 m2, [INTERP_OFFSET_PS] ; register map @@ -10536,16 +10536,16 @@ add r3d, r3d mov r4d, r4m mov r5d, r5m - shl r4d, 4 + shl r4d, 6 %ifdef PIC - lea r6, [tab_LumaCoeff] - vpbroadcastq m0, [r6 + r4] - vpbroadcastq m1, [r6 + r4 + 8] + lea r6, [tab_LumaCoeffV] + movu m0, [r6 + r4] + movu m1, [r6 + r4 + mmsize] %else - vpbroadcastq m0, [tab_LumaCoeff + r4] - vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] + movu m0, [tab_LumaCoeffV + r4] + movu m1, [tab_LumaCoeffV + r4 + mmsize] %endif - mova m3, [pb_shuf] + mova m3, [interp8_hpp_shuf_new] vbroadcasti128 m2, [INTERP_OFFSET_PS] ; register map @@ -10554,7 +10554,7 @@ sub r0, 6 test r5d, r5d mov r4d, %2 - jz .loop0 + jz .loop0 lea r6, [r1*3] sub r0, r6 add r4d, 7 @@ -10563,64 +10563,64 @@ %assign x 0 %rep %1/16 vbroadcasti128 m4, [r0 + x] - vbroadcasti128 m5, [r0 + 8 + x] + vbroadcasti128 m5, [r0 + 4 * SIZEOF_PIXEL + x] pshufb m4, m3 - pshufb m7, m5, m3 + pshufb m5, m3 pmaddwd m4, m0 - pmaddwd m7, m1 + pmaddwd m7, m5, m1 paddd m4, m7 + vextracti128 xm7, m4, 1 + paddd xm4, xm7 + paddd xm4, xm2 + psrad xm4, INTERP_SHIFT_PS vbroadcasti128 m6, [r0 + 16 + x] - pshufb m5, m3 - pshufb m7, m6, m3 + pshufb m6, m3 pmaddwd m5, m0 - pmaddwd m7, m1 + pmaddwd m7, m6, m1 paddd m5, m7 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m2 - vextracti128 xm5,m4, 1 - psrad xm4, INTERP_SHIFT_PS + vextracti128 xm7, m5, 1 + paddd xm5, xm7 + paddd xm5, xm2 psrad xm5, INTERP_SHIFT_PS - packssdw xm4, xm5 + packssdw xm4, xm5 movu [r2 + x], xm4 vbroadcasti128 m5, [r0 + 24 + x] - pshufb m6, m3 - pshufb m7, m5, m3 + pshufb m5, m3 pmaddwd m6, m0 - pmaddwd m7, m1 + pmaddwd m7, m5, m1 paddd m6, m7 + vextracti128 xm7, m6, 1 + paddd xm6, xm7 + paddd xm6, xm2 + psrad xm6, INTERP_SHIFT_PS vbroadcasti128 m7, [r0 + 32 + x] - pshufb m5, m3 pshufb m7, m3 pmaddwd m5, m0 pmaddwd m7, m1 paddd m5, m7 - - phaddd m6, m5 - vpermq m6, m6, q3120 - paddd m6, m2 - vextracti128 xm5,m6, 1 - psrad xm6, INTERP_SHIFT_PS + vextracti128 xm7, m5, 1 + paddd xm5, xm7 + paddd xm5, xm2 psrad xm5, INTERP_SHIFT_PS - packssdw xm6, xm5 + packssdw xm6, xm5 movu [r2 + 16 + x], xm6 - %assign x x+32 - %endrep + +%assign x x+32 +%endrep add r2, r3 add r0, r1 dec r4d - jnz .loop0 + jnz .loop0 RET %endif %endmacro @@ -10656,7 +10656,7 @@ vpbroadcastq m0, [tab_LumaCoeff + r4] vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] %endif - mova m3, [pb_shuf] + mova m3, [interp8_hpp_shuf] vbroadcasti128 m2, [INTERP_OFFSET_PS] ; register map @@ -10749,7 +10749,7 @@ vpbroadcastq m0, [tab_LumaCoeff + r4] vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] %endif - mova m3, [pb_shuf] + mova m3, [interp8_hpp_shuf] vbroadcasti128 m2, [INTERP_OFFSET_PS] ; register map @@ -10824,7 +10824,7 @@ %else vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] %endif - mova m3, [pb_shuf] + mova m3, [interp8_hpp_shuf] vbroadcasti128 m2, [INTERP_OFFSET_PS] ; register map @@ -10883,7 +10883,7 @@ %else vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] %endif - mova m3, [pb_shuf] + mova m3, [interp8_hpp_shuf] vbroadcasti128 m2, [INTERP_OFFSET_PS] ; register map @@ -10956,7 +10956,7 @@ %else vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] %endif - mova m3, [pb_shuf] + mova m3, [interp8_hpp_shuf] vbroadcasti128 m2, [INTERP_OFFSET_PS] ; register map @@ -11038,7 +11038,7 @@ %else vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] %endif - mova m3, [pb_shuf] + mova m3, [interp8_hpp_shuf] vbroadcasti128 m2, [INTERP_OFFSET_PS] ; register map @@ -11103,7 +11103,7 @@ %else vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] %endif - mova m3, [pb_shuf] + mova m3, [interp8_hpp_shuf] vbroadcasti128 m2, [INTERP_OFFSET_PS] ; register map @@ -11204,7 +11204,7 @@ %else vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] %endif - mova m3, [pb_shuf] + mova m3, [interp8_hpp_shuf] vbroadcasti128 m2, [INTERP_OFFSET_PS] ; register map @@ -11357,7 +11357,7 @@ %else vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] %endif - mova m3, [pb_shuf] + mova m3, [interp8_hpp_shuf] vbroadcasti128 m2, [INTERP_OFFSET_PS] ; register map @@ -11477,7 +11477,7 @@ %else vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] %endif - mova m3, [pb_shuf] + mova m3, [interp8_hpp_shuf] vbroadcasti128 m2, [INTERP_OFFSET_PS] ; register map
View file
x265_1.9.tar.gz/source/common/x86/loopfilter.asm -> x265_2.0.tar.gz/source/common/x86/loopfilter.asm
Changed
@@ -29,9 +29,6 @@ %include "x86util.asm" SECTION_RODATA 32 -pb_31: times 32 db 31 -pb_124: times 32 db 124 -pb_15: times 32 db 15 SECTION .text cextern pb_1 @@ -39,6 +36,10 @@ cextern pb_3 cextern pb_4 cextern pb_01 +cextern pb_0123 +cextern pb_15 +cextern pb_31 +cextern pb_124 cextern pb_128 cextern pw_1 cextern pw_n1 @@ -48,7 +49,9 @@ cextern pb_movemask cextern pb_movemask_32 cextern hmul_16p - +cextern pw_1_ffff +cextern pb_shuf_off4 +cextern pw_shuf_off4 ;============================================================================================================ ; void saoCuOrgE0(pixel * rec, int8_t * offsetEo, int lcuWidth, int8_t* signLeft, intptr_t stride) @@ -154,7 +157,9 @@ sub r4d, 16 jnz .loopH RET -%else ; HIGH_BIT_DEPTH + +%else ; HIGH_BIT_DEPTH == 1 + cglobal saoCuOrgE0, 5, 5, 8, rec, offsetEo, lcuWidth, signLeft, stride mov r4d, r4m @@ -240,7 +245,7 @@ sub r4d, 16 jnz .loopH RET -%endif +%endif ; HIGH_BIT_DEPTH == 0 INIT_YMM avx2 %if HIGH_BIT_DEPTH @@ -2061,6 +2066,117 @@ ; saoCuStatsE0(const int16_t *diff, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count) ;----------------------------------------------------------------------------------------------------------------------- %if ARCH_X86_64 + +%if HIGH_BIT_DEPTH == 1 +INIT_XMM sse4 +cglobal saoCuStatsE0, 3,10,8, 0-32 + mov r3d, r3m + mov r4d, r4m + mov r9, r5mp + + ; clear internal temporary buffer + pxor m0, m0 + mova [rsp], m0 + mova [rsp + mmsize], m0 + mova m4, [pw_1] + mova m5, [pb_2] + xor r7d, r7d + + ; correct stride for diff[] and rec + mov r6d, r3d + and r6d, ~15 + sub r2, r6 + lea r8, [(r6 - 64) * 2] ; 64 = MAX_CU_SIZE + + FIX_STRIDES r2 + +.loopH: + mov r5d, r3d + + ; calculate signLeft + mov r7w, [r1] + sub r7w, [r1 - SIZEOF_PIXEL] + seta r7b + setb r6b + sub r7b, r6b + neg r7b + pinsrb m0, r7d, 15 + +.loopL: + + movu m3, [r1] + movu m2, [r1 + SIZEOF_PIXEL] + pcmpgtw m6, m3, m2 + pcmpgtw m2, m3 + pand m6, m4 + por m2, m6 + + movu m3, [r1 + mmsize] + movu m6, [r1 + mmsize + SIZEOF_PIXEL] + pcmpgtw m7, m3, m6 + pcmpgtw m6, m3 + pand m7, m4 + por m7, m6 + + packsswb m2, m7 ; signRight + + palignr m3, m2, m0, 15 + + pxor m6, m6 + psubb m6, m3 ; signLeft + + mova m0, m2 + paddb m2, m6 + paddb m2, m5 ; edgeType + + ; stats[edgeType] +%assign x 0 +%rep 16 + pextrb r7d, m2, x + + movsx r6d, word [r0 + x * 2] + inc word [rsp + r7 * 2] ; tmp_count[edgeType]++ + add [rsp + 5 * 2 + r7 * 4], r6d ; tmp_stats[edgeType] += (fenc[x] - rec[x]) + dec r5d + jz .next +%assign x x+1 +%endrep + + add r0, 16*2 + add r1, 16 * SIZEOF_PIXEL + jmp .loopL + +.next: + sub r0, r8 + add r1, r2 + + dec r4d + jnz .loopH + + ; sum to global buffer + mov r0, r6mp + + ; s_eoTable = {1, 2, 0, 3, 4} + pmovzxwd m0, [rsp + 0 * 2] + pshufd m0, m0, q3102 + movu m1, [r0] + paddd m0, m1 + movu [r0], m0 + movzx r5d, word [rsp + 4 * 2] + add [r0 + 4 * 4], r5d + + movu m0, [rsp + 5 * 2 + 0 * 4] + pshufd m0, m0, q3102 + movu m1, [r9] + paddd m0, m1 + movu [r9], m0 + mov r6d, [rsp + 5 * 2 + 4 * 4] + add [r9 + 4 * 4], r6d + RET +%endif ; HIGH_BIT_DEPTH=1 + + +%if HIGH_BIT_DEPTH == 0 INIT_XMM sse4 cglobal saoCuStatsE0, 3,10,6, 0-32 mov r3d, r3m @@ -2086,7 +2202,7 @@ ; calculate signLeft mov r7b, [r1] - sub r7b, [r1 - 1] + sub r7b, [r1 - SIZEOF_PIXEL] seta r7b setb r6b sub r7b, r6b @@ -2095,13 +2211,14 @@ .loopL: movu m3, [r1] - movu m2, [r1 + 1] + movu m2, [r1 + SIZEOF_PIXEL] pxor m1, m3, m4 pxor m2, m4 pcmpgtb m3, m1, m2 pcmpgtb m2, m1 pand m3, [pb_1] + por m2, m3 ; signRight palignr m3, m2, m0, 15 @@ -2125,7 +2242,7 @@ %endrep add r0, 16*2 - add r1, 16 + add r1, 16 * SIZEOF_PIXEL jmp .loopL .next: @@ -2155,6 +2272,7 @@ mov r6d, [rsp + 5 * 2 + 4 * 4] add [r9 + 4 * 4], r6d RET +%endif ; HIGH_BIT_DEPTH=0 ;----------------------------------------------------------------------------------------------------------------------- @@ -2341,6 +2459,112 @@ ; saoCuStatsE1_c(const int16_t *diff, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count) ;------------------------------------------------------------------------------------------------------------------------------------------- %if ARCH_X86_64 + +%if HIGH_BIT_DEPTH +INIT_XMM sse4 +cglobal saoCuStatsE1, 4,12,8,0-32 ; Stack: 5 of stats and 5 of count + mov r5d, r5m + mov r4d, r4m + + ; clear internal temporary buffer + pxor m0, m0 + mova [rsp], m0 + mova [rsp + mmsize], m0 + mova m5, [pw_1] + mova m6, [pb_2] + movh m7, [r3 + r4] + + FIX_STRIDES r2d + +.loopH: + mov r6d, r4d + mov r9, r0 + mov r10, r1 + mov r11, r3 + +.loopW: + ; signDown + movu m1, [r10] + movu m2, [r10 + r2] + pcmpgtw m3, m1, m2 + pcmpgtw m2, m1 + pand m3, m5 + por m2, m3 + + movu m3, [r10 + mmsize] + movu m4, [r10 + mmsize + r2] + pcmpgtw m0, m3, m4 + pcmpgtw m4, m3 + pand m0, m5 + por m4, m0 + packsswb m2, m4 + + pxor m3, m3 + psubb m3, m2 ; -signDown + + ; edgeType + movu m4, [r11] + paddb m4, m6 + paddb m2, m4 + + ; update upBuff1 + movu [r11], m3 + + ; 16 pixels +%assign x 0 +%rep 16 + pextrb r7d, m2, x + inc word [rsp + r7 * 2] + + ; stats[edgeType] + movsx r8d, word [r9 + x * 2] + add [rsp + 5 * 2 + r7 * 4], r8d + + dec r6d + jz .next +%assign x x+1 +%endrep + + add r9, mmsize * 2 + add r10, mmsize * SIZEOF_PIXEL + add r11, mmsize + jmp .loopW + +.next: + ; restore pointer upBuff1 + add r0, 64*2 ; MAX_CU_SIZE + add r1, r2 + + dec r5d + jg .loopH + + ; restore unavailable pixels + movh [r3 + r4], m7 + + ; sum to global buffer + mov r1, r6m + mov r0, r7m + + ; s_eoTable = {1,2,0,3,4} + pmovzxwd m0, [rsp + 0 * 2] + pshufd m0, m0, q3102 + movu m1, [r0] + paddd m0, m1 + movu [r0], m0 + movzx r5d, word [rsp + 4 * 2] + add [r0 + 4 * 4], r5d + + movu m0, [rsp + 5 * 2 + 0 * 4] + pshufd m0, m0, q3102 + movu m1, [r1] + paddd m0, m1 + movu [r1], m0 + mov r6d, [rsp + 5 * 2 + 4 * 4] + add [r1 + 4 * 4], r6d + RET + +%else ; HIGH_BIT_DEPTH == 1 + INIT_XMM sse4 cglobal saoCuStatsE1, 4,12,8,0-32 ; Stack: 5 of stats and 5 of count mov r5d, r5m @@ -2435,6 +2659,7 @@ mov r6d, [rsp + 5 * 2 + 4 * 4] add [r1 + 4 * 4], r6d RET +%endif ; HIGH_BIT_DEPTH == 0 INIT_YMM avx2 @@ -2650,6 +2875,129 @@ ;} %if ARCH_X86_64 + +%if HIGH_BIT_DEPTH == 1 +INIT_XMM sse4 +cglobal saoCuStatsE2, 5,9,7,0-32 ; Stack: 5 of stats and 5 of count + mov r5d, r5m + FIX_STRIDES r2d + + ; clear internal temporary buffer + pxor m0, m0 + mova [rsp], m0 + mova [rsp + mmsize], m0 + mova m5, [pw_1] + mova m6, [pb_2] + +.loopH: + ; TODO: merge into SIMD in below + ; get upBuffX[0] + mov r6w, [r1 + r2] + sub r6w, [r1 - 1 * SIZEOF_PIXEL] + seta r6b + setb r7b + sub r6b, r7b + mov [r4], r6b + + ; backup unavailable pixels + movh m0, [r4 + r5 + 1] + + mov r6d, r5d +.loopW: + ; signDown + ; stats[edgeType] + ; edgeType + movu m1, [r1] + movu m2, [r1 + r2 + 1 * SIZEOF_PIXEL] + pcmpgtw m3, m1, m2 + pcmpgtw m2, m1 + pand m2, m5 + por m3, m2 + + movu m1, [r1 + mmsize] + movu m2, [r1 + r2 + 1 * SIZEOF_PIXEL + mmsize] + pcmpgtw m4, m1, m2 + pcmpgtw m2, m1 + pand m2, m5 + por m4, m2 + packsswb m3, m4 + + movu m4, [r3] + paddb m4, m6 + psubb m4, m3 + + ; update upBuff1 + movu [r4 + 1], m3 + + ; 16 pixels +%assign x 0 +%rep 16 + pextrb r7d, m4, x + inc word [rsp + r7 * 2] + + movsx r8d, word [r0 + x * 2] + add [rsp + 5 * 2 + r7 * 4], r8d + + dec r6d + jz .next +%assign x x+1 +%endrep + + add r0, mmsize * 2 + add r1, mmsize * SIZEOF_PIXEL + add r3, mmsize + add r4, mmsize + jmp .loopW + +.next: + xchg r3, r4 + + ; restore pointer upBuff1 + mov r6d, r5d + and r6d, ~15 + neg r6 ; MUST BE 64-bits, it is Negtive + + ; move to next row + + ; move back to start point + add r3, r6 + add r4, r6 + + ; adjust with stride + lea r0, [r0 + (r6 + 64) * 2] ; 64 = MAX_CU_SIZE + add r1, r2 + lea r1, [r1 + r6 * SIZEOF_PIXEL] + + ; restore unavailable pixels + movh [r3 + r5 + 1], m0 + + dec byte r6m + jg .loopH + + ; sum to global buffer + mov r1, r7m + mov r0, r8m + + ; s_eoTable = {1,2,0,3,4} + pmovzxwd m0, [rsp + 0 * 2] + pshufd m0, m0, q3102 + movu m1, [r0] + paddd m0, m1 + movu [r0], m0 + movzx r5d, word [rsp + 4 * 2] + add [r0 + 4 * 4], r5d + + movu m0, [rsp + 5 * 2 + 0 * 4] + pshufd m0, m0, q3102 + movu m1, [r1] + paddd m0, m1 + movu [r1], m0 + mov r6d, [rsp + 5 * 2 + 4 * 4] + add [r1 + 4 * 4], r6d + RET + +%else ; HIGH_BIT_DEPTH == 1 + ; TODO: x64 only because I need temporary register r7,r8, easy portab to x86 INIT_XMM sse4 cglobal saoCuStatsE2, 5,9,8,0-32 ; Stack: 5 of stats and 5 of count @@ -2767,6 +3115,7 @@ add [r1 + 4 * 4], r6d RET +%endif ; HIGH_BIT_DEPTH == 0 INIT_YMM avx2 cglobal saoCuStatsE2, 5,10,16 ; Stack: 5 of stats and 5 of count @@ -2994,6 +3343,119 @@ ;} %if ARCH_X86_64 + +%if HIGH_BIT_DEPTH == 1 +INIT_XMM sse4 +cglobal saoCuStatsE3, 4,9,8,0-32 ; Stack: 5 of stats and 5 of count + mov r4d, r4m + mov r5d, r5m + FIX_STRIDES r2d + + ; clear internal temporary buffer + pxor m0, m0 + mova [rsp], m0 + mova [rsp + mmsize], m0 + ;mova m0, [pb_128] + mova m5, [pw_1] + mova m6, [pb_2] + movh m7, [r3 + r4] + +.loopH: + mov r6d, r4d + +.loopW: + ; signDown + movu m1, [r1] + movu m2, [r1 + r2 - 1 * SIZEOF_PIXEL] + pcmpgtw m3, m1, m2 + pcmpgtw m2, m1 + pand m2, m5 + por m3, m2 + + movu m1, [r1 + mmsize] + movu m2, [r1 + r2 - 1 * SIZEOF_PIXEL + mmsize] + pcmpgtw m4, m1, m2 + pcmpgtw m2, m1 + pand m2, m5 + por m4, m2 + packsswb m3, m4 + + ; edgeType + movu m4, [r3] + paddb m4, m6 + psubb m4, m3 + + ; update upBuff1 + movu [r3 - 1], m3 + + ; stats[edgeType] + pxor m1, m0 + + ; 16 pixels +%assign x 0 +%rep 16 + pextrb r7d, m4, x + inc word [rsp + r7 * 2] + + movsx r8d, word [r0 + x * 2] + add [rsp + 5 * 2 + r7 * 4], r8d + + dec r6d + jz .next +%assign x x+1 +%endrep + + add r0, 16 * 2 + add r1, 16 * SIZEOF_PIXEL + add r3, 16 + jmp .loopW + +.next: + ; restore pointer upBuff1 + mov r6d, r4d + and r6d, ~15 + neg r6 ; MUST BE 64-bits, it is Negtive + + ; move to next row + + ; move back to start point + add r3, r6 + + ; adjust with stride + lea r0, [r0 + (r6 + 64) * 2] ; 64 = MAX_CU_SIZE + add r1, r2 + lea r1, [r1 + r6 * SIZEOF_PIXEL] + + dec r5d + jg .loopH + + ; restore unavailable pixels + movh [r3 + r4], m7 + + ; sum to global buffer + mov r1, r6m + mov r0, r7m + + ; s_eoTable = {1,2,0,3,4} + pmovzxwd m0, [rsp + 0 * 2] + pshufd m0, m0, q3102 + movu m1, [r0] + paddd m0, m1 + movu [r0], m0 + movzx r5d, word [rsp + 4 * 2] + add [r0 + 4 * 4], r5d + + movu m0, [rsp + 5 * 2 + 0 * 4] + pshufd m0, m0, q3102 + movu m1, [r1] + paddd m0, m1 + movu [r1], m0 + mov r6d, [rsp + 5 * 2 + 4 * 4] + add [r1 + 4 * 4], r6d + RET + +%else ; HIGH_BIT_DEPTH == 1 + INIT_XMM sse4 cglobal saoCuStatsE3, 4,9,8,0-32 ; Stack: 5 of stats and 5 of count mov r4d, r4m @@ -3099,6 +3561,7 @@ add [r1 + 4 * 4], r6d RET +%endif ; HIGH_BIT_DEPTH == 0 INIT_YMM avx2 cglobal saoCuStatsE3, 4,10,16 ; Stack: 5 of stats and 5 of count @@ -3297,6 +3760,9 @@ INIT_XMM sse4 cglobal pelFilterLumaStrong_H, 5,7,10 +%if HIGH_BIT_DEPTH + add r2d, r2d +%endif mov r1, r2 neg r3d neg r4d @@ -3305,6 +3771,16 @@ lea r5, [r2 * 3] lea r6, [r1 * 3] +%if HIGH_BIT_DEPTH + movu m4, [r0] ; src[0] + movu m3, [r0 + r1] ; src[-offset] + movu m2, [r0 + r1 * 2] ; src[-offset * 2] + movu m1, [r0 + r6] ; src[-offset * 3] + movu m0, [r0 + r1 * 4] ; src[-offset * 4] + movu m5, [r0 + r2] ; src[offset] + movu m6, [r0 + r2 * 2] ; src[offset * 2] + movu m7, [r0 + r5] ; src[offset * 3] +%else pmovzxbw m4, [r0] ; src[0] pmovzxbw m3, [r0 + r1] ; src[-offset] pmovzxbw m2, [r0 + r1 * 2] ; src[-offset * 2] @@ -3313,6 +3789,7 @@ pmovzxbw m5, [r0 + r2] ; src[offset] pmovzxbw m6, [r0 + r2 * 2] ; src[offset * 2] pmovzxbw m7, [r0 + r5] ; src[offset * 3] +%endif paddw m0, m0 ; m0*2 mova m8, m2 @@ -3380,6 +3857,15 @@ paddw m0, m1 paddw m3, m4 paddw m9, m5 + +%if HIGH_BIT_DEPTH + movh [r0 + r6], m0 + movhps [r0 + r1], m0 + movh [r0], m3 + movhps [r0 + r2 * 2], m3, + movh [r0 + r2 * 1], m9 + movhps [r0 + r1 * 2], m9 +%else packuswb m0, m0 packuswb m3, m9 @@ -3389,14 +3875,41 @@ pextrd [r0 + r2 * 2], m3, 1 pextrd [r0 + r2 * 1], m3, 2 pextrd [r0 + r1 * 2], m3, 3 +%endif RET INIT_XMM sse4 cglobal pelFilterLumaStrong_V, 5,5,10 +%if HIGH_BIT_DEPTH + add r1d, r1d +%endif neg r3d neg r4d lea r2, [r1 * 3] +%if HIGH_BIT_DEPTH + movu m0, [r0 - 8] ; src[-offset * 4] row 0 + movu m1, [r0 + r1 * 1 - 8] ; src[-offset * 4] row 1 + movu m2, [r0 + r1 * 2 - 8] ; src[-offset * 4] row 2 + movu m3, [r0 + r2 * 1 - 8] ; src[-offset * 4] row 3 + + punpckhwd m4, m0, m1 ; [m4 m4 m5 m5 m6 m6 m7 m7] + punpcklwd m0, m1 ; [m0 m0 m1 m1 m2 m2 m3 m3] + + punpckhwd m5, m2, m3 ; [m4 m4 m5 m5 m6 m6 m7 m7] + punpcklwd m2, m3 ; [m0 m0 m1 m1 m2 m2 m3 m3] + + punpckhdq m3, m0, m2 ; [m2 m2 m2 m2 m3 m3 m3 m3] + punpckldq m0, m2 ; [m0 m0 m0 m0 m1 m1 m1 m1] + psrldq m1, m0, 8 ; [m1 m1 m1 m1 x x x x] + mova m2, m3 ; [m2 m2 m2 m2 x x x x] + punpckhqdq m3, m3 ; [m3 m3 m3 m3 x x x x] + + punpckhdq m6, m4, m5 ; [m6 m6 m6 m6 m7 m7 m7 m7] + punpckldq m4, m5 ; [m4 m4 m4 m4 m5 m5 m5 m5] + psrldq m7, m6, 8 + psrldq m5, m4, 8 +%else movh m0, [r0 - 4] ; src[-offset * 4] row 0 movh m1, [r0 + r1 * 1 - 4] ; src[-offset * 4] row 1 movh m2, [r0 + r1 * 2 - 4] ; src[-offset * 4] row 2 @@ -3429,6 +3942,7 @@ pmovzxbw m5, m5 pmovzxbw m6, m6 pmovzxbw m7, m7 +%endif paddw m0, m0 ; m0*2 mova m8, m2 @@ -3496,6 +4010,35 @@ paddw m0, m1 paddw m3, m4 paddw m9, m5 + +%if HIGH_BIT_DEPTH + ; 4x6 output rows - + ; m0 - col 0 + ; m3 - col 3 + + psrldq m1, m0, 8 + psrldq m2, m3, 8 + + mova m4, m9 + psrldq m5, m9, 8 + + ; transpose 4x6 to 6x4 + punpcklwd m0, m5 + punpcklwd m1, m3 + punpcklwd m4, m2 + + punpckldq m9, m0, m1 + punpckhdq m0, m1 + + movh [r0 + r1 * 0 - 6], m9 + movhps [r0 + r1 * 1 - 6], m9 + movh [r0 + r1 * 2 - 6], m0 + movhps [r0 + r2 * 1 - 6], m0 + pextrd [r0 + r1 * 0 + 2], m4, 0 + pextrd [r0 + r1 * 1 + 2], m4, 1 + pextrd [r0 + r1 * 2 + 2], m4, 2 + pextrd [r0 + r2 * 1 + 2], m4, 3 +%else packuswb m0, m0 packuswb m3, m9 @@ -3525,5 +4068,143 @@ pextrw [r0 + r1 * 1 + 1], m4, 1 pextrw [r0 + r1 * 2 + 1], m4, 2 pextrw [r0 + r2 * 1 + 1], m4, 3 +%endif + RET +%endif ; ARCH_X86_64 + +%if ARCH_X86_64 +INIT_XMM sse4 +cglobal pelFilterChroma_H, 6,6,5 +%if HIGH_BIT_DEPTH + add r2d, r2d +%endif + mov r1, r2 + neg r3d + neg r1 + +%if HIGH_BIT_DEPTH + movu m4, [r0] ; src[0] + movu m3, [r0 + r1] ; src[-offset] + movu m0, [r0 + r2] ; src[offset] + movu m2, [r0 + r1 * 2] ; src[-offset * 2] +%else + pmovzxbw m4, [r0] ; src[0] + pmovzxbw m3, [r0 + r1] ; src[-offset] + pmovzxbw m0, [r0 + r2] ; src[offset] + pmovzxbw m2, [r0 + r1 * 2] ; src[-offset * 2] +%endif + + psubw m1, m4, m3 ; m4 - m3 + psubw m2, m0 ; m2 - m5 + paddw m2, [pw_4] + psllw m1, 2 ; (m4 - m3) * 4 + paddw m1, m2 + psraw m1, 3 + + movd m0, r3d + pshufb m0, [pb_01] ; -tc + + pmaxsw m1, m0 + psignw m0, [pw_n1] + pminsw m1, m0 ; delta + punpcklqdq m1, m1 + + shl r5d, 16 + or r5w, r4w + punpcklqdq m3, m4 + mova m2, [pw_1_ffff] + + movd m0, r5d + pshufb m0, [pb_0123] + + pand m0, m1 ; (delta & maskP) (delta & maskQ) + psignw m0, m2 + paddw m3, m0 + + pxor m0, m0 + pmaxsw m3, m0 + pminsw m3, [pw_pixel_max] + +%if HIGH_BIT_DEPTH + movh [r0 + r1], m3 + movhps [r0], m3 +%else + packuswb m3, m3 + movd [r0 + r1], m3 + pextrd [r0], m3, 1 +%endif + RET + +INIT_XMM sse4 +cglobal pelFilterChroma_V, 6,6,5 +%if HIGH_BIT_DEPTH + add r1d, r1d +%endif + neg r3d + lea r2, [r1 * 3] + +%if HIGH_BIT_DEPTH + movu m4, [r0 + r1 * 0 - 4] ; src[-offset*2, -offset, 0, offset] [m2 m3 m4 m5] + movu m3, [r0 + r1 * 1 - 4] + movu m0, [r0 + r1 * 2 - 4] + movu m2, [r0 + r2 * 1 - 4] +%else + pmovzxbw m4, [r0 + r1 * 0 - 2] ; src[-offset*2, -offset, 0, offset] [m2 m3 m4 m5] + pmovzxbw m3, [r0 + r1 * 1 - 2] + pmovzxbw m0, [r0 + r1 * 2 - 2] + pmovzxbw m2, [r0 + r2 * 1 - 2] +%endif + punpcklwd m4, m3 + punpcklwd m0, m2 + punpckldq m2, m4, m0 ; [m2 m2 m2 m2 m3 m3 m3 m3] + punpckhdq m4, m0 ; [m4 m4 m4 m4 m5 m5 m5 m5] + psrldq m3, m2, 8 + psrldq m0, m4, 8 + + psubw m1, m4, m3 ; m4 - m3 + psubw m2, m0 ; m2 - m5 + paddw m2, [pw_4] + psllw m1, 2 ; (m4 - m3) * 4 + paddw m1, m2 + psraw m1, 3 + + movd m0, r3d + pshufb m0, [pb_01] ; -tc + + pmaxsw m1, m0 + psignw m0, [pw_n1] + pminsw m1, m0 ; delta + punpcklqdq m1, m1 + + shl r5d, 16 + or r5w, r4w + punpcklqdq m3, m4 + mova m2, [pw_1_ffff] + + movd m0, r5d + pshufb m0, [pb_0123] + + pand m0, m1 ; (delta & maskP) (delta & maskQ) + psignw m0, m2 + paddw m3, m0 + + pxor m0, m0 + pmaxsw m3, m0 + pminsw m3, [pw_pixel_max] + +%if HIGH_BIT_DEPTH + pshufb m3, [pw_shuf_off4] + pextrd [r0 + r1 * 0 - 2], m3, 0 + pextrd [r0 + r1 * 1 - 2], m3, 1 + pextrd [r0 + r1 * 2 - 2], m3, 2 + pextrd [r0 + r2 * 1 - 2], m3, 3 +%else + packuswb m3, m3 + pshufb m3, [pb_shuf_off4] + pextrw [r0 + r1 * 0 - 1], m3, 0 + pextrw [r0 + r1 * 1 - 1], m3, 1 + pextrw [r0 + r1 * 2 - 1], m3, 2 + pextrw [r0 + r2 * 1 - 1], m3, 3 +%endif RET %endif ; ARCH_X86_64
View file
x265_1.9.tar.gz/source/common/x86/loopfilter.h -> x265_2.0.tar.gz/source/common/x86/loopfilter.h
Changed
@@ -48,5 +48,7 @@ void PFX(pelFilterLumaStrong_V_sse4)(pixel* src, intptr_t srcStep, intptr_t offset, int32_t tcP, int32_t tcQ); void PFX(pelFilterLumaStrong_H_sse4)(pixel* src, intptr_t srcStep, intptr_t offset, int32_t tcP, int32_t tcQ); +void PFX(pelFilterChroma_V_sse4)(pixel* src, intptr_t srcStep, intptr_t offset, int32_t tc, int32_t maskP, int32_t maskQ); +void PFX(pelFilterChroma_H_sse4)(pixel* src, intptr_t srcStep, intptr_t offset, int32_t tc, int32_t maskP, int32_t maskQ); #endif // ifndef X265_LOOPFILTER_H
View file
x265_1.9.tar.gz/source/common/x86/mc-a.asm -> x265_2.0.tar.gz/source/common/x86/mc-a.asm
Changed
@@ -53,7 +53,6 @@ times 8 db 2 times 8 db 4 times 8 db 6 -sq_1: times 1 dq 1 SECTION .text @@ -74,6 +73,7 @@ cextern pw_pixel_max cextern pd_32 cextern pd_64 +cextern pq_1 ;==================================================================================================================== ;void addAvg (int16_t* src0, int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride) @@ -3638,7 +3638,7 @@ mova m3, [r4+16] movd m2, [r4+32] ; denom mova m4, [pw_pixel_max] - paddw m2, [sq_1] ; denom+1 + paddw m2, [pq_1] ; denom+1 %endmacro ; src1, src2
View file
x265_1.9.tar.gz/source/common/x86/mc-a2.asm -> x265_2.0.tar.gz/source/common/x86/mc-a2.asm
Changed
@@ -43,11 +43,11 @@ deinterleave_shuf32a: db 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30 deinterleave_shuf32b: db 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31 %endif -pw_1024: times 16 dw 1024 -pd_16: times 4 dd 16 -pd_0f: times 4 dd 0xffff -pf_inv256: times 8 dd 0.00390625 +cutree_fix8_unpack_shuf: db -1,-1, 0, 1,-1,-1, 2, 3,-1,-1, 4, 5,-1,-1, 6, 7 + db -1,-1, 8, 9,-1,-1,10,11,-1,-1,12,13,-1,-1,14,15 + +const pq_256, times 4 dq 256.0 const pd_inv256, times 4 dq 0.00390625 const pd_0_5, times 4 dq 0.5 @@ -59,9 +59,11 @@ cextern pw_32 cextern pw_512 cextern pw_00ff +cextern pw_1024 cextern pw_3fff cextern pw_pixel_max cextern pd_ffff +cextern pd_16 ;The hpel_filter routines use non-temporal writes for output. ;The following defines may be uncommented for testing. @@ -1215,3 +1217,121 @@ INIT_YMM avx2 MBTREE_AVX + + +%macro CUTREE_FIX8 0 +;----------------------------------------------------------------------------- +; void cutree_fix8_pack( uint16_t *dst, double *src, int count ) +;----------------------------------------------------------------------------- +cglobal cutree_fix8_pack, 3, 4, 5 + movapd m2, [pq_256] + sub r2d, mmsize / 2 + movsxdifnidn r2, r2d + lea r1, [r1 + 8 * r2] + lea r0, [r0 + 2 * r2] + neg r2 + jg .skip_loop +.loop: + mulpd m0, m2, [r1 + 8 * r2] + mulpd m1, m2, [r1 + 8 * r2 + mmsize] + mulpd m3, m2, [r1 + 8 * r2 + 2 * mmsize] + mulpd m4, m2, [r1 + 8 * r2 + 3 * mmsize] + cvttpd2dq xm0, m0 + cvttpd2dq xm1, m1 + cvttpd2dq xm3, m3 + cvttpd2dq xm4, m4 +%if mmsize == 32 + vinserti128 m0, m0, xm3, 1 + vinserti128 m1, m1, xm4, 1 + packssdw m0, m1 +%else + punpcklqdq m0, m1 + punpcklqdq m3, m4 + packssdw m0, m3 +%endif + mova [r0 + 2 * r2], m0 + add r2, mmsize / 2 + jle .loop +.skip_loop: + sub r2, mmsize / 2 + jz .end + ; Do the remaining values in scalar in order to avoid overreading src. +.scalar: + movq xm0, [r1 + 8 * r2 + 4 * mmsize] + mulsd xm0, xm2 + cvttsd2si r3d, xm0 + mov [r0 + 2 * r2 + mmsize], r3w + inc r2 + jl .scalar +.end: + RET + +;----------------------------------------------------------------------------- +; void cutree_fix8_unpack( double *dst, uint16_t *src, int count ) +;----------------------------------------------------------------------------- +cglobal cutree_fix8_unpack, 3, 4, 7 +%if mmsize != 32 + mova m4, [cutree_fix8_unpack_shuf+16] +%endif + movapd m2, [pd_inv256] + mova m3, [cutree_fix8_unpack_shuf] + sub r2d, mmsize / 2 + movsxdifnidn r2, r2d + lea r1, [r1 + 2 * r2] + lea r0, [r0 + 8 * r2] + neg r2 + jg .skip_loop +.loop: +%if mmsize == 32 + vbroadcasti128 m0, [r1 + 2 * r2] + vbroadcasti128 m1, [r1 + 2 * r2 + 16] + pshufb m0, m3 + pshufb m1, m3 +%else + mova m1, [r1 + 2 * r2] + pshufb m0, m1, m3 + pshufb m1, m4 +%endif + psrad m0, 16 ; sign-extend + psrad m1, 16 + cvtdq2pd m5, xm0 + cvtdq2pd m6, xm1 +%if mmsize == 32 + vpermq m0, m0, q1032 + vpermq m1, m1, q1032 +%else + psrldq m0, 8 + psrldq m1, 8 +%endif + cvtdq2pd m0, xm0 + cvtdq2pd m1, xm1 + mulpd m0, m2 + mulpd m1, m2 + mulpd m5, m2 + mulpd m6, m2 + movapd [r0 + 8 * r2], m5 + movapd [r0 + 8 * r2 + mmsize], m0 + movapd [r0 + 8 * r2 + mmsize * 2], m6 + movapd [r0 + 8 * r2 + mmsize * 3], m1 + add r2, mmsize / 2 + jle .loop +.skip_loop: + sub r2, mmsize / 2 + jz .end +.scalar: + movzx r3d, word [r1 + 2 * r2 + mmsize] + movsx r3d, r3w + cvtsi2sd xm0, r3d + mulsd xm0, xm2 + movsd [r0 + 8 * r2 + 4 * mmsize], xm0 + inc r2 + jl .scalar +.end: + RET +%endmacro + +INIT_XMM ssse3 +CUTREE_FIX8 + +INIT_YMM avx2 +CUTREE_FIX8
View file
x265_1.9.tar.gz/source/common/x86/mc.h -> x265_2.0.tar.gz/source/common/x86/mc.h
Changed
@@ -46,4 +46,20 @@ #undef PROPAGATE_COST +#define FIX8UNPACK(cpu) \ + void PFX(cutree_fix8_unpack_ ## cpu)(double *dst, uint16_t *src, int count); + +FIX8UNPACK(ssse3) +FIX8UNPACK(avx2) + +#undef FIX8UNPACK + +#define FIX8PACK(cpu) \ + void PFX(cutree_fix8_pack_## cpu)(uint16_t *dst, double *src, int count); + +FIX8PACK(ssse3) +FIX8PACK(avx2) + +#undef FIX8PACK + #endif // ifndef X265_MC_H
View file
x265_1.9.tar.gz/source/common/x86/pixel-a.asm -> x265_2.0.tar.gz/source/common/x86/pixel-a.asm
Changed
@@ -50,9 +50,6 @@ transd_shuf1: SHUFFLE_MASK_W 0, 8, 2, 10, 4, 12, 6, 14 transd_shuf2: SHUFFLE_MASK_W 1, 9, 3, 11, 5, 13, 7, 15 -sw_f0: dq 0xfff0, 0 -pd_f0: times 4 dd 0xffff0000 - SECTION .text cextern pb_0 @@ -67,7 +64,6 @@ cextern pw_pmpmpmpm cextern pw_pmmpzzzz cextern pd_1 -cextern popcnt_table cextern pd_2 cextern hmul_16p cextern pb_movemask @@ -13803,3 +13799,589 @@ movzx eax, al RET %endif ; ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 0 + + +%if HIGH_BIT_DEPTH == 1 && BIT_DEPTH == 10 +%macro LOAD_DIFF_AVX2 4 + movu %1, %3 + movu %2, %4 + psubw %1, %2 +%endmacro + +%macro LOAD_DIFF_8x4P_AVX2 6-8 r0,r2 ; 4x dest, 2x temp, 2x pointer + LOAD_DIFF_AVX2 xm%1, xm%5, [%7], [%8] + LOAD_DIFF_AVX2 xm%2, xm%6, [%7+r1], [%8+r3] + LOAD_DIFF_AVX2 xm%3, xm%5, [%7+2*r1], [%8+2*r3] + LOAD_DIFF_AVX2 xm%4, xm%6, [%7+r4], [%8+r5] + + ;lea %7, [%7+4*r1] + ;lea %8, [%8+4*r3] +%endmacro + +INIT_YMM avx2 +cglobal pixel_satd_8x8, 4,4,7 + + FIX_STRIDES r1, r3 + pxor xm6, xm6 + + ; load_diff 0 & 4 + movu xm0, [r0] + movu xm1, [r2] + vinserti128 m0, m0, [r0 + r1 * 4], 1 + vinserti128 m1, m1, [r2 + r3 * 4], 1 + psubw m0, m1 + add r0, r1 + add r2, r3 + + ; load_diff 1 & 5 + movu xm1, [r0] + movu xm2, [r2] + vinserti128 m1, m1, [r0 + r1 * 4], 1 + vinserti128 m2, m2, [r2 + r3 * 4], 1 + psubw m1, m2 + add r0, r1 + add r2, r3 + + ; load_diff 2 & 6 + movu xm2, [r0] + movu xm3, [r2] + vinserti128 m2, m2, [r0 + r1 * 4], 1 + vinserti128 m3, m3, [r2 + r3 * 4], 1 + psubw m2, m3 + add r0, r1 + add r2, r3 + + ; load_diff 3 & 7 + movu xm3, [r0] + movu xm4, [r2] + vinserti128 m3, m3, [r0 + r1 * 4], 1 + vinserti128 m4, m4, [r2 + r3 * 4], 1 + psubw m3, m4 + + SATD_8x4_SSE vertical, 0, 1, 2, 3, 4, 5, 6 + + vextracti128 xm0, m6, 1 + paddw xm6, xm0 + HADDUW xm6, xm0 + movd eax, xm6 + RET + +INIT_XMM avx2 +cglobal pixel_sa8d_8x8_internal + lea r6, [r0+4*r1] + lea r7, [r2+4*r3] + LOAD_DIFF_8x4P_AVX2 0, 1, 2, 8, 5, 6, r0, r2 + LOAD_DIFF_8x4P_AVX2 4, 5, 3, 9, 11, 6, r6, r7 + + HADAMARD8_2D 0, 1, 2, 8, 4, 5, 3, 9, 6, amax + ;HADAMARD2_2D 0, 1, 2, 8, 6, wd + ;HADAMARD2_2D 4, 5, 3, 9, 6, wd + ;HADAMARD2_2D 0, 2, 1, 8, 6, dq + ;HADAMARD2_2D 4, 3, 5, 9, 6, dq + ;HADAMARD2_2D 0, 4, 2, 3, 6, qdq, amax + ;HADAMARD2_2D 1, 5, 8, 9, 6, qdq, amax + + paddw m0, m1 + paddw m0, m2 + paddw m0, m8 + SAVE_MM_PERMUTATION + ret + + +INIT_XMM avx2 +cglobal pixel_sa8d_8x8, 4,8,12 + FIX_STRIDES r1, r3 + lea r4, [3*r1] + lea r5, [3*r3] + call pixel_sa8d_8x8_internal + HADDUW m0, m1 + movd eax, m0 + add eax, 1 + shr eax, 1 + RET + + +INIT_YMM avx2 +cglobal pixel_sa8d_16x16, 4,8,12 + FIX_STRIDES r1, r3 + lea r4, [3*r1] + lea r5, [3*r3] + lea r6, [r0+4*r1] + lea r7, [r2+4*r3] + vbroadcasti128 m7, [pw_1] + + ; Top 16x8 + ;LOAD_DIFF_8x4P_AVX2 0, 1, 2, 8, 5, 6, r0, r2 + movu m0, [r0] ; 10 bits + movu m5, [r2] + psubw m0, m5 ; 11 bits + movu m1, [r0 + r1] + movu m6, [r2 + r3] + psubw m1, m6 + movu m2, [r0 + r1 * 2] + movu m5, [r2 + r3 * 2] + psubw m2, m5 + movu m8, [r0 + r4] + movu m6, [r2 + r5] + psubw m8, m6 + + ;LOAD_DIFF_8x4P_AVX2 4, 5, 3, 9, 11, 6, r6, r7 + movu m4, [r6] + movu m11, [r7] + psubw m4, m11 + movu m5, [r6 + r1] + movu m6, [r7 + r3] + psubw m5, m6 + movu m3, [r6 + r1 * 2] + movu m11, [r7 + r3 * 2] + psubw m3, m11 + movu m9, [r6 + r4] + movu m6, [r7 + r5] + psubw m9, m6 + + HADAMARD8_2D 0, 1, 2, 8, 4, 5, 3, 9, 6, amax ; 16 bits + pmaddwd m0, m7 + pmaddwd m1, m7 + pmaddwd m2, m7 + pmaddwd m8, m7 + paddd m0, m1 + paddd m2, m8 + paddd m10, m0, m2 + + lea r0, [r0+8*r1] + lea r2, [r2+8*r3] + lea r6, [r6+8*r1] + lea r7, [r7+8*r3] + + ; Bottom 16x8 + ;LOAD_DIFF_8x4P_AVX2 0, 1, 2, 8, 5, 6, r0, r2 + movu m0, [r0] + movu m5, [r2] + psubw m0, m5 + movu m1, [r0 + r1] + movu m6, [r2 + r3] + psubw m1, m6 + movu m2, [r0 + r1 * 2] + movu m5, [r2 + r3 * 2] + psubw m2, m5 + movu m8, [r0 + r4] + movu m6, [r2 + r5] + psubw m8, m6 + + ;LOAD_DIFF_8x4P_AVX2 4, 5, 3, 9, 11, 6, r6, r7 + movu m4, [r6] + movu m11, [r7] + psubw m4, m11 + movu m5, [r6 + r1] + movu m6, [r7 + r3] + psubw m5, m6 + movu m3, [r6 + r1 * 2] + movu m11, [r7 + r3 * 2] + psubw m3, m11 + movu m9, [r6 + r4] + movu m6, [r7 + r5] + psubw m9, m6 + + HADAMARD8_2D 0, 1, 2, 8, 4, 5, 3, 9, 6, amax + pmaddwd m0, m7 + pmaddwd m1, m7 + pmaddwd m2, m7 + pmaddwd m8, m7 + paddd m0, m1 + paddd m2, m8 + paddd m10, m0 + paddd m10, m2 + + HADDD m10, m0 + + movd eax, xm10 + add eax, 1 + shr eax, 1 + RET + + +; TODO: optimize me, need more 2 of YMM registers because C model get partial result every 16x16 block +INIT_YMM avx2 +cglobal pixel_sa8d_32x32, 4,8,14 + FIX_STRIDES r1, r3 + lea r4, [3*r1] + lea r5, [3*r3] + lea r6, [r0+4*r1] + lea r7, [r2+4*r3] + vbroadcasti128 m7, [pw_1] + + + ;SA8D[16x8] ; pix[0] + ;LOAD_DIFF_8x4P_AVX2 0, 1, 2, 8, 5, 6, r0, r2 + movu m0, [r0] + movu m5, [r2] + psubw m0, m5 + movu m1, [r0 + r1] + movu m6, [r2 + r3] + psubw m1, m6 + movu m2, [r0 + r1 * 2] + movu m5, [r2 + r3 * 2] + psubw m2, m5 + movu m8, [r0 + r4] + movu m6, [r2 + r5] + psubw m8, m6 + + ;LOAD_DIFF_8x4P_AVX2 4, 5, 3, 9, 11, 6, r6, r7 + movu m4, [r6] + movu m11, [r7] + psubw m4, m11 + movu m5, [r6 + r1] + movu m6, [r7 + r3] + psubw m5, m6 + movu m3, [r6 + r1 * 2] + movu m11, [r7 + r3 * 2] + psubw m3, m11 + movu m9, [r6 + r4] + movu m6, [r7 + r5] + psubw m9, m6 + + HADAMARD8_2D 0, 1, 2, 8, 4, 5, 3, 9, 6, amax + pmaddwd m0, m7 + pmaddwd m1, m7 + pmaddwd m2, m7 + pmaddwd m8, m7 + paddd m0, m1 + paddd m2, m8 + paddd m10, m0, m2 + + + ; SA8D[16x8] ; pix[16] + add r0, mmsize + add r2, mmsize + add r6, mmsize + add r7, mmsize + + ;LOAD_DIFF_8x4P_AVX2 0, 1, 2, 8, 5, 6, r0, r2 + movu m0, [r0] + movu m5, [r2] + psubw m0, m5 + movu m1, [r0 + r1] + movu m6, [r2 + r3] + psubw m1, m6 + movu m2, [r0 + r1 * 2] + movu m5, [r2 + r3 * 2] + psubw m2, m5 + movu m8, [r0 + r4] + movu m6, [r2 + r5] + psubw m8, m6 + + ;LOAD_DIFF_8x4P_AVX2 4, 5, 3, 9, 11, 6, r6, r7 + movu m4, [r6] + movu m11, [r7] + psubw m4, m11 + movu m5, [r6 + r1] + movu m6, [r7 + r3] + psubw m5, m6 + movu m3, [r6 + r1 * 2] + movu m11, [r7 + r3 * 2] + psubw m3, m11 + movu m9, [r6 + r4] + movu m6, [r7 + r5] + psubw m9, m6 + + HADAMARD8_2D 0, 1, 2, 8, 4, 5, 3, 9, 6, amax + pmaddwd m0, m7 + pmaddwd m1, m7 + pmaddwd m2, m7 + pmaddwd m8, m7 + paddd m0, m1 + paddd m2, m8 + paddd m12, m0, m2 + + + ; SA8D[16x8] ; pix[8*stride+16] + lea r0, [r0+8*r1] + lea r2, [r2+8*r3] + lea r6, [r6+8*r1] + lea r7, [r7+8*r3] + + ;LOAD_DIFF_8x4P_AVX2 0, 1, 2, 8, 5, 6, r0, r2 + movu m0, [r0] + movu m5, [r2] + psubw m0, m5 + movu m1, [r0 + r1] + movu m6, [r2 + r3] + psubw m1, m6 + movu m2, [r0 + r1 * 2] + movu m5, [r2 + r3 * 2] + psubw m2, m5 + movu m8, [r0 + r4] + movu m6, [r2 + r5] + psubw m8, m6 + + ;LOAD_DIFF_8x4P_AVX2 4, 5, 3, 9, 11, 6, r6, r7 + movu m4, [r6] + movu m11, [r7] + psubw m4, m11 + movu m5, [r6 + r1] + movu m6, [r7 + r3] + psubw m5, m6 + movu m3, [r6 + r1 * 2] + movu m11, [r7 + r3 * 2] + psubw m3, m11 + movu m9, [r6 + r4] + movu m6, [r7 + r5] + psubw m9, m6 + + HADAMARD8_2D 0, 1, 2, 8, 4, 5, 3, 9, 6, amax + pmaddwd m0, m7 + pmaddwd m1, m7 + pmaddwd m2, m7 + pmaddwd m8, m7 + paddd m0, m1 + paddd m2, m8 + paddd m12, m0 + paddd m12, m2 + + ; sum[1] + HADDD m12, m0 + + + ; SA8D[16x8] ; pix[8*stride] + sub r0, mmsize + sub r2, mmsize + sub r6, mmsize + sub r7, mmsize + + ;LOAD_DIFF_8x4P_AVX2 0, 1, 2, 8, 5, 6, r0, r2 + movu m0, [r0] + movu m5, [r2] + psubw m0, m5 + movu m1, [r0 + r1] + movu m6, [r2 + r3] + psubw m1, m6 + movu m2, [r0 + r1 * 2] + movu m5, [r2 + r3 * 2] + psubw m2, m5 + movu m8, [r0 + r4] + movu m6, [r2 + r5] + psubw m8, m6 + + ;LOAD_DIFF_8x4P_AVX2 4, 5, 3, 9, 11, 6, r6, r7 + movu m4, [r6] + movu m11, [r7] + psubw m4, m11 + movu m5, [r6 + r1] + movu m6, [r7 + r3] + psubw m5, m6 + movu m3, [r6 + r1 * 2] + movu m11, [r7 + r3 * 2] + psubw m3, m11 + movu m9, [r6 + r4] + movu m6, [r7 + r5] + psubw m9, m6 + + HADAMARD8_2D 0, 1, 2, 8, 4, 5, 3, 9, 6, amax + pmaddwd m0, m7 + pmaddwd m1, m7 + pmaddwd m2, m7 + pmaddwd m8, m7 + paddd m0, m1 + paddd m2, m8 + paddd m10, m0 + paddd m10, m2 + + ; sum[0] + HADDD m10, m0 + punpckldq xm10, xm12 + + + ;SA8D[16x8] ; pix[16*stridr] + lea r0, [r0+8*r1] + lea r2, [r2+8*r3] + lea r6, [r6+8*r1] + lea r7, [r7+8*r3] + + ;LOAD_DIFF_8x4P_AVX2 0, 1, 2, 8, 5, 6, r0, r2 + movu m0, [r0] + movu m5, [r2] + psubw m0, m5 + movu m1, [r0 + r1] + movu m6, [r2 + r3] + psubw m1, m6 + movu m2, [r0 + r1 * 2] + movu m5, [r2 + r3 * 2] + psubw m2, m5 + movu m8, [r0 + r4] + movu m6, [r2 + r5] + psubw m8, m6 + + ;LOAD_DIFF_8x4P_AVX2 4, 5, 3, 9, 11, 6, r6, r7 + movu m4, [r6] + movu m11, [r7] + psubw m4, m11 + movu m5, [r6 + r1] + movu m6, [r7 + r3] + psubw m5, m6 + movu m3, [r6 + r1 * 2] + movu m11, [r7 + r3 * 2] + psubw m3, m11 + movu m9, [r6 + r4] + movu m6, [r7 + r5] + psubw m9, m6 + + HADAMARD8_2D 0, 1, 2, 8, 4, 5, 3, 9, 6, amax + pmaddwd m0, m7 + pmaddwd m1, m7 + pmaddwd m2, m7 + pmaddwd m8, m7 + paddd m0, m1 + paddd m2, m8 + paddd m12, m0, m2 + + + ; SA8D[16x8] ; pix[16*stride+16] + add r0, mmsize + add r2, mmsize + add r6, mmsize + add r7, mmsize + + ;LOAD_DIFF_8x4P_AVX2 0, 1, 2, 8, 5, 6, r0, r2 + movu m0, [r0] + movu m5, [r2] + psubw m0, m5 + movu m1, [r0 + r1] + movu m6, [r2 + r3] + psubw m1, m6 + movu m2, [r0 + r1 * 2] + movu m5, [r2 + r3 * 2] + psubw m2, m5 + movu m8, [r0 + r4] + movu m6, [r2 + r5] + psubw m8, m6 + + ;LOAD_DIFF_8x4P_AVX2 4, 5, 3, 9, 11, 6, r6, r7 + movu m4, [r6] + movu m11, [r7] + psubw m4, m11 + movu m5, [r6 + r1] + movu m6, [r7 + r3] + psubw m5, m6 + movu m3, [r6 + r1 * 2] + movu m11, [r7 + r3 * 2] + psubw m3, m11 + movu m9, [r6 + r4] + movu m6, [r7 + r5] + psubw m9, m6 + + HADAMARD8_2D 0, 1, 2, 8, 4, 5, 3, 9, 6, amax + pmaddwd m0, m7 + pmaddwd m1, m7 + pmaddwd m2, m7 + pmaddwd m8, m7 + paddd m0, m1 + paddd m2, m8 + paddd m13, m0, m2 + + + ; SA8D[16x8] ; pix[24*stride+16] + lea r0, [r0+8*r1] + lea r2, [r2+8*r3] + lea r6, [r6+8*r1] + lea r7, [r7+8*r3] + + ;LOAD_DIFF_8x4P_AVX2 0, 1, 2, 8, 5, 6, r0, r2 + movu m0, [r0] + movu m5, [r2] + psubw m0, m5 + movu m1, [r0 + r1] + movu m6, [r2 + r3] + psubw m1, m6 + movu m2, [r0 + r1 * 2] + movu m5, [r2 + r3 * 2] + psubw m2, m5 + movu m8, [r0 + r4] + movu m6, [r2 + r5] + psubw m8, m6 + + ;LOAD_DIFF_8x4P_AVX2 4, 5, 3, 9, 11, 6, r6, r7 + movu m4, [r6] + movu m11, [r7] + psubw m4, m11 + movu m5, [r6 + r1] + movu m6, [r7 + r3] + psubw m5, m6 + movu m3, [r6 + r1 * 2] + movu m11, [r7 + r3 * 2] + psubw m3, m11 + movu m9, [r6 + r4] + movu m6, [r7 + r5] + psubw m9, m6 + + HADAMARD8_2D 0, 1, 2, 8, 4, 5, 3, 9, 6, amax + pmaddwd m0, m7 + pmaddwd m1, m7 + pmaddwd m2, m7 + pmaddwd m8, m7 + paddd m0, m1 + paddd m2, m8 + paddd m13, m0 + paddd m13, m2 + + ; sum[3] + HADDD m13, m0 + + + ; SA8D[16x8] ; pix[24*stride] + sub r0, mmsize + sub r2, mmsize + sub r6, mmsize + sub r7, mmsize + + ;LOAD_DIFF_8x4P_AVX2 0, 1, 2, 8, 5, 6, r0, r2 + movu m0, [r0] + movu m5, [r2] + psubw m0, m5 + movu m1, [r0 + r1] + movu m6, [r2 + r3] + psubw m1, m6 + movu m2, [r0 + r1 * 2] + movu m5, [r2 + r3 * 2] + psubw m2, m5 + movu m8, [r0 + r4] + movu m6, [r2 + r5] + psubw m8, m6 + + ;LOAD_DIFF_8x4P_AVX2 4, 5, 3, 9, 11, 6, r6, r7 + movu m4, [r6] + movu m11, [r7] + psubw m4, m11 + movu m5, [r6 + r1] + movu m6, [r7 + r3] + psubw m5, m6 + movu m3, [r6 + r1 * 2] + movu m11, [r7 + r3 * 2] + psubw m3, m11 + movu m9, [r6 + r4] + movu m6, [r7 + r5] + psubw m9, m6 + + HADAMARD8_2D 0, 1, 2, 8, 4, 5, 3, 9, 6, amax + pmaddwd m0, m7 + pmaddwd m1, m7 + pmaddwd m2, m7 + pmaddwd m8, m7 + paddd m0, m1 + paddd m2, m8 + paddd m12, m0 + paddd m12, m2 + + ; sum[2] + HADDD m12, m0 + punpckldq xm12, xm13 + + ; SA8D + punpcklqdq xm0, xm10, xm12 + paddd xm0, [pd_1] + psrld xm0, 1 + HADDD xm0, xm1 + + movd eax, xm0 + RET + +%endif ; HIGH_BIT_DEPTH == 1 && BIT_DEPTH == 10
View file
x265_1.9.tar.gz/source/common/yuv.cpp -> x265_2.0.tar.gz/source/common/yuv.cpp
Changed
@@ -163,14 +163,19 @@ } } -void Yuv::addClip(const Yuv& srcYuv0, const ShortYuv& srcYuv1, uint32_t log2SizeL) +void Yuv::addClip(const Yuv& srcYuv0, const ShortYuv& srcYuv1, uint32_t log2SizeL, int picCsp) { primitives.cu[log2SizeL - 2].add_ps(m_buf[0], m_size, srcYuv0.m_buf[0], srcYuv1.m_buf[0], srcYuv0.m_size, srcYuv1.m_size); - if (m_csp != X265_CSP_I400) + if (m_csp != X265_CSP_I400 && picCsp != X265_CSP_I400) { primitives.chroma[m_csp].cu[log2SizeL - 2].add_ps(m_buf[1], m_csize, srcYuv0.m_buf[1], srcYuv1.m_buf[1], srcYuv0.m_csize, srcYuv1.m_csize); primitives.chroma[m_csp].cu[log2SizeL - 2].add_ps(m_buf[2], m_csize, srcYuv0.m_buf[2], srcYuv1.m_buf[2], srcYuv0.m_csize, srcYuv1.m_csize); } + if (picCsp == X265_CSP_I400 && m_csp != X265_CSP_I400) + { + primitives.chroma[m_csp].cu[m_part].copy_pp(m_buf[1], m_csize, srcYuv0.m_buf[1], srcYuv0.m_csize); + primitives.chroma[m_csp].cu[m_part].copy_pp(m_buf[2], m_csize, srcYuv0.m_buf[2], srcYuv0.m_csize); + } } void Yuv::addAvg(const ShortYuv& srcYuv0, const ShortYuv& srcYuv1, uint32_t absPartIdx, uint32_t width, uint32_t height, bool bLuma, bool bChroma)
View file
x265_1.9.tar.gz/source/common/yuv.h -> x265_2.0.tar.gz/source/common/yuv.h
Changed
@@ -73,7 +73,7 @@ void copyPartToYuv(Yuv& dstYuv, uint32_t absPartIdx) const; // Clip(srcYuv0 + srcYuv1) -> m_buf .. aka recon = clip(pred + residual) - void addClip(const Yuv& srcYuv0, const ShortYuv& srcYuv1, uint32_t log2SizeL); + void addClip(const Yuv& srcYuv0, const ShortYuv& srcYuv1, uint32_t log2SizeL, int picCsp); // (srcYuv0 + srcYuv1)/2 for YUV partition (bidir averaging) void addAvg(const ShortYuv& srcYuv0, const ShortYuv& srcYuv1, uint32_t absPartIdx, uint32_t width, uint32_t height, bool bLuma, bool bChroma);
View file
x265_1.9.tar.gz/source/compat/msvc/stdint.h -> x265_2.0.tar.gz/source/compat/msvc/stdint.h
Changed
@@ -8,6 +8,7 @@ #if !defined(UINT64_MAX) #include <limits.h> #define UINT64_MAX _UI64_MAX +#define INT64_MAX _I64_MAX #define INT16_MAX _I16_MAX #endif
View file
x265_1.9.tar.gz/source/encoder/analysis.cpp -> x265_2.0.tar.gz/source/encoder/analysis.cpp
Changed
@@ -74,14 +74,18 @@ { m_reuseInterDataCTU = NULL; m_reuseRef = NULL; - m_reuseBestMergeCand = NULL; - m_reuseMv = NULL; + m_bHD = false; } bool Analysis::create(ThreadLocalData *tld) { m_tld = tld; m_bTryLossless = m_param->bCULossless && !m_param->bLossless && m_param->rdLevel >= 2; - m_bChromaSa8d = m_param->rdLevel >= 3; + + int costArrSize = 1; + uint32_t maxDQPDepth = g_log2Size[m_param->maxCUSize] - g_log2Size[m_param->rc.qgSize]; + for (uint32_t i = 1; i <= maxDQPDepth; i++) + costArrSize += (1 << (i * 2)); + cacheCost = X265_MALLOC(uint64_t, costArrSize); int csp = m_param->internalCsp; uint32_t cuSize = g_maxCUSize; @@ -102,6 +106,8 @@ md.pred[j].fencYuv = &md.fencYuv; } } + if (m_param->sourceHeight >= 1080) + m_bHD = true; return ok; } @@ -119,12 +125,14 @@ m_modeDepth[i].pred[j].reconYuv.destroy(); } } + X265_FREE(cacheCost); } Mode& Analysis::compressCTU(CUData& ctu, Frame& frame, const CUGeom& cuGeom, const Entropy& initialContext) { m_slice = ctu.m_slice; m_frame = &frame; + m_bChromaSa8d = m_param->rdLevel >= 3; #if _DEBUG || CHECKED_BUILD invalidateContexts(0); @@ -142,8 +150,13 @@ int numPredDir = m_slice->isInterP() ? 1 : 2; m_reuseInterDataCTU = (analysis_inter_data*)m_frame->m_analysisData.interData; m_reuseRef = &m_reuseInterDataCTU->ref[ctu.m_cuAddr * X265_MAX_PRED_MODE_PER_CTU * numPredDir]; - m_reuseBestMergeCand = &m_reuseInterDataCTU->bestMergeCand[ctu.m_cuAddr * CUGeom::MAX_GEOMS]; - m_reuseMv = &m_reuseInterDataCTU->mv[ctu.m_cuAddr * X265_MAX_PRED_MODE_PER_CTU * numPredDir]; + m_reuseDepth = &m_reuseInterDataCTU->depth[ctu.m_cuAddr * ctu.m_numPartitions]; + m_reuseModes = &m_reuseInterDataCTU->modes[ctu.m_cuAddr * ctu.m_numPartitions]; + m_reusePartSize = &m_reuseInterDataCTU->partSize[ctu.m_cuAddr * ctu.m_numPartitions]; + m_reuseMergeFlag = &m_reuseInterDataCTU->mergeFlag[ctu.m_cuAddr * ctu.m_numPartitions]; + if (m_param->analysisMode == X265_ANALYSIS_SAVE) + for (int i = 0; i < X265_MAX_PRED_MODE_PER_CTU * numPredDir; i++) + m_reuseRef[i] = -1; } ProfileCUScope(ctu, totalCTUTime, totalCTUs); @@ -158,14 +171,6 @@ memcpy(ctu.m_chromaIntraDir, &intraDataCTU->chromaModes[ctu.m_cuAddr * numPartition], sizeof(uint8_t) * numPartition); } compressIntraCU(ctu, cuGeom, qp); - if (m_param->analysisMode == X265_ANALYSIS_SAVE && intraDataCTU) - { - CUData* bestCU = &m_modeDepth[0].bestMode->cu; - memcpy(&intraDataCTU->depth[ctu.m_cuAddr * numPartition], bestCU->m_cuDepth, sizeof(uint8_t) * numPartition); - memcpy(&intraDataCTU->modes[ctu.m_cuAddr * numPartition], bestCU->m_lumaIntraDir, sizeof(uint8_t) * numPartition); - memcpy(&intraDataCTU->partSizes[ctu.m_cuAddr * numPartition], bestCU->m_partSize, sizeof(uint8_t) * numPartition); - memcpy(&intraDataCTU->chromaModes[ctu.m_cuAddr * numPartition], bestCU->m_chromaIntraDir, sizeof(uint8_t) * numPartition); - } } else { @@ -189,18 +194,12 @@ else if (m_param->rdLevel <= 4) compressInterCU_rd0_4(ctu, cuGeom, qp); else - { - uint32_t zOrder = 0; - compressInterCU_rd5_6(ctu, cuGeom, zOrder, qp); - if (m_param->analysisMode == X265_ANALYSIS_SAVE && m_frame->m_analysisData.interData) - { - CUData* bestCU = &m_modeDepth[0].bestMode->cu; - memcpy(&m_reuseInterDataCTU->depth[ctu.m_cuAddr * numPartition], bestCU->m_cuDepth, sizeof(uint8_t) * numPartition); - memcpy(&m_reuseInterDataCTU->modes[ctu.m_cuAddr * numPartition], bestCU->m_predMode, sizeof(uint8_t) * numPartition); - } - } + compressInterCU_rd5_6(ctu, cuGeom, qp); } + if (m_param->bEnableRdRefine) + qprdRefine(ctu, cuGeom, qp, qp); + return *m_modeDepth[0].bestMode; } @@ -229,6 +228,61 @@ } } +void Analysis::qprdRefine(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp, int32_t lqp) +{ + uint32_t depth = cuGeom.depth; + ModeDepth& md = m_modeDepth[depth]; + md.bestMode = NULL; + + bool bDecidedDepth = parentCTU.m_cuDepth[cuGeom.absPartIdx] == depth; + + int bestCUQP = qp; + int lambdaQP = lqp; + + bool doQPRefine = (bDecidedDepth && depth <= m_slice->m_pps->maxCuDQPDepth) || (!bDecidedDepth && depth == m_slice->m_pps->maxCuDQPDepth); + + if (doQPRefine) + { + uint64_t bestCUCost, origCUCost, cuCost, cuPrevCost; + + int cuIdx = (cuGeom.childOffset - 1) / 3; + bestCUCost = origCUCost = cacheCost[cuIdx]; + + for (int dir = 2; dir >= -2; dir -= 4) + { + int threshold = 1; + int failure = 0; + cuPrevCost = origCUCost; + + int modCUQP = qp + dir; + while (modCUQP >= QP_MIN && modCUQP <= QP_MAX_SPEC) + { + recodeCU(parentCTU, cuGeom, modCUQP, qp); + cuCost = md.bestMode->rdCost; + + COPY2_IF_LT(bestCUCost, cuCost, bestCUQP, modCUQP); + if (cuCost < cuPrevCost) + failure = 0; + else + failure++; + + if (failure > threshold) + break; + + cuPrevCost = cuCost; + modCUQP += dir; + } + } + lambdaQP = bestCUQP; + } + + recodeCU(parentCTU, cuGeom, bestCUQP, lambdaQP); + + /* Copy best data to encData CTU and recon */ + md.bestMode->cu.copyToPic(depth); + md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, parentCTU.m_cuAddr, cuGeom.absPartIdx); +} + void Analysis::compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp) { uint32_t depth = cuGeom.depth; @@ -334,6 +388,12 @@ checkBestMode(*splitPred, depth); } + if (m_param->bEnableRdRefine && depth <= m_slice->m_pps->maxCuDQPDepth) + { + int cuIdx = (cuGeom.childOffset - 1) / 3; + cacheCost[cuIdx] = md.bestMode->rdCost; + } + /* Copy best data to encData CTU and recon */ md.bestMode->cu.copyToPic(depth); if (md.bestMode != &md.pred[PRED_SPLIT]) @@ -377,6 +437,7 @@ slave.m_slice = m_slice; slave.m_frame = m_frame; slave.m_param = m_param; + slave.m_bChromaSa8d = m_param->rdLevel >= 3; slave.setLambdaFromQP(md.pred[PRED_2Nx2N].cu, m_rdCost.m_qp); slave.invalidateContexts(0); slave.m_rqt[pmode.cuGeom.depth].cur.load(m_rqt[pmode.cuGeom.depth].cur); @@ -555,7 +616,7 @@ if (m_param->rdLevel <= 4) checkMerge2Nx2N_rd0_4(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom); else - checkMerge2Nx2N_rd5_6(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom, false); + checkMerge2Nx2N_rd5_6(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom); } bool bNoSplit = false; @@ -827,8 +888,11 @@ bool mightSplit = !(cuGeom.flags & CUGeom::LEAF); bool mightNotSplit = !(cuGeom.flags & CUGeom::SPLIT_MANDATORY); uint32_t minDepth = topSkipMinDepth(parentCTU, cuGeom); - bool earlyskip = false; + bool skipModes = false; /* Skip any remaining mode analyses at current depth */ + bool skipRecursion = false; /* Skip recursion */ bool splitIntra = true; + bool skipRectAmp = false; + bool chooseMerge = false; SplitData splitData[4]; splitData[0].initSplitCUData(); @@ -844,27 +908,56 @@ md.pred[PRED_2Nx2N].sa8dCost = 0; } - /* Step 1. Evaluate Merge/Skip candidates for likely early-outs */ - if (mightNotSplit && depth >= minDepth) + if (m_param->analysisMode == X265_ANALYSIS_LOAD) + { + if (mightNotSplit && depth == m_reuseDepth[cuGeom.absPartIdx]) + { + if (m_reuseModes[cuGeom.absPartIdx] == MODE_SKIP) + { + md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp); + md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp); + checkMerge2Nx2N_rd0_4(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom); + + skipRecursion = !!m_param->bEnableRecursionSkip && md.bestMode; + if (m_param->rdLevel) + skipModes = m_param->bEnableEarlySkip && md.bestMode; + } + if (m_reusePartSize[cuGeom.absPartIdx] == SIZE_2Nx2N) + { + if (m_reuseModes[cuGeom.absPartIdx] != MODE_INTRA && m_reuseModes[cuGeom.absPartIdx] != 4) + { + skipRectAmp = true && !!md.bestMode; + chooseMerge = !!m_reuseMergeFlag[cuGeom.absPartIdx] && !!md.bestMode; + } + } + } + } + + /* Step 1. Evaluate Merge/Skip candidates for likely early-outs, if skip mode was not set above */ + if (mightNotSplit && depth >= minDepth && !md.bestMode) /* TODO: Re-evaluate if analysis load/save still works */ { /* Compute Merge Cost */ md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp); md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp); checkMerge2Nx2N_rd0_4(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom); if (m_param->rdLevel) - earlyskip = m_param->bEnableEarlySkip && md.bestMode && md.bestMode->cu.isSkipped(0); // TODO: sa8d threshold per depth + skipModes = m_param->bEnableEarlySkip && md.bestMode && md.bestMode->cu.isSkipped(0); // TODO: sa8d threshold per depth } - bool bNoSplit = false; - if (md.bestMode) + if (md.bestMode && m_param->bEnableRecursionSkip) { - bNoSplit = md.bestMode->cu.isSkipped(0); - if (mightSplit && depth && depth >= minDepth && !bNoSplit) - bNoSplit = recursionDepthCheck(parentCTU, cuGeom, *md.bestMode); + skipRecursion = md.bestMode->cu.isSkipped(0); + if (mightSplit && depth >= minDepth && !skipRecursion) + { + if (depth) + skipRecursion = recursionDepthCheck(parentCTU, cuGeom, *md.bestMode); + if (m_bHD && !skipRecursion && m_param->rdLevel == 2 && md.fencYuv.m_size != MAX_CU_SIZE) + skipRecursion = complexityCheckCU(*md.bestMode); + } } /* Step 2. Evaluate each of the 4 split sub-blocks in series */ - if (mightSplit && !bNoSplit) + if (mightSplit && !skipRecursion) { Mode* splitPred = &md.pred[PRED_SPLIT]; splitPred->initCosts(); @@ -926,7 +1019,7 @@ if (m_slice->m_pps->bUseDQP && depth <= m_slice->m_pps->maxCuDQPDepth && m_slice->m_pps->maxCuDQPDepth != 0) setLambdaFromQP(parentCTU, qp); - if (!earlyskip) + if (!skipModes) { uint32_t refMasks[2]; refMasks[0] = allSplitRefs; @@ -947,158 +1040,161 @@ } Mode *bestInter = &md.pred[PRED_2Nx2N]; - if (m_param->bEnableRectInter) + if (!skipRectAmp) { - uint64_t splitCost = splitData[0].sa8dCost + splitData[1].sa8dCost + splitData[2].sa8dCost + splitData[3].sa8dCost; - uint32_t threshold_2NxN, threshold_Nx2N; - - if (m_slice->m_sliceType == P_SLICE) - { - threshold_2NxN = splitData[0].mvCost[0] + splitData[1].mvCost[0]; - threshold_Nx2N = splitData[0].mvCost[0] + splitData[2].mvCost[0]; - } - else - { - threshold_2NxN = (splitData[0].mvCost[0] + splitData[1].mvCost[0] - + splitData[0].mvCost[1] + splitData[1].mvCost[1] + 1) >> 1; - threshold_Nx2N = (splitData[0].mvCost[0] + splitData[2].mvCost[0] - + splitData[0].mvCost[1] + splitData[2].mvCost[1] + 1) >> 1; - } - - int try_2NxN_first = threshold_2NxN < threshold_Nx2N; - if (try_2NxN_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_2NxN) - { - refMasks[0] = splitData[0].splitRefs | splitData[1].splitRefs; /* top */ - refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* bot */ - md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd0_4(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN, refMasks); - if (md.pred[PRED_2NxN].sa8dCost < bestInter->sa8dCost) - bestInter = &md.pred[PRED_2NxN]; - } - - if (splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_Nx2N) - { - refMasks[0] = splitData[0].splitRefs | splitData[2].splitRefs; /* left */ - refMasks[1] = splitData[1].splitRefs | splitData[3].splitRefs; /* right */ - md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd0_4(md.pred[PRED_Nx2N], cuGeom, SIZE_Nx2N, refMasks); - if (md.pred[PRED_Nx2N].sa8dCost < bestInter->sa8dCost) - bestInter = &md.pred[PRED_Nx2N]; - } - - if (!try_2NxN_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_2NxN) - { - refMasks[0] = splitData[0].splitRefs | splitData[1].splitRefs; /* top */ - refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* bot */ - md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd0_4(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN, refMasks); - if (md.pred[PRED_2NxN].sa8dCost < bestInter->sa8dCost) - bestInter = &md.pred[PRED_2NxN]; - } - } - - if (m_slice->m_sps->maxAMPDepth > depth) - { - uint64_t splitCost = splitData[0].sa8dCost + splitData[1].sa8dCost + splitData[2].sa8dCost + splitData[3].sa8dCost; - uint32_t threshold_2NxnU, threshold_2NxnD, threshold_nLx2N, threshold_nRx2N; - - if (m_slice->m_sliceType == P_SLICE) + if (m_param->bEnableRectInter) { - threshold_2NxnU = splitData[0].mvCost[0] + splitData[1].mvCost[0]; - threshold_2NxnD = splitData[2].mvCost[0] + splitData[3].mvCost[0]; + uint64_t splitCost = splitData[0].sa8dCost + splitData[1].sa8dCost + splitData[2].sa8dCost + splitData[3].sa8dCost; + uint32_t threshold_2NxN, threshold_Nx2N; - threshold_nLx2N = splitData[0].mvCost[0] + splitData[2].mvCost[0]; - threshold_nRx2N = splitData[1].mvCost[0] + splitData[3].mvCost[0]; - } - else - { - threshold_2NxnU = (splitData[0].mvCost[0] + splitData[1].mvCost[0] + if (m_slice->m_sliceType == P_SLICE) + { + threshold_2NxN = splitData[0].mvCost[0] + splitData[1].mvCost[0]; + threshold_Nx2N = splitData[0].mvCost[0] + splitData[2].mvCost[0]; + } + else + { + threshold_2NxN = (splitData[0].mvCost[0] + splitData[1].mvCost[0] + splitData[0].mvCost[1] + splitData[1].mvCost[1] + 1) >> 1; - threshold_2NxnD = (splitData[2].mvCost[0] + splitData[3].mvCost[0] - + splitData[2].mvCost[1] + splitData[3].mvCost[1] + 1) >> 1; - - threshold_nLx2N = (splitData[0].mvCost[0] + splitData[2].mvCost[0] + threshold_Nx2N = (splitData[0].mvCost[0] + splitData[2].mvCost[0] + splitData[0].mvCost[1] + splitData[2].mvCost[1] + 1) >> 1; - threshold_nRx2N = (splitData[1].mvCost[0] + splitData[3].mvCost[0] - + splitData[1].mvCost[1] + splitData[3].mvCost[1] + 1) >> 1; - } - - bool bHor = false, bVer = false; - if (bestInter->cu.m_partSize[0] == SIZE_2NxN) - bHor = true; - else if (bestInter->cu.m_partSize[0] == SIZE_Nx2N) - bVer = true; - else if (bestInter->cu.m_partSize[0] == SIZE_2Nx2N && - md.bestMode && md.bestMode->cu.getQtRootCbf(0)) - { - bHor = true; - bVer = true; - } + } - if (bHor) - { - int try_2NxnD_first = threshold_2NxnD < threshold_2NxnU; - if (try_2NxnD_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_2NxnD) + int try_2NxN_first = threshold_2NxN < threshold_Nx2N; + if (try_2NxN_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_2NxN) { - refMasks[0] = allSplitRefs; /* 75% top */ - refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* 25% bot */ - md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd0_4(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD, refMasks); - if (md.pred[PRED_2NxnD].sa8dCost < bestInter->sa8dCost) - bestInter = &md.pred[PRED_2NxnD]; + refMasks[0] = splitData[0].splitRefs | splitData[1].splitRefs; /* top */ + refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* bot */ + md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd0_4(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN, refMasks); + if (md.pred[PRED_2NxN].sa8dCost < bestInter->sa8dCost) + bestInter = &md.pred[PRED_2NxN]; } - if (splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_2NxnU) + if (splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_Nx2N) { - refMasks[0] = splitData[0].splitRefs | splitData[1].splitRefs; /* 25% top */ - refMasks[1] = allSplitRefs; /* 75% bot */ - md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd0_4(md.pred[PRED_2NxnU], cuGeom, SIZE_2NxnU, refMasks); - if (md.pred[PRED_2NxnU].sa8dCost < bestInter->sa8dCost) - bestInter = &md.pred[PRED_2NxnU]; + refMasks[0] = splitData[0].splitRefs | splitData[2].splitRefs; /* left */ + refMasks[1] = splitData[1].splitRefs | splitData[3].splitRefs; /* right */ + md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd0_4(md.pred[PRED_Nx2N], cuGeom, SIZE_Nx2N, refMasks); + if (md.pred[PRED_Nx2N].sa8dCost < bestInter->sa8dCost) + bestInter = &md.pred[PRED_Nx2N]; } - if (!try_2NxnD_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_2NxnD) + if (!try_2NxN_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_2NxN) { - refMasks[0] = allSplitRefs; /* 75% top */ - refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* 25% bot */ - md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd0_4(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD, refMasks); - if (md.pred[PRED_2NxnD].sa8dCost < bestInter->sa8dCost) - bestInter = &md.pred[PRED_2NxnD]; + refMasks[0] = splitData[0].splitRefs | splitData[1].splitRefs; /* top */ + refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* bot */ + md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd0_4(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN, refMasks); + if (md.pred[PRED_2NxN].sa8dCost < bestInter->sa8dCost) + bestInter = &md.pred[PRED_2NxN]; } } - if (bVer) + + if (m_slice->m_sps->maxAMPDepth > depth) { - int try_nRx2N_first = threshold_nRx2N < threshold_nLx2N; - if (try_nRx2N_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_nRx2N) + uint64_t splitCost = splitData[0].sa8dCost + splitData[1].sa8dCost + splitData[2].sa8dCost + splitData[3].sa8dCost; + uint32_t threshold_2NxnU, threshold_2NxnD, threshold_nLx2N, threshold_nRx2N; + + if (m_slice->m_sliceType == P_SLICE) { - refMasks[0] = allSplitRefs; /* 75% left */ - refMasks[1] = splitData[1].splitRefs | splitData[3].splitRefs; /* 25% right */ - md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd0_4(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N, refMasks); - if (md.pred[PRED_nRx2N].sa8dCost < bestInter->sa8dCost) - bestInter = &md.pred[PRED_nRx2N]; + threshold_2NxnU = splitData[0].mvCost[0] + splitData[1].mvCost[0]; + threshold_2NxnD = splitData[2].mvCost[0] + splitData[3].mvCost[0]; + + threshold_nLx2N = splitData[0].mvCost[0] + splitData[2].mvCost[0]; + threshold_nRx2N = splitData[1].mvCost[0] + splitData[3].mvCost[0]; + } + else + { + threshold_2NxnU = (splitData[0].mvCost[0] + splitData[1].mvCost[0] + + splitData[0].mvCost[1] + splitData[1].mvCost[1] + 1) >> 1; + threshold_2NxnD = (splitData[2].mvCost[0] + splitData[3].mvCost[0] + + splitData[2].mvCost[1] + splitData[3].mvCost[1] + 1) >> 1; + + threshold_nLx2N = (splitData[0].mvCost[0] + splitData[2].mvCost[0] + + splitData[0].mvCost[1] + splitData[2].mvCost[1] + 1) >> 1; + threshold_nRx2N = (splitData[1].mvCost[0] + splitData[3].mvCost[0] + + splitData[1].mvCost[1] + splitData[3].mvCost[1] + 1) >> 1; } - if (splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_nLx2N) + bool bHor = false, bVer = false; + if (bestInter->cu.m_partSize[0] == SIZE_2NxN) + bHor = true; + else if (bestInter->cu.m_partSize[0] == SIZE_Nx2N) + bVer = true; + else if (bestInter->cu.m_partSize[0] == SIZE_2Nx2N && + md.bestMode && md.bestMode->cu.getQtRootCbf(0)) { - refMasks[0] = splitData[0].splitRefs | splitData[2].splitRefs; /* 25% left */ - refMasks[1] = allSplitRefs; /* 75% right */ - md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd0_4(md.pred[PRED_nLx2N], cuGeom, SIZE_nLx2N, refMasks); - if (md.pred[PRED_nLx2N].sa8dCost < bestInter->sa8dCost) - bestInter = &md.pred[PRED_nLx2N]; + bHor = true; + bVer = true; } - if (!try_nRx2N_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_nRx2N) + if (bHor) { - refMasks[0] = allSplitRefs; /* 75% left */ - refMasks[1] = splitData[1].splitRefs | splitData[3].splitRefs; /* 25% right */ - md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd0_4(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N, refMasks); - if (md.pred[PRED_nRx2N].sa8dCost < bestInter->sa8dCost) - bestInter = &md.pred[PRED_nRx2N]; + int try_2NxnD_first = threshold_2NxnD < threshold_2NxnU; + if (try_2NxnD_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_2NxnD) + { + refMasks[0] = allSplitRefs; /* 75% top */ + refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* 25% bot */ + md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd0_4(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD, refMasks); + if (md.pred[PRED_2NxnD].sa8dCost < bestInter->sa8dCost) + bestInter = &md.pred[PRED_2NxnD]; + } + + if (splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_2NxnU) + { + refMasks[0] = splitData[0].splitRefs | splitData[1].splitRefs; /* 25% top */ + refMasks[1] = allSplitRefs; /* 75% bot */ + md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd0_4(md.pred[PRED_2NxnU], cuGeom, SIZE_2NxnU, refMasks); + if (md.pred[PRED_2NxnU].sa8dCost < bestInter->sa8dCost) + bestInter = &md.pred[PRED_2NxnU]; + } + + if (!try_2NxnD_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_2NxnD) + { + refMasks[0] = allSplitRefs; /* 75% top */ + refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* 25% bot */ + md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd0_4(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD, refMasks); + if (md.pred[PRED_2NxnD].sa8dCost < bestInter->sa8dCost) + bestInter = &md.pred[PRED_2NxnD]; + } + } + if (bVer) + { + int try_nRx2N_first = threshold_nRx2N < threshold_nLx2N; + if (try_nRx2N_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_nRx2N) + { + refMasks[0] = allSplitRefs; /* 75% left */ + refMasks[1] = splitData[1].splitRefs | splitData[3].splitRefs; /* 25% right */ + md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd0_4(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N, refMasks); + if (md.pred[PRED_nRx2N].sa8dCost < bestInter->sa8dCost) + bestInter = &md.pred[PRED_nRx2N]; + } + + if (splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_nLx2N) + { + refMasks[0] = splitData[0].splitRefs | splitData[2].splitRefs; /* 25% left */ + refMasks[1] = allSplitRefs; /* 75% right */ + md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd0_4(md.pred[PRED_nLx2N], cuGeom, SIZE_nLx2N, refMasks); + if (md.pred[PRED_nLx2N].sa8dCost < bestInter->sa8dCost) + bestInter = &md.pred[PRED_nLx2N]; + } + + if (!try_nRx2N_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_nRx2N) + { + refMasks[0] = allSplitRefs; /* 75% left */ + refMasks[1] = splitData[1].splitRefs | splitData[3].splitRefs; /* 25% right */ + md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd0_4(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N, refMasks); + if (md.pred[PRED_nRx2N].sa8dCost < bestInter->sa8dCost) + bestInter = &md.pred[PRED_nRx2N]; + } } } } @@ -1106,7 +1202,7 @@ if (m_param->rdLevel >= 3) { /* Calculate RD cost of best inter option */ - if (!m_bChromaSa8d && (m_csp != X265_CSP_I400)) /* When m_bChromaSa8d is enabled, chroma MC has already been done */ + if ((!m_bChromaSa8d && (m_csp != X265_CSP_I400)) || (m_frame->m_fencPic->m_picCsp == X265_CSP_I400 && m_csp != X265_CSP_I400)) /* When m_bChromaSa8d is enabled, chroma MC has already been done */ { uint32_t numPU = bestInter->cu.getNumPartInter(0); for (uint32_t puIdx = 0; puIdx < numPU; puIdx++) @@ -1115,15 +1211,26 @@ motionCompensation(bestInter->cu, pu, bestInter->predYuv, false, true); } } - encodeResAndCalcRdInterCU(*bestInter, cuGeom); - checkBestMode(*bestInter, depth); - /* If BIDIR is available and within 17/16 of best inter option, choose by RDO */ - if (m_slice->m_sliceType == B_SLICE && md.pred[PRED_BIDIR].sa8dCost != MAX_INT64 && - md.pred[PRED_BIDIR].sa8dCost * 16 <= bestInter->sa8dCost * 17) + if (!chooseMerge) { - encodeResAndCalcRdInterCU(md.pred[PRED_BIDIR], cuGeom); - checkBestMode(md.pred[PRED_BIDIR], depth); + encodeResAndCalcRdInterCU(*bestInter, cuGeom); + checkBestMode(*bestInter, depth); + + /* If BIDIR is available and within 17/16 of best inter option, choose by RDO */ + if (m_slice->m_sliceType == B_SLICE && md.pred[PRED_BIDIR].sa8dCost != MAX_INT64 && + md.pred[PRED_BIDIR].sa8dCost * 16 <= bestInter->sa8dCost * 17) + { + uint32_t numPU = md.pred[PRED_BIDIR].cu.getNumPartInter(0); + if (m_frame->m_fencPic->m_picCsp == X265_CSP_I400 && m_csp != X265_CSP_I400) + for (uint32_t puIdx = 0; puIdx < numPU; puIdx++) + { + PredictionUnit pu(md.pred[PRED_BIDIR].cu, cuGeom, puIdx); + motionCompensation(md.pred[PRED_BIDIR].cu, pu, md.pred[PRED_BIDIR].predYuv, true, true); + } + encodeResAndCalcRdInterCU(md.pred[PRED_BIDIR], cuGeom); + checkBestMode(md.pred[PRED_BIDIR], depth); + } } if ((bTryIntra && md.bestMode->cu.getQtRootCbf(0)) || @@ -1198,10 +1305,10 @@ uint32_t tuDepthRange[2]; cu.getInterTUQtDepthRange(tuDepthRange, 0); - m_rqt[cuGeom.depth].tmpResiYuv.subtract(*md.bestMode->fencYuv, md.bestMode->predYuv, cuGeom.log2CUSize); + m_rqt[cuGeom.depth].tmpResiYuv.subtract(*md.bestMode->fencYuv, md.bestMode->predYuv, cuGeom.log2CUSize, m_frame->m_fencPic->m_picCsp); residualTransformQuantInter(*md.bestMode, cuGeom, 0, 0, tuDepthRange); if (cu.getQtRootCbf(0)) - md.bestMode->reconYuv.addClip(md.bestMode->predYuv, m_rqt[cuGeom.depth].tmpResiYuv, cu.m_log2CUSize[0]); + md.bestMode->reconYuv.addClip(md.bestMode->predYuv, m_rqt[cuGeom.depth].tmpResiYuv, cu.m_log2CUSize[0], m_frame->m_fencPic->m_picCsp); else { md.bestMode->reconYuv.copyFromYuv(md.bestMode->predYuv); @@ -1241,7 +1348,7 @@ addSplitFlagCost(*md.bestMode, cuGeom.depth); } - if (mightSplit && !bNoSplit) + if (mightSplit && !skipRecursion) { Mode* splitPred = &md.pred[PRED_SPLIT]; if (!md.bestMode) @@ -1279,9 +1386,8 @@ splitCUData.sa8dCost = md.pred[PRED_2Nx2N].sa8dCost; } - if (mightNotSplit) + if (mightNotSplit && md.bestMode->cu.isSkipped(0)) { - /* early-out statistics */ FrameData& curEncData = *m_frame->m_encData; FrameData::RCStatCU& cuStat = curEncData.m_cuStat[parentCTU.m_cuAddr]; uint64_t temp = cuStat.avgCost[depth] * cuStat.count[depth]; @@ -1297,7 +1403,7 @@ return splitCUData; } -SplitData Analysis::compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp) +SplitData Analysis::compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp) { uint32_t depth = cuGeom.depth; ModeDepth& md = m_modeDepth[depth]; @@ -1305,8 +1411,10 @@ bool mightSplit = !(cuGeom.flags & CUGeom::LEAF); bool mightNotSplit = !(cuGeom.flags & CUGeom::SPLIT_MANDATORY); - bool foundSkip = false; + bool skipRecursion = false; + bool skipModes = false; bool splitIntra = true; + bool skipRectAmp = false; // avoid uninitialize value in below reference if (m_param->limitModes) @@ -1316,41 +1424,55 @@ md.pred[PRED_2Nx2N].rdCost = 0; } - if (m_param->analysisMode == X265_ANALYSIS_LOAD) - { - uint8_t* reuseDepth = &m_reuseInterDataCTU->depth[parentCTU.m_cuAddr * parentCTU.m_numPartitions]; - uint8_t* reuseModes = &m_reuseInterDataCTU->modes[parentCTU.m_cuAddr * parentCTU.m_numPartitions]; - if (mightNotSplit && depth == reuseDepth[zOrder] && zOrder == cuGeom.absPartIdx && reuseModes[zOrder] == MODE_SKIP) - { - md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp); - md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp); - checkMerge2Nx2N_rd5_6(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom, true); - - // increment zOrder offset to point to next best depth in sharedDepth buffer - zOrder += g_depthInc[g_maxCUDepth - 1][reuseDepth[zOrder]]; - - foundSkip = true; - } - } - SplitData splitData[4]; splitData[0].initSplitCUData(); splitData[1].initSplitCUData(); splitData[2].initSplitCUData(); splitData[3].initSplitCUData(); + uint32_t allSplitRefs = splitData[0].splitRefs | splitData[1].splitRefs | splitData[2].splitRefs | splitData[3].splitRefs; + uint32_t refMasks[2]; + if (m_param->analysisMode == X265_ANALYSIS_LOAD) + { + if (mightNotSplit && depth == m_reuseDepth[cuGeom.absPartIdx]) + { + if (m_reuseModes[cuGeom.absPartIdx] == MODE_SKIP) + { + md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp); + md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp); + checkMerge2Nx2N_rd5_6(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom); + skipModes = !!m_param->bEnableEarlySkip && md.bestMode; + refMasks[0] = allSplitRefs; + md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_2Nx2N], cuGeom, SIZE_2Nx2N, refMasks); + checkBestMode(md.pred[PRED_2Nx2N], cuGeom.depth); + + if (m_param->bEnableRecursionSkip && depth && m_modeDepth[depth - 1].bestMode) + skipRecursion = md.bestMode && !md.bestMode->cu.getQtRootCbf(0); + } + if (m_reusePartSize[cuGeom.absPartIdx] == SIZE_2Nx2N) + skipRectAmp = true && !!md.bestMode; + } + } /* Step 1. Evaluate Merge/Skip candidates for likely early-outs */ - if (mightNotSplit && !foundSkip) + if (mightNotSplit && !md.bestMode) { md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp); md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp); - checkMerge2Nx2N_rd5_6(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom, false); - foundSkip = md.bestMode && !md.bestMode->cu.getQtRootCbf(0); + checkMerge2Nx2N_rd5_6(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom); + skipModes = m_param->bEnableEarlySkip && md.bestMode && !md.bestMode->cu.getQtRootCbf(0); + refMasks[0] = allSplitRefs; + md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_2Nx2N], cuGeom, SIZE_2Nx2N, refMasks); + checkBestMode(md.pred[PRED_2Nx2N], cuGeom.depth); + + if (m_param->bEnableRecursionSkip && depth && m_modeDepth[depth - 1].bestMode) + skipRecursion = md.bestMode && !md.bestMode->cu.getQtRootCbf(0); } // estimate split cost /* Step 2. Evaluate each of the 4 split sub-blocks in series */ - if (mightSplit && !foundSkip) + if (mightSplit && !skipRecursion) { Mode* splitPred = &md.pred[PRED_SPLIT]; splitPred->initCosts(); @@ -1375,7 +1497,7 @@ if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth) nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom)); - splitData[subPartIdx] = compressInterCU_rd5_6(parentCTU, childGeom, zOrder, nextQP); + splitData[subPartIdx] = compressInterCU_rd5_6(parentCTU, childGeom, nextQP); // Save best CU and pred data for this sub CU splitIntra |= nd.bestMode->cu.isIntra(0); @@ -1387,7 +1509,6 @@ else { splitCU->setEmptyPart(childGeom, subPartIdx); - zOrder += g_depthInc[g_maxCUDepth - 1][nextDepth]; } } nextContext->store(splitPred->contexts); @@ -1402,20 +1523,16 @@ /* Split CUs * 0 1 * 2 3 */ - uint32_t allSplitRefs = splitData[0].splitRefs | splitData[1].splitRefs | splitData[2].splitRefs | splitData[3].splitRefs; + allSplitRefs = splitData[0].splitRefs | splitData[1].splitRefs | splitData[2].splitRefs | splitData[3].splitRefs; /* Step 3. Evaluate ME (2Nx2N, rect, amp) and intra modes at current depth */ if (mightNotSplit) { if (m_slice->m_pps->bUseDQP && depth <= m_slice->m_pps->maxCuDQPDepth && m_slice->m_pps->maxCuDQPDepth != 0) setLambdaFromQP(parentCTU, qp); - if (!(foundSkip && m_param->bEnableEarlySkip)) + if (!skipModes) { - uint32_t refMasks[2]; refMasks[0] = allSplitRefs; - md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd5_6(md.pred[PRED_2Nx2N], cuGeom, SIZE_2Nx2N, refMasks); - checkBestMode(md.pred[PRED_2Nx2N], cuGeom.depth); if (m_param->limitReferences & X265_REF_LIMIT_CU) { @@ -1430,155 +1547,165 @@ checkBidir2Nx2N(md.pred[PRED_2Nx2N], md.pred[PRED_BIDIR], cuGeom); if (md.pred[PRED_BIDIR].sa8dCost < MAX_INT64) { + uint32_t numPU = md.pred[PRED_BIDIR].cu.getNumPartInter(0); + if (m_frame->m_fencPic->m_picCsp == X265_CSP_I400 && m_csp != X265_CSP_I400) + for (uint32_t puIdx = 0; puIdx < numPU; puIdx++) + { + PredictionUnit pu(md.pred[PRED_BIDIR].cu, cuGeom, puIdx); + motionCompensation(md.pred[PRED_BIDIR].cu, pu, md.pred[PRED_BIDIR].predYuv, true, true); + } encodeResAndCalcRdInterCU(md.pred[PRED_BIDIR], cuGeom); checkBestMode(md.pred[PRED_BIDIR], cuGeom.depth); } } - if (m_param->bEnableRectInter) + if (!skipRectAmp) { - uint64_t splitCost = splitData[0].sa8dCost + splitData[1].sa8dCost + splitData[2].sa8dCost + splitData[3].sa8dCost; - uint32_t threshold_2NxN, threshold_Nx2N; - - if (m_slice->m_sliceType == P_SLICE) - { - threshold_2NxN = splitData[0].mvCost[0] + splitData[1].mvCost[0]; - threshold_Nx2N = splitData[0].mvCost[0] + splitData[2].mvCost[0]; - } - else + if (m_param->bEnableRectInter) { - threshold_2NxN = (splitData[0].mvCost[0] + splitData[1].mvCost[0] - + splitData[0].mvCost[1] + splitData[1].mvCost[1] + 1) >> 1; - threshold_Nx2N = (splitData[0].mvCost[0] + splitData[2].mvCost[0] - + splitData[0].mvCost[1] + splitData[2].mvCost[1] + 1) >> 1; - } - - int try_2NxN_first = threshold_2NxN < threshold_Nx2N; - if (try_2NxN_first && splitCost < md.bestMode->rdCost + threshold_2NxN) - { - refMasks[0] = splitData[0].splitRefs | splitData[1].splitRefs; /* top */ - refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* bot */ - md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd5_6(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN, refMasks); - checkBestMode(md.pred[PRED_2NxN], cuGeom.depth); - } + uint64_t splitCost = splitData[0].sa8dCost + splitData[1].sa8dCost + splitData[2].sa8dCost + splitData[3].sa8dCost; + uint32_t threshold_2NxN, threshold_Nx2N; - if (splitCost < md.bestMode->rdCost + threshold_Nx2N) - { - refMasks[0] = splitData[0].splitRefs | splitData[2].splitRefs; /* left */ - refMasks[1] = splitData[1].splitRefs | splitData[3].splitRefs; /* right */ - md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd5_6(md.pred[PRED_Nx2N], cuGeom, SIZE_Nx2N, refMasks); - checkBestMode(md.pred[PRED_Nx2N], cuGeom.depth); - } - - if (!try_2NxN_first && splitCost < md.bestMode->rdCost + threshold_2NxN) - { - refMasks[0] = splitData[0].splitRefs | splitData[1].splitRefs; /* top */ - refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* bot */ - md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd5_6(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN, refMasks); - checkBestMode(md.pred[PRED_2NxN], cuGeom.depth); - } - } - - // Try AMP (SIZE_2NxnU, SIZE_2NxnD, SIZE_nLx2N, SIZE_nRx2N) - if (m_slice->m_sps->maxAMPDepth > depth) - { - uint64_t splitCost = splitData[0].sa8dCost + splitData[1].sa8dCost + splitData[2].sa8dCost + splitData[3].sa8dCost; - uint32_t threshold_2NxnU, threshold_2NxnD, threshold_nLx2N, threshold_nRx2N; - - if (m_slice->m_sliceType == P_SLICE) - { - threshold_2NxnU = splitData[0].mvCost[0] + splitData[1].mvCost[0]; - threshold_2NxnD = splitData[2].mvCost[0] + splitData[3].mvCost[0]; - - threshold_nLx2N = splitData[0].mvCost[0] + splitData[2].mvCost[0]; - threshold_nRx2N = splitData[1].mvCost[0] + splitData[3].mvCost[0]; - } - else - { - threshold_2NxnU = (splitData[0].mvCost[0] + splitData[1].mvCost[0] + if (m_slice->m_sliceType == P_SLICE) + { + threshold_2NxN = splitData[0].mvCost[0] + splitData[1].mvCost[0]; + threshold_Nx2N = splitData[0].mvCost[0] + splitData[2].mvCost[0]; + } + else + { + threshold_2NxN = (splitData[0].mvCost[0] + splitData[1].mvCost[0] + splitData[0].mvCost[1] + splitData[1].mvCost[1] + 1) >> 1; - threshold_2NxnD = (splitData[2].mvCost[0] + splitData[3].mvCost[0] - + splitData[2].mvCost[1] + splitData[3].mvCost[1] + 1) >> 1; - - threshold_nLx2N = (splitData[0].mvCost[0] + splitData[2].mvCost[0] + threshold_Nx2N = (splitData[0].mvCost[0] + splitData[2].mvCost[0] + splitData[0].mvCost[1] + splitData[2].mvCost[1] + 1) >> 1; - threshold_nRx2N = (splitData[1].mvCost[0] + splitData[3].mvCost[0] - + splitData[1].mvCost[1] + splitData[3].mvCost[1] + 1) >> 1; - } - - bool bHor = false, bVer = false; - if (md.bestMode->cu.m_partSize[0] == SIZE_2NxN) - bHor = true; - else if (md.bestMode->cu.m_partSize[0] == SIZE_Nx2N) - bVer = true; - else if (md.bestMode->cu.m_partSize[0] == SIZE_2Nx2N && !md.bestMode->cu.m_mergeFlag[0]) - { - bHor = true; - bVer = true; - } + } - if (bHor) - { - int try_2NxnD_first = threshold_2NxnD < threshold_2NxnU; - if (try_2NxnD_first && splitCost < md.bestMode->rdCost + threshold_2NxnD) + int try_2NxN_first = threshold_2NxN < threshold_Nx2N; + if (try_2NxN_first && splitCost < md.bestMode->rdCost + threshold_2NxN) { - refMasks[0] = allSplitRefs; /* 75% top */ - refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* 25% bot */ - md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd5_6(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD, refMasks); - checkBestMode(md.pred[PRED_2NxnD], cuGeom.depth); + refMasks[0] = splitData[0].splitRefs | splitData[1].splitRefs; /* top */ + refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* bot */ + md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN, refMasks); + checkBestMode(md.pred[PRED_2NxN], cuGeom.depth); } - if (splitCost < md.bestMode->rdCost + threshold_2NxnU) + if (splitCost < md.bestMode->rdCost + threshold_Nx2N) { - refMasks[0] = splitData[0].splitRefs | splitData[1].splitRefs; /* 25% top */ - refMasks[1] = allSplitRefs; /* 75% bot */ - md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd5_6(md.pred[PRED_2NxnU], cuGeom, SIZE_2NxnU, refMasks); - checkBestMode(md.pred[PRED_2NxnU], cuGeom.depth); + refMasks[0] = splitData[0].splitRefs | splitData[2].splitRefs; /* left */ + refMasks[1] = splitData[1].splitRefs | splitData[3].splitRefs; /* right */ + md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_Nx2N], cuGeom, SIZE_Nx2N, refMasks); + checkBestMode(md.pred[PRED_Nx2N], cuGeom.depth); } - if (!try_2NxnD_first && splitCost < md.bestMode->rdCost + threshold_2NxnD) + if (!try_2NxN_first && splitCost < md.bestMode->rdCost + threshold_2NxN) { - refMasks[0] = allSplitRefs; /* 75% top */ - refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* 25% bot */ - md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd5_6(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD, refMasks); - checkBestMode(md.pred[PRED_2NxnD], cuGeom.depth); + refMasks[0] = splitData[0].splitRefs | splitData[1].splitRefs; /* top */ + refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* bot */ + md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN, refMasks); + checkBestMode(md.pred[PRED_2NxN], cuGeom.depth); } } - if (bVer) + // Try AMP (SIZE_2NxnU, SIZE_2NxnD, SIZE_nLx2N, SIZE_nRx2N) + if (m_slice->m_sps->maxAMPDepth > depth) { - int try_nRx2N_first = threshold_nRx2N < threshold_nLx2N; - if (try_nRx2N_first && splitCost < md.bestMode->rdCost + threshold_nRx2N) + uint64_t splitCost = splitData[0].sa8dCost + splitData[1].sa8dCost + splitData[2].sa8dCost + splitData[3].sa8dCost; + uint32_t threshold_2NxnU, threshold_2NxnD, threshold_nLx2N, threshold_nRx2N; + + if (m_slice->m_sliceType == P_SLICE) { - refMasks[0] = allSplitRefs; /* 75% left */ - refMasks[1] = splitData[1].splitRefs | splitData[3].splitRefs; /* 25% right */ - md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd5_6(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N, refMasks); - checkBestMode(md.pred[PRED_nRx2N], cuGeom.depth); + threshold_2NxnU = splitData[0].mvCost[0] + splitData[1].mvCost[0]; + threshold_2NxnD = splitData[2].mvCost[0] + splitData[3].mvCost[0]; + + threshold_nLx2N = splitData[0].mvCost[0] + splitData[2].mvCost[0]; + threshold_nRx2N = splitData[1].mvCost[0] + splitData[3].mvCost[0]; } + else + { + threshold_2NxnU = (splitData[0].mvCost[0] + splitData[1].mvCost[0] + + splitData[0].mvCost[1] + splitData[1].mvCost[1] + 1) >> 1; + threshold_2NxnD = (splitData[2].mvCost[0] + splitData[3].mvCost[0] + + splitData[2].mvCost[1] + splitData[3].mvCost[1] + 1) >> 1; - if (splitCost < md.bestMode->rdCost + threshold_nLx2N) + threshold_nLx2N = (splitData[0].mvCost[0] + splitData[2].mvCost[0] + + splitData[0].mvCost[1] + splitData[2].mvCost[1] + 1) >> 1; + threshold_nRx2N = (splitData[1].mvCost[0] + splitData[3].mvCost[0] + + splitData[1].mvCost[1] + splitData[3].mvCost[1] + 1) >> 1; + } + + bool bHor = false, bVer = false; + if (md.bestMode->cu.m_partSize[0] == SIZE_2NxN) + bHor = true; + else if (md.bestMode->cu.m_partSize[0] == SIZE_Nx2N) + bVer = true; + else if (md.bestMode->cu.m_partSize[0] == SIZE_2Nx2N && !md.bestMode->cu.m_mergeFlag[0]) { - refMasks[0] = splitData[0].splitRefs | splitData[2].splitRefs; /* 25% left */ - refMasks[1] = allSplitRefs; /* 75% right */ - md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd5_6(md.pred[PRED_nLx2N], cuGeom, SIZE_nLx2N, refMasks); - checkBestMode(md.pred[PRED_nLx2N], cuGeom.depth); + bHor = true; + bVer = true; + } + + if (bHor) + { + int try_2NxnD_first = threshold_2NxnD < threshold_2NxnU; + if (try_2NxnD_first && splitCost < md.bestMode->rdCost + threshold_2NxnD) + { + refMasks[0] = allSplitRefs; /* 75% top */ + refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* 25% bot */ + md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD, refMasks); + checkBestMode(md.pred[PRED_2NxnD], cuGeom.depth); + } + + if (splitCost < md.bestMode->rdCost + threshold_2NxnU) + { + refMasks[0] = splitData[0].splitRefs | splitData[1].splitRefs; /* 25% top */ + refMasks[1] = allSplitRefs; /* 75% bot */ + md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_2NxnU], cuGeom, SIZE_2NxnU, refMasks); + checkBestMode(md.pred[PRED_2NxnU], cuGeom.depth); + } + + if (!try_2NxnD_first && splitCost < md.bestMode->rdCost + threshold_2NxnD) + { + refMasks[0] = allSplitRefs; /* 75% top */ + refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* 25% bot */ + md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD, refMasks); + checkBestMode(md.pred[PRED_2NxnD], cuGeom.depth); + } } - if (!try_nRx2N_first && splitCost < md.bestMode->rdCost + threshold_nRx2N) + if (bVer) { - refMasks[0] = allSplitRefs; /* 75% left */ - refMasks[1] = splitData[1].splitRefs | splitData[3].splitRefs; /* 25% right */ - md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd5_6(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N, refMasks); - checkBestMode(md.pred[PRED_nRx2N], cuGeom.depth); + int try_nRx2N_first = threshold_nRx2N < threshold_nLx2N; + if (try_nRx2N_first && splitCost < md.bestMode->rdCost + threshold_nRx2N) + { + refMasks[0] = allSplitRefs; /* 75% left */ + refMasks[1] = splitData[1].splitRefs | splitData[3].splitRefs; /* 25% right */ + md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N, refMasks); + checkBestMode(md.pred[PRED_nRx2N], cuGeom.depth); + } + + if (splitCost < md.bestMode->rdCost + threshold_nLx2N) + { + refMasks[0] = splitData[0].splitRefs | splitData[2].splitRefs; /* 25% left */ + refMasks[1] = allSplitRefs; /* 75% right */ + md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_nLx2N], cuGeom, SIZE_nLx2N, refMasks); + checkBestMode(md.pred[PRED_nLx2N], cuGeom.depth); + } + + if (!try_nRx2N_first && splitCost < md.bestMode->rdCost + threshold_nRx2N) + { + refMasks[0] = allSplitRefs; /* 75% left */ + refMasks[1] = splitData[1].splitRefs | splitData[3].splitRefs; /* 25% right */ + md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N, refMasks); + checkBestMode(md.pred[PRED_nRx2N], cuGeom.depth); + } } } } @@ -1604,6 +1731,17 @@ ProfileCounter(parentCTU, skippedIntraCU[cuGeom.depth]); } } + if ((md.bestMode->cu.isInter(0) && !(md.bestMode->cu.m_mergeFlag[0] && md.bestMode->cu.m_partSize[0] == SIZE_2Nx2N)) && (m_frame->m_fencPic->m_picCsp == X265_CSP_I400 && m_csp != X265_CSP_I400)) + { + uint32_t numPU = md.bestMode->cu.getNumPartInter(0); + + for (uint32_t puIdx = 0; puIdx < numPU; puIdx++) + { + PredictionUnit pu(md.bestMode->cu, cuGeom, puIdx); + motionCompensation(md.bestMode->cu, pu, md.bestMode->predYuv, false, m_csp != X265_CSP_I400); + } + encodeResAndCalcRdInterCU(*md.bestMode, cuGeom); + } } if (m_bTryLossless) @@ -1614,9 +1752,15 @@ } /* compare split RD cost against best cost */ - if (mightSplit && !foundSkip) + if (mightSplit && !skipRecursion) checkBestMode(md.pred[PRED_SPLIT], depth); + if (m_param->bEnableRdRefine && depth <= m_slice->m_pps->maxCuDQPDepth) + { + int cuIdx = (cuGeom.childOffset - 1) / 3; + cacheCost[cuIdx] = md.bestMode->rdCost; + } + /* determine which motion references the parent CU should search */ SplitData splitCUData; splitCUData.initSplitCUData(); @@ -1648,6 +1792,110 @@ return splitCUData; } +void Analysis::recodeCU(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp, int32_t lqp) +{ + uint32_t depth = cuGeom.depth; + ModeDepth& md = m_modeDepth[depth]; + md.bestMode = NULL; + + bool mightSplit = !(cuGeom.flags & CUGeom::LEAF); + bool mightNotSplit = !(cuGeom.flags & CUGeom::SPLIT_MANDATORY); + bool bDecidedDepth = parentCTU.m_cuDepth[cuGeom.absPartIdx] == depth; + + if (bDecidedDepth) + { + setLambdaFromQP(parentCTU, qp, lqp); + + Mode& mode = md.pred[0]; + md.bestMode = &mode; + mode.cu.initSubCU(parentCTU, cuGeom, qp); + PartSize size = (PartSize)parentCTU.m_partSize[cuGeom.absPartIdx]; + if (parentCTU.isIntra(cuGeom.absPartIdx)) + { + memcpy(mode.cu.m_lumaIntraDir, parentCTU.m_lumaIntraDir + cuGeom.absPartIdx, cuGeom.numPartitions); + memcpy(mode.cu.m_chromaIntraDir, parentCTU.m_chromaIntraDir + cuGeom.absPartIdx, cuGeom.numPartitions); + checkIntra(mode, cuGeom, size); + } + else + { + mode.cu.copyFromPic(parentCTU, cuGeom, m_csp, false); + for (int part = 0; part < (int)parentCTU.getNumPartInter(cuGeom.absPartIdx); part++) + { + PredictionUnit pu(mode.cu, cuGeom, part); + motionCompensation(mode.cu, pu, mode.predYuv, true, true); + } + + if (parentCTU.isSkipped(cuGeom.absPartIdx)) + encodeResAndCalcRdSkipCU(mode); + else + encodeResAndCalcRdInterCU(mode, cuGeom); + + /* checkMerge2Nx2N function performs checkDQP after encoding residual, do the same */ + bool mergeInter2Nx2N = size == SIZE_2Nx2N && parentCTU.m_mergeFlag[cuGeom.absPartIdx]; + if (parentCTU.isSkipped(cuGeom.absPartIdx) || mergeInter2Nx2N) + checkDQP(mode, cuGeom); + } + + if (m_bTryLossless) + tryLossless(cuGeom); + + if (mightSplit) + addSplitFlagCost(*md.bestMode, cuGeom.depth); + } + else + { + Mode* splitPred = &md.pred[PRED_SPLIT]; + md.bestMode = splitPred; + splitPred->initCosts(); + CUData* splitCU = &splitPred->cu; + splitCU->initSubCU(parentCTU, cuGeom, qp); + + uint32_t nextDepth = depth + 1; + ModeDepth& nd = m_modeDepth[nextDepth]; + invalidateContexts(nextDepth); + Entropy* nextContext = &m_rqt[depth].cur; + int nextQP = qp; + + for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++) + { + const CUGeom& childGeom = *(&cuGeom + cuGeom.childOffset + subPartIdx); + if (childGeom.flags & CUGeom::PRESENT) + { + m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx); + m_rqt[nextDepth].cur.load(*nextContext); + + if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth) + nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom)); + + qprdRefine(parentCTU, childGeom, nextQP, lqp); + + // Save best CU and pred data for this sub CU + splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx); + splitPred->addSubCosts(*nd.bestMode); + nd.bestMode->reconYuv.copyToPartYuv(splitPred->reconYuv, childGeom.numPartitions * subPartIdx); + nextContext = &nd.bestMode->contexts; + } + else + { + splitCU->setEmptyPart(childGeom, subPartIdx); + // Set depth of non-present CU to 0 to ensure that correct CU is fetched as reference to code deltaQP + memset(parentCTU.m_cuDepth + childGeom.absPartIdx, 0, childGeom.numPartitions); + } + } + nextContext->store(splitPred->contexts); + if (mightNotSplit) + addSplitFlagCost(*splitPred, cuGeom.depth); + else + updateModeCost(*splitPred); + + checkDQPForSplitPred(*splitPred, cuGeom); + + /* Copy best data to encData CTU and recon */ + md.bestMode->cu.copyToPic(depth); + md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, parentCTU.m_cuAddr, cuGeom.absPartIdx); + } +} + /* sets md.bestMode if a valid merge candidate is found, else leaves it NULL */ void Analysis::checkMerge2Nx2N_rd0_4(Mode& skip, Mode& merge, const CUGeom& cuGeom) { @@ -1705,11 +1953,11 @@ tempPred->cu.m_mv[1][0] = candMvField[i][1].mv; tempPred->cu.m_refIdx[0][0] = (int8_t)candMvField[i][0].refIdx; tempPred->cu.m_refIdx[1][0] = (int8_t)candMvField[i][1].refIdx; - motionCompensation(tempPred->cu, pu, tempPred->predYuv, true, m_bChromaSa8d && (m_csp != X265_CSP_I400)); + motionCompensation(tempPred->cu, pu, tempPred->predYuv, true, m_bChromaSa8d && (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400)); tempPred->sa8dBits = getTUBits(i, numMergeCand); tempPred->distortion = primitives.cu[sizeIdx].sa8d(fencYuv->m_buf[0], fencYuv->m_size, tempPred->predYuv.m_buf[0], tempPred->predYuv.m_size); - if (m_bChromaSa8d && (m_csp != X265_CSP_I400)) + if (m_bChromaSa8d && (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400)) { tempPred->distortion += primitives.chroma[m_csp].cu[sizeIdx].sa8d(fencYuv->m_buf[1], fencYuv->m_csize, tempPred->predYuv.m_buf[1], tempPred->predYuv.m_csize); tempPred->distortion += primitives.chroma[m_csp].cu[sizeIdx].sa8d(fencYuv->m_buf[2], fencYuv->m_csize, tempPred->predYuv.m_buf[2], tempPred->predYuv.m_csize); @@ -1728,7 +1976,7 @@ return; /* calculate the motion compensation for chroma for the best mode selected */ - if (!m_bChromaSa8d && (m_csp != X265_CSP_I400)) /* Chroma MC was done above */ + if ((!m_bChromaSa8d && (m_csp != X265_CSP_I400)) || (m_frame->m_fencPic->m_picCsp == X265_CSP_I400 && m_csp != X265_CSP_I400)) /* Chroma MC was done above */ motionCompensation(bestPred->cu, pu, bestPred->predYuv, false, true); if (m_param->rdLevel) @@ -1766,7 +2014,7 @@ } /* sets md.bestMode if a valid merge candidate is found, else leaves it NULL */ -void Analysis::checkMerge2Nx2N_rd5_6(Mode& skip, Mode& merge, const CUGeom& cuGeom, bool isShareMergeCand) +void Analysis::checkMerge2Nx2N_rd5_6(Mode& skip, Mode& merge, const CUGeom& cuGeom) { uint32_t depth = cuGeom.depth; @@ -1794,19 +2042,13 @@ bool triedPZero = false, triedBZero = false; bestPred->rdCost = MAX_INT64; - uint32_t first = 0, last = numMergeCand; - if (isShareMergeCand) - { - first = *m_reuseBestMergeCand; - last = first + 1; - } int safeX, maxSafeMv; if (m_param->bIntraRefresh && m_slice->m_sliceType == P_SLICE) { safeX = m_slice->m_refFrameList[0][0]->m_encData->m_pir.pirEndCol * g_maxCUSize - 3; maxSafeMv = (safeX - tempPred->cu.m_cuPelX) * 4; } - for (uint32_t i = first; i < last; i++) + for (uint32_t i = 0; i < numMergeCand; i++) { if (m_bFrameParallel && (candMvField[i][0].mv.y >= (m_param->searchRange + 1) * 4 || @@ -1845,8 +2087,7 @@ uint8_t hasCbf = true; bool swapped = false; - /* bypass encoding merge with residual if analysis-mode = load as only SKIP CUs enter this function */ - if (!foundCbf0Merge && !isShareMergeCand) + if (!foundCbf0Merge) { /* if the best prediction has CBF (not a skip) then try merge with residual */ @@ -1896,13 +2137,6 @@ bestPred->cu.setPURefIdx(1, (int8_t)candMvField[bestCand][1].refIdx, 0, 0); checkDQP(*bestPred, cuGeom); } - - if (m_param->analysisMode) - { - if (m_param->analysisMode == X265_ANALYSIS_SAVE) - *m_reuseBestMergeCand = bestPred->cu.m_mvpIdx[0][0]; - m_reuseBestMergeCand++; - } } void Analysis::checkInter_rd0_4(Mode& interMode, const CUGeom& cuGeom, PartSize partSize, uint32_t refMask[2]) @@ -1914,28 +2148,25 @@ if (m_param->analysisMode == X265_ANALYSIS_LOAD && m_reuseInterDataCTU) { + int refOffset = cuGeom.geomRecurId * 16 * numPredDir + partSize * numPredDir * 2; + int index = 0; + uint32_t numPU = interMode.cu.getNumPartInter(0); for (uint32_t part = 0; part < numPU; part++) { MotionData* bestME = interMode.bestME[part]; for (int32_t i = 0; i < numPredDir; i++) - { - bestME[i].ref = *m_reuseRef; - m_reuseRef++; - - bestME[i].mv = *m_reuseMv; - m_reuseMv++; - } + bestME[i].ref = m_reuseRef[refOffset + index++]; } } - predInterSearch(interMode, cuGeom, m_bChromaSa8d && (m_csp != X265_CSP_I400), refMask); + predInterSearch(interMode, cuGeom, m_bChromaSa8d && (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400), refMask); /* predInterSearch sets interMode.sa8dBits */ const Yuv& fencYuv = *interMode.fencYuv; Yuv& predYuv = interMode.predYuv; int part = partitionFromLog2Size(cuGeom.log2CUSize); interMode.distortion = primitives.cu[part].sa8d(fencYuv.m_buf[0], fencYuv.m_size, predYuv.m_buf[0], predYuv.m_size); - if (m_bChromaSa8d && (m_csp != X265_CSP_I400)) + if (m_bChromaSa8d && (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400)) { interMode.distortion += primitives.chroma[m_csp].cu[part].sa8d(fencYuv.m_buf[1], fencYuv.m_csize, predYuv.m_buf[1], predYuv.m_csize); interMode.distortion += primitives.chroma[m_csp].cu[part].sa8d(fencYuv.m_buf[2], fencYuv.m_csize, predYuv.m_buf[2], predYuv.m_csize); @@ -1944,20 +2175,15 @@ if (m_param->analysisMode == X265_ANALYSIS_SAVE && m_reuseInterDataCTU) { + int refOffset = cuGeom.geomRecurId * 16 * numPredDir + partSize * numPredDir * 2; + int index = 0; + uint32_t numPU = interMode.cu.getNumPartInter(0); for (uint32_t puIdx = 0; puIdx < numPU; puIdx++) { - PredictionUnit pu(interMode.cu, cuGeom, puIdx); MotionData* bestME = interMode.bestME[puIdx]; for (int32_t i = 0; i < numPredDir; i++) - { - if (bestME[i].ref >= 0) - *m_reuseMv = getLowresMV(interMode.cu, pu, i, bestME[i].ref); - - *m_reuseRef = bestME[i].ref; - m_reuseRef++; - m_reuseMv++; - } + m_reuseRef[refOffset + index++] = bestME[i].ref; } } } @@ -1971,41 +2197,33 @@ if (m_param->analysisMode == X265_ANALYSIS_LOAD && m_reuseInterDataCTU) { + int refOffset = cuGeom.geomRecurId * 16 * numPredDir + partSize * numPredDir * 2; + int index = 0; + uint32_t numPU = interMode.cu.getNumPartInter(0); for (uint32_t puIdx = 0; puIdx < numPU; puIdx++) { MotionData* bestME = interMode.bestME[puIdx]; for (int32_t i = 0; i < numPredDir; i++) - { - bestME[i].ref = *m_reuseRef; - m_reuseRef++; - - bestME[i].mv = *m_reuseMv; - m_reuseMv++; - } + bestME[i].ref = m_reuseRef[refOffset + index++]; } } - predInterSearch(interMode, cuGeom, m_csp != X265_CSP_I400, refMask); + predInterSearch(interMode, cuGeom, m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400, refMask); /* predInterSearch sets interMode.sa8dBits, but this is ignored */ encodeResAndCalcRdInterCU(interMode, cuGeom); if (m_param->analysisMode == X265_ANALYSIS_SAVE && m_reuseInterDataCTU) { + int refOffset = cuGeom.geomRecurId * 16 * numPredDir + partSize * numPredDir * 2; + int index = 0; + uint32_t numPU = interMode.cu.getNumPartInter(0); for (uint32_t puIdx = 0; puIdx < numPU; puIdx++) { - PredictionUnit pu(interMode.cu, cuGeom, puIdx); MotionData* bestME = interMode.bestME[puIdx]; for (int32_t i = 0; i < numPredDir; i++) - { - if (bestME[i].ref >= 0) - *m_reuseMv = getLowresMV(interMode.cu, pu, i, bestME[i].ref); - - *m_reuseRef = bestME[i].ref; - m_reuseRef++; - m_reuseMv++; - } + m_reuseRef[refOffset + index++] = bestME[i].ref; } } } @@ -2053,10 +2271,10 @@ cu.m_mvd[1][0] = bestME[1].mv - mvp1; PredictionUnit pu(cu, cuGeom, 0); - motionCompensation(cu, pu, bidir2Nx2N.predYuv, true, m_bChromaSa8d && (m_csp != X265_CSP_I400)); + motionCompensation(cu, pu, bidir2Nx2N.predYuv, true, m_bChromaSa8d && (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400)); int sa8d = primitives.cu[partEnum].sa8d(fencYuv.m_buf[0], fencYuv.m_size, bidir2Nx2N.predYuv.m_buf[0], bidir2Nx2N.predYuv.m_size); - if (m_bChromaSa8d && (m_csp != X265_CSP_I400)) + if (m_bChromaSa8d && (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400)) { /* Add in chroma distortion */ sa8d += primitives.chroma[m_csp].cu[partEnum].sa8d(fencYuv.m_buf[1], fencYuv.m_csize, bidir2Nx2N.predYuv.m_buf[1], bidir2Nx2N.predYuv.m_csize); @@ -2087,7 +2305,7 @@ int zsa8d; - if (m_bChromaSa8d && (m_csp != X265_CSP_I400)) + if (m_bChromaSa8d && (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400)) { cu.m_mv[0][0] = mvzero; cu.m_mv[1][0] = mvzero; @@ -2135,9 +2353,9 @@ if (m_bChromaSa8d) /* real MC was already performed */ bidir2Nx2N.predYuv.copyFromYuv(tmpPredYuv); else - motionCompensation(cu, pu, bidir2Nx2N.predYuv, true, m_csp != X265_CSP_I400); + motionCompensation(cu, pu, bidir2Nx2N.predYuv, true, m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400); } - else if (m_bChromaSa8d && (m_csp != X265_CSP_I400)) + else if (m_bChromaSa8d && (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400)) { /* recover overwritten motion vectors */ cu.m_mv[0][0] = bestME[0].mv; @@ -2183,7 +2401,7 @@ cu.getIntraTUQtDepthRange(tuDepthRange, 0); residualTransformQuantIntra(*bestMode, cuGeom, 0, 0, tuDepthRange); - if (m_csp != X265_CSP_I400) + if (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400) { getBestIntraModeChroma(*bestMode, cuGeom); residualQTIntraChroma(*bestMode, cuGeom, 0, 0); @@ -2207,7 +2425,7 @@ fencYuv.m_buf[0], predY, fencYuv.m_size, predYuv.m_size); - if (m_csp != X265_CSP_I400) + if (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400) { pixel* predU = predYuv.getCbAddr(absPartIdx); pixel* predV = predYuv.getCrAddr(absPartIdx); @@ -2237,7 +2455,7 @@ else primitives.cu[sizeIdx].copy_pp(reconPic.getLumaAddr(cu.m_cuAddr, absPartIdx), reconPic.m_stride, predY, predYuv.m_size); - if (m_csp != X265_CSP_I400) + if (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400) { pixel* predU = predYuv.getCbAddr(absPartIdx); pixel* predV = predYuv.getCrAddr(absPartIdx); @@ -2257,7 +2475,7 @@ } } - cu.updatePic(cuGeom.depth); + cu.updatePic(cuGeom.depth, m_frame->m_fencPic->m_picCsp); } void Analysis::addSplitFlagCost(Mode& mode, uint32_t depth) @@ -2390,6 +2608,30 @@ return false; } + +bool Analysis::complexityCheckCU(const Mode& bestMode) +{ + uint32_t mean = 0; + uint32_t homo = 0; + uint32_t cuSize = bestMode.fencYuv->m_size; + for (uint32_t y = 0; y < cuSize; y++) { + for (uint32_t x = 0; x < cuSize; x++) { + mean += (bestMode.fencYuv->m_buf[0][y * cuSize + x]); + } + } + mean = mean / (cuSize * cuSize); + for (uint32_t y = 0 ; y < cuSize; y++){ + for (uint32_t x = 0 ; x < cuSize; x++){ + homo += abs(int(bestMode.fencYuv->m_buf[0][y * cuSize + x] - mean)); + } + } + homo = homo / (cuSize * cuSize); + + if (homo < (.1 * mean)) + return true; + + return false; +} int Analysis::calculateQpforCuSize(const CUData& ctu, const CUGeom& cuGeom, double baseQp) {
View file
x265_1.9.tar.gz/source/encoder/analysis.h -> x265_2.0.tar.gz/source/encoder/analysis.h
Changed
@@ -108,6 +108,7 @@ ModeDepth m_modeDepth[NUM_CU_DEPTH]; bool m_bTryLossless; bool m_bChromaSa8d; + bool m_bHD; Analysis(); @@ -117,12 +118,19 @@ Mode& compressCTU(CUData& ctu, Frame& frame, const CUGeom& cuGeom, const Entropy& initialContext); protected: - /* Analysis data for load/save modes, keeps getting incremented as CTU analysis proceeds and data is consumed or read */ + /* Analysis data for save/load mode, writes/reads data based on absPartIdx */ analysis_inter_data* m_reuseInterDataCTU; - MV* m_reuseMv; int32_t* m_reuseRef; - uint32_t* m_reuseBestMergeCand; + uint8_t* m_reuseDepth; + uint8_t* m_reuseModes; + uint8_t* m_reusePartSize; + uint8_t* m_reuseMergeFlag; + uint32_t m_splitRefIdx[4]; + uint64_t* cacheCost; + + /* refine RD based on QP for rd-levels 5 and 6 */ + void qprdRefine(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp, int32_t lqp); /* full analysis for an I-slice CU */ void compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp); @@ -130,11 +138,13 @@ /* full analysis for a P or B slice CU */ uint32_t compressInterCU_dist(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp); SplitData compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp); - SplitData compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp); + SplitData compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp); + + void recodeCU(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp, int32_t origqp = -1); /* measure merge and skip */ void checkMerge2Nx2N_rd0_4(Mode& skip, Mode& merge, const CUGeom& cuGeom); - void checkMerge2Nx2N_rd5_6(Mode& skip, Mode& merge, const CUGeom& cuGeom, bool isShareMergeCand); + void checkMerge2Nx2N_rd5_6(Mode& skip, Mode& merge, const CUGeom& cuGeom); /* measure inter options */ void checkInter_rd0_4(Mode& interMode, const CUGeom& cuGeom, PartSize partSize, uint32_t refmask[2]); @@ -151,6 +161,7 @@ /* work-avoidance heuristics for RD levels < 5 */ uint32_t topSkipMinDepth(const CUData& parentCTU, const CUGeom& cuGeom); bool recursionDepthCheck(const CUData& parentCTU, const CUGeom& cuGeom, const Mode& bestMode); + bool complexityCheckCU(const Mode& bestMode); /* generate residual and recon pixels for an entire CTU recursively (RD0) */ void encodeResidue(const CUData& parentCTU, const CUGeom& cuGeom);
View file
x265_1.9.tar.gz/source/encoder/api.cpp -> x265_2.0.tar.gz/source/encoder/api.cpp
Changed
@@ -166,15 +166,20 @@ x265_param save; Encoder* encoder = static_cast<Encoder*>(enc); + if (encoder->m_reconfigure) /* Reconfigure in progress */ + return 1; memcpy(&save, encoder->m_latestParam, sizeof(x265_param)); int ret = encoder->reconfigureParam(encoder->m_latestParam, param_in); if (ret) + { /* reconfigure failed, recover saved param set */ memcpy(encoder->m_latestParam, &save, sizeof(x265_param)); + ret = -1; + } else { - encoder->m_reconfigured = true; - x265_print_reconfigured_params(&save, encoder->m_latestParam); + encoder->m_reconfigure = true; + encoder->printReconfigureParams(); } return ret; }
View file
x265_1.9.tar.gz/source/encoder/dpb.cpp -> x265_2.0.tar.gz/source/encoder/dpb.cpp
Changed
@@ -146,8 +146,8 @@ // Mark pictures in m_piclist as unreferenced if they are not included in RPS applyReferencePictureSet(&slice->m_rps, pocCurr); - slice->m_numRefIdx[0] = X265_MIN(m_maxRefL0, slice->m_rps.numberOfNegativePictures); // Ensuring L0 contains just the -ve POC - slice->m_numRefIdx[1] = X265_MIN(m_maxRefL1, slice->m_rps.numberOfPositivePictures); + slice->m_numRefIdx[0] = X265_MIN(newFrame->m_param->maxNumReferences, slice->m_rps.numberOfNegativePictures); // Ensuring L0 contains just the -ve POC + slice->m_numRefIdx[1] = X265_MIN(newFrame->m_param->bBPyramid ? 2 : 1, slice->m_rps.numberOfPositivePictures); slice->setRefPicList(m_picList); X265_CHECK(slice->m_sliceType != B_SLICE || slice->m_numRefIdx[1], "B slice without L1 references (non-fatal)\n");
View file
x265_1.9.tar.gz/source/encoder/dpb.h -> x265_2.0.tar.gz/source/encoder/dpb.h
Changed
@@ -39,8 +39,6 @@ int m_lastIDR; int m_pocCRA; - int m_maxRefL0; - int m_maxRefL1; int m_bOpenGOP; bool m_bRefreshPending; bool m_bTemporalSublayer; @@ -54,8 +52,6 @@ m_pocCRA = 0; m_bRefreshPending = false; m_frameDataFreeList = NULL; - m_maxRefL0 = param->maxNumReferences; - m_maxRefL1 = param->bBPyramid ? 2 : 1; m_bOpenGOP = param->bOpenGOP; m_bTemporalSublayer = !!param->bEnableTemporalSubLayers; }
View file
x265_1.9.tar.gz/source/encoder/encoder.cpp -> x265_2.0.tar.gz/source/encoder/encoder.cpp
Changed
@@ -55,7 +55,7 @@ Encoder::Encoder() { m_aborted = false; - m_reconfigured = false; + m_reconfigure = false; m_encodedFrameNum = 0; m_pocLast = -1; m_curEncoder = 0; @@ -361,7 +361,10 @@ } if (m_threadPool) - m_threadPool->stopWorkers(); + { + for (int i = 0; i < m_numPools; i++) + m_threadPool[i].stopWorkers(); + } } void Encoder::destroy() @@ -508,12 +511,6 @@ if (pic_in) { - if (pic_in->colorSpace != m_param->internalCsp) - { - x265_log(m_param, X265_LOG_ERROR, "Unsupported chroma subsampling (%d) on input\n", - pic_in->colorSpace); - return -1; - } if (pic_in->bitDepth < 8 || pic_in->bitDepth > 16) { x265_log(m_param, X265_LOG_ERROR, "Input bit depth (%d) must be between 8 and 16\n", @@ -525,7 +522,7 @@ if (m_dpb->m_freeList.empty()) { inFrame = new Frame; - x265_param* p = m_reconfigured? m_latestParam : m_param; + x265_param* p = m_reconfigure ? m_latestParam : m_param; if (inFrame->create(p, pic_in->quantOffsets)) { /* the first PicYuv created is asked to generate the CU and block unit offset @@ -535,7 +532,7 @@ { inFrame->m_fencPic->m_cuOffsetY = m_sps.cuOffsetY; inFrame->m_fencPic->m_buOffsetY = m_sps.buOffsetY; - if (pic_in->colorSpace != X265_CSP_I400) + if (m_param->internalCsp != X265_CSP_I400) { inFrame->m_fencPic->m_cuOffsetC = m_sps.cuOffsetC; inFrame->m_fencPic->m_buOffsetC = m_sps.buOffsetC; @@ -555,7 +552,7 @@ { m_sps.cuOffsetY = inFrame->m_fencPic->m_cuOffsetY; m_sps.buOffsetY = inFrame->m_fencPic->m_buOffsetY; - if (pic_in->colorSpace != X265_CSP_I400) + if (m_param->internalCsp != X265_CSP_I400) { m_sps.cuOffsetC = inFrame->m_fencPic->m_cuOffsetC; m_sps.cuOffsetY = inFrame->m_fencPic->m_cuOffsetY; @@ -591,7 +588,7 @@ inFrame->m_userData = pic_in->userData; inFrame->m_pts = pic_in->pts; inFrame->m_forceqp = pic_in->forceqp; - inFrame->m_param = m_reconfigured ? m_latestParam : m_param; + inFrame->m_param = m_reconfigure ? m_latestParam : m_param; if (pic_in->quantOffsets != NULL) { @@ -719,7 +716,7 @@ pic_out->analysisData.numPartitions = outFrame->m_analysisData.numPartitions; pic_out->analysisData.interData = outFrame->m_analysisData.interData; pic_out->analysisData.intraData = outFrame->m_analysisData.intraData; - writeAnalysisFile(&pic_out->analysisData); + writeAnalysisFile(&pic_out->analysisData, *outFrame->m_encData); freeAnalysis(&pic_out->analysisData); } } @@ -780,6 +777,27 @@ if (m_rateControl->writeRateControlFrameStats(outFrame, &curEncoder->m_rce)) m_aborted = true; + if (pic_out && m_param->rc.bStatWrite) + { + /* m_rcData is allocated for every frame */ + pic_out->rcData = outFrame->m_rcData; + outFrame->m_rcData->qpaRc = outFrame->m_encData->m_avgQpRc; + outFrame->m_rcData->qRceq = curEncoder->m_rce.qRceq; + outFrame->m_rcData->qpNoVbv = curEncoder->m_rce.qpNoVbv; + outFrame->m_rcData->coeffBits = outFrame->m_encData->m_frameStats.coeffBits; + outFrame->m_rcData->miscBits = outFrame->m_encData->m_frameStats.miscBits; + outFrame->m_rcData->mvBits = outFrame->m_encData->m_frameStats.mvBits; + outFrame->m_rcData->qScale = outFrame->m_rcData->newQScale = x265_qp2qScale(outFrame->m_encData->m_avgQpRc); + outFrame->m_rcData->poc = curEncoder->m_rce.poc; + outFrame->m_rcData->encodeOrder = curEncoder->m_rce.encodeOrder; + outFrame->m_rcData->sliceType = curEncoder->m_rce.sliceType; + outFrame->m_rcData->keptAsRef = curEncoder->m_rce.sliceType == B_SLICE && !IS_REFERENCED(outFrame) ? 0 : 1; + outFrame->m_rcData->qpAq = outFrame->m_encData->m_avgQpAq; + outFrame->m_rcData->iCuCount = outFrame->m_encData->m_frameStats.percent8x8Intra * m_rateControl->m_ncu; + outFrame->m_rcData->pCuCount = outFrame->m_encData->m_frameStats.percent8x8Inter * m_rateControl->m_ncu; + outFrame->m_rcData->skipCuCount = outFrame->m_encData->m_frameStats.percent8x8Skip * m_rateControl->m_ncu; + } + /* Allow this frame to be recycled if no frame encoders are using it for reference */ if (!pic_out) { @@ -800,16 +818,32 @@ frameEnc = m_lookahead->getDecidedPicture(); if (frameEnc && !pass) { + if (curEncoder->m_reconfigure) + { + /* One round robin cycle of FE reconfigure is complete */ + /* Safe to copy m_latestParam to Encoder::m_param, encoder reconfigure complete */ + for (int frameEncId = 0; frameEncId < m_param->frameNumThreads; frameEncId++) + m_frameEncoder[frameEncId]->m_reconfigure = false; + memcpy (m_param, m_latestParam, sizeof(x265_param)); + m_reconfigure = false; + } + + /* Initiate reconfigure for this FE if necessary */ + curEncoder->m_param = m_reconfigure ? m_latestParam : m_param; + curEncoder->m_reconfigure = m_reconfigure; + /* give this frame a FrameData instance before encoding */ if (m_dpb->m_frameDataFreeList) { frameEnc->m_encData = m_dpb->m_frameDataFreeList; m_dpb->m_frameDataFreeList = m_dpb->m_frameDataFreeList->m_freeListNext; frameEnc->reinit(m_sps); + frameEnc->m_param = m_reconfigure ? m_latestParam : m_param; + frameEnc->m_encData->m_param = m_reconfigure ? m_latestParam : m_param; } else { - frameEnc->allocEncodeData(m_param, m_sps); + frameEnc->allocEncodeData(m_reconfigure ? m_latestParam : m_param, m_sps); Slice* slice = frameEnc->m_encData->m_slice; slice->m_sps = &m_sps; slice->m_pps = &m_pps; @@ -817,7 +851,7 @@ slice->m_endCUAddr = slice->realEndAddress(m_sps.numCUsInFrame * NUM_4x4_PARTITIONS); } - curEncoder->m_rce.encodeOrder = m_encodedFrameNum++; + curEncoder->m_rce.encodeOrder = frameEnc->m_encodeOrder = m_encodedFrameNum++; if (m_bframeDelay) { int64_t *prevReorderedPts = m_prevReorderedPts; @@ -867,28 +901,23 @@ int Encoder::reconfigureParam(x265_param* encParam, x265_param* param) { encParam->maxNumReferences = param->maxNumReferences; // never uses more refs than specified in stream headers - encParam->bEnableLoopFilter = param->bEnableLoopFilter; - encParam->deblockingFilterTCOffset = param->deblockingFilterTCOffset; - encParam->deblockingFilterBetaOffset = param->deblockingFilterBetaOffset; encParam->bEnableFastIntra = param->bEnableFastIntra; encParam->bEnableEarlySkip = param->bEnableEarlySkip; - encParam->bEnableTemporalMvp = param->bEnableTemporalMvp; - /* Scratch buffer prevents me_range from being increased for esa/tesa - if (param->searchMethod < X265_FULL_SEARCH || param->searchMethod < encParam->searchRange) - encParam->searchRange = param->searchRange; */ - encParam->noiseReductionInter = param->noiseReductionInter; - encParam->noiseReductionIntra = param->noiseReductionIntra; + encParam->bEnableRecursionSkip = param->bEnableRecursionSkip; + encParam->searchMethod = param->searchMethod; + /* Scratch buffer prevents me_range from being increased for esa/tesa */ + if (param->searchRange < encParam->searchRange) + encParam->searchRange = param->searchRange; /* We can't switch out of subme=0 during encoding. */ if (encParam->subpelRefine) encParam->subpelRefine = param->subpelRefine; encParam->rdoqLevel = param->rdoqLevel; encParam->rdLevel = param->rdLevel; - encParam->bEnableTSkipFast = param->bEnableTSkipFast; - encParam->psyRd = param->psyRd; - encParam->psyRdoq = param->psyRdoq; - encParam->bEnableSignHiding = param->bEnableSignHiding; - encParam->bEnableFastIntra = param->bEnableFastIntra; - encParam->maxTUSize = param->maxTUSize; + encParam->bEnableRectInter = param->bEnableRectInter; + encParam->maxNumMergeCand = param->maxNumMergeCand; + encParam->bIntraInBFrames = param->bIntraInBFrames; + /* To add: Loop Filter/deblocking controls, transform skip, signhide require PPS to be resent */ + /* To add: SAO, temporal MVP, AMP, TU depths require SPS to be resent, at every CVS boundary */ return x265_check_params(encParam); } @@ -1214,12 +1243,6 @@ stats->maxCLL = m_analyzeAll.m_maxCLL; stats->maxFALL = (uint16_t)(m_analyzeAll.m_maxFALL / m_analyzeAll.m_numPics); - - if (m_emitCLLSEI) - { - m_param->maxCLL = stats->maxCLL; - m_param->maxFALL = stats->maxFALL; - } } /* If new statistics are added to x265_stats, we must check here whether the @@ -1304,7 +1327,7 @@ if (frameStats) { - const int picOrderCntLSB = (slice->m_poc - slice->m_lastIDR + (1 << BITS_FOR_POC)) % (1 << BITS_FOR_POC); + const int picOrderCntLSB = slice->m_poc - slice->m_lastIDR; frameStats->encoderOrder = m_outputCount++; frameStats->sliceType = c; @@ -1576,7 +1599,6 @@ void Encoder::configure(x265_param *p) { this->m_param = p; - if (p->keyframeMax < 0) { /* A negative max GOP size indicates the user wants only one I frame at @@ -1741,12 +1763,20 @@ x265_log(p, X265_LOG_WARNING, "Analysis load/save options incompatible with pmode/pme, Disabling pmode/pme\n"); p->bDistributeMotionEstimation = p->bDistributeModeAnalysis = 0; } + if (p->analysisMode && p->rc.cuTree) { x265_log(p, X265_LOG_WARNING, "Analysis load/save options works only with cu-tree off, Disabling cu-tree\n"); p->rc.cuTree = 0; } + if (p->rc.bEnableGrain) + { + x265_log(p, X265_LOG_WARNING, "Rc Grain removes qp fluctuations caused by aq/cutree, Disabling aq,cu-tree\n"); + p->rc.cuTree = 0; + p->rc.aqMode = 0; + } + if (p->bDistributeModeAnalysis && (p->limitReferences >> 1) && 1) { x265_log(p, X265_LOG_WARNING, "Limit reference options 2 and 3 are not supported with pmode. Disabling limit reference\n"); @@ -1815,20 +1845,10 @@ m_conformanceWindow.rightOffset = padsize; } - /* set pad size if height is not multiple of the minimum CU size */ - if (p->sourceHeight & (p->minCUSize - 1)) - { - uint32_t rem = p->sourceHeight & (p->minCUSize - 1); - uint32_t padsize = p->minCUSize - rem; - p->sourceHeight += padsize; - - m_conformanceWindow.bEnabled = true; - m_conformanceWindow.bottomOffset = padsize; - } - if (p->bDistributeModeAnalysis && p->analysisMode) + if (p->bEnableRdRefine && (p->rdLevel < 5 || !p->rc.aqMode)) { - p->analysisMode = X265_ANALYSIS_OFF; - x265_log(p, X265_LOG_WARNING, "Analysis save and load mode not supported for distributed mode analysis\n"); + p->bEnableRdRefine = false; + x265_log(p, X265_LOG_WARNING, "--rd-refine disabled, requires RD level > 4 and adaptive quant\n"); } bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0; @@ -1848,6 +1868,112 @@ else m_param->rc.qgSize = p->maxCUSize; + if (p->uhdBluray) + { + p->bEnableAccessUnitDelimiters = 1; + p->vui.aspectRatioIdc = 1; + p->bEmitHRDSEI = 1; + int disableUhdBd = 0; + + if (p->levelIdc && p->levelIdc != 51) + { + x265_log(p, X265_LOG_WARNING, "uhd-bd: Wrong level specified, UHD Bluray mandates Level 5.1\n"); + } + p->levelIdc = 51; + + if (!p->bHighTier) + { + x265_log(p, X265_LOG_WARNING, "uhd-bd: Turning on high tier\n"); + p->bHighTier = 1; + } + + if (!p->bRepeatHeaders) + { + x265_log(p, X265_LOG_WARNING, "uhd-bd: Turning on repeat-headers\n"); + p->bRepeatHeaders = 1; + } + + if (p->bOpenGOP) + { + x265_log(p, X265_LOG_WARNING, "uhd-bd: Turning off open GOP\n"); + p->bOpenGOP = false; + } + + if (p->bIntraRefresh) + { + x265_log(p, X265_LOG_WARNING, "uhd-bd: turning off intra-refresh\n"); + p->bIntraRefresh = 0; + } + + if (p->keyframeMin != 1) + { + x265_log(p, X265_LOG_WARNING, "uhd-bd: keyframeMin is always 1\n"); + p->keyframeMin = 1; + } + + int fps = (p->fpsNum + p->fpsDenom - 1) / p->fpsDenom; + if (p->keyframeMax > fps) + { + x265_log(p, X265_LOG_WARNING, "uhd-bd: reducing keyframeMax to %d\n", fps); + p->keyframeMax = fps; + } + + if (p->maxNumReferences > 6) + { + x265_log(p, X265_LOG_WARNING, "uhd-bd: reducing references to 6\n"); + p->maxNumReferences = 6; + } + + if (p->bEnableTemporalSubLayers) + { + x265_log(p, X265_LOG_WARNING, "uhd-bd: Turning off temporal layering\n"); + p->bEnableTemporalSubLayers = 0; + } + + if (p->vui.colorPrimaries != 1 && p->vui.colorPrimaries != 9) + { + x265_log(p, X265_LOG_ERROR, "uhd-bd: colour primaries should be either BT.709 or BT.2020\n"); + disableUhdBd = 1; + } + else if (p->vui.colorPrimaries == 9) + { + p->vui.bEnableChromaLocInfoPresentFlag = 1; + p->vui.chromaSampleLocTypeTopField = 2; + p->vui.chromaSampleLocTypeBottomField = 2; + } + + if (p->vui.transferCharacteristics != 1 && p->vui.transferCharacteristics != 14 && p->vui.transferCharacteristics != 16) + { + x265_log(p, X265_LOG_ERROR, "uhd-bd: transfer characteristics supported are BT.709, BT.2020-10 or SMPTE ST.2084\n"); + disableUhdBd = 1; + } + if (p->vui.matrixCoeffs != 1 && p->vui.matrixCoeffs != 9) + { + x265_log(p, X265_LOG_ERROR, "uhd-bd: matrix coeffs supported are either BT.709 or BT.2020\n"); + disableUhdBd = 1; + } + if ((p->sourceWidth != 1920 && p->sourceWidth != 3840) || (p->sourceHeight != 1080 && p->sourceHeight != 2160)) + { + x265_log(p, X265_LOG_ERROR, "uhd-bd: Supported resolutions are 1920x1080 and 3840x2160\n"); + disableUhdBd = 1; + } + if (disableUhdBd) + { + p->uhdBluray = 0; + x265_log(p, X265_LOG_ERROR, "uhd-bd: Disabled\n"); + } + } + + /* set pad size if height is not multiple of the minimum CU size */ + if (p->sourceHeight & (p->minCUSize - 1)) + { + uint32_t rem = p->sourceHeight & (p->minCUSize - 1); + uint32_t padsize = p->minCUSize - rem; + p->sourceHeight += padsize; + m_conformanceWindow.bEnabled = true; + m_conformanceWindow.bottomOffset = padsize; + } + if (p->bLogCuStats) x265_log(p, X265_LOG_WARNING, "--cu-stats option is now deprecated\n"); @@ -1877,8 +2003,9 @@ CHECKED_MALLOC_ZERO(interData->ref, int32_t, analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir); CHECKED_MALLOC(interData->depth, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); CHECKED_MALLOC(interData->modes, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); - CHECKED_MALLOC_ZERO(interData->bestMergeCand, uint32_t, analysis->numCUsInFrame * CUGeom::MAX_GEOMS); - CHECKED_MALLOC_ZERO(interData->mv, MV, analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir); + CHECKED_MALLOC(interData->partSize, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); + CHECKED_MALLOC(interData->mergeFlag, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); + CHECKED_MALLOC_ZERO(interData->wt, WeightParam, 3 * numDir); analysis->interData = interData; } return; @@ -1903,8 +2030,9 @@ X265_FREE(((analysis_inter_data*)analysis->interData)->ref); X265_FREE(((analysis_inter_data*)analysis->interData)->depth); X265_FREE(((analysis_inter_data*)analysis->interData)->modes); - X265_FREE(((analysis_inter_data*)analysis->interData)->bestMergeCand); - X265_FREE(((analysis_inter_data*)analysis->interData)->mv); + X265_FREE(((analysis_inter_data*)analysis->interData)->mergeFlag); + X265_FREE(((analysis_inter_data*)analysis->interData)->partSize); + X265_FREE(((analysis_inter_data*)analysis->interData)->wt); X265_FREE(analysis->interData); } } @@ -1923,10 +2051,12 @@ static uint64_t consumedBytes = 0; static uint64_t totalConsumedBytes = 0; + uint32_t depthBytes = 0; fseeko(m_analysisFile, totalConsumedBytes, SEEK_SET); int poc; uint32_t frameRecordSize; X265_FREAD(&frameRecordSize, sizeof(uint32_t), 1, m_analysisFile); + X265_FREAD(&depthBytes, sizeof(uint32_t), 1, m_analysisFile); X265_FREAD(&poc, sizeof(int), 1, m_analysisFile); uint64_t currentOffset = totalConsumedBytes; @@ -1937,6 +2067,7 @@ currentOffset += frameRecordSize; fseeko(m_analysisFile, currentOffset, SEEK_SET); X265_FREAD(&frameRecordSize, sizeof(uint32_t), 1, m_analysisFile); + X265_FREAD(&depthBytes, sizeof(uint32_t), 1, m_analysisFile); X265_FREAD(&poc, sizeof(int), 1, m_analysisFile); } @@ -1961,36 +2092,67 @@ if (analysis->sliceType == X265_TYPE_IDR || analysis->sliceType == X265_TYPE_I) { - X265_FREAD(((analysis_intra_data *)analysis->intraData)->depth, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); + uint8_t *tempBuf = NULL, *depthBuf = NULL, *modeBuf = NULL, *partSizes = NULL; + + tempBuf = X265_MALLOC(uint8_t, depthBytes * 3); + X265_FREAD(tempBuf, sizeof(uint8_t), depthBytes * 3, m_analysisFile); + + depthBuf = tempBuf; + modeBuf = tempBuf + depthBytes; + partSizes = tempBuf + 2 * depthBytes; + + size_t count = 0; + for (uint32_t d = 0; d < depthBytes; d++) + { + int bytes = analysis->numPartitions >> (depthBuf[d] * 2); + memset(&((analysis_intra_data *)analysis->intraData)->depth[count], depthBuf[d], bytes); + memset(&((analysis_intra_data *)analysis->intraData)->chromaModes[count], modeBuf[d], bytes); + memset(&((analysis_intra_data *)analysis->intraData)->partSizes[count], partSizes[d], bytes); + count += bytes; + } X265_FREAD(((analysis_intra_data *)analysis->intraData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); - X265_FREAD(((analysis_intra_data *)analysis->intraData)->partSizes, sizeof(char), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); - X265_FREAD(((analysis_intra_data *)analysis->intraData)->chromaModes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); + X265_FREE(tempBuf); analysis->sliceType = X265_TYPE_I; consumedBytes += frameRecordSize; } - else if (analysis->sliceType == X265_TYPE_P) - { - X265_FREAD(((analysis_inter_data *)analysis->interData)->ref, sizeof(int32_t), analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU, m_analysisFile); - X265_FREAD(((analysis_inter_data *)analysis->interData)->depth, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); - X265_FREAD(((analysis_inter_data *)analysis->interData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); - X265_FREAD(((analysis_inter_data *)analysis->interData)->bestMergeCand, sizeof(uint32_t), analysis->numCUsInFrame * CUGeom::MAX_GEOMS, m_analysisFile); - X265_FREAD(((analysis_inter_data *)analysis->interData)->mv, sizeof(MV), analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU, m_analysisFile); - consumedBytes += frameRecordSize; - totalConsumedBytes = consumedBytes; - } + else { - X265_FREAD(((analysis_inter_data *)analysis->interData)->ref, sizeof(int32_t), analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * 2, m_analysisFile); - X265_FREAD(((analysis_inter_data *)analysis->interData)->depth, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); - X265_FREAD(((analysis_inter_data *)analysis->interData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); - X265_FREAD(((analysis_inter_data *)analysis->interData)->bestMergeCand, sizeof(uint32_t), analysis->numCUsInFrame * CUGeom::MAX_GEOMS, m_analysisFile); - X265_FREAD(((analysis_inter_data *)analysis->interData)->mv, sizeof(MV), analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * 2, m_analysisFile); + uint8_t *tempBuf = NULL, *depthBuf = NULL, *modeBuf = NULL, *partSize = NULL, *mergeFlag = NULL; + + tempBuf = X265_MALLOC(uint8_t, depthBytes * 4); + X265_FREAD(tempBuf, sizeof(uint8_t), depthBytes * 4, m_analysisFile); + + depthBuf = tempBuf; + modeBuf = tempBuf + depthBytes; + partSize = modeBuf + depthBytes; + mergeFlag = partSize + depthBytes; + + size_t count = 0; + for (uint32_t d = 0; d < depthBytes; d++) + { + int bytes = analysis->numPartitions >> (depthBuf[d] * 2); + memset(&((analysis_inter_data *)analysis->interData)->depth[count], depthBuf[d], bytes); + memset(&((analysis_inter_data *)analysis->interData)->modes[count], modeBuf[d], bytes); + memset(&((analysis_inter_data *)analysis->interData)->partSize[count], partSize[d], bytes); + memset(&((analysis_inter_data *)analysis->interData)->mergeFlag[count], mergeFlag[d], bytes); + count += bytes; + } + + X265_FREE(tempBuf); + + int numDir = analysis->sliceType == X265_TYPE_P ? 1 : 2; + X265_FREAD(((analysis_inter_data *)analysis->interData)->ref, sizeof(int32_t), analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir, m_analysisFile); + uint32_t numPlanes = m_param->internalCsp == X265_CSP_I400 ? 1 : 3; + X265_FREAD(((analysis_inter_data *)analysis->interData)->wt, sizeof(WeightParam), numPlanes * numDir, m_analysisFile); consumedBytes += frameRecordSize; + if (numDir == 1) + totalConsumedBytes = consumedBytes; } #undef X265_FREAD } -void Encoder::writeAnalysisFile(x265_analysis_data* analysis) +void Encoder::writeAnalysisFile(x265_analysis_data* analysis, FrameData &curEncData) { #define X265_FWRITE(val, size, writeSize, fileOffset)\ @@ -2002,26 +2164,82 @@ return;\ }\ - /* calculate frameRecordSize */ - analysis->frameRecordSize = sizeof(analysis->frameRecordSize) + sizeof(analysis->poc) + sizeof(analysis->sliceType) + - sizeof(analysis->numCUsInFrame) + sizeof(analysis->numPartitions) + sizeof(analysis->bScenecut) + sizeof(analysis->satdCost); + uint32_t depthBytes = 0; if (analysis->sliceType == X265_TYPE_IDR || analysis->sliceType == X265_TYPE_I) - analysis->frameRecordSize += sizeof(uint8_t) * analysis->numCUsInFrame * analysis->numPartitions * 4; - else if (analysis->sliceType == X265_TYPE_P) { - analysis->frameRecordSize += sizeof(int32_t) * analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU; - analysis->frameRecordSize += sizeof(uint8_t) * analysis->numCUsInFrame * analysis->numPartitions * 2; - analysis->frameRecordSize += sizeof(uint32_t) * analysis->numCUsInFrame * CUGeom::MAX_GEOMS; - analysis->frameRecordSize += sizeof(MV) * analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU; + for (uint32_t cuAddr = 0; cuAddr < analysis->numCUsInFrame; cuAddr++) + { + uint8_t depth = 0; + uint8_t mode = 0; + uint8_t partSize = 0; + + CUData* ctu = curEncData.getPicCTU(cuAddr); + analysis_intra_data* intraDataCTU = (analysis_intra_data*)analysis->intraData; + + for (uint32_t absPartIdx = 0; absPartIdx < ctu->m_numPartitions; depthBytes++) + { + depth = ctu->m_cuDepth[absPartIdx]; + intraDataCTU->depth[depthBytes] = depth; + + mode = ctu->m_chromaIntraDir[absPartIdx]; + intraDataCTU->chromaModes[depthBytes] = mode; + + partSize = ctu->m_partSize[absPartIdx]; + intraDataCTU->partSizes[depthBytes] = partSize; + + absPartIdx += ctu->m_numPartitions >> (depth * 2); + } + memcpy(&intraDataCTU->modes[ctu->m_cuAddr * ctu->m_numPartitions], ctu->m_lumaIntraDir, sizeof(uint8_t)* ctu->m_numPartitions); + } } else { - analysis->frameRecordSize += sizeof(int32_t) * analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * 2; - analysis->frameRecordSize += sizeof(uint8_t) * analysis->numCUsInFrame * analysis->numPartitions * 2; - analysis->frameRecordSize += sizeof(uint32_t) * analysis->numCUsInFrame * CUGeom::MAX_GEOMS; - analysis->frameRecordSize += sizeof(MV) * analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * 2; + for (uint32_t cuAddr = 0; cuAddr < analysis->numCUsInFrame; cuAddr++) + { + uint8_t depth = 0; + uint8_t predMode = 0; + uint8_t partSize = 0; + uint8_t mergeFlag = 0; + + CUData* ctu = curEncData.getPicCTU(cuAddr); + analysis_inter_data* interDataCTU = (analysis_inter_data*)analysis->interData; + + for (uint32_t absPartIdx = 0; absPartIdx < ctu->m_numPartitions; depthBytes++) + { + depth = ctu->m_cuDepth[absPartIdx]; + interDataCTU->depth[depthBytes] = depth; + + predMode = ctu->m_predMode[absPartIdx]; + if (ctu->m_refIdx[1][absPartIdx] != -1) + predMode = 4; // used as indiacator if the block is coded as bidir + + interDataCTU->modes[depthBytes] = predMode; + + partSize = ctu->m_partSize[absPartIdx]; + interDataCTU->partSize[depthBytes] = partSize; + + mergeFlag = ctu->m_mergeFlag[absPartIdx]; + interDataCTU->mergeFlag[depthBytes] = mergeFlag; + + absPartIdx += ctu->m_numPartitions >> (depth * 2); + } + } + } + + /* calculate frameRecordSize */ + analysis->frameRecordSize = sizeof(analysis->frameRecordSize) + sizeof(depthBytes) + sizeof(analysis->poc) + sizeof(analysis->sliceType) + + sizeof(analysis->numCUsInFrame) + sizeof(analysis->numPartitions) + sizeof(analysis->bScenecut) + sizeof(analysis->satdCost); + if (analysis->sliceType == X265_TYPE_IDR || analysis->sliceType == X265_TYPE_I) + analysis->frameRecordSize += sizeof(uint8_t)* analysis->numCUsInFrame * analysis->numPartitions + depthBytes * 3; + else + { + int numDir = (analysis->sliceType == X265_TYPE_P) ? 1 : 2; + analysis->frameRecordSize += depthBytes * 4; + analysis->frameRecordSize += sizeof(int32_t)* analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir; + analysis->frameRecordSize += sizeof(WeightParam)* 3 * numDir; } X265_FWRITE(&analysis->frameRecordSize, sizeof(uint32_t), 1, m_analysisFile); + X265_FWRITE(&depthBytes, sizeof(uint32_t), 1, m_analysisFile); X265_FWRITE(&analysis->poc, sizeof(int), 1, m_analysisFile); X265_FWRITE(&analysis->sliceType, sizeof(int), 1, m_analysisFile); X265_FWRITE(&analysis->bScenecut, sizeof(int), 1, m_analysisFile); @@ -2031,26 +2249,46 @@ if (analysis->sliceType == X265_TYPE_IDR || analysis->sliceType == X265_TYPE_I) { - X265_FWRITE(((analysis_intra_data*)analysis->intraData)->depth, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); + X265_FWRITE(((analysis_intra_data*)analysis->intraData)->depth, sizeof(uint8_t), depthBytes, m_analysisFile); + X265_FWRITE(((analysis_intra_data*)analysis->intraData)->chromaModes, sizeof(uint8_t), depthBytes, m_analysisFile); + X265_FWRITE(((analysis_intra_data*)analysis->intraData)->partSizes, sizeof(char), depthBytes, m_analysisFile); X265_FWRITE(((analysis_intra_data*)analysis->intraData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); - X265_FWRITE(((analysis_intra_data*)analysis->intraData)->partSizes, sizeof(char), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); - X265_FWRITE(((analysis_intra_data*)analysis->intraData)->chromaModes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); - } - else if (analysis->sliceType == X265_TYPE_P) - { - X265_FWRITE(((analysis_inter_data*)analysis->interData)->ref, sizeof(int32_t), analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU, m_analysisFile); - X265_FWRITE(((analysis_inter_data*)analysis->interData)->depth, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); - X265_FWRITE(((analysis_inter_data*)analysis->interData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); - X265_FWRITE(((analysis_inter_data*)analysis->interData)->bestMergeCand, sizeof(uint32_t), analysis->numCUsInFrame * CUGeom::MAX_GEOMS, m_analysisFile); - X265_FWRITE(((analysis_inter_data*)analysis->interData)->mv, sizeof(MV), analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU, m_analysisFile); } else { - X265_FWRITE(((analysis_inter_data*)analysis->interData)->ref, sizeof(int32_t), analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * 2, m_analysisFile); - X265_FWRITE(((analysis_inter_data*)analysis->interData)->depth, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); - X265_FWRITE(((analysis_inter_data*)analysis->interData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); - X265_FWRITE(((analysis_inter_data*)analysis->interData)->bestMergeCand, sizeof(uint32_t), analysis->numCUsInFrame * CUGeom::MAX_GEOMS, m_analysisFile); - X265_FWRITE(((analysis_inter_data*)analysis->interData)->mv, sizeof(MV), analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * 2, m_analysisFile); + int numDir = analysis->sliceType == X265_TYPE_P ? 1 : 2; + X265_FWRITE(((analysis_inter_data*)analysis->interData)->depth, sizeof(uint8_t), depthBytes, m_analysisFile); + X265_FWRITE(((analysis_inter_data*)analysis->interData)->modes, sizeof(uint8_t), depthBytes, m_analysisFile); + X265_FWRITE(((analysis_inter_data*)analysis->interData)->partSize, sizeof(uint8_t), depthBytes, m_analysisFile); + X265_FWRITE(((analysis_inter_data*)analysis->interData)->mergeFlag, sizeof(uint8_t), depthBytes, m_analysisFile); + X265_FWRITE(((analysis_inter_data*)analysis->interData)->ref, sizeof(int32_t), analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir, m_analysisFile); + uint32_t numPlanes = m_param->internalCsp == X265_CSP_I400 ? 1 : 3; + X265_FWRITE(((analysis_inter_data*)analysis->interData)->wt, sizeof(WeightParam), numPlanes * numDir, m_analysisFile); } #undef X265_FWRITE } + +void Encoder::printReconfigureParams() +{ + if (!m_reconfigure) + return; + x265_param* oldParam = m_param; + x265_param* newParam = m_latestParam; + + x265_log(newParam, X265_LOG_INFO, "Reconfigured param options, input Frame: %d\n", m_pocLast + 1); + + char tmp[40]; +#define TOOLCMP(COND1, COND2, STR) if (COND1 != COND2) { sprintf(tmp, STR, COND1, COND2); x265_log(newParam, X265_LOG_INFO, tmp); } + TOOLCMP(oldParam->maxNumReferences, newParam->maxNumReferences, "ref=%d to %d\n"); + TOOLCMP(oldParam->bEnableFastIntra, newParam->bEnableFastIntra, "fast-intra=%d to %d\n"); + TOOLCMP(oldParam->bEnableEarlySkip, newParam->bEnableEarlySkip, "early-skip=%d to %d\n"); + TOOLCMP(oldParam->bEnableRecursionSkip, newParam->bEnableRecursionSkip, "rskip=%d to %d\n"); + TOOLCMP(oldParam->searchMethod, newParam->searchMethod, "me=%d to %d\n"); + TOOLCMP(oldParam->searchRange, newParam->searchRange, "merange=%d to %d\n"); + TOOLCMP(oldParam->subpelRefine, newParam->subpelRefine, "subme= %d to %d\n"); + TOOLCMP(oldParam->rdLevel, newParam->rdLevel, "rd=%d to %d\n"); + TOOLCMP(oldParam->rdoqLevel, newParam->rdoqLevel, "rdoq=%d to %d\n" ); + TOOLCMP(oldParam->bEnableRectInter, newParam->bEnableRectInter, "rect=%d to %d\n"); + TOOLCMP(oldParam->maxNumMergeCand, newParam->maxNumMergeCand, "max-merge=%d to %d\n"); + TOOLCMP(oldParam->bIntraInBFrames, newParam->bIntraInBFrames, "b-intra=%d to %d\n"); +}
View file
x265_1.9.tar.gz/source/encoder/encoder.h -> x265_2.0.tar.gz/source/encoder/encoder.h
Changed
@@ -74,6 +74,7 @@ class Lookahead; class RateControl; class ThreadPool; +class FrameData; class Encoder : public x265_encoder { @@ -110,7 +111,7 @@ Frame* m_exportedPic; FILE* m_analysisFile; x265_param* m_param; - x265_param* m_latestParam; + x265_param* m_latestParam; // Holds latest param during a reconfigure RateControl* m_rateControl; Lookahead* m_lookahead; @@ -129,7 +130,7 @@ bool m_emitCLLSEI; bool m_bZeroLatency; // x265_encoder_encode() returns NALs for the input picture, zero lag bool m_aborted; // fatal error detected - bool m_reconfigured; // reconfigure of encoder detected + bool m_reconfigure; // Encoder reconfigure in progress /* Begin intra refresh when one not in progress or else begin one as soon as the current * one is done. Requires bIntraRefresh to be set.*/ @@ -152,6 +153,8 @@ void printSummary(); + void printReconfigureParams(); + char* statsString(EncStats&, char*); void configure(x265_param *param); @@ -164,7 +167,7 @@ void readAnalysisFile(x265_analysis_data* analysis, int poc); - void writeAnalysisFile(x265_analysis_data* pic); + void writeAnalysisFile(x265_analysis_data* pic, FrameData &curEncData); void finishFrameStats(Frame* pic, FrameEncoder *curEncoder, x265_frame_stats* frameStats, int inPoc);
View file
x265_1.9.tar.gz/source/encoder/entropy.cpp -> x265_2.0.tar.gz/source/encoder/entropy.cpp
Changed
@@ -38,6 +38,189 @@ namespace X265_NS { +// initial probability for cu_transquant_bypass flag +static const uint8_t INIT_CU_TRANSQUANT_BYPASS_FLAG[3][NUM_TQUANT_BYPASS_FLAG_CTX] = +{ + { 154 }, + { 154 }, + { 154 }, +}; + +// initial probability for split flag +static const uint8_t INIT_SPLIT_FLAG[3][NUM_SPLIT_FLAG_CTX] = +{ + { 107, 139, 126, }, + { 107, 139, 126, }, + { 139, 141, 157, }, +}; + +static const uint8_t INIT_SKIP_FLAG[3][NUM_SKIP_FLAG_CTX] = +{ + { 197, 185, 201, }, + { 197, 185, 201, }, + { CNU, CNU, CNU, }, +}; + +static const uint8_t INIT_MERGE_FLAG_EXT[3][NUM_MERGE_FLAG_EXT_CTX] = +{ + { 154, }, + { 110, }, + { CNU, }, +}; + +static const uint8_t INIT_MERGE_IDX_EXT[3][NUM_MERGE_IDX_EXT_CTX] = +{ + { 137, }, + { 122, }, + { CNU, }, +}; + +static const uint8_t INIT_PART_SIZE[3][NUM_PART_SIZE_CTX] = +{ + { 154, 139, 154, 154 }, + { 154, 139, 154, 154 }, + { 184, CNU, CNU, CNU }, +}; + +static const uint8_t INIT_PRED_MODE[3][NUM_PRED_MODE_CTX] = +{ + { 134, }, + { 149, }, + { CNU, }, +}; + +static const uint8_t INIT_INTRA_PRED_MODE[3][NUM_ADI_CTX] = +{ + { 183, }, + { 154, }, + { 184, }, +}; + +static const uint8_t INIT_CHROMA_PRED_MODE[3][NUM_CHROMA_PRED_CTX] = +{ + { 152, 139, }, + { 152, 139, }, + { 63, 139, }, +}; + +static const uint8_t INIT_INTER_DIR[3][NUM_INTER_DIR_CTX] = +{ + { 95, 79, 63, 31, 31, }, + { 95, 79, 63, 31, 31, }, + { CNU, CNU, CNU, CNU, CNU, }, +}; + +static const uint8_t INIT_MVD[3][NUM_MV_RES_CTX] = +{ + { 169, 198, }, + { 140, 198, }, + { CNU, CNU, }, +}; + +static const uint8_t INIT_REF_PIC[3][NUM_REF_NO_CTX] = +{ + { 153, 153 }, + { 153, 153 }, + { CNU, CNU }, +}; + +static const uint8_t INIT_DQP[3][NUM_DELTA_QP_CTX] = +{ + { 154, 154, 154, }, + { 154, 154, 154, }, + { 154, 154, 154, }, +}; + +static const uint8_t INIT_QT_CBF[3][NUM_QT_CBF_CTX] = +{ + { 153, 111, 149, 92, 167, 154, 154 }, + { 153, 111, 149, 107, 167, 154, 154 }, + { 111, 141, 94, 138, 182, 154, 154 }, +}; + +static const uint8_t INIT_QT_ROOT_CBF[3][NUM_QT_ROOT_CBF_CTX] = +{ + { 79, }, + { 79, }, + { CNU, }, +}; + +static const uint8_t INIT_LAST[3][NUM_CTX_LAST_FLAG_XY] = +{ + { 125, 110, 124, 110, 95, 94, 125, 111, 111, 79, 125, 126, 111, 111, 79, + 108, 123, 93 }, + { 125, 110, 94, 110, 95, 79, 125, 111, 110, 78, 110, 111, 111, 95, 94, + 108, 123, 108 }, + { 110, 110, 124, 125, 140, 153, 125, 127, 140, 109, 111, 143, 127, 111, 79, + 108, 123, 63 }, +}; + +static const uint8_t INIT_SIG_CG_FLAG[3][2 * NUM_SIG_CG_FLAG_CTX] = +{ + { 121, 140, + 61, 154, }, + { 121, 140, + 61, 154, }, + { 91, 171, + 134, 141, }, +}; + +static const uint8_t INIT_SIG_FLAG[3][NUM_SIG_FLAG_CTX] = +{ + { 170, 154, 139, 153, 139, 123, 123, 63, 124, 166, 183, 140, 136, 153, 154, 166, 183, 140, 136, 153, 154, 166, 183, 140, 136, 153, 154, 170, 153, 138, 138, 122, 121, 122, 121, 167, 151, 183, 140, 151, 183, 140, }, + { 155, 154, 139, 153, 139, 123, 123, 63, 153, 166, 183, 140, 136, 153, 154, 166, 183, 140, 136, 153, 154, 166, 183, 140, 136, 153, 154, 170, 153, 123, 123, 107, 121, 107, 121, 167, 151, 183, 140, 151, 183, 140, }, + { 111, 111, 125, 110, 110, 94, 124, 108, 124, 107, 125, 141, 179, 153, 125, 107, 125, 141, 179, 153, 125, 107, 125, 141, 179, 153, 125, 140, 139, 182, 182, 152, 136, 152, 136, 153, 136, 139, 111, 136, 139, 111, }, +}; + +static const uint8_t INIT_ONE_FLAG[3][NUM_ONE_FLAG_CTX] = +{ + { 154, 196, 167, 167, 154, 152, 167, 182, 182, 134, 149, 136, 153, 121, 136, 122, 169, 208, 166, 167, 154, 152, 167, 182, }, + { 154, 196, 196, 167, 154, 152, 167, 182, 182, 134, 149, 136, 153, 121, 136, 137, 169, 194, 166, 167, 154, 167, 137, 182, }, + { 140, 92, 137, 138, 140, 152, 138, 139, 153, 74, 149, 92, 139, 107, 122, 152, 140, 179, 166, 182, 140, 227, 122, 197, }, +}; + +static const uint8_t INIT_ABS_FLAG[3][NUM_ABS_FLAG_CTX] = +{ + { 107, 167, 91, 107, 107, 167, }, + { 107, 167, 91, 122, 107, 167, }, + { 138, 153, 136, 167, 152, 152, }, +}; + +static const uint8_t INIT_MVP_IDX[3][NUM_MVP_IDX_CTX] = +{ + { 168 }, + { 168 }, + { CNU }, +}; + +static const uint8_t INIT_SAO_MERGE_FLAG[3][NUM_SAO_MERGE_FLAG_CTX] = +{ + { 153, }, + { 153, }, + { 153, }, +}; + +static const uint8_t INIT_SAO_TYPE_IDX[3][NUM_SAO_TYPE_IDX_CTX] = +{ + { 160, }, + { 185, }, + { 200, }, +}; + +static const uint8_t INIT_TRANS_SUBDIV_FLAG[3][NUM_TRANS_SUBDIV_FLAG_CTX] = +{ + { 224, 167, 122, }, + { 124, 138, 94, }, + { 153, 138, 138, }, +}; + +static const uint8_t INIT_TRANSFORMSKIP_FLAG[3][2 * NUM_TRANSFORMSKIP_FLAG_CTX] = +{ + { 139, 139 }, + { 139, 139 }, + { 139, 139 }, +}; + Entropy::Entropy() { markValid(); @@ -306,7 +489,7 @@ { for (int sizeId = 0; sizeId < ScalingList::NUM_SIZES; sizeId++) { - for (int listId = 0; listId < ScalingList::NUM_LISTS; listId++) + for (int listId = 0; listId < ScalingList::NUM_LISTS; listId += (sizeId == 3) ? 3 : 1) { int predList = scalingList.checkPredMode(sizeId, listId); WRITE_FLAG(predList < 0, "scaling_list_pred_mode_flag"); @@ -334,12 +517,7 @@ for (int i = 0; i < coefNum; i++) { data = src[scan[i]] - nextCoef; - nextCoef = src[scan[i]]; - if (data > 127) - data = data - 256; - if (data < -128) - data = data + 256; - + nextCoef = (nextCoef + data + 256) % 256; WRITE_SVLC(data, "scaling_list_delta_coef"); } } @@ -726,16 +904,12 @@ bool bSmallChroma = (log2CurSize - hChromaShift) < 2; if (!curDepth || !bSmallChroma) { - if (!curDepth || cu.getCbf(absPartIdx, TEXT_CHROMA_U, curDepth - 1)) + uint32_t parentIdx = absPartIdx & (0xFF << (log2CurSize + 1 - LOG2_UNIT_SIZE) * 2); + if (!curDepth || cu.getCbf(parentIdx, TEXT_CHROMA_U, curDepth - 1)) codeQtCbfChroma(cu, absPartIdx, TEXT_CHROMA_U, curDepth, !subdiv); - if (!curDepth || cu.getCbf(absPartIdx, TEXT_CHROMA_V, curDepth - 1)) + if (!curDepth || cu.getCbf(parentIdx, TEXT_CHROMA_V, curDepth - 1)) codeQtCbfChroma(cu, absPartIdx, TEXT_CHROMA_V, curDepth, !subdiv); } - else - { - X265_CHECK(cu.getCbf(absPartIdx, TEXT_CHROMA_U, curDepth) == cu.getCbf(absPartIdx, TEXT_CHROMA_U, curDepth - 1), "chroma xform size match failure\n"); - X265_CHECK(cu.getCbf(absPartIdx, TEXT_CHROMA_V, curDepth) == cu.getCbf(absPartIdx, TEXT_CHROMA_V, curDepth - 1), "chroma xform size match failure\n"); - } if (subdiv) { @@ -758,7 +932,7 @@ X265_CHECK(cu.getCbf(absPartIdxC, TEXT_LUMA, 0), "CBF should have been set\n"); } else - codeQtCbfLuma(cu, absPartIdx, curDepth); + codeQtCbfLuma(cu.getCbf(absPartIdx, TEXT_LUMA, curDepth), curDepth); uint32_t cbfY = cu.getCbf(absPartIdx, TEXT_LUMA, curDepth); uint32_t cbfU = cu.getCbf(absPartIdxC, TEXT_CHROMA_U, curDepth); @@ -879,7 +1053,7 @@ X265_CHECK(cu.getCbf(absPartIdx, TEXT_LUMA, 0), "CBF should have been set\n"); } else - codeQtCbfLuma(cu, absPartIdx, curDepth); + codeQtCbfLuma(cu.getCbf(absPartIdx, TEXT_LUMA, curDepth), curDepth); uint32_t cbfY = cu.getCbf(absPartIdx, TEXT_LUMA, curDepth); @@ -1005,10 +1179,10 @@ enum { OFFSET_THRESH = 1 << X265_MIN(X265_DEPTH - 5, 5) }; if (typeIdx == SAO_BO) { - for (int i = 0; i < SAO_BO_LEN; i++) + for (int i = 0; i < SAO_NUM_OFFSET; i++) codeSaoMaxUvlc(abs(ctuParam.offset[i]), OFFSET_THRESH - 1); - for (int i = 0; i < SAO_BO_LEN; i++) + for (int i = 0; i < SAO_NUM_OFFSET; i++) if (ctuParam.offset[i] != 0) encodeBinEP(ctuParam.offset[i] < 0); @@ -1026,6 +1200,44 @@ } } +void Entropy::codeSaoOffsetEO(int *offset, int typeIdx, int plane) +{ + if (plane != 2) + { + encodeBin(1, m_contextState[OFF_SAO_TYPE_IDX_CTX]); + encodeBinEP(1); + } + + enum { OFFSET_THRESH = 1 << X265_MIN(X265_DEPTH - 5, 5) }; + + codeSaoMaxUvlc(offset[0], OFFSET_THRESH - 1); + codeSaoMaxUvlc(offset[1], OFFSET_THRESH - 1); + codeSaoMaxUvlc(-offset[2], OFFSET_THRESH - 1); + codeSaoMaxUvlc(-offset[3], OFFSET_THRESH - 1); + if (plane != 2) + encodeBinsEP((uint32_t)(typeIdx), 2); +} + +void Entropy::codeSaoOffsetBO(int *offset, int bandPos, int plane) +{ + if (plane != 2) + { + encodeBin(1, m_contextState[OFF_SAO_TYPE_IDX_CTX]); + encodeBinEP(0); + } + + enum { OFFSET_THRESH = 1 << X265_MIN(X265_DEPTH - 5, 5) }; + + for (int i = 0; i < SAO_NUM_OFFSET; i++) + codeSaoMaxUvlc(abs(offset[i]), OFFSET_THRESH - 1); + + for (int i = 0; i < SAO_NUM_OFFSET; i++) + if (offset[i] != 0) + encodeBinEP(offset[i] < 0); + + encodeBinsEP(bandPos, 5); +} + /** initialize context model with respect to QP and initialization value */ uint8_t sbacInit(int qp, int initValue) {
View file
x265_1.9.tar.gz/source/encoder/entropy.h -> x265_2.0.tar.gz/source/encoder/entropy.h
Changed
@@ -162,13 +162,13 @@ void codePartSize(const CUData& cu, uint32_t absPartIdx, uint32_t depth); void codePredInfo(const CUData& cu, uint32_t absPartIdx); - inline void codeQtCbfLuma(const CUData& cu, uint32_t absPartIdx, uint32_t tuDepth) { codeQtCbfLuma(cu.getCbf(absPartIdx, TEXT_LUMA, tuDepth), tuDepth); } void codeQtCbfChroma(const CUData& cu, uint32_t absPartIdx, TextType ttype, uint32_t tuDepth, bool lowestLevel); void codeCoeff(const CUData& cu, uint32_t absPartIdx, bool& bCodeDQP, const uint32_t depthRange[2]); void codeCoeffNxN(const CUData& cu, const coeff_t* coef, uint32_t absPartIdx, uint32_t log2TrSize, TextType ttype); inline void codeSaoMerge(uint32_t code) { encodeBin(code, m_contextState[OFF_SAO_MERGE_FLAG_CTX]); } + inline void codeSaoType(uint32_t code) { encodeBin(code, m_contextState[OFF_SAO_TYPE_IDX_CTX]); } inline void codeMVPIdx(uint32_t symbol) { encodeBin(symbol, m_contextState[OFF_MVP_IDX_CTX]); } inline void codeMergeFlag(const CUData& cu, uint32_t absPartIdx) { encodeBin(cu.m_mergeFlag[absPartIdx], m_contextState[OFF_MERGE_FLAG_EXT_CTX]); } inline void codeSkipFlag(const CUData& cu, uint32_t absPartIdx) { encodeBin(cu.isSkipped(absPartIdx), m_contextState[OFF_SKIP_FLAG_CTX + cu.getCtxSkipFlag(absPartIdx)]); } @@ -182,6 +182,8 @@ inline void codeTransformSkipFlags(uint32_t transformSkip, TextType ttype) { encodeBin(transformSkip, m_contextState[OFF_TRANSFORMSKIP_FLAG_CTX + (ttype ? NUM_TRANSFORMSKIP_FLAG_CTX : 0)]); } void codeDeltaQP(const CUData& cu, uint32_t absPartIdx); void codeSaoOffset(const SaoCtuParam& ctuParam, int plane); + void codeSaoOffsetEO(int *offset, int typeIdx, int plane); + void codeSaoOffsetBO(int *offset, int bandPos, int plane); /* RDO functions */ void estBit(EstBitsSbac& estBitsSbac, uint32_t log2TrSize, bool bIsLuma) const;
View file
x265_1.9.tar.gz/source/encoder/frameencoder.cpp -> x265_2.0.tar.gz/source/encoder/frameencoder.cpp
Changed
@@ -41,6 +41,7 @@ FrameEncoder::FrameEncoder() { m_prevOutputTime = x265_mdate(); + m_reconfigure = false; m_isFrameEncoder = true; m_threadActive = true; m_slicetypeWaitTime = 0; @@ -104,6 +105,7 @@ m_param = top->m_param; m_numRows = numRows; m_numCols = numCols; + m_reconfigure = false; m_filterRowDelay = ((m_param->bEnableSAO && m_param->bSaoNonDeblocked) || (!m_param->bEnableLoopFilter && m_param->bEnableSAO)) ? 2 : (m_param->bEnableSAO || m_param->bEnableLoopFilter ? 1 : 0); @@ -213,7 +215,6 @@ { m_slicetypeWaitTime = x265_mdate() - m_prevOutputTime; m_frame = curFrame; - m_param = curFrame->m_param; m_sliceType = curFrame->m_lowres.sliceType; curFrame->m_encData->m_frameEncoderID = m_jpId; curFrame->m_encData->m_jobProvider = this; @@ -333,18 +334,40 @@ // Weighted Prediction parameters estimation. bool bUseWeightP = slice->m_sliceType == P_SLICE && slice->m_pps->bUseWeightPred; bool bUseWeightB = slice->m_sliceType == B_SLICE && slice->m_pps->bUseWeightedBiPred; + + WeightParam* reuseWP = NULL; + if (m_param->analysisMode && (bUseWeightP || bUseWeightB)) + reuseWP = ((analysis_inter_data*)m_frame->m_analysisData.interData)->wt; + if (bUseWeightP || bUseWeightB) { #if DETAILED_CU_STATS m_cuStats.countWeightAnalyze++; ScopedElapsedTime time(m_cuStats.weightAnalyzeTime); #endif - WeightAnalysis wa(*this); - if (m_pool && wa.tryBondPeers(*this, 1)) - /* use an idle worker for weight analysis */ - wa.waitForExit(); + if (m_param->analysisMode == X265_ANALYSIS_LOAD) + { + for (int list = 0; list < slice->isInterB() + 1; list++) + { + for (int plane = 0; plane < (m_param->internalCsp != X265_CSP_I400 ? 3 : 1); plane++) + { + for (int ref = 1; ref < slice->m_numRefIdx[list]; ref++) + SET_WEIGHT(slice->m_weightPredTable[list][ref][plane], false, 1 << reuseWP->log2WeightDenom, reuseWP->log2WeightDenom, 0); + slice->m_weightPredTable[list][0][plane] = *(reuseWP++); + } + } + } else - weightAnalyse(*slice, *m_frame, *m_param); + { + WeightAnalysis wa(*this); + if (m_pool && wa.tryBondPeers(*this, 1)) + /* use an idle worker for weight analysis */ + wa.waitForExit(); + else + weightAnalyse(*slice, *m_frame, *m_param); + + } + } else slice->disableWeights(); @@ -361,6 +384,12 @@ slice->m_refReconPicList[l][ref] = slice->m_refFrameList[l][ref]->m_reconPic; m_mref[l][ref].init(slice->m_refReconPicList[l][ref], w, *m_param); } + if (m_param->analysisMode == X265_ANALYSIS_SAVE && (bUseWeightP || bUseWeightB)) + { + for (int i = 0; i < (m_param->internalCsp != X265_CSP_I400 ? 3 : 1); i++) + *(reuseWP++) = slice->m_weightPredTable[l][0][i]; + } + } int numTLD; @@ -371,6 +400,7 @@ /* Get the QP for this frame from rate control. This call may block until * frames ahead of it in encode order have called rateControlEnd() */ + m_rce.encodeOrder = m_frame->m_encodeOrder; int qp = m_top->m_rateControl->rateControlStart(m_frame, &m_rce, m_top); m_rce.newQp = qp; @@ -409,7 +439,7 @@ m_initSliceContext.resetEntropy(*slice); - m_frameFilter.start(m_frame, m_initSliceContext, qp); + m_frameFilter.start(m_frame, m_initSliceContext); /* ensure all rows are blocked prior to initializing row CTU counters */ WaveFront::clearEnabledRowMask(); @@ -969,44 +999,48 @@ /* Deblock with idle threading */ if (m_param->bEnableLoopFilter | m_param->bEnableSAO) { - // TODO: Multiple Threading - // Delay ONE row to avoid Intra Prediction Conflict - if (m_pool && (row >= 1)) + // NOTE: in VBV mode, we may reencode anytime, so we can't do Deblock stage-Horizon and SAO + if (!bIsVbv) { - // Waitting last threading finish - m_frameFilter.m_parallelFilter[row - 1].waitForExit(); + // TODO: Multiple Threading + // Delay ONE row to avoid Intra Prediction Conflict + if (m_pool && (row >= 1)) + { + // Waitting last threading finish + m_frameFilter.m_parallelFilter[row - 1].waitForExit(); - // Processing new group - int allowCol = col; + // Processing new group + int allowCol = col; - // avoid race condition on last column - if (row >= 2) - { - allowCol = X265_MIN(((col == numCols - 1) ? m_frameFilter.m_parallelFilter[row - 2].m_lastDeblocked.get() - : m_frameFilter.m_parallelFilter[row - 2].m_lastCol.get()), (int)col); + // avoid race condition on last column + if (row >= 2) + { + allowCol = X265_MIN(((col == numCols - 1) ? m_frameFilter.m_parallelFilter[row - 2].m_lastDeblocked.get() + : m_frameFilter.m_parallelFilter[row - 2].m_lastCol.get()), (int)col); + } + m_frameFilter.m_parallelFilter[row - 1].m_allowedCol.set(allowCol); + m_frameFilter.m_parallelFilter[row - 1].tryBondPeers(*this, 1); } - m_frameFilter.m_parallelFilter[row - 1].m_allowedCol.set(allowCol); - m_frameFilter.m_parallelFilter[row - 1].tryBondPeers(*this, 1); - } - // Last Row may start early - if (m_pool && (row == m_numRows - 1)) - { - // Waiting for the last thread to finish - m_frameFilter.m_parallelFilter[row].waitForExit(); + // Last Row may start early + if (m_pool && (row == m_numRows - 1)) + { + // Waiting for the last thread to finish + m_frameFilter.m_parallelFilter[row].waitForExit(); - // Deblocking last row - int allowCol = col; + // Deblocking last row + int allowCol = col; - // avoid race condition on last column - if (row >= 2) - { - allowCol = X265_MIN(((col == numCols - 1) ? m_frameFilter.m_parallelFilter[row - 1].m_lastDeblocked.get() - : m_frameFilter.m_parallelFilter[row - 1].m_lastCol.get()), (int)col); + // avoid race condition on last column + if (row >= 2) + { + allowCol = X265_MIN(((col == numCols - 1) ? m_frameFilter.m_parallelFilter[row - 1].m_lastDeblocked.get() + : m_frameFilter.m_parallelFilter[row - 1].m_lastCol.get()), (int)col); + } + m_frameFilter.m_parallelFilter[row].m_allowedCol.set(allowCol); + m_frameFilter.m_parallelFilter[row].tryBondPeers(*this, 1); } - m_frameFilter.m_parallelFilter[row].m_allowedCol.set(allowCol); - m_frameFilter.m_parallelFilter[row].tryBondPeers(*this, 1); - } + } // end of !bIsVbv } // Both Loopfilter and SAO Disabled else @@ -1179,7 +1213,9 @@ uint32_t rowCount = 0; if (m_param->rc.rateControlMode == X265_RC_ABR || bIsVbv) { - if ((uint32_t)m_rce.encodeOrder <= 2 * (m_param->fpsNum / m_param->fpsDenom)) + if (!m_rce.encodeOrder) + rowCount = m_numRows - 1; + else if ((uint32_t)m_rce.encodeOrder <= 2 * (m_param->fpsNum / m_param->fpsDenom)) rowCount = X265_MIN((m_numRows + 1) / 2, m_numRows - 1); else rowCount = X265_MIN(m_refLagRows, m_numRows - 1);
View file
x265_1.9.tar.gz/source/encoder/frameencoder.h -> x265_2.0.tar.gz/source/encoder/frameencoder.h
Changed
@@ -129,7 +129,7 @@ Event m_done; Event m_completionEvent; int m_localTldIdx; - + bool m_reconfigure; /* reconfigure in progress */ volatile bool m_threadActive; volatile bool m_bAllRowsStop; volatile int m_completionCount;
View file
x265_1.9.tar.gz/source/encoder/framefilter.cpp -> x265_2.0.tar.gz/source/encoder/framefilter.cpp
Changed
@@ -54,7 +54,7 @@ void FrameFilter::init(Encoder *top, FrameEncoder *frame, int numRows, uint32_t numCols) { - m_param = top->m_param; + m_param = frame->m_param; m_frameEncoder = frame; m_numRows = numRows; m_numCols = numCols; @@ -103,7 +103,7 @@ } -void FrameFilter::start(Frame *frame, Entropy& initState, int qp) +void FrameFilter::start(Frame *frame, Entropy& initState) { m_frame = frame; @@ -113,7 +113,7 @@ for(int row = 0; row < m_numRows; row++) { if (m_param->bEnableSAO) - m_parallelFilter[row].m_sao.startSlice(frame, initState, qp); + m_parallelFilter[row].m_sao.startSlice(frame, initState); m_parallelFilter[row].m_lastCol.set(0); m_parallelFilter[row].m_allowedCol.set(0); @@ -198,14 +198,14 @@ } } -void FrameFilter::ParallelFilter::processSaoUnitCu(SAOParam *saoParam, int col) +void FrameFilter::ParallelFilter::processSaoCTU(SAOParam *saoParam, int col) { // TODO: apply SAO on CU and copy back soon, is it necessary? if (saoParam->bSaoFlag[0]) - m_sao.processSaoUnitCuLuma(saoParam->ctuParam[0], m_row, col); + m_sao.generateLumaOffsets(saoParam->ctuParam[0], m_row, col); if (saoParam->bSaoFlag[1]) - m_sao.processSaoUnitCuChroma(saoParam->ctuParam, m_row, col); + m_sao.generateChromaOffsets(saoParam->ctuParam, m_row, col); if (m_encData->m_slice->m_pps->bTransquantBypassEnabled) { @@ -320,11 +320,14 @@ const uint32_t* ctuGeomMap = m_frameFilter->m_frameEncoder->m_ctuGeomMap; PicYuv* reconPic = m_encData->m_reconPic; const int colStart = m_lastCol.get(); - // TODO: Waiting previous row finish or simple clip on it? - const int colEnd = m_allowedCol.get(); const int numCols = m_frameFilter->m_numCols; + // TODO: Waiting previous row finish or simple clip on it? + int colEnd = m_allowedCol.get(); // Avoid threading conflict + if (m_prevRow && colEnd > m_prevRow->m_lastDeblocked.get()) + colEnd = m_prevRow->m_lastDeblocked.get(); + if (colStart >= colEnd) return; @@ -368,7 +371,7 @@ if (m_row >= 1 && col >= 3) { // Must delay 1 row to avoid thread data race conflict - m_prevRow->processSaoUnitCu(saoParam, col - 3); + m_prevRow->processSaoCTU(saoParam, col - 3); m_prevRow->processPostCu(col - 3); } } @@ -409,19 +412,19 @@ // Process Previous Rows SAO CU if (m_row >= 1 && numCols >= 3) { - m_prevRow->processSaoUnitCu(saoParam, numCols - 3); + m_prevRow->processSaoCTU(saoParam, numCols - 3); m_prevRow->processPostCu(numCols - 3); } if (m_row >= 1 && numCols >= 2) { - m_prevRow->processSaoUnitCu(saoParam, numCols - 2); + m_prevRow->processSaoCTU(saoParam, numCols - 2); m_prevRow->processPostCu(numCols - 2); } if (m_row >= 1 && numCols >= 1) { - m_prevRow->processSaoUnitCu(saoParam, numCols - 1); + m_prevRow->processSaoCTU(saoParam, numCols - 1); m_prevRow->processPostCu(numCols - 1); } @@ -475,7 +478,7 @@ for(int col = 0; col < m_numCols; col++) { // NOTE: must use processSaoUnitCu(), it include TQBypass logic - m_parallelFilter[row].processSaoUnitCu(saoParam, col); + m_parallelFilter[row].processSaoCTU(saoParam, col); } } @@ -550,10 +553,10 @@ pixel *fenc = m_frame->m_fencPic->m_picOrg[0]; intptr_t stride1 = reconPic->m_stride; intptr_t stride2 = m_frame->m_fencPic->m_stride; - uint32_t bEnd = ((row + 1) == (this->m_numRows - 1)); + uint32_t bEnd = ((row) == (this->m_numRows - 1)); uint32_t bStart = (row == 0); uint32_t minPixY = row * g_maxCUSize - 4 * !bStart; - uint32_t maxPixY = (row + 1) * g_maxCUSize - 4 * !bEnd; + uint32_t maxPixY = X265_MIN((row + 1) * g_maxCUSize - 4 * !bEnd, (uint32_t)m_param->sourceHeight); uint32_t ssim_cnt; x265_emms(); @@ -723,7 +726,7 @@ { std::swap(sum0, sum1); for (uint32_t x = 0; x < width; x += 2) - primitives.ssim_4x4x2_core(&pix1[(4 * x + (z * stride1))], stride1, &pix2[(4 * x + (z * stride2))], stride2, &sum0[x]); + primitives.ssim_4x4x2_core(&pix1[4 * (x + (z * stride1))], stride1, &pix2[4 * (x + (z * stride2))], stride2, &sum0[x]); } for (uint32_t x = 0; x < width - 1; x += 4)
View file
x265_1.9.tar.gz/source/encoder/framefilter.h -> x265_2.0.tar.gz/source/encoder/framefilter.h
Changed
@@ -90,7 +90,7 @@ void processTasks(int workerThreadId); // Apply SAO on a CU in current row - void processSaoUnitCu(SAOParam *saoParam, int col); + void processSaoCTU(SAOParam *saoParam, int col); // Copy and Save SAO reference pixels for SAO Rdo decide void copySaoAboveRef(PicYuv* reconPic, uint32_t cuAddr, int col); @@ -127,7 +127,7 @@ void init(Encoder *top, FrameEncoder *frame, int numRows, uint32_t numCols); void destroy(); - void start(Frame *pic, Entropy& initState, int qp); + void start(Frame *pic, Entropy& initState); void processRow(int row); void processPostRow(int row);
View file
x265_1.9.tar.gz/source/encoder/level.cpp -> x265_2.0.tar.gz/source/encoder/level.cpp
Changed
@@ -131,6 +131,14 @@ vps.ptl.levelIdc = Level::LEVEL8_5; vps.ptl.tierFlag = Level::MAIN; } + else if (param.uhdBluray) + { + i = 8; + vps.ptl.levelIdc = levels[i].levelEnum; + vps.ptl.tierFlag = Level::HIGH; + vps.ptl.minCrForLevel = levels[i].minCompressionRatio; + vps.ptl.maxLumaSrForLevel = levels[i].maxLumaSamplesPerSecond; + } else for (i = 0; i < NumLevels; i++) { if (lumaSamples > levels[i].maxLumaSamples) @@ -145,8 +153,10 @@ continue; else if (param.sourceHeight > sqrt(levels[i].maxLumaSamples * 8.0f)) continue; - + else if (param.levelIdc && param.levelIdc != levels[i].levelIdc) + continue; uint32_t maxDpbSize = MaxDpbPicBuf; + if (lumaSamples <= (levels[i].maxLumaSamples >> 2)) maxDpbSize = X265_MIN(4 * MaxDpbPicBuf, 16); else if (lumaSamples <= (levels[i].maxLumaSamples >> 1)) @@ -188,7 +198,7 @@ CHECK_RANGE((uint32_t)param.rc.vbvBufferSize, levels[i].maxCpbSizeMain, levels[i].maxCpbSizeHigh)) { /* The bitrate or buffer size are out of range for Main tier, but in - * range for High tier. If the user requested High tier then give + * range for High tier. If the user allowed High tier then give * them High tier at this level. Otherwise allow the loop to * progress to the Main tier of the next level */ if (param.bHighTier) @@ -279,7 +289,7 @@ bool enforceLevel(x265_param& param, VPS& vps) { vps.numReorderPics = (param.bBPyramid && param.bframes > 1) ? 2 : !!param.bframes; - vps.maxDecPicBuffering = X265_MIN(MAX_NUM_REF, X265_MAX(vps.numReorderPics + 2, (uint32_t)param.maxNumReferences) + vps.numReorderPics); + vps.maxDecPicBuffering = X265_MIN(MAX_NUM_REF, X265_MAX(vps.numReorderPics + 2, (uint32_t)param.maxNumReferences) + 1); /* no level specified by user, just auto-detect from the configuration */ if (param.levelIdc <= 0) @@ -290,17 +300,14 @@ level++; if (levels[level].levelIdc != param.levelIdc) { - x265_log(¶m, X265_LOG_WARNING, "specified level %d does not exist\n", param.levelIdc); + x265_log(¶m, X265_LOG_ERROR, "specified level %d does not exist\n", param.levelIdc); return false; } LevelSpec& l = levels[level]; - bool highTier = !!param.bHighTier; - if (highTier && l.maxBitrateHigh == MAX_UINT) - { - highTier = false; - x265_log(¶m, X265_LOG_WARNING, "Level %s has no High tier, using Main tier\n", l.name); - } + + //highTier is allowed for this level and has not been explicitly disabled. This does not mean it is the final chosen tier + bool allowHighTier = l.maxBitrateHigh < MAX_UINT && param.bHighTier; uint32_t lumaSamples = param.sourceWidth * param.sourceHeight; uint32_t samplesPerSec = (uint32_t)(lumaSamples * ((double)param.fpsNum / param.fpsDenom)); @@ -313,47 +320,51 @@ ok = false; if (!ok) { - x265_log(¶m, X265_LOG_WARNING, "picture dimensions are out of range for specified level\n"); + x265_log(¶m, X265_LOG_ERROR, "picture dimensions are out of range for specified level\n"); return false; } else if (samplesPerSec > l.maxLumaSamplesPerSecond) { - x265_log(¶m, X265_LOG_WARNING, "frame rate is out of range for specified level\n"); + x265_log(¶m, X265_LOG_ERROR, "frame rate is out of range for specified level\n"); return false; } - if ((uint32_t)param.rc.vbvMaxBitrate > (highTier ? l.maxBitrateHigh : l.maxBitrateMain)) + /* Adjustments of Bitrate, VBV buffer size, refs will be triggered only if specified params do not fit + * within the max limits of that level (high tier if allowed, main otherwise) + */ + + if ((uint32_t)param.rc.vbvMaxBitrate > (allowHighTier ? l.maxBitrateHigh : l.maxBitrateMain)) { - param.rc.vbvMaxBitrate = highTier ? l.maxBitrateHigh : l.maxBitrateMain; - x265_log(¶m, X265_LOG_INFO, "lowering VBV max bitrate to %dKbps\n", param.rc.vbvMaxBitrate); + param.rc.vbvMaxBitrate = allowHighTier ? l.maxBitrateHigh : l.maxBitrateMain; + x265_log(¶m, X265_LOG_WARNING, "lowering VBV max bitrate to %dKbps\n", param.rc.vbvMaxBitrate); } - if ((uint32_t)param.rc.vbvBufferSize > (highTier ? l.maxCpbSizeHigh : l.maxCpbSizeMain)) + if ((uint32_t)param.rc.vbvBufferSize > (allowHighTier ? l.maxCpbSizeHigh : l.maxCpbSizeMain)) { - param.rc.vbvBufferSize = highTier ? l.maxCpbSizeHigh : l.maxCpbSizeMain; - x265_log(¶m, X265_LOG_INFO, "lowering VBV buffer size to %dKb\n", param.rc.vbvBufferSize); + param.rc.vbvBufferSize = allowHighTier ? l.maxCpbSizeHigh : l.maxCpbSizeMain; + x265_log(¶m, X265_LOG_WARNING, "lowering VBV buffer size to %dKb\n", param.rc.vbvBufferSize); } switch (param.rc.rateControlMode) { case X265_RC_ABR: - if ((uint32_t)param.rc.bitrate > (highTier ? l.maxBitrateHigh : l.maxBitrateMain)) + if ((uint32_t)param.rc.bitrate > (allowHighTier ? l.maxBitrateHigh : l.maxBitrateMain)) { - param.rc.bitrate = l.maxBitrateHigh; - x265_log(¶m, X265_LOG_INFO, "lowering target bitrate to High tier limit of %dKbps\n", param.rc.bitrate); + param.rc.bitrate = allowHighTier ? l.maxBitrateHigh : l.maxBitrateMain; + x265_log(¶m, X265_LOG_WARNING, "lowering target bitrate to High tier limit of %dKbps\n", param.rc.bitrate); } break; case X265_RC_CQP: - x265_log(¶m, X265_LOG_WARNING, "Constant QP is inconsistent with specifying a decoder level, no bitrate guarantee is possible.\n"); + x265_log(¶m, X265_LOG_ERROR, "Constant QP is inconsistent with specifying a decoder level, no bitrate guarantee is possible.\n"); return false; case X265_RC_CRF: if (!param.rc.vbvBufferSize || !param.rc.vbvMaxBitrate) { if (!param.rc.vbvMaxBitrate) - param.rc.vbvMaxBitrate = highTier ? l.maxBitrateHigh : l.maxBitrateMain; + param.rc.vbvMaxBitrate = allowHighTier ? l.maxBitrateHigh : l.maxBitrateMain; if (!param.rc.vbvBufferSize) - param.rc.vbvBufferSize = highTier ? l.maxCpbSizeHigh : l.maxCpbSizeMain; + param.rc.vbvBufferSize = allowHighTier ? l.maxCpbSizeHigh : l.maxCpbSizeMain; x265_log(¶m, X265_LOG_WARNING, "Specifying a decoder level with constant rate factor rate-control requires\n"); x265_log(¶m, X265_LOG_WARNING, "enabling VBV with vbv-bufsize=%dkb vbv-maxrate=%dkbps. VBV outputs are non-deterministic!\n", param.rc.vbvBufferSize, param.rc.vbvMaxBitrate); @@ -368,27 +379,30 @@ /* The value of sps_max_dec_pic_buffering_minus1[ HighestTid ] + 1 shall be less than or equal to MaxDpbSize */ const uint32_t MaxDpbPicBuf = 6; uint32_t maxDpbSize = MaxDpbPicBuf; - if (lumaSamples <= (l.maxLumaSamples >> 2)) - maxDpbSize = X265_MIN(4 * MaxDpbPicBuf, 16); - else if (lumaSamples <= (l.maxLumaSamples >> 1)) - maxDpbSize = X265_MIN(2 * MaxDpbPicBuf, 16); - else if (lumaSamples <= ((3 * l.maxLumaSamples) >> 2)) - maxDpbSize = X265_MIN((4 * MaxDpbPicBuf) / 3, 16); + if (!param.uhdBluray) /* Do not change MaxDpbPicBuf for UHD-Bluray */ + { + if (lumaSamples <= (l.maxLumaSamples >> 2)) + maxDpbSize = X265_MIN(4 * MaxDpbPicBuf, 16); + else if (lumaSamples <= (l.maxLumaSamples >> 1)) + maxDpbSize = X265_MIN(2 * MaxDpbPicBuf, 16); + else if (lumaSamples <= ((3 * l.maxLumaSamples) >> 2)) + maxDpbSize = X265_MIN((4 * MaxDpbPicBuf) / 3, 16); + } int savedRefCount = param.maxNumReferences; while (vps.maxDecPicBuffering > maxDpbSize && param.maxNumReferences > 1) { param.maxNumReferences--; - vps.maxDecPicBuffering = X265_MIN(MAX_NUM_REF, X265_MAX(vps.numReorderPics + 1, (uint32_t)param.maxNumReferences) + vps.numReorderPics); + vps.maxDecPicBuffering = X265_MIN(MAX_NUM_REF, X265_MAX(vps.numReorderPics + 1, (uint32_t)param.maxNumReferences) + 1); } if (param.maxNumReferences != savedRefCount) - x265_log(¶m, X265_LOG_INFO, "Lowering max references to %d to meet level requirement\n", param.maxNumReferences); + x265_log(¶m, X265_LOG_WARNING, "Lowering max references to %d to meet level requirement\n", param.maxNumReferences); /* For level 5 and higher levels, the value of CtbSizeY shall be equal to 32 or 64 */ if (param.levelIdc >= 50 && param.maxCUSize < 32) { param.maxCUSize = 32; - x265_log(¶m, X265_LOG_INFO, "Levels 5.0 and above require a maximum CTU size of at least 32, using --ctu 32\n"); + x265_log(¶m, X265_LOG_WARNING, "Levels 5.0 and above require a maximum CTU size of at least 32, using --ctu 32\n"); } /* The value of NumPocTotalCurr shall be less than or equal to 8 */ @@ -396,7 +410,7 @@ if (numPocTotalCurr > 8) { param.maxNumReferences = 8 - !!param.bframes; - x265_log(¶m, X265_LOG_INFO, "Lowering max references to %d to meet numPocTotalCurr requirement\n", param.maxNumReferences); + x265_log(¶m, X265_LOG_WARNING, "Lowering max references to %d to meet numPocTotalCurr requirement\n", param.maxNumReferences); } return true;
View file
x265_1.9.tar.gz/source/encoder/motion.cpp -> x265_2.0.tar.gz/source/encoder/motion.cpp
Changed
@@ -111,10 +111,8 @@ chromaSatd = NULL; } -void MotionEstimate::init(int method, int refine, int csp) +void MotionEstimate::init(int csp) { - searchMethod = method; - subpelRefine = refine; fencPUYuv.create(FENC_STRIDE, csp); } @@ -162,7 +160,7 @@ } /* Called by lookahead, luma only, no use of PicYuv */ -void MotionEstimate::setSourcePU(pixel *fencY, intptr_t stride, intptr_t offset, int pwidth, int pheight) +void MotionEstimate::setSourcePU(pixel *fencY, intptr_t stride, intptr_t offset, int pwidth, int pheight, const int method, const int refine) { partEnum = partitionFromSizes(pwidth, pheight); X265_CHECK(LUMA_4x4 != partEnum, "4x4 inter partition detected!\n"); @@ -175,13 +173,17 @@ blockOffset = offset; absPartIdx = ctuAddr = -1; + /* Search params */ + searchMethod = method; + subpelRefine = refine; + /* copy PU block into cache */ primitives.pu[partEnum].copy_pp(fencPUYuv.m_buf[0], FENC_STRIDE, fencY + offset, stride); X265_CHECK(!bChromaSATD, "chroma distortion measurements impossible in this code path\n"); } /* Called by Search::predInterSearch() or --pme equivalent, chroma residual might be considered */ -void MotionEstimate::setSourcePU(const Yuv& srcFencYuv, int _ctuAddr, int cuPartIdx, int puPartIdx, int pwidth, int pheight) +void MotionEstimate::setSourcePU(const Yuv& srcFencYuv, int _ctuAddr, int cuPartIdx, int puPartIdx, int pwidth, int pheight, const int method, const int refine, bool bChroma) { partEnum = partitionFromSizes(pwidth, pheight); X265_CHECK(LUMA_4x4 != partEnum, "4x4 inter partition detected!\n"); @@ -192,9 +194,13 @@ chromaSatd = primitives.chroma[fencPUYuv.m_csp].pu[partEnum].satd; + /* Set search characteristics */ + searchMethod = method; + subpelRefine = refine; + /* Enable chroma residual cost if subpelRefine level is greater than 2 and chroma block size * is an even multiple of 4x4 pixels (indicated by non-null chromaSatd pointer) */ - bChromaSATD = subpelRefine > 2 && chromaSatd && (srcFencYuv.m_csp != X265_CSP_I400); + bChromaSATD = subpelRefine > 2 && chromaSatd && (srcFencYuv.m_csp != X265_CSP_I400 && bChroma); X265_CHECK(!(bChromaSATD && !workload[subpelRefine].hpel_satd), "Chroma SATD cannot be used with SAD hpel\n"); ctuAddr = _ctuAddr; @@ -1174,15 +1180,17 @@ int MotionEstimate::subpelCompare(ReferencePlanes *ref, const MV& qmv, pixelcmp_t cmp) { intptr_t refStride = ref->lumaStride; - pixel *fref = ref->fpelPlane[0] + blockOffset + (qmv.x >> 2) + (qmv.y >> 2) * refStride; + const pixel* fref = ref->fpelPlane[0] + blockOffset + (qmv.x >> 2) + (qmv.y >> 2) * refStride; int xFrac = qmv.x & 0x3; int yFrac = qmv.y & 0x3; int cost; - intptr_t lclStride = fencPUYuv.m_size; - X265_CHECK(lclStride == FENC_STRIDE, "fenc buffer is assumed to have FENC_STRIDE by sad_x3 and sad_x4\n"); + const intptr_t fencStride = FENC_STRIDE; + X265_CHECK(fencPUYuv.m_size == FENC_STRIDE, "fenc buffer is assumed to have FENC_STRIDE by sad_x3 and sad_x4\n"); + ALIGN_VAR_32(pixel, subpelbuf[MAX_CU_SIZE * MAX_CU_SIZE]); + if (!(yFrac | xFrac)) - cost = cmp(fencPUYuv.m_buf[0], lclStride, fref, refStride); + cost = cmp(fencPUYuv.m_buf[0], fencStride, fref, refStride); else { /* we are taking a short-cut here if the reference is weighted. To be @@ -1190,15 +1198,13 @@ * the final 16bit values prior to rounding and down shifting. Instead we * are simply interpolating the weighted full-pel pixels. Not 100% * accurate but good enough for fast qpel ME */ - ALIGN_VAR_32(pixel, subpelbuf[64 * 64]); if (!yFrac) - primitives.pu[partEnum].luma_hpp(fref, refStride, subpelbuf, lclStride, xFrac); + primitives.pu[partEnum].luma_hpp(fref, refStride, subpelbuf, blockwidth, xFrac); else if (!xFrac) - primitives.pu[partEnum].luma_vpp(fref, refStride, subpelbuf, lclStride, yFrac); + primitives.pu[partEnum].luma_vpp(fref, refStride, subpelbuf, blockwidth, yFrac); else - primitives.pu[partEnum].luma_hvpp(fref, refStride, subpelbuf, lclStride, xFrac, yFrac); - - cost = cmp(fencPUYuv.m_buf[0], lclStride, subpelbuf, lclStride); + primitives.pu[partEnum].luma_hvpp(fref, refStride, subpelbuf, blockwidth, xFrac, yFrac); + cost = cmp(fencPUYuv.m_buf[0], fencStride, subpelbuf, blockwidth); } if (bChromaSATD) @@ -1206,12 +1212,12 @@ int csp = fencPUYuv.m_csp; int hshift = fencPUYuv.m_hChromaShift; int vshift = fencPUYuv.m_vChromaShift; - int shiftHor = (2 + hshift); - int shiftVer = (2 + vshift); - lclStride = fencPUYuv.m_csize; + int mvx = qmv.x << (1 - hshift); + int mvy = qmv.y << (1 - vshift); + intptr_t fencStrideC = fencPUYuv.m_csize; intptr_t refStrideC = ref->reconPic->m_strideC; - intptr_t refOffset = (qmv.x >> shiftHor) + (qmv.y >> shiftVer) * refStrideC; + intptr_t refOffset = (mvx >> 3) + (mvy >> 3) * refStrideC; const pixel* refCb = ref->getCbAddr(ctuAddr, absPartIdx) + refOffset; const pixel* refCr = ref->getCrAddr(ctuAddr, absPartIdx) + refOffset; @@ -1219,48 +1225,46 @@ X265_CHECK((hshift == 0) || (hshift == 1), "hshift must be 0 or 1\n"); X265_CHECK((vshift == 0) || (vshift == 1), "vshift must be 0 or 1\n"); - xFrac = qmv.x & (hshift ? 7 : 3); - yFrac = qmv.y & (vshift ? 7 : 3); + xFrac = mvx & 7; + yFrac = mvy & 7; if (!(yFrac | xFrac)) { - cost += chromaSatd(fencPUYuv.m_buf[1], lclStride, refCb, refStrideC); - cost += chromaSatd(fencPUYuv.m_buf[2], lclStride, refCr, refStrideC); + cost += chromaSatd(fencPUYuv.m_buf[1], fencStrideC, refCb, refStrideC); + cost += chromaSatd(fencPUYuv.m_buf[2], fencStrideC, refCr, refStrideC); } else { - ALIGN_VAR_32(pixel, subpelbuf[64 * 64]); + int blockwidthC = blockwidth >> hshift; + if (!yFrac) { - primitives.chroma[csp].pu[partEnum].filter_hpp(refCb, refStrideC, subpelbuf, lclStride, xFrac << (1 - hshift)); - cost += chromaSatd(fencPUYuv.m_buf[1], lclStride, subpelbuf, lclStride); + primitives.chroma[csp].pu[partEnum].filter_hpp(refCb, refStrideC, subpelbuf, blockwidthC, xFrac); + cost += chromaSatd(fencPUYuv.m_buf[1], fencStrideC, subpelbuf, blockwidthC); - primitives.chroma[csp].pu[partEnum].filter_hpp(refCr, refStrideC, subpelbuf, lclStride, xFrac << (1 - hshift)); - cost += chromaSatd(fencPUYuv.m_buf[2], lclStride, subpelbuf, lclStride); + primitives.chroma[csp].pu[partEnum].filter_hpp(refCr, refStrideC, subpelbuf, blockwidthC, xFrac); + cost += chromaSatd(fencPUYuv.m_buf[2], fencStrideC, subpelbuf, blockwidthC); } else if (!xFrac) { - primitives.chroma[csp].pu[partEnum].filter_vpp(refCb, refStrideC, subpelbuf, lclStride, yFrac << (1 - vshift)); - cost += chromaSatd(fencPUYuv.m_buf[1], lclStride, subpelbuf, lclStride); + primitives.chroma[csp].pu[partEnum].filter_vpp(refCb, refStrideC, subpelbuf, blockwidthC, yFrac); + cost += chromaSatd(fencPUYuv.m_buf[1], fencStrideC, subpelbuf, blockwidthC); - primitives.chroma[csp].pu[partEnum].filter_vpp(refCr, refStrideC, subpelbuf, lclStride, yFrac << (1 - vshift)); - cost += chromaSatd(fencPUYuv.m_buf[2], lclStride, subpelbuf, lclStride); + primitives.chroma[csp].pu[partEnum].filter_vpp(refCr, refStrideC, subpelbuf, blockwidthC, yFrac); + cost += chromaSatd(fencPUYuv.m_buf[2], fencStrideC, subpelbuf, blockwidthC); } else { - ALIGN_VAR_32(int16_t, immed[64 * (64 + NTAPS_CHROMA)]); - - int extStride = blockwidth >> hshift; - int filterSize = NTAPS_CHROMA; - int halfFilterSize = (filterSize >> 1); + ALIGN_VAR_32(int16_t, immed[MAX_CU_SIZE * (MAX_CU_SIZE + NTAPS_LUMA - 1)]); + const int halfFilterSize = (NTAPS_CHROMA >> 1); - primitives.chroma[csp].pu[partEnum].filter_hps(refCb, refStrideC, immed, extStride, xFrac << (1 - hshift), 1); - primitives.chroma[csp].pu[partEnum].filter_vsp(immed + (halfFilterSize - 1) * extStride, extStride, subpelbuf, lclStride, yFrac << (1 - vshift)); - cost += chromaSatd(fencPUYuv.m_buf[1], lclStride, subpelbuf, lclStride); + primitives.chroma[csp].pu[partEnum].filter_hps(refCb, refStrideC, immed, blockwidthC, xFrac, 1); + primitives.chroma[csp].pu[partEnum].filter_vsp(immed + (halfFilterSize - 1) * blockwidthC, blockwidthC, subpelbuf, blockwidthC, yFrac); + cost += chromaSatd(fencPUYuv.m_buf[1], fencStrideC, subpelbuf, blockwidthC); - primitives.chroma[csp].pu[partEnum].filter_hps(refCr, refStrideC, immed, extStride, xFrac << (1 - hshift), 1); - primitives.chroma[csp].pu[partEnum].filter_vsp(immed + (halfFilterSize - 1) * extStride, extStride, subpelbuf, lclStride, yFrac << (1 - vshift)); - cost += chromaSatd(fencPUYuv.m_buf[2], lclStride, subpelbuf, lclStride); + primitives.chroma[csp].pu[partEnum].filter_hps(refCr, refStrideC, immed, blockwidthC, xFrac, 1); + primitives.chroma[csp].pu[partEnum].filter_vsp(immed + (halfFilterSize - 1) * blockwidthC, blockwidthC, subpelbuf, blockwidthC, yFrac); + cost += chromaSatd(fencPUYuv.m_buf[2], fencStrideC, subpelbuf, blockwidthC); } } }
View file
x265_1.9.tar.gz/source/encoder/motion.h -> x265_2.0.tar.gz/source/encoder/motion.h
Changed
@@ -70,12 +70,12 @@ static void initScales(); static int hpelIterationCount(int subme); - void init(int method, int refine, int csp); + void init(int csp); /* Methods called at slice setup */ - void setSourcePU(pixel *fencY, intptr_t stride, intptr_t offset, int pwidth, int pheight); - void setSourcePU(const Yuv& srcFencYuv, int ctuAddr, int cuPartIdx, int puPartIdx, int pwidth, int pheight); + void setSourcePU(pixel *fencY, intptr_t stride, intptr_t offset, int pwidth, int pheight, const int searchMethod, const int subpelRefine); + void setSourcePU(const Yuv& srcFencYuv, int ctuAddr, int cuPartIdx, int puPartIdx, int pwidth, int pheight, const int searchMethod, const int subpelRefine, bool bChroma); /* buf*() and motionEstimate() methods all use cached fenc pixels and thus * require setSourcePU() to be called prior. */
View file
x265_1.9.tar.gz/source/encoder/ratecontrol.cpp -> x265_2.0.tar.gz/source/encoder/ratecontrol.cpp
Changed
@@ -53,7 +53,7 @@ {\ bErr = 0;\ p = strstr(opts, opt "=");\ - char* q = strstr(opts, "no-"opt);\ + char* q = strstr(opts, "no-" opt);\ if (p && sscanf(p, opt "=%d" , &i) && param_val != i)\ bErr = 1;\ else if (!param_val && !q && !p)\ @@ -91,24 +91,6 @@ return z + lut[x]; } -inline void reduceFraction(int* n, int* d) -{ - int a = *n; - int b = *d; - int c; - if (!a || !b) - return; - c = a % b; - while (c) - { - a = b; - b = c; - c = a % b; - } - *n /= b; - *d /= b; -} - inline char *strcatFilename(const char *input, const char *suffix) { char *output = X265_MALLOC(char, strlen(input) + strlen(suffix) + 1); @@ -190,6 +172,8 @@ m_numEntries = 0; m_isSceneTransition = false; m_lastPredictorReset = 0; + m_avgPFrameQp = 0; + m_isFirstMiniGop = false; if (m_param->rc.rateControlMode == X265_RC_CRF) { m_param->rc.qp = (int)m_param->rc.rfConstant; @@ -212,7 +196,7 @@ m_rateFactorMaxDecrement = m_param->rc.rfConstant - m_param->rc.rfConstantMin; } m_isAbr = m_param->rc.rateControlMode != X265_RC_CQP && !m_param->rc.bStatRead; - m_2pass = (m_param->rc.rateControlMode == X265_RC_ABR || m_param->rc.vbvMaxBitrate > 0) && m_param->rc.bStatRead; + m_2pass = m_param->rc.rateControlMode != X265_RC_CQP && m_param->rc.bStatRead; m_bitrate = m_param->rc.bitrate * 1000; m_frameDuration = (double)m_param->fpsDenom / m_param->fpsNum; m_qp = m_param->rc.qp; @@ -225,8 +209,10 @@ m_statFileOut = NULL; m_cutreeStatFileOut = m_cutreeStatFileIn = NULL; m_rce2Pass = NULL; + m_encOrder = NULL; m_lastBsliceSatdCost = 0; m_movingAvgSum = 0.0; + m_isNextGop = false; // vbv initialization m_param->rc.vbvBufferSize = x265_clip3(0, 2000000, m_param->rc.vbvBufferSize); @@ -288,9 +274,13 @@ m_ipOffset = 6.0 * X265_LOG2(m_param->rc.ipFactor); m_pbOffset = 6.0 * X265_LOG2(m_param->rc.pbFactor); + for (int i = 0; i < QP_MAX_MAX; i++) + m_qpToEncodedBits[i] = 0; + /* Adjust the first frame in order to stabilize the quality level compared to the rest */ #define ABR_INIT_QP_MIN (24) -#define ABR_INIT_QP_MAX (40) +#define ABR_INIT_QP_MAX (37) +#define ABR_INIT_QP_GRAIN_MAX (33) #define ABR_SCENECUT_INIT_QP_MIN (12) #define CRF_INIT_QP (int)m_param->rc.rfConstant for (int i = 0; i < 3; i++) @@ -361,6 +351,7 @@ m_amortizeFraction = 0.85; m_amortizeFrames = m_param->totalFrames / 2; } + for (int i = 0; i < s_slidingWindowFrames; i++) { m_satdCostWindow[i] = 0; @@ -370,15 +361,22 @@ m_isPatternPresent = false; m_numBframesInPattern = 0; - /* 720p videos seem to be a good cutoff for cplxrSum */ - double tuneCplxFactor = (m_param->rc.cuTree && m_ncu > 3600) ? 2.5 : 1; + m_isGrainEnabled = false; + if(m_param->rc.bEnableGrain) // tune for grainy content OR equal p-b frame sizes + m_isGrainEnabled = true; + for (int i = 0; i < 3; i++) + m_lastQScaleFor[i] = x265_qp2qScale(m_param->rc.rateControlMode == X265_RC_CRF ? CRF_INIT_QP : ABR_INIT_QP_MIN); + m_avgPFrameQp = 0 ; + /* 720p videos seem to be a good cutoff for cplxrSum */ + double tuneCplxFactor = (m_ncu > 3600 && m_param->rc.cuTree) ? 2.5 : m_isGrainEnabled ? 1.9 : 1; /* estimated ratio that produces a reasonable QP for the first I-frame */ m_cplxrSum = .01 * pow(7.0e5, m_qCompress) * pow(m_ncu, 0.5) * tuneCplxFactor; m_wantedBitsWindow = m_bitrate * m_frameDuration; m_accumPNorm = .01; m_accumPQp = (m_param->rc.rateControlMode == X265_RC_CRF ? CRF_INIT_QP : ABR_INIT_QP_MIN) * m_accumPNorm; + /* Frame Predictors used in vbv */ initFramePredictors(); if (!m_statFileOut && (m_param->rc.bStatWrite || m_param->rc.bStatRead)) @@ -401,11 +399,11 @@ char *tmpFile = strcatFilename(fileName, ".cutree"); if (!tmpFile) return false; - m_cutreeStatFileIn = fopen(tmpFile, "rb"); + m_cutreeStatFileIn = x265_fopen(tmpFile, "rb"); X265_FREE(tmpFile); if (!m_cutreeStatFileIn) { - x265_log(m_param, X265_LOG_ERROR, "can't open stats file %s\n", tmpFile); + x265_log_file(m_param, X265_LOG_ERROR, "can't open stats file %s.cutree\n", fileName); return false; } } @@ -417,7 +415,7 @@ return false; } { - int i, j; + int i, j, m; uint32_t k , l; bool bErr = false; char *opts = statsBuf; @@ -439,6 +437,11 @@ x265_log(m_param, X265_LOG_ERROR, "fps specified in stats file not valid\n"); return false; } + if (((p = strstr(opts, " vbv-maxrate=")) == 0 || sscanf(p, " vbv-maxrate=%d", &m) != 1) && m_param->rc.rateControlMode == X265_RC_CRF) + { + x265_log(m_param, X265_LOG_ERROR, "Constant rate-factor is incompatible with 2pass without vbv-maxrate in the previous pass\n"); + return false; + } if (k != m_param->fpsNum || l != m_param->fpsDenom) { x265_log(m_param, X265_LOG_ERROR, "fps mismatch with 1st pass (%u/%u vs %u/%u)\n", @@ -564,8 +567,10 @@ p = next; } X265_FREE(statsBuf); - if (m_param->rc.rateControlMode == X265_RC_ABR || m_param->rc.vbvMaxBitrate > 0) + if (m_param->rc.rateControlMode != X265_RC_CQP) { + m_start = 0; + m_isQpModified = true; if (!initPass2()) return false; } /* else we're using constant quant, so no need to run the bitrate allocation */ @@ -579,11 +584,11 @@ statFileTmpname = strcatFilename(fileName, ".temp"); if (!statFileTmpname) return false; - m_statFileOut = fopen(statFileTmpname, "wb"); + m_statFileOut = x265_fopen(statFileTmpname, "wb"); X265_FREE(statFileTmpname); if (!m_statFileOut) { - x265_log(m_param, X265_LOG_ERROR, "can't open stats file %s\n", statFileTmpname); + x265_log_file(m_param, X265_LOG_ERROR, "can't open stats file %s.temp\n", fileName); return false; } p = x265_param2string(m_param); @@ -595,11 +600,11 @@ statFileTmpname = strcatFilename(fileName, ".cutree.temp"); if (!statFileTmpname) return false; - m_cutreeStatFileOut = fopen(statFileTmpname, "wb"); + m_cutreeStatFileOut = x265_fopen(statFileTmpname, "wb"); X265_FREE(statFileTmpname); if (!m_cutreeStatFileOut) { - x265_log(m_param, X265_LOG_ERROR, "can't open mbtree stats file %s\n", statFileTmpname); + x265_log_file(m_param, X265_LOG_ERROR, "can't open mbtree stats file %s.cutree.temp\n", fileName); return false; } } @@ -647,7 +652,7 @@ #undef MAX_DURATION } -bool RateControl::analyseABR2Pass(int startIndex, int endIndex, uint64_t allAvailableBits) +bool RateControl::analyseABR2Pass(uint64_t allAvailableBits) { double rateFactor, stepMult; double qBlur = m_param->rc.qblur; @@ -657,21 +662,21 @@ double *qScale, *blurredQscale; double baseCplx = m_ncu * (m_param->bframes ? 120 : 80); double clippedDuration = CLIP_DURATION(m_frameDuration) / BASE_FRAME_DURATION; - int framesCount = endIndex - startIndex + 1; /* Blur complexities, to reduce local fluctuation of QP. * We don't blur the QPs directly, because then one very simple frame * could drag down the QP of a nearby complex frame and give it more * bits than intended. */ - for (int i = startIndex; i <= endIndex; i++) + for (int i = 0; i < m_numEntries; i++) { double weightSum = 0; double cplxSum = 0; double weight = 1.0; double gaussianWeight; /* weighted average of cplx of future frames */ - for (int j = 1; j < cplxBlur * 2 && j <= endIndex - i; j++) + for (int j = 1; j < cplxBlur * 2 && j < m_numEntries - i; j++) { - RateControlEntry *rcj = &m_rce2Pass[i + j]; + int index = m_encOrder[i + j]; + RateControlEntry *rcj = &m_rce2Pass[index]; weight *= 1 - pow(rcj->iCuCount / m_ncu, 2); if (weight < 0.0001) break; @@ -683,7 +688,8 @@ weight = 1.0; for (int j = 0; j <= cplxBlur * 2 && j <= i; j++) { - RateControlEntry *rcj = &m_rce2Pass[i - j]; + int index = m_encOrder[i - j]; + RateControlEntry *rcj = &m_rce2Pass[index]; gaussianWeight = weight * exp(-j * j / 200.0); weightSum += gaussianWeight; cplxSum += gaussianWeight * (qScale2bits(rcj, 1) - rcj->miscBits) / clippedDuration; @@ -691,12 +697,12 @@ if (weight < .0001) break; } - m_rce2Pass[i].blurredComplexity = cplxSum / weightSum; + m_rce2Pass[m_encOrder[i]].blurredComplexity = cplxSum / weightSum; } - CHECKED_MALLOC(qScale, double, framesCount); + CHECKED_MALLOC(qScale, double, m_numEntries); if (filterSize > 1) { - CHECKED_MALLOC(blurredQscale, double, framesCount); + CHECKED_MALLOC(blurredQscale, double, m_numEntries); } else blurredQscale = qScale; @@ -708,9 +714,9 @@ * approximation of scaling the 1st pass by the ratio of bitrates. * The search range is probably overkill, but speed doesn't matter here. */ expectedBits = 1; - for (int i = startIndex; i <= endIndex; i++) + for (int i = 0; i < m_numEntries; i++) { - RateControlEntry* rce = &m_rce2Pass[i]; + RateControlEntry* rce = &m_rce2Pass[m_encOrder[i]]; double q = getQScale(rce, 1.0); expectedBits += qScale2bits(rce, q); m_lastQScaleFor[rce->sliceType] = q; @@ -733,7 +739,7 @@ /* find qscale */ for (int i = 0; i < m_numEntries; i++) { - RateControlEntry *rce = &m_rce2Pass[i]; + RateControlEntry *rce = &m_rce2Pass[m_encOrder[i]]; qScale[i] = getQScale(rce, rateFactor); m_lastQScaleFor[rce->sliceType] = qScale[i]; } @@ -741,7 +747,7 @@ /* fixed I/B qscale relative to P */ for (int i = m_numEntries - 1; i >= 0; i--) { - qScale[i] = getDiffLimitedQScale(&m_rce2Pass[i], qScale[i]); + qScale[i] = getDiffLimitedQScale(&m_rce2Pass[m_encOrder[i]], qScale[i]); X265_CHECK(qScale[i] >= 0, "qScale became negative\n"); } @@ -760,7 +766,7 @@ double coeff = qBlur == 0 ? 1.0 : exp(-d * d / (qBlur * qBlur)); if (idx < 0 || idx >= m_numEntries) continue; - if (m_rce2Pass[i].sliceType != m_rce2Pass[idx].sliceType) + if (m_rce2Pass[m_encOrder[i]].sliceType != m_rce2Pass[m_encOrder[idx]].sliceType) continue; q += qScale[idx] * coeff; sum += coeff; @@ -772,7 +778,7 @@ /* find expected bits */ for (int i = 0; i < m_numEntries; i++) { - RateControlEntry *rce = &m_rce2Pass[i]; + RateControlEntry *rce = &m_rce2Pass[m_encOrder[i]]; rce->newQScale = clipQscale(NULL, rce, blurredQscale[i]); // check if needed X265_CHECK(rce->newQScale >= 0, "new Qscale is negative\n"); expectedBits += qScale2bits(rce, rce->newQScale); @@ -786,9 +792,9 @@ if (filterSize > 1) X265_FREE(blurredQscale); if (m_isVbv) - if (!vbv2Pass(allAvailableBits, endIndex, startIndex)) + if (!vbv2Pass(allAvailableBits, m_numEntries - 1, 0)) return false; - expectedBits = countExpectedBits(startIndex, endIndex); + expectedBits = countExpectedBits(0, m_numEntries - 1); if (fabs(expectedBits / allAvailableBits - 1.0) > 0.01) { double avgq = 0; @@ -826,13 +832,12 @@ uint64_t allConstBits = 0, allCodedBits = 0; uint64_t allAvailableBits = uint64_t(m_param->rc.bitrate * 1000. * m_numEntries * m_frameDuration); int startIndex, framesCount, endIndex; - int fps = (int)(m_fps + 0.5); + int fps = X265_MIN(m_param->keyframeMax, (int)(m_fps + 0.5)); startIndex = endIndex = framesCount = 0; - bool isQpModified = true; int diffQp = 0; double targetBits = 0; double expectedBits = 0; - for (startIndex = 0, endIndex = 0; endIndex < m_numEntries; endIndex++) + for (startIndex = m_start, endIndex = m_start; endIndex < m_numEntries; endIndex++) { allConstBits += m_rce2Pass[endIndex].miscBits; allCodedBits += m_rce2Pass[endIndex].coeffBits + m_rce2Pass[endIndex].mvBits; @@ -846,11 +851,16 @@ { if (diffQp >= 1) { - if (!isQpModified && endIndex > fps) + if (!m_isQpModified && endIndex > fps) { double factor = 2; double step = 0; - for (int start = endIndex; start <= endIndex + fps - 1 && start < m_numEntries; start++) + if (endIndex + fps >= m_numEntries) + { + m_start = endIndex - (endIndex % fps); + return true; + } + for (int start = endIndex + 1; start <= endIndex + fps && start < m_numEntries; start++) { RateControlEntry *rce = &m_rce2Pass[start]; targetBits += qScale2bits(rce, x265_qp2qScale(rce->qpNoVbv)); @@ -858,12 +868,13 @@ } if (expectedBits < 0.95 * targetBits) { - isQpModified = true; + m_isQpModified = true; + m_isGopReEncoded = true; while (endIndex + fps < m_numEntries) { step = pow(2, factor / 6.0); expectedBits = 0; - for (int start = endIndex; start <= endIndex + fps - 1; start++) + for (int start = endIndex + 1; start <= endIndex + fps; start++) { RateControlEntry *rce = &m_rce2Pass[start]; rce->newQScale = rce->qScale / step; @@ -878,13 +889,13 @@ } if (m_isVbv && endIndex + fps < m_numEntries) - if (!vbv2Pass((uint64_t)targetBits, endIndex + fps - 1, endIndex)) + if (!vbv2Pass((uint64_t)targetBits, endIndex + fps, endIndex + 1)) return false; targetBits = 0; expectedBits = 0; - for (int start = endIndex - fps; start <= endIndex - 1; start++) + for (int start = endIndex - fps + 1; start <= endIndex; start++) { RateControlEntry *rce = &m_rce2Pass[start]; targetBits += qScale2bits(rce, x265_qp2qScale(rce->qpNoVbv)); @@ -893,7 +904,7 @@ { step = pow(2, factor / 6.0); expectedBits = 0; - for (int start = endIndex - fps; start <= endIndex - 1; start++) + for (int start = endIndex - fps + 1; start <= endIndex; start++) { RateControlEntry *rce = &m_rce2Pass[start]; rce->newQScale = rce->qScale * step; @@ -907,10 +918,13 @@ break; } if (m_isVbv) - if (!vbv2Pass((uint64_t)targetBits, endIndex - 1, endIndex - fps)) + if (!vbv2Pass((uint64_t)targetBits, endIndex, endIndex - fps + 1)) return false; diffQp = 0; + m_reencode = endIndex - fps + 1; + endIndex = endIndex + fps; startIndex = endIndex + 1; + m_start = startIndex; targetBits = expectedBits = 0; } else @@ -918,7 +932,7 @@ } } else - isQpModified = false; + m_isQpModified = false; } } } @@ -931,9 +945,12 @@ (int)(allConstBits * m_fps / framesCount * 1000.)); return false; } - if (!analyseABR2Pass(0, m_numEntries - 1, allAvailableBits)) + if (!analyseABR2Pass(allAvailableBits)) return false; } + + m_start = X265_MAX(m_start, endIndex - fps); + return true; } @@ -1049,12 +1066,12 @@ } m_pred[0].coeff = m_pred[3].coeff = 0.75; m_pred[0].coeffMin = m_pred[3].coeffMin = 0.75 / 4; - if (m_param->rc.qCompress >= 0.8) // when tuned for grain + if (m_isGrainEnabled) // when tuned for grain { m_pred[1].coeffMin = 0.75 / 4; m_pred[1].coeff = 0.75; - m_pred[0].coeff = m_pred[3].coeff = 0.5; - m_pred[0].coeffMin = m_pred[3].coeffMin = 0.5 / 4; + m_pred[0].coeff = m_pred[3].coeff = 0.75; + m_pred[0].coeffMin = m_pred[3].coeffMin = 0.75 / 4; } } @@ -1088,11 +1105,15 @@ copyRceData(rce, &m_rce2Pass[index]); } rce->isActive = true; + rce->scenecut = false; bool isRefFrameScenecut = m_sliceType!= I_SLICE && m_curSlice->m_refFrameList[0][0]->m_lowres.bScenecut; + m_isFirstMiniGop = m_sliceType == I_SLICE ? true : m_isFirstMiniGop; if (curFrame->m_lowres.bScenecut) { m_isSceneTransition = true; + rce->scenecut = true; m_lastPredictorReset = rce->encodeOrder; + initFramePredictors(); } else if (m_sliceType != B_SLICE && !isRefFrameScenecut) @@ -1197,6 +1218,7 @@ double q = x265_qScale2qp(rateEstimateQscale(curFrame, rce)); q = x265_clip3((double)QP_MIN, (double)QP_MAX_MAX, q); m_qp = int(q + 0.5); + q = m_isGrainEnabled ? m_qp : q; rce->qpaRc = curEncData.m_avgQpRc = curEncData.m_avgQpAq = q; /* copy value of lastRceq into thread local rce struct *to be used in RateControlEnd() */ rce->qRceq = m_lastRceq; @@ -1322,14 +1344,6 @@ m_accumPNorm = mask * (1 + m_accumPNorm); } - x265_zone* zone = getZone(); - if (zone) - { - if (zone->bForceQp) - q = x265_qp2qScale(zone->qp); - else - q /= zone->bitrateFactor; - } return q; } double RateControl::countExpectedBits(int startPos, int endPos) @@ -1418,12 +1432,9 @@ } while(type != sliceTypeActual); } + primitives.fix8Unpack(frame->m_lowres.qpCuTreeOffset, m_cuTreeStats.qpBuffer[m_cuTreeStats.qpBufPos], m_ncu); for (int i = 0; i < m_ncu; i++) - { - int16_t qpFix8 = m_cuTreeStats.qpBuffer[m_cuTreeStats.qpBufPos][i]; - frame->m_lowres.qpCuTreeOffset[i] = (double)(qpFix8) / 256.0; frame->m_lowres.invQscaleFactor[i] = x265_exp2fix8(frame->m_lowres.qpCuTreeOffset[i]); - } m_cuTreeStats.qpBufPos--; } return true; @@ -1436,8 +1447,6 @@ double RateControl::tuneAbrQScaleFromFeedback(double qScale) { double abrBuffer = 2 * m_rateTolerance * m_bitrate; - if (m_currentSatd) - { /* use framesDone instead of POC as poc count is not serial with bframes enabled */ double overflow = 1.0; double timeDone = (double)(m_framesDone - m_param->frameNumThreads + 1) * m_frameDuration; @@ -1450,16 +1459,31 @@ } if (wantedBits > 0 && encodedBits > 0 && (!m_partialResidualFrames || - m_param->rc.bStrictCbr)) + m_param->rc.bStrictCbr || m_isGrainEnabled)) { abrBuffer *= X265_MAX(1, sqrt(timeDone)); overflow = x265_clip3(.5, 2.0, 1.0 + (encodedBits - wantedBits) / abrBuffer); qScale *= overflow; } - } return qScale; } +double RateControl::tuneQScaleForGrain(double rcOverflow) +{ + double qpstep = rcOverflow > 1.1 ? rcOverflow : m_lstep; + double qScaleAvg = x265_qp2qScale(m_avgPFrameQp); + double q = m_lastQScaleFor[P_SLICE]; + int curQp = int (x265_qScale2qp(m_lastQScaleFor[P_SLICE]) + 0.5); + double curBitrate = m_qpToEncodedBits[curQp] * int(m_fps + 0.5); + int newQp = rcOverflow > 1.1 ? curQp + 2 : rcOverflow > 1 ? curQp + 1 : curQp - 1 ; + double projectedBitrate = int(m_fps + 0.5) * m_qpToEncodedBits[newQp]; + if (curBitrate > 0 && projectedBitrate > 0) + q = abs(projectedBitrate - m_bitrate) < abs (curBitrate - m_bitrate) ? x265_qp2qScale(newQp) : m_lastQScaleFor[P_SLICE]; + else + q = rcOverflow > 1 ? qScaleAvg * qpstep : rcOverflow < 1 ? qScaleAvg / qpstep : m_lastQScaleFor[P_SLICE]; + return q; +} + double RateControl::rateEstimateQscale(Frame* curFrame, RateControlEntry *rce) { double q; @@ -1525,6 +1549,7 @@ q0 = q1; } } + if (prevRefSlice->m_sliceType == B_SLICE && IS_REFERENCED(m_curSlice->m_refFrameList[0][0])) q0 -= m_pbOffset / 2; if (nextRefSlice->m_sliceType == B_SLICE && IS_REFERENCED(m_curSlice->m_refFrameList[1][0])) @@ -1535,7 +1560,9 @@ q = q1; else if (i1) q = q0; - else + else if(m_isGrainEnabled && !m_2pass) + q = q1; + else q = (q0 * dt1 + q1 * dt0) / (dt0 + dt1); if (IS_REFERENCED(curFrame)) @@ -1543,7 +1570,7 @@ else q += m_pbOffset; - /* Set a min qp at scenechanges and transitions */ + /* Set a min qp at scenechanges and transitions */ if (m_isSceneTransition) { q = X265_MAX(ABR_SCENECUT_INIT_QP_MIN, q); @@ -1553,11 +1580,28 @@ double qScale = x265_qp2qScale(q); rce->qpNoVbv = q; double lmin = 0, lmax = 0; + if (m_isGrainEnabled && m_isFirstMiniGop) + { + lmin = m_lastQScaleFor[P_SLICE] / m_lstep; + lmax = m_lastQScaleFor[P_SLICE] * m_lstep; + double tunedQscale = tuneAbrQScaleFromFeedback(qScale); + double overflow = tunedQscale / qScale; + if (!m_isAbrReset) + qScale = x265_clip3(lmin, lmax, qScale); + m_avgPFrameQp = m_avgPFrameQp == 0 ? rce->qpNoVbv : m_avgPFrameQp; + if (overflow != 1) + { + qScale = tuneQScaleForGrain(overflow); + q = x265_qScale2qp(qScale); + } + rce->qpNoVbv = q; + } if (m_isVbv) { lmin = m_lastQScaleFor[P_SLICE] / m_lstep; lmax = m_lastQScaleFor[P_SLICE] * m_lstep; - if (m_isCbr) + + if (m_isCbr && !m_isGrainEnabled) { qScale = tuneAbrQScaleFromFeedback(qScale); if (!m_isAbrReset) @@ -1581,7 +1625,17 @@ rce->frameSizePlanned = X265_MIN(rce->frameSizePlanned, rce->frameSizeMaximum); rce->frameSizeEstimated = rce->frameSizePlanned; } + rce->newQScale = qScale; + if(rce->bLastMiniGopBFrame) + { + if (m_isFirstMiniGop && m_isGrainEnabled) + { + m_avgPFrameQp = (m_avgPFrameQp + rce->qpNoVbv) / 2; + m_lastQScaleFor[P_SLICE] = x265_qp2qScale(m_avgPFrameQp); + } + m_isFirstMiniGop = false; + } return qScale; } else @@ -1608,6 +1662,14 @@ } diff = m_predictedBits - (int64_t)rce->expectedBits; q = rce->newQScale; + x265_zone* zone = getZone(); + if (zone) + { + if (zone->bForceQp) + q = x265_qp2qScale(zone->qp); + else + q /= zone->bitrateFactor; + } q /= x265_clip3(0.5, 2.0, (double)(abrBuffer - diff) / abrBuffer); if (m_expectedBitsSum > 0) { @@ -1617,6 +1679,9 @@ double w = x265_clip3(0.0, 1.0, curTime * 100); q *= pow((double)m_totalBits / m_expectedBitsSum, w); } + if (m_framesDone == 0 && m_param->rc.rateControlMode == X265_RC_ABR && m_isGrainEnabled) + q = X265_MIN(x265_qp2qScale(ABR_INIT_QP_GRAIN_MAX), q); + rce->qpNoVbv = x265_qScale2qp(q); if (m_isVbv) { @@ -1669,21 +1734,50 @@ if (m_param->rc.rateControlMode == X265_RC_CRF) { q = getQScale(rce, m_rateFactorConstant); + x265_zone* zone = getZone(); + if (zone) + { + if (zone->bForceQp) + q = x265_qp2qScale(zone->qp); + else + q /= zone->bitrateFactor; + } } else { if (!m_param->rc.bStatRead) checkAndResetABR(rce, false); double initialQScale = getQScale(rce, m_wantedBitsWindow / m_cplxrSum); - q = tuneAbrQScaleFromFeedback(initialQScale); - overflow = q / initialQScale; + x265_zone* zone = getZone(); + if (zone) + { + if (zone->bForceQp) + initialQScale = x265_qp2qScale(zone->qp); + else + initialQScale /= zone->bitrateFactor; + } + double tunedQScale = tuneAbrQScaleFromFeedback(initialQScale); + overflow = tunedQScale / initialQScale; + q = !m_partialResidualFrames? tunedQScale : initialQScale; + bool isEncodeEnd = (m_param->totalFrames && + m_framesDone > 0.75 * m_param->totalFrames) ? 1 : 0; + bool isEncodeBeg = m_framesDone < (int)(m_fps + 0.5); + if (m_isGrainEnabled) + { + if(m_sliceType!= I_SLICE && m_framesDone && !isEncodeEnd && + ((overflow < 1.05 && overflow > 0.95) || isEncodeBeg)) + { + q = tuneQScaleForGrain(overflow); + } + } } - if (m_sliceType == I_SLICE && m_param->keyframeMax > 1 - && m_lastNonBPictType != I_SLICE && !m_isAbrReset) + if ((m_sliceType == I_SLICE && m_param->keyframeMax > 1 + && m_lastNonBPictType != I_SLICE && !m_isAbrReset) || (m_isNextGop && !m_framesDone)) { if (!m_param->rc.bStrictCbr) q = x265_qp2qScale(m_accumPQp / m_accumPNorm); q /= fabs(m_param->rc.ipFactor); + m_avgPFrameQp = 0; } else if (m_framesDone > 0) { @@ -1691,7 +1785,7 @@ { lqmin = m_lastQScaleFor[m_sliceType] / m_lstep; lqmax = m_lastQScaleFor[m_sliceType] * m_lstep; - if (!m_partialResidualFrames) + if (!m_partialResidualFrames || m_isGrainEnabled) { if (overflow > 1.1 && m_framesDone > 3) lqmax *= m_lstep; @@ -1708,8 +1802,9 @@ else if (m_framesDone == 0 && !m_isVbv && m_param->rc.rateControlMode == X265_RC_ABR) { /* for ABR alone, clip the first I frame qp */ - lqmax = x265_qp2qScale(ABR_INIT_QP_MAX) * m_lstep; - q = X265_MIN(lqmax, q); + lqmax = (m_lstep * m_isGrainEnabled) ? x265_qp2qScale(ABR_INIT_QP_GRAIN_MAX) : + x265_qp2qScale(ABR_INIT_QP_MAX); + q = X265_MIN(lqmax, q); } q = x265_clip3(MIN_QPSCALE, MAX_MAX_QPSCALE, q); /* Set a min qp at scenechanges and transitions */ @@ -1720,6 +1815,11 @@ m_lastQScaleFor[P_SLICE] = X265_MAX(minScenecutQscale, m_lastQScaleFor[P_SLICE]); } rce->qpNoVbv = x265_qScale2qp(q); + if(m_sliceType == P_SLICE) + { + m_avgPFrameQp = m_avgPFrameQp == 0 ? rce->qpNoVbv : m_avgPFrameQp; + m_avgPFrameQp = (m_avgPFrameQp + rce->qpNoVbv) / 2; + } q = clipQscale(curFrame, rce, q); /* clip qp to permissible range after vbv-lookahead estimation to avoid possible * mispredictions by initial frame size predictors, after each scenecut */ @@ -1806,7 +1906,7 @@ double abrBuffer = 2 * m_rateTolerance * m_bitrate; // Check if current Slice is a scene cut that follows low detailed/blank frames - if (rce->lastSatd > 4 * rce->movingAvgSum) + if (rce->lastSatd > 4 * rce->movingAvgSum || rce->scenecut) { if (!m_isAbrReset && rce->movingAvgSum > 0 && (m_isPatternPresent || !m_param->bframes)) @@ -1842,18 +1942,17 @@ const HRDInfo* hrd = &vui->hrdParameters; int num = 90000; int denom = hrd->bitRateValue << (hrd->bitRateScale + BR_SHIFT); - reduceFraction(&num, &denom); int64_t cpbState = (int64_t)m_bufferFillFinal; int64_t cpbSize = (int64_t)hrd->cpbSizeValue << (hrd->cpbSizeScale + CPB_SHIFT); if (cpbState < 0 || cpbState > cpbSize) { x265_log(m_param, X265_LOG_WARNING, "CPB %s: %.0lf bits in a %.0lf-bit buffer\n", - cpbState < 0 ? "underflow" : "overflow", (float)cpbState/denom, (float)cpbSize/denom); + cpbState < 0 ? "underflow" : "overflow", (float)cpbState, (float)cpbSize); } - seiBP->m_initialCpbRemovalDelay = (uint32_t)(num * cpbState + denom) / denom; - seiBP->m_initialCpbRemovalDelayOffset = (uint32_t)((num * cpbSize + denom) / denom - seiBP->m_initialCpbRemovalDelay); + seiBP->m_initialCpbRemovalDelay = (uint32_t)(num * cpbState / denom); + seiBP->m_initialCpbRemovalDelayOffset = (uint32_t)(num * cpbSize / denom - seiBP->m_initialCpbRemovalDelay); } void RateControl::updateVbvPlan(Encoder* enc) @@ -2084,8 +2183,6 @@ int RateControl::rowDiagonalVbvRateControl(Frame* curFrame, uint32_t row, RateControlEntry* rce, double& qpVbv) { - if (m_param->rc.bStatRead && m_param->rc.rateControlMode == X265_RC_CRF) - return 0; FrameData& curEncData = *curFrame->m_encData; double qScaleVbv = x265_qp2qScale(qpVbv); uint64_t rowSatdCost = curEncData.m_rowStat[row].diagSatd; @@ -2260,15 +2357,7 @@ m_lastRceq = q; q /= rateFactor; } - - x265_zone* zone = getZone(); - if (zone) - { - if (zone->bForceQp) - q = x265_qp2qScale(zone->qp); - else - q /= zone->bitrateFactor; - } + return q; } @@ -2336,21 +2425,25 @@ { if (m_isVbv && !(m_2pass && m_param->rc.rateControlMode == X265_RC_CRF)) { + double avgQpRc = 0; /* determine avg QP decided by VBV rate control */ for (uint32_t i = 0; i < slice->m_sps->numCuInHeight; i++) - curEncData.m_avgQpRc += curEncData.m_rowStat[i].sumQpRc; + avgQpRc += curEncData.m_rowStat[i].sumQpRc; - curEncData.m_avgQpRc /= slice->m_sps->numCUsInFrame; + avgQpRc /= slice->m_sps->numCUsInFrame; + curEncData.m_avgQpRc = x265_clip3((double)QP_MIN, (double)QP_MAX_MAX, avgQpRc); rce->qpaRc = curEncData.m_avgQpRc; } if (m_param->rc.aqMode) { + double avgQpAq = 0; /* determine actual avg encoded QP, after AQ/cutree adjustments */ for (uint32_t i = 0; i < slice->m_sps->numCuInHeight; i++) - curEncData.m_avgQpAq += curEncData.m_rowStat[i].sumQpAq; + avgQpAq += curEncData.m_rowStat[i].sumQpAq; - curEncData.m_avgQpAq /= (slice->m_sps->numCUsInFrame * NUM_4x4_PARTITIONS); + avgQpAq /= (slice->m_sps->numCUsInFrame * NUM_4x4_PARTITIONS); + curEncData.m_avgQpAq = avgQpAq; } else curEncData.m_avgQpAq = curEncData.m_avgQpRc; @@ -2367,13 +2460,13 @@ bool is2passCrfChange = false; if (m_2pass) { - if (abs(curEncData.m_avgQpRc - rce->qpPrev) > 0.1) + if (fabs(curEncData.m_avgQpRc - rce->qpPrev) > 0.1) { qpRef = rce->qpPrev; is2passCrfChange = true; } } - if (is2passCrfChange || abs(qpRef - rce->qpNoVbv) > 0.5) + if (is2passCrfChange || fabs(qpRef - rce->qpNoVbv) > 0.5) { double crfFactor = rce->qRceq /x265_qp2qScale(qpRef); double baseCplx = m_ncu * (m_param->bframes ? 120 : 80); @@ -2426,6 +2519,11 @@ int pos = m_sliderPos - m_param->frameNumThreads; if (pos >= 0) m_encodedBitsWindow[pos % s_slidingWindowFrames] = actualBits; + if(rce->sliceType != I_SLICE) + { + int qp = int (rce->qpaRc + 0.5); + m_qpToEncodedBits[qp] = m_qpToEncodedBits[qp] == 0 ? actualBits : (m_qpToEncodedBits[qp] + actualBits) * 0.5; + } } if (m_2pass) @@ -2493,8 +2591,7 @@ if (m_param->rc.cuTree && IS_REFERENCED(curFrame) && !m_param->rc.bStatRead) { uint8_t sliceType = (uint8_t)rce->sliceType; - for (int i = 0; i < m_ncu; i++) - m_cuTreeStats.qpBuffer[0][i] = (uint16_t)(curFrame->m_lowres.qpCuTreeOffset[i] * 256.0); + primitives.fix8Pack(m_cuTreeStats.qpBuffer[0], curFrame->m_lowres.qpCuTreeOffset, m_ncu); if (fwrite(&sliceType, 1, 1, m_cutreeStatFileOut) < 1) goto writeFailure; if (fwrite(m_cuTreeStats.qpBuffer[0], sizeof(uint16_t), m_ncu, m_cutreeStatFileOut) < (size_t)m_ncu) @@ -2542,13 +2639,12 @@ int bError = 1; if (tmpFileName) { - unlink(fileName); - bError = rename(tmpFileName, fileName); + x265_unlink(fileName); + bError = x265_rename(tmpFileName, fileName); } if (bError) { - x265_log(m_param, X265_LOG_ERROR, "failed to rename output stats file to \"%s\"\n", - fileName); + x265_log_file(m_param, X265_LOG_ERROR, "failed to rename output stats file to \"%s\"\n", fileName); } X265_FREE(tmpFileName); } @@ -2561,13 +2657,12 @@ int bError = 1; if (tmpFileName && newFileName) { - unlink(newFileName); - bError = rename(tmpFileName, newFileName); + x265_unlink(newFileName); + bError = x265_rename(tmpFileName, newFileName); } if (bError) { - x265_log(m_param, X265_LOG_ERROR, "failed to rename cutree output stats file to \"%s\"\n", - newFileName); + x265_log_file(m_param, X265_LOG_ERROR, "failed to rename cutree output stats file to \"%s\"\n", newFileName); } X265_FREE(tmpFileName); X265_FREE(newFileName); @@ -2577,6 +2672,7 @@ fclose(m_cutreeStatFileIn); X265_FREE(m_rce2Pass); + X265_FREE(m_encOrder); for (int i = 0; i < 2; i++) X265_FREE(m_cuTreeStats.qpBuffer[i]);
View file
x265_1.9.tar.gz/source/encoder/ratecontrol.h -> x265_2.0.tar.gz/source/encoder/ratecontrol.h
Changed
@@ -107,6 +107,7 @@ int miscBits; int coeffBits; bool keptAsRef; + bool scenecut; SEIPictureTiming *picTimingSEI; HRDTiming *hrdTiming; @@ -126,8 +127,9 @@ bool m_isVbv; bool m_isCbr; bool m_singleFrameVbv; - + bool m_isGrainEnabled; bool m_isAbrReset; + bool m_isNextGop; int m_lastAbrResetPoc; double m_rateTolerance; @@ -141,7 +143,8 @@ double m_vbvMaxRate; /* in kbps */ double m_rateFactorMaxIncrement; /* Don't allow RF above (CRF + this value). */ double m_rateFactorMaxDecrement; /* don't allow RF below (this value). */ - + double m_avgPFrameQp; + bool m_isFirstMiniGop; Predictor m_pred[4]; /* Slice predictors to preidct bits for each Slice type - I,P,Bref and B */ int64_t m_leadingNoBSatd; int m_predType; /* Type of slice predictors to be used - depends on the slice type */ @@ -178,7 +181,7 @@ bool m_isPatternPresent; bool m_isSceneTransition; int m_lastPredictorReset; - + double m_qpToEncodedBits[QP_MAX_MAX + 1]; /* a common variable on which rateControlStart, rateControlEnd and rateControUpdateStats waits to * sync the calls to these functions. For example * -F2: @@ -202,7 +205,11 @@ /* 2 pass */ bool m_2pass; + bool m_isGopReEncoded; + bool m_isQpModified; int m_numEntries; + int m_start; + int m_reencode; FILE* m_statFileOut; FILE* m_cutreeStatFileOut; FILE* m_cutreeStatFileIn; @@ -235,6 +242,8 @@ bool cuTreeReadFor2Pass(Frame* curFrame); void hrdFullness(SEIBufferingPeriod* sei); int writeRateControlFrameStats(Frame* curFrame, RateControlEntry* rce); + bool initPass2(); + protected: static const int s_slidingWindowFrames; @@ -261,14 +270,14 @@ double predictSize(Predictor *p, double q, double var); void checkAndResetABR(RateControlEntry* rce, bool isFrameDone); double predictRowsSizeSum(Frame* pic, RateControlEntry* rce, double qpm, int32_t& encodedBits); - bool initPass2(); - bool analyseABR2Pass(int startPoc, int endPoc, uint64_t allAvailableBits); + bool analyseABR2Pass(uint64_t allAvailableBits); void initFramePredictors(); double getDiffLimitedQScale(RateControlEntry *rce, double q); double countExpectedBits(int startPos, int framesCount); bool vbv2Pass(uint64_t allAvailableBits, int frameCount, int startPos); bool findUnderflow(double *fills, int *t0, int *t1, int over, int framesCount); bool fixUnderflow(int t0, int t1, double adjustment, double qscaleMin, double qscaleMax); + double tuneQScaleForGrain(double rcOverflow); }; } #endif // ifndef X265_RATECONTROL_H
View file
x265_1.9.tar.gz/source/encoder/reference.cpp -> x265_2.0.tar.gz/source/encoder/reference.cpp
Changed
@@ -68,7 +68,7 @@ intptr_t stride = reconPic->m_stride; int cuHeight = g_maxCUSize; - for (int c = 0; c < (p.internalCsp != X265_CSP_I400 ? numInterpPlanes : 1); c++) + for (int c = 0; c < (p.internalCsp != X265_CSP_I400 && recPic->m_picCsp != X265_CSP_I400 ? numInterpPlanes : 1); c++) { if (c == 1) {
View file
x265_1.9.tar.gz/source/encoder/sao.cpp -> x265_2.0.tar.gz/source/encoder/sao.cpp
Changed
@@ -53,7 +53,7 @@ return r; } -inline int64_t estSaoDist(int32_t count, int offset, int32_t offsetOrg) +inline int64_t estSaoDist(int32_t count, int32_t offset, int32_t offsetOrg) { return (count * offset - offsetOrg * 2) * offset; } @@ -76,8 +76,6 @@ m_countPreDblk = NULL; m_offsetOrgPreDblk = NULL; m_refDepth = 0; - m_lumaLambda = 0; - m_chromaLambda = 0; m_param = NULL; m_clipTable = NULL; m_clipTableBase = NULL; @@ -120,8 +118,11 @@ if (initCommon) { - CHECKED_MALLOC(m_countPreDblk, PerPlane, numCtu); - CHECKED_MALLOC(m_offsetOrgPreDblk, PerPlane, numCtu); + if (m_param->bSaoNonDeblocked) + { + CHECKED_MALLOC(m_countPreDblk, PerPlane, numCtu); + CHECKED_MALLOC(m_offsetOrgPreDblk, PerPlane, numCtu); + } CHECKED_MALLOC(m_depthSaoRate, double, 2 * SAO_DEPTHRATE_SIZE); m_depthSaoRate[0 * SAO_DEPTHRATE_SIZE + 0] = 0; @@ -137,17 +138,16 @@ m_clipTable = &(m_clipTableBase[rangeExt]); // Share with fast clip lookup table - if (initCommon) - { - for (int i = 0; i < rangeExt; i++) - m_clipTableBase[i] = 0; - for (int i = 0; i < maxY; i++) - m_clipTable[i] = (pixel)i; + for (int i = 0; i < rangeExt; i++) + m_clipTableBase[i] = 0; + + for (int i = 0; i < maxY; i++) + m_clipTable[i] = (pixel)i; + + for (int i = maxY; i < maxY + rangeExt; i++) + m_clipTable[i] = maxY; - for (int i = maxY; i < maxY + rangeExt; i++) - m_clipTable[i] = maxY; - } } else { @@ -204,8 +204,11 @@ if (destoryCommon) { - X265_FREE_ZERO(m_countPreDblk); - X265_FREE_ZERO(m_offsetOrgPreDblk); + if (m_param->bSaoNonDeblocked) + { + X265_FREE_ZERO(m_countPreDblk); + X265_FREE_ZERO(m_offsetOrgPreDblk); + } X265_FREE_ZERO(m_depthSaoRate); X265_FREE_ZERO(m_clipTableBase); } @@ -221,17 +224,10 @@ saoParam->ctuParam[i] = new SaoCtuParam[m_numCuInHeight * m_numCuInWidth]; } -void SAO::startSlice(Frame* frame, Entropy& initState, int qp) +void SAO::startSlice(Frame* frame, Entropy& initState) { - Slice* slice = frame->m_encData->m_slice; - int qpCb = qp; - if (m_param->internalCsp == X265_CSP_I420) - qpCb = x265_clip3(QP_MIN, QP_MAX_MAX, (int)g_chromaScale[qp + slice->m_pps->chromaQpOffset[0]]); - else - qpCb = X265_MIN(qp + slice->m_pps->chromaQpOffset[0], QP_MAX_SPEC); - m_lumaLambda = x265_lambda2_tab[qp]; - m_chromaLambda = x265_lambda2_tab[qpCb]; // Use Cb QP for SAO chroma m_frame = frame; + Slice* slice = m_frame->m_encData->m_slice; switch (slice->m_sliceType) { @@ -259,7 +255,7 @@ } saoParam->bSaoFlag[0] = true; - saoParam->bSaoFlag[1] = m_param->internalCsp != X265_CSP_I400; + saoParam->bSaoFlag[1] = m_param->internalCsp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400; m_numNoSao[0] = 0; // Luma m_numNoSao[1] = 0; // Chroma @@ -275,9 +271,8 @@ } // CTU-based SAO process without slice granularity -void SAO::processSaoCu(int addr, int typeIdx, int plane) +void SAO::applyPixelOffsets(int addr, int typeIdx, int plane) { - int x, y; PicYuv* reconPic = m_frame->m_reconPic; pixel* rec = reconPic->getPlaneAddr(plane, addr); intptr_t stride = plane ? reconPic->m_strideC : reconPic->m_stride; @@ -302,20 +297,13 @@ ctuWidth = rpelx - lpelx; ctuHeight = bpely - tpely; - int startX; - int startY; - int endX; - int endY; - pixel* tmpL; - pixel* tmpU; - int8_t _upBuff1[MAX_CU_SIZE + 2], *upBuff1 = _upBuff1 + 1, signLeft1[2]; int8_t _upBufft[MAX_CU_SIZE + 2], *upBufft = _upBufft + 1; memset(_upBuff1 + MAX_CU_SIZE, 0, 2 * sizeof(int8_t)); /* avoid valgrind uninit warnings */ - tmpL = m_tmpL1[plane]; - tmpU = &(m_tmpU[plane][lpelx]); + pixel* tmpL = m_tmpL1[plane]; + pixel* tmpU = &(m_tmpU[plane][lpelx]); int8_t* offsetEo = m_offsetEo[plane]; @@ -324,14 +312,14 @@ case SAO_EO_0: // dir: - { pixel firstPxl = 0, lastPxl = 0, row1FirstPxl = 0, row1LastPxl = 0; - startX = !lpelx; - endX = (rpelx == picWidth) ? ctuWidth - 1 : ctuWidth; + int startX = !lpelx; + int endX = (rpelx == picWidth) ? ctuWidth - 1 : ctuWidth; if (ctuWidth & 15) { - for (y = 0; y < ctuHeight; y++) + for (int y = 0; y < ctuHeight; y++, rec += stride) { int signLeft = signOf(rec[startX] - tmpL[y]); - for (x = startX; x < endX; x++) + for (int x = startX; x < endX; x++) { int signRight = signOf(rec[x] - rec[x + 1]); int edgeType = signRight + signLeft + 2; @@ -339,13 +327,11 @@ rec[x] = m_clipTable[rec[x] + offsetEo[edgeType]]; } - - rec += stride; } } else { - for (y = 0; y < ctuHeight; y += 2) + for (int y = 0; y < ctuHeight; y += 2, rec += 2 * stride) { signLeft1[0] = signOf(rec[startX] - tmpL[y]); signLeft1[1] = signOf(rec[stride + startX] - tmpL[y + 1]); @@ -375,27 +361,25 @@ rec[ctuWidth - 1] = lastPxl; rec[stride + ctuWidth - 1] = row1LastPxl; } - - rec += 2 * stride; } } break; } case SAO_EO_1: // dir: | { - startY = !tpely; - endY = (bpely == picHeight) ? ctuHeight - 1 : ctuHeight; + int startY = !tpely; + int endY = (bpely == picHeight) ? ctuHeight - 1 : ctuHeight; if (!tpely) rec += stride; if (ctuWidth & 15) { - for (x = 0; x < ctuWidth; x++) + for (int x = 0; x < ctuWidth; x++) upBuff1[x] = signOf(rec[x] - tmpU[x]); - for (y = startY; y < endY; y++) + for (int y = startY; y < endY; y++, rec += stride) { - for (x = 0; x < ctuWidth; x++) + for (int x = 0; x < ctuWidth; x++) { int8_t signDown = signOf(rec[x] - rec[x + stride]); int edgeType = signDown + upBuff1[x] + 2; @@ -403,8 +387,6 @@ rec[x] = m_clipTable[rec[x] + offsetEo[edgeType]]; } - - rec += stride; } } else @@ -412,11 +394,9 @@ primitives.sign(upBuff1, rec, tmpU, ctuWidth); int diff = (endY - startY) % 2; - for (y = startY; y < endY - diff; y += 2) - { + for (int y = startY; y < endY - diff; y += 2, rec += 2 * stride) primitives.saoCuOrgE1_2Rows(rec, upBuff1, offsetEo, stride, ctuWidth); - rec += 2 * stride; - } + if (diff & 1) primitives.saoCuOrgE1(rec, upBuff1, offsetEo, stride, ctuWidth); } @@ -425,11 +405,11 @@ } case SAO_EO_2: // dir: 135 { - startX = !lpelx; - endX = (rpelx == picWidth) ? ctuWidth - 1 : ctuWidth; + int startX = !lpelx; + int endX = (rpelx == picWidth) ? ctuWidth - 1 : ctuWidth; - startY = !tpely; - endY = (bpely == picHeight) ? ctuHeight - 1 : ctuHeight; + int startY = !tpely; + int endY = (bpely == picHeight) ? ctuHeight - 1 : ctuHeight; if (!tpely) rec += stride; @@ -454,16 +434,16 @@ } else { - for (x = startX; x < endX; x++) + for (int x = startX; x < endX; x++) upBuff1[x] = signOf(rec[x] - tmpU[x - 1]); } if (ctuWidth & 15) { - for (y = startY; y < endY; y++) + for (int y = startY; y < endY; y++, rec += stride) { upBufft[startX] = signOf(rec[stride + startX] - tmpL[y]); - for (x = startX; x < endX; x++) + for (int x = startX; x < endX; x++) { int8_t signDown = signOf(rec[x] - rec[x + stride + 1]); int edgeType = signDown + upBuff1[x] + 2; @@ -472,13 +452,11 @@ } std::swap(upBuff1, upBufft); - - rec += stride; } } else { - for (y = startY; y < endY; y++) + for (int y = startY; y < endY; y++, rec += stride) { int8_t iSignDown2 = signOf(rec[stride + startX] - tmpL[y]); @@ -487,30 +465,29 @@ upBufft[startX] = iSignDown2; std::swap(upBuff1, upBufft); - rec += stride; } } break; } case SAO_EO_3: // dir: 45 { - startX = !lpelx; - endX = (rpelx == picWidth) ? ctuWidth - 1 : ctuWidth; + int startX = !lpelx; + int endX = (rpelx == picWidth) ? ctuWidth - 1 : ctuWidth; - startY = !tpely; - endY = (bpely == picHeight) ? ctuHeight - 1 : ctuHeight; + int startY = !tpely; + int endY = (bpely == picHeight) ? ctuHeight - 1 : ctuHeight; if (!tpely) rec += stride; if (ctuWidth & 15) { - for (x = startX - 1; x < endX; x++) + for (int x = startX - 1; x < endX; x++) upBuff1[x] = signOf(rec[x] - tmpU[x + 1]); - for (y = startY; y < endY; y++) + for (int y = startY; y < endY; y++, rec += stride) { - x = startX; + int x = startX; int8_t signDown = signOf(rec[x] - tmpL[y + 1]); int edgeType = signDown + upBuff1[x] + 2; upBuff1[x - 1] = -signDown; @@ -525,8 +502,6 @@ } upBuff1[endX - 1] = signOf(rec[endX - 1 + stride] - rec[endX]); - - rec += stride; } } else @@ -545,9 +520,9 @@ if (rpelx == picWidth) upBuff1[ctuWidth - 1] = lastSign; - for (y = startY; y < endY; y++) + for (int y = startY; y < endY; y++, rec += stride) { - x = startX; + int x = startX; int8_t signDown = signOf(rec[x] - tmpL[y + 1]); int edgeType = signDown + upBuff1[x] + 2; upBuff1[x - 1] = -signDown; @@ -556,8 +531,6 @@ primitives.saoCuOrgE3[endX > 16](rec, upBuff1, offsetEo, stride - 1, startX, endX); upBuff1[endX - 1] = signOf(rec[endX - 1 + stride] - rec[endX]); - - rec += stride; } } @@ -571,24 +544,14 @@ { #define SAO_BO_BITS 5 const int boShift = X265_DEPTH - SAO_BO_BITS; - for (y = 0; y < ctuHeight; y++) - { - for (x = 0; x < ctuWidth; x++) - { - int val = rec[x] + offsetBo[rec[x] >> boShift]; - if (val < 0) - val = 0; - else if (val > ((1 << X265_DEPTH) - 1)) - val = ((1 << X265_DEPTH) - 1); - rec[x] = (pixel)val; - } - rec += stride; - } + + for (int y = 0; y < ctuHeight; y++, rec += stride) + for (int x = 0; x < ctuWidth; x++) + rec[x] = x265_clip(rec[x] + offsetBo[rec[x] >> boShift]); } else - { primitives.saoCuOrgB0(rec, offsetBo, ctuWidth, ctuHeight, stride); - } + break; } default: break; @@ -596,7 +559,7 @@ } /* Process SAO unit */ -void SAO::processSaoUnitCuLuma(SaoCtuParam* ctuParam, int idxY, int idxX) +void SAO::generateLumaOffsets(SaoCtuParam* ctuParam, int idxY, int idxX) { PicYuv* reconPic = m_frame->m_reconPic; intptr_t stride = reconPic->m_stride; @@ -637,7 +600,7 @@ memset(m_offsetBo[0], 0, sizeof(m_offsetBo[0])); for (int i = 0; i < SAO_NUM_OFFSET; i++) - m_offsetBo[0][((ctuParam[addr].bandPos + i) & (SAO_NUM_BO_CLASSES - 1))] = (int8_t)(ctuParam[addr].offset[i] << SAO_BIT_INC); + m_offsetBo[0][((ctuParam[addr].bandPos + i) & (MAX_NUM_SAO_CLASS - 1))] = (int8_t)(ctuParam[addr].offset[i] << SAO_BIT_INC); } else // if (typeIdx == SAO_EO_0 || typeIdx == SAO_EO_1 || typeIdx == SAO_EO_2 || typeIdx == SAO_EO_3) { @@ -650,13 +613,13 @@ m_offsetEo[0][edgeType] = (int8_t)offset[s_eoTable[edgeType]]; } } - processSaoCu(addr, typeIdx, 0); + applyPixelOffsets(addr, typeIdx, 0); } std::swap(m_tmpL1[0], m_tmpL2[0]); } /* Process SAO unit (Chroma only) */ -void SAO::processSaoUnitCuChroma(SaoCtuParam* ctuParam[3], int idxY, int idxX) +void SAO::generateChromaOffsets(SaoCtuParam* ctuParam[3], int idxY, int idxX) { PicYuv* reconPic = m_frame->m_reconPic; intptr_t stride = reconPic->m_strideC; @@ -712,7 +675,7 @@ memset(m_offsetBo[1], 0, sizeof(m_offsetBo[0])); for (int i = 0; i < SAO_NUM_OFFSET; i++) - m_offsetBo[1][((ctuParam[1][addr].bandPos + i) & (SAO_NUM_BO_CLASSES - 1))] = (int8_t)(ctuParam[1][addr].offset[i] << SAO_BIT_INC); + m_offsetBo[1][((ctuParam[1][addr].bandPos + i) & (MAX_NUM_SAO_CLASS - 1))] = (int8_t)(ctuParam[1][addr].offset[i] << SAO_BIT_INC); } else // if (typeIdx == SAO_EO_0 || typeIdx == SAO_EO_1 || typeIdx == SAO_EO_2 || typeIdx == SAO_EO_3) { @@ -725,7 +688,7 @@ m_offsetEo[1][edgeType] = (int8_t)offset[s_eoTable[edgeType]]; } } - processSaoCu(addr, typeIdxCb, 1); + applyPixelOffsets(addr, typeIdxCb, 1); } // Process V @@ -738,7 +701,7 @@ memset(m_offsetBo[2], 0, sizeof(m_offsetBo[0])); for (int i = 0; i < SAO_NUM_OFFSET; i++) - m_offsetBo[2][((ctuParam[2][addr].bandPos + i) & (SAO_NUM_BO_CLASSES - 1))] = (int8_t)(ctuParam[2][addr].offset[i] << SAO_BIT_INC); + m_offsetBo[2][((ctuParam[2][addr].bandPos + i) & (MAX_NUM_SAO_CLASS - 1))] = (int8_t)(ctuParam[2][addr].offset[i] << SAO_BIT_INC); } else // if (typeIdx == SAO_EO_0 || typeIdx == SAO_EO_1 || typeIdx == SAO_EO_2 || typeIdx == SAO_EO_3) { @@ -751,25 +714,15 @@ m_offsetEo[2][edgeType] = (int8_t)offset[s_eoTable[edgeType]]; } } - processSaoCu(addr, typeIdxCb, 2); + applyPixelOffsets(addr, typeIdxCb, 2); } std::swap(m_tmpL1[1], m_tmpL2[1]); std::swap(m_tmpL1[2], m_tmpL2[2]); } -void SAO::copySaoUnit(SaoCtuParam* saoUnitDst, const SaoCtuParam* saoUnitSrc) -{ - saoUnitDst->mergeMode = saoUnitSrc->mergeMode; - saoUnitDst->typeIdx = saoUnitSrc->typeIdx; - saoUnitDst->bandPos = saoUnitSrc->bandPos; - - for (int i = 0; i < SAO_NUM_OFFSET; i++) - saoUnitDst->offset[i] = saoUnitSrc->offset[i]; -} - /* Calculate SAO statistics for current CTU without non-crossing slice */ -void SAO::calcSaoStatsCu(int addr, int plane) +void SAO::calcSaoStatsCTU(int addr, int plane) { const PicYuv* reconPic = m_frame->m_reconPic; const CUData* cu = m_frame->m_encData->getPicCTU(addr); @@ -982,7 +935,7 @@ memset(m_offsetOrgPreDblk[addr], 0, sizeof(PerPlane)); int plane_offset = 0; - for (int plane = 0; plane < (frame->m_param->internalCsp != X265_CSP_I400 ? NUM_PLANE : 1); plane++) + for (int plane = 0; plane < (frame->m_param->internalCsp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400? NUM_PLANE : 1); plane++) { if (plane == 1) { @@ -1017,7 +970,7 @@ { for (x = (y < startY ? startX : 0); x < ctuWidth; x++) { - int classIdx = 1 + (rec[x] >> boShift); + int classIdx = rec[x] >> boShift; stats[classIdx] += (fenc[x] - rec[x]); count[classIdx]++; } @@ -1233,137 +1186,31 @@ m_depthSaoRate[1 * SAO_DEPTHRATE_SIZE + m_refDepth] = m_numNoSao[1] / ((double)numctus); } -void SAO::rdoSaoUnitRow(SAOParam* saoParam, int idxY) +void SAO::rdoSaoUnitCu(SAOParam* saoParam, int rowBaseAddr, int idxX, int addr) { - SaoCtuParam mergeSaoParam[NUM_MERGE_MODE][2]; - double mergeDist[NUM_MERGE_MODE]; - bool allowMerge[2]; // left, up - allowMerge[1] = (idxY > 0); - - for (int idxX = 0; idxX < m_numCuInWidth; idxX++) - { - int addr = idxX + idxY * m_numCuInWidth; - int addrUp = idxY ? addr - m_numCuInWidth : -1; - int addrLeft = idxX ? addr - 1 : -1; - allowMerge[0] = (idxX > 0); - - m_entropyCoder.load(m_rdContexts.cur); - if (allowMerge[0]) - m_entropyCoder.codeSaoMerge(0); - if (allowMerge[1]) - m_entropyCoder.codeSaoMerge(0); - m_entropyCoder.store(m_rdContexts.temp); - - // reset stats Y, Cb, Cr - X265_CHECK(sizeof(PerPlane) == (sizeof(int32_t) * (NUM_PLANE * MAX_NUM_SAO_TYPE * MAX_NUM_SAO_CLASS)), "Found Padding space in struct PerPlane"); - - // TODO: Confirm the address space is continuous - if (m_param->bSaoNonDeblocked) - { - memcpy(m_count, m_countPreDblk[addr], sizeof(m_count)); - memcpy(m_offsetOrg, m_offsetOrgPreDblk[addr], sizeof(m_offsetOrg)); - } - else - { - memset(m_count, 0, sizeof(m_count)); - memset(m_offsetOrg, 0, sizeof(m_offsetOrg)); - } - - saoParam->ctuParam[0][addr].reset(); - saoParam->ctuParam[1][addr].reset(); - saoParam->ctuParam[2][addr].reset(); - - if (saoParam->bSaoFlag[0]) - calcSaoStatsCu(addr, 0); - - if (saoParam->bSaoFlag[1]) - { - calcSaoStatsCu(addr, 1); - calcSaoStatsCu(addr, 2); - } - - saoComponentParamDist(saoParam, addr, addrUp, addrLeft, &mergeSaoParam[0][0], mergeDist); - if (m_chromaFormat != X265_CSP_I400) - sao2ChromaParamDist(saoParam, addr, addrUp, addrLeft, mergeSaoParam, mergeDist); - - if (saoParam->bSaoFlag[0] || saoParam->bSaoFlag[1]) - { - // Cost of new SAO_params - m_entropyCoder.load(m_rdContexts.cur); - m_entropyCoder.resetBits(); - if (allowMerge[0]) - m_entropyCoder.codeSaoMerge(0); - if (allowMerge[1]) - m_entropyCoder.codeSaoMerge(0); - for (int plane = 0; plane < 3; plane++) - { - if (saoParam->bSaoFlag[plane > 0]) - m_entropyCoder.codeSaoOffset(saoParam->ctuParam[plane][addr], plane); - } - - uint32_t rate = m_entropyCoder.getNumberOfWrittenBits(); - double bestCost = mergeDist[0] + (double)rate; - m_entropyCoder.store(m_rdContexts.temp); + Slice* slice = m_frame->m_encData->m_slice; +// int qp = slice->m_sliceQp; + const CUData* cu = m_frame->m_encData->getPicCTU(addr); + int qp = cu->m_qp[0]; - // Cost of Merge - for (int mergeIdx = 0; mergeIdx < 2; ++mergeIdx) - { - if (!allowMerge[mergeIdx]) - continue; - - m_entropyCoder.load(m_rdContexts.cur); - m_entropyCoder.resetBits(); - if (allowMerge[0]) - m_entropyCoder.codeSaoMerge(1 - mergeIdx); - if (allowMerge[1] && (mergeIdx == 1)) - m_entropyCoder.codeSaoMerge(1); - - rate = m_entropyCoder.getNumberOfWrittenBits(); - double mergeCost = mergeDist[mergeIdx + 1] + (double)rate; - if (mergeCost < bestCost) - { - SaoMergeMode mergeMode = mergeIdx ? SAO_MERGE_UP : SAO_MERGE_LEFT; - bestCost = mergeCost; - m_entropyCoder.store(m_rdContexts.temp); - for (int plane = 0; plane < 3; plane++) - { - mergeSaoParam[plane][mergeIdx].mergeMode = mergeMode; - if (saoParam->bSaoFlag[plane > 0]) - copySaoUnit(&saoParam->ctuParam[plane][addr], &mergeSaoParam[plane][mergeIdx]); - } - } - } + int64_t lambda[2] = { 0 }; - if (saoParam->ctuParam[0][addr].typeIdx < 0) - m_numNoSao[0]++; - if (m_chromaFormat != X265_CSP_I400 && saoParam->ctuParam[1][addr].typeIdx < 0) - m_numNoSao[1]++; + int qpCb = qp; + if (m_param->internalCsp == X265_CSP_I420) + qpCb = x265_clip3(QP_MIN, QP_MAX_MAX, (int)g_chromaScale[qp + slice->m_pps->chromaQpOffset[0]]); + else + qpCb = X265_MIN(qp + slice->m_pps->chromaQpOffset[0], QP_MAX_SPEC); - m_entropyCoder.load(m_rdContexts.temp); - m_entropyCoder.store(m_rdContexts.cur); - } - } -} + lambda[0] = (int64_t)floor(256.0 * x265_lambda2_tab[qp]); + lambda[1] = (int64_t)floor(256.0 * x265_lambda2_tab[qpCb]); // Use Cb QP for SAO chroma -void SAO::rdoSaoUnitCu(SAOParam* saoParam, int rowBaseAddr, int idxX, int addr) -{ - SaoCtuParam mergeSaoParam[NUM_MERGE_MODE][2]; - double mergeDist[NUM_MERGE_MODE]; const bool allowMerge[2] = {(idxX != 0), (rowBaseAddr != 0)}; // left, up - const int addrUp = rowBaseAddr ? addr - m_numCuInWidth : -1; - const int addrLeft = idxX ? addr - 1 : -1; + const int addrMerge[2] = {(idxX ? addr - 1 : -1), (rowBaseAddr ? addr - m_numCuInWidth : -1)};// left, up - bool chroma = m_param->internalCsp != X265_CSP_I400; + bool chroma = m_param->internalCsp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400; int planes = chroma ? 3 : 1; - m_entropyCoder.load(m_rdContexts.cur); - if (allowMerge[0]) - m_entropyCoder.codeSaoMerge(0); - if (allowMerge[1]) - m_entropyCoder.codeSaoMerge(0); - m_entropyCoder.store(m_rdContexts.temp); - // reset stats Y, Cb, Cr X265_CHECK(sizeof(PerPlane) == (sizeof(int32_t) * (NUM_PLANE * MAX_NUM_SAO_TYPE * MAX_NUM_SAO_CLASS)), "Found Padding space in struct PerPlane"); @@ -1383,43 +1230,59 @@ saoParam->ctuParam[i][addr].reset(); if (saoParam->bSaoFlag[0]) - calcSaoStatsCu(addr, 0); + calcSaoStatsCTU(addr, 0); if (saoParam->bSaoFlag[1]) { - calcSaoStatsCu(addr, 1); - calcSaoStatsCu(addr, 2); + calcSaoStatsCTU(addr, 1); + calcSaoStatsCTU(addr, 2); } - saoComponentParamDist(saoParam, addr, addrUp, addrLeft, &mergeSaoParam[0][0], mergeDist); + saoStatsInitialOffset(planes); + + // SAO distortion calculation + m_entropyCoder.load(m_rdContexts.cur); + m_entropyCoder.resetBits(); + if (allowMerge[0]) + m_entropyCoder.codeSaoMerge(0); + if (allowMerge[1]) + m_entropyCoder.codeSaoMerge(0); + m_entropyCoder.store(m_rdContexts.temp); + + // Estimate distortion and cost of new SAO params + int64_t bestCost = 0; + int64_t rateDist = 0; + // Estimate distortion and cost of new SAO params + saoLumaComponentParamDist(saoParam, addr, rateDist, lambda, bestCost); if (chroma) - sao2ChromaParamDist(saoParam, addr, addrUp, addrLeft, mergeSaoParam, mergeDist); + saoChromaComponentParamDist(saoParam, addr, rateDist, lambda, bestCost); if (saoParam->bSaoFlag[0] || saoParam->bSaoFlag[1]) { - // Cost of new SAO_params - m_entropyCoder.load(m_rdContexts.cur); - m_entropyCoder.resetBits(); - if (allowMerge[0]) - m_entropyCoder.codeSaoMerge(0); - if (allowMerge[1]) - m_entropyCoder.codeSaoMerge(0); - for (int plane = 0; plane < planes; plane++) - { - if (saoParam->bSaoFlag[plane > 0]) - m_entropyCoder.codeSaoOffset(saoParam->ctuParam[plane][addr], plane); - } - - uint32_t rate = m_entropyCoder.getNumberOfWrittenBits(); - double bestCost = mergeDist[0] + (double)rate; - m_entropyCoder.store(m_rdContexts.temp); - - // Cost of Merge + // Cost of merge left or Up for (int mergeIdx = 0; mergeIdx < 2; ++mergeIdx) { if (!allowMerge[mergeIdx]) continue; + int64_t mergeDist = 0; + for (int plane = 0; plane < planes; plane++) + { + int64_t estDist = 0; + SaoCtuParam* mergeSrcParam = &(saoParam->ctuParam[plane][addrMerge[mergeIdx]]); + int typeIdx = mergeSrcParam->typeIdx; + if (typeIdx >= 0) + { + int bandPos = (typeIdx == SAO_BO) ? mergeSrcParam->bandPos : 1; + for (int classIdx = 0; classIdx < SAO_NUM_OFFSET; classIdx++) + { + int mergeOffset = mergeSrcParam->offset[classIdx]; + estDist += estSaoDist(m_count[plane][typeIdx][classIdx + bandPos], mergeOffset, m_offsetOrg[plane][typeIdx][classIdx + bandPos]); + } + } + mergeDist += (estDist << 8) / lambda[!!plane]; + } + m_entropyCoder.load(m_rdContexts.cur); m_entropyCoder.resetBits(); if (allowMerge[0]) @@ -1427,8 +1290,8 @@ if (allowMerge[1] && (mergeIdx == 1)) m_entropyCoder.codeSaoMerge(1); - rate = m_entropyCoder.getNumberOfWrittenBits(); - double mergeCost = mergeDist[mergeIdx + 1] + (double)rate; + uint32_t estRate = m_entropyCoder.getNumberOfWrittenBits(); + int64_t mergeCost = mergeDist + estRate; if (mergeCost < bestCost) { SaoMergeMode mergeMode = mergeIdx ? SAO_MERGE_UP : SAO_MERGE_LEFT; @@ -1436,9 +1299,17 @@ m_entropyCoder.store(m_rdContexts.temp); for (int plane = 0; plane < planes; plane++) { - mergeSaoParam[plane][mergeIdx].mergeMode = mergeMode; if (saoParam->bSaoFlag[plane > 0]) - copySaoUnit(&saoParam->ctuParam[plane][addr], &mergeSaoParam[plane][mergeIdx]); + { + SaoCtuParam* dstCtuParam = &saoParam->ctuParam[plane][addr]; + SaoCtuParam* mergeSrcParam = &(saoParam->ctuParam[plane][addrMerge[mergeIdx]]); + dstCtuParam->mergeMode = mergeMode; + dstCtuParam->typeIdx = mergeSrcParam->typeIdx; + dstCtuParam->bandPos = mergeSrcParam->bandPos; + + for (int i = 0; i < SAO_NUM_OFFSET; i++) + dstCtuParam->offset[i] = mergeSrcParam->offset[i]; + } } } } @@ -1452,309 +1323,371 @@ } } -/** rate distortion optimization of SAO unit */ -inline int64_t SAO::estSaoTypeDist(int plane, int typeIdx, double lambda, int32_t* currentDistortionTableBo, double* currentRdCostTableBo) + +// Rounds the division of initial offsets by the number of samples in +// each of the statistics table entries. +void SAO::saoStatsInitialOffset(int planes) { - int64_t estDist = 0; + memset(m_offset, 0, sizeof(m_offset)); - for (int classIdx = 1; classIdx < ((typeIdx < SAO_BO) ? SAO_EO_LEN + 1 : SAO_NUM_BO_CLASSES + 1); classIdx++) + // EO + for (int plane = 0; plane < planes; plane++) { - int32_t count = m_count[plane][typeIdx][classIdx]; - int32_t& offsetOrg = m_offsetOrg[plane][typeIdx][classIdx]; - int32_t& offsetOut = m_offset[plane][typeIdx][classIdx]; - - if (typeIdx == SAO_BO) + for (int typeIdx = 0; typeIdx < MAX_NUM_SAO_TYPE - 1; typeIdx++) { - currentDistortionTableBo[classIdx - 1] = 0; - currentRdCostTableBo[classIdx - 1] = lambda; - } - if (count) - { - int offset = roundIBDI(offsetOrg << (X265_DEPTH - 8), count); - offset = x265_clip3(-OFFSET_THRESH + 1, OFFSET_THRESH - 1, offset); - if (typeIdx < SAO_BO) + for (int classIdx = 1; classIdx < SAO_NUM_OFFSET + 1; classIdx++) { - if (classIdx < 3) - offset = X265_MAX(offset, 0); - else - offset = X265_MIN(offset, 0); + int32_t& count = m_count[plane][typeIdx][classIdx]; + int32_t& offsetOrg = m_offsetOrg[plane][typeIdx][classIdx]; + int32_t& offsetOut = m_offset[plane][typeIdx][classIdx]; + + if (count) + { + offsetOut = roundIBDI(offsetOrg, count << SAO_BIT_INC); + offsetOut = x265_clip3(-OFFSET_THRESH + 1, OFFSET_THRESH - 1, offsetOut); + + if (classIdx < 3) + offsetOut = X265_MAX(offsetOut, 0); + else + offsetOut = X265_MIN(offsetOut, 0); + } } - offsetOut = estIterOffset(typeIdx, classIdx, lambda, offset, count, offsetOrg, currentDistortionTableBo, currentRdCostTableBo); } - else + } + + // BO + for (int plane = 0; plane < planes; plane++) + { + for (int classIdx = 0; classIdx < MAX_NUM_SAO_CLASS; classIdx++) { - offsetOrg = 0; - offsetOut = 0; + int32_t& count = m_count[plane][SAO_BO][classIdx]; + int32_t& offsetOrg = m_offsetOrg[plane][SAO_BO][classIdx]; + int32_t& offsetOut = m_offset[plane][SAO_BO][classIdx]; + + if (count) + { + offsetOut = roundIBDI(offsetOrg, count << SAO_BIT_INC); + offsetOut = x265_clip3(-OFFSET_THRESH + 1, OFFSET_THRESH - 1, offsetOut); + } } - if (typeIdx != SAO_BO) - estDist += estSaoDist(count, (int)offsetOut << SAO_BIT_INC, offsetOrg); } +} - return estDist; +inline int64_t SAO::calcSaoRdoCost(int64_t distortion, uint32_t bits, int64_t lambda) +{ +#if X265_DEPTH < 10 + X265_CHECK(bits <= (INT64_MAX - 128) / lambda, + "calcRdCost wrap detected dist: " X265_LL ", bits %u, lambda: " X265_LL "\n", + distortion, bits, lambda); +#else + X265_CHECK(bits <= (INT64_MAX - 128) / lambda, + "calcRdCost wrap detected dist: " X265_LL ", bits %u, lambda: " X265_LL "\n", + distortion, bits, lambda); +#endif + return distortion + ((bits * lambda + 128) >> 8); } -inline int SAO::estIterOffset(int typeIdx, int classIdx, double lambda, int offset, int32_t count, int32_t offsetOrg, int32_t* currentDistortionTableBo, double* currentRdCostTableBo) +void SAO::estIterOffset(int typeIdx, int64_t lambda, int32_t count, int32_t offsetOrg, int32_t& offset, int32_t& distClasses, int64_t& costClasses) { - int offsetOut = 0; + int bestOffset = 0; + distClasses = 0; - // Assuming sending quantized value 0 results in zero offset and sending the value zero needs 1 bit. entropy coder can be used to measure the exact rate here. - double tempMinCost = lambda; + // Assuming sending quantized value 0 results in zero offset and sending the value zero needs 1 bit. + // entropy coder can be used to measure the exact rate here. + int64_t bestCost = calcSaoRdoCost(0, 1, lambda); while (offset != 0) { // Calculate the bits required for signalling the offset - int tempRate = (typeIdx == SAO_BO) ? (abs(offset) + 2) : (abs(offset) + 1); + uint32_t rate = (typeIdx == SAO_BO) ? (abs(offset) + 2) : (abs(offset) + 1); if (abs(offset) == OFFSET_THRESH - 1) - tempRate--; + rate--; // Do the dequntization before distorion calculation - int tempOffset = offset << SAO_BIT_INC; - int64_t tempDist = estSaoDist(count, tempOffset, offsetOrg); - double tempCost = ((double)tempDist + lambda * (double)tempRate); - if (tempCost < tempMinCost) + int64_t dist = estSaoDist(count, offset << SAO_BIT_INC, offsetOrg); + int64_t cost = calcSaoRdoCost(dist, rate, lambda); + if (cost < bestCost) { - tempMinCost = tempCost; - offsetOut = offset; - if (typeIdx == SAO_BO) - { - currentDistortionTableBo[classIdx - 1] = (int)tempDist; - currentRdCostTableBo[classIdx - 1] = tempCost; - } + bestCost = cost; + bestOffset = offset; + distClasses = (int)dist; } offset = (offset > 0) ? (offset - 1) : (offset + 1); } - return offsetOut; + costClasses = bestCost; + offset = bestOffset; } -void SAO::saoComponentParamDist(SAOParam* saoParam, int addr, int addrUp, int addrLeft, SaoCtuParam* mergeSaoParam, double* mergeDist) +void SAO::saoLumaComponentParamDist(SAOParam* saoParam, int32_t addr, int64_t& rateDist, int64_t* lambda, int64_t &bestCost) { int64_t bestDist = 0; + int bestTypeIdx = -1; SaoCtuParam* lclCtuParam = &saoParam->ctuParam[0][addr]; - double bestRDCostTableBo = MAX_DOUBLE; - int bestClassTableBo = 0; - int currentDistortionTableBo[MAX_NUM_SAO_CLASS]; - double currentRdCostTableBo[MAX_NUM_SAO_CLASS]; + int32_t distClasses[MAX_NUM_SAO_CLASS]; + int64_t costClasses[MAX_NUM_SAO_CLASS]; + // RDO SAO_NA m_entropyCoder.load(m_rdContexts.temp); m_entropyCoder.resetBits(); - m_entropyCoder.codeSaoOffset(*lclCtuParam, 0); - double dCostPartBest = m_entropyCoder.getNumberOfWrittenBits() * m_lumaLambda; + m_entropyCoder.codeSaoType(0); - for (int typeIdx = 0; typeIdx < MAX_NUM_SAO_TYPE; typeIdx++) - { - int64_t estDist = estSaoTypeDist(0, typeIdx, m_lumaLambda, currentDistortionTableBo, currentRdCostTableBo); + int64_t costPartBest = calcSaoRdoCost(0, m_entropyCoder.getNumberOfWrittenBits(), lambda[0]); - if (typeIdx == SAO_BO) + //EO distortion calculation + for (int typeIdx = 0; typeIdx < MAX_NUM_SAO_TYPE - 1; typeIdx++) + { + int64_t estDist = 0; + for (int classIdx = 1; classIdx < SAO_NUM_OFFSET + 1; classIdx++) { - // Estimate Best Position - for (int i = 0; i < SAO_NUM_BO_CLASSES - SAO_BO_LEN + 1; i++) - { - double currentRDCost = 0.0; - for (int j = i; j < i + SAO_BO_LEN; j++) - currentRDCost += currentRdCostTableBo[j]; + int32_t& count = m_count[0][typeIdx][classIdx]; + int32_t& offsetOrg = m_offsetOrg[0][typeIdx][classIdx]; + int32_t& offsetOut = m_offset[0][typeIdx][classIdx]; - if (currentRDCost < bestRDCostTableBo) - { - bestRDCostTableBo = currentRDCost; - bestClassTableBo = i; - } - } + estIterOffset(typeIdx, lambda[0], count, offsetOrg, offsetOut, distClasses[classIdx], costClasses[classIdx]); - // Re code all Offsets - // Code Center - estDist = 0; - for (int classIdx = bestClassTableBo; classIdx < bestClassTableBo + SAO_BO_LEN; classIdx++) - estDist += currentDistortionTableBo[classIdx]; + //Calculate distortion + estDist += distClasses[classIdx]; } - SaoCtuParam ctuParamRdo; - ctuParamRdo.mergeMode = SAO_MERGE_NONE; - ctuParamRdo.typeIdx = typeIdx; - ctuParamRdo.bandPos = (typeIdx == SAO_BO) ? bestClassTableBo : 0; - for (int classIdx = 0; classIdx < SAO_NUM_OFFSET; classIdx++) - ctuParamRdo.offset[classIdx] = (int)m_offset[0][typeIdx][classIdx + ctuParamRdo.bandPos + 1]; m_entropyCoder.load(m_rdContexts.temp); m_entropyCoder.resetBits(); - m_entropyCoder.codeSaoOffset(ctuParamRdo, 0); + m_entropyCoder.codeSaoOffsetEO(m_offset[0][typeIdx] + 1, typeIdx, 0); - uint32_t estRate = m_entropyCoder.getNumberOfWrittenBits(); - double cost = (double)estDist + m_lumaLambda * (double)estRate; + int64_t cost = calcSaoRdoCost(estDist, m_entropyCoder.getNumberOfWrittenBits(), lambda[0]); - if (cost < dCostPartBest) + if (cost < costPartBest) { - dCostPartBest = cost; - copySaoUnit(lclCtuParam, &ctuParamRdo); + costPartBest = cost; bestDist = estDist; + bestTypeIdx = typeIdx; } } - mergeDist[0] = ((double)bestDist / m_lumaLambda); - m_entropyCoder.load(m_rdContexts.temp); - m_entropyCoder.codeSaoOffset(*lclCtuParam, 0); - m_entropyCoder.store(m_rdContexts.temp); + if (bestTypeIdx != -1) + { + lclCtuParam->mergeMode = SAO_MERGE_NONE; + lclCtuParam->typeIdx = bestTypeIdx; + lclCtuParam->bandPos = 0; + for (int classIdx = 0; classIdx < SAO_NUM_OFFSET; classIdx++) + lclCtuParam->offset[classIdx] = m_offset[0][bestTypeIdx][classIdx + 1]; + } - // merge left or merge up - for (int mergeIdx = 0; mergeIdx < 2; mergeIdx++) + //BO RDO + int64_t estDist = 0; + for (int classIdx = 0; classIdx < MAX_NUM_SAO_CLASS; classIdx++) { - SaoCtuParam* mergeSrcParam = NULL; - if (addrLeft >= 0 && mergeIdx == 0) - mergeSrcParam = &(saoParam->ctuParam[0][addrLeft]); - else if (addrUp >= 0 && mergeIdx == 1) - mergeSrcParam = &(saoParam->ctuParam[0][addrUp]); - if (mergeSrcParam) - { - int64_t estDist = 0; - int typeIdx = mergeSrcParam->typeIdx; - if (typeIdx >= 0) - { - int bandPos = (typeIdx == SAO_BO) ? mergeSrcParam->bandPos : 0; - for (int classIdx = 0; classIdx < SAO_NUM_OFFSET; classIdx++) - { - int mergeOffset = mergeSrcParam->offset[classIdx]; - estDist += estSaoDist(m_count[0][typeIdx][classIdx + bandPos + 1], mergeOffset, m_offsetOrg[0][typeIdx][classIdx + bandPos + 1]); - } - } + int32_t& count = m_count[0][SAO_BO][classIdx]; + int32_t& offsetOrg = m_offsetOrg[0][SAO_BO][classIdx]; + int32_t& offsetOut = m_offset[0][SAO_BO][classIdx]; - copySaoUnit(&mergeSaoParam[mergeIdx], mergeSrcParam); - mergeSaoParam[mergeIdx].mergeMode = mergeIdx ? SAO_MERGE_UP : SAO_MERGE_LEFT; + estIterOffset(SAO_BO, lambda[0], count, offsetOrg, offsetOut, distClasses[classIdx], costClasses[classIdx]); + } + + // Estimate Best Position + int64_t bestRDCostBO = MAX_INT64; + int32_t bestClassBO = 0; + + for (int i = 0; i < MAX_NUM_SAO_CLASS - SAO_NUM_OFFSET + 1; i++) + { + int64_t currentRDCost = 0; + for (int j = i; j < i + SAO_NUM_OFFSET; j++) + currentRDCost += costClasses[j]; - mergeDist[mergeIdx + 1] = ((double)estDist / m_lumaLambda); + if (currentRDCost < bestRDCostBO) + { + bestRDCostBO = currentRDCost; + bestClassBO = i; } } + + estDist = 0; + for (int classIdx = bestClassBO; classIdx < bestClassBO + SAO_NUM_OFFSET; classIdx++) + estDist += distClasses[classIdx]; + + m_entropyCoder.load(m_rdContexts.temp); + m_entropyCoder.resetBits(); + m_entropyCoder.codeSaoOffsetBO(m_offset[0][SAO_BO] + bestClassBO, bestClassBO, 0); + + int64_t cost = calcSaoRdoCost(estDist, m_entropyCoder.getNumberOfWrittenBits(), lambda[0]); + + if (cost < costPartBest) + { + costPartBest = cost; + bestDist = estDist; + + lclCtuParam->mergeMode = SAO_MERGE_NONE; + lclCtuParam->typeIdx = SAO_BO; + lclCtuParam->bandPos = bestClassBO; + for (int classIdx = 0; classIdx < SAO_NUM_OFFSET; classIdx++) + lclCtuParam->offset[classIdx] = m_offset[0][SAO_BO][classIdx + bestClassBO]; + } + + rateDist = (bestDist << 8) / lambda[0]; + m_entropyCoder.load(m_rdContexts.temp); + m_entropyCoder.codeSaoOffset(*lclCtuParam, 0); + m_entropyCoder.store(m_rdContexts.temp); + + if (m_param->internalCsp == X265_CSP_I400) + { + bestCost = rateDist + m_entropyCoder.getNumberOfWrittenBits(); + } } -void SAO::sao2ChromaParamDist(SAOParam* saoParam, int addr, int addrUp, int addrLeft, SaoCtuParam mergeSaoParam[][2], double* mergeDist) +void SAO::saoChromaComponentParamDist(SAOParam* saoParam, int32_t addr, int64_t& rateDist, int64_t* lambda, int64_t &bestCost) { int64_t bestDist = 0; + int bestTypeIdx = -1; SaoCtuParam* lclCtuParam[2] = { &saoParam->ctuParam[1][addr], &saoParam->ctuParam[2][addr] }; - double currentRdCostTableBo[MAX_NUM_SAO_CLASS]; - int bestClassTableBo[2] = { 0, 0 }; - int currentDistortionTableBo[MAX_NUM_SAO_CLASS]; + int64_t costClasses[MAX_NUM_SAO_CLASS]; + int32_t distClasses[MAX_NUM_SAO_CLASS]; + int32_t bestClassBO[2] = { 0, 0 }; m_entropyCoder.load(m_rdContexts.temp); m_entropyCoder.resetBits(); - m_entropyCoder.codeSaoOffset(*lclCtuParam[0], 1); - m_entropyCoder.codeSaoOffset(*lclCtuParam[1], 2); + m_entropyCoder.codeSaoType(0); - double costPartBest = m_entropyCoder.getNumberOfWrittenBits() * m_chromaLambda; + uint32_t bits = m_entropyCoder.getNumberOfWrittenBits(); + int64_t costPartBest = calcSaoRdoCost(0, bits, lambda[1]); - for (int typeIdx = 0; typeIdx < MAX_NUM_SAO_TYPE; typeIdx++) + //EO RDO + for (int typeIdx = 0; typeIdx < MAX_NUM_SAO_TYPE - 1; typeIdx++) { - int64_t estDist[2]; - if (typeIdx == SAO_BO) + int64_t estDist[2] = {0, 0}; + for (int compIdx = 1; compIdx < 3; compIdx++) { - // Estimate Best Position - for (int compIdx = 0; compIdx < 2; compIdx++) + for (int classIdx = 1; classIdx < SAO_NUM_OFFSET + 1; classIdx++) { - double bestRDCostTableBo = MAX_DOUBLE; - estDist[compIdx] = estSaoTypeDist(compIdx + 1, typeIdx, m_chromaLambda, currentDistortionTableBo, currentRdCostTableBo); - for (int i = 0; i < SAO_NUM_BO_CLASSES - SAO_BO_LEN + 1; i++) - { - double currentRDCost = 0.0; - for (int j = i; j < i + SAO_BO_LEN; j++) - currentRDCost += currentRdCostTableBo[j]; + int32_t& count = m_count[compIdx][typeIdx][classIdx]; + int32_t& offsetOrg = m_offsetOrg[compIdx][typeIdx][classIdx]; + int32_t& offsetOut = m_offset[compIdx][typeIdx][classIdx]; - if (currentRDCost < bestRDCostTableBo) - { - bestRDCostTableBo = currentRDCost; - bestClassTableBo[compIdx] = i; - } - } + estIterOffset(typeIdx, lambda[1], count, offsetOrg, offsetOut, distClasses[classIdx], costClasses[classIdx]); - // Re code all Offsets - // Code Center - estDist[compIdx] = 0; - for (int classIdx = bestClassTableBo[compIdx]; classIdx < bestClassTableBo[compIdx] + SAO_BO_LEN; classIdx++) - estDist[compIdx] += currentDistortionTableBo[classIdx]; + estDist[compIdx - 1] += distClasses[classIdx]; } } - else - { - estDist[0] = estSaoTypeDist(1, typeIdx, m_chromaLambda, currentDistortionTableBo, currentRdCostTableBo); - estDist[1] = estSaoTypeDist(2, typeIdx, m_chromaLambda, currentDistortionTableBo, currentRdCostTableBo); - } m_entropyCoder.load(m_rdContexts.temp); m_entropyCoder.resetBits(); - SaoCtuParam ctuParamRdo[2]; for (int compIdx = 0; compIdx < 2; compIdx++) - { - ctuParamRdo[compIdx].mergeMode = SAO_MERGE_NONE; - ctuParamRdo[compIdx].typeIdx = typeIdx; - ctuParamRdo[compIdx].bandPos = (typeIdx == SAO_BO) ? bestClassTableBo[compIdx] : 0; - for (int classIdx = 0; classIdx < SAO_NUM_OFFSET; classIdx++) - ctuParamRdo[compIdx].offset[classIdx] = (int)m_offset[compIdx + 1][typeIdx][classIdx + ctuParamRdo[compIdx].bandPos + 1]; - - m_entropyCoder.codeSaoOffset(ctuParamRdo[compIdx], compIdx + 1); - } + m_entropyCoder.codeSaoOffsetEO(m_offset[compIdx + 1][typeIdx] + 1, typeIdx, compIdx + 1); uint32_t estRate = m_entropyCoder.getNumberOfWrittenBits(); - double cost = (double)(estDist[0] + estDist[1]) + m_chromaLambda * (double)estRate; + int64_t cost = calcSaoRdoCost((estDist[0] + estDist[1]), estRate, lambda[1]); if (cost < costPartBest) { costPartBest = cost; - copySaoUnit(lclCtuParam[0], &ctuParamRdo[0]); - copySaoUnit(lclCtuParam[1], &ctuParamRdo[1]); bestDist = (estDist[0] + estDist[1]); + bestTypeIdx = typeIdx; } } - mergeDist[0] += ((double)bestDist / m_chromaLambda); - m_entropyCoder.load(m_rdContexts.temp); - m_entropyCoder.codeSaoOffset(*lclCtuParam[0], 1); - m_entropyCoder.codeSaoOffset(*lclCtuParam[1], 2); - m_entropyCoder.store(m_rdContexts.temp); - - // merge left or merge up - for (int mergeIdx = 0; mergeIdx < 2; mergeIdx++) + if (bestTypeIdx != -1) { for (int compIdx = 0; compIdx < 2; compIdx++) { - int plane = compIdx + 1; - SaoCtuParam* mergeSrcParam = NULL; - if (addrLeft >= 0 && mergeIdx == 0) - mergeSrcParam = &(saoParam->ctuParam[plane][addrLeft]); - else if (addrUp >= 0 && mergeIdx == 1) - mergeSrcParam = &(saoParam->ctuParam[plane][addrUp]); - if (mergeSrcParam) - { - int64_t estDist = 0; - int typeIdx = mergeSrcParam->typeIdx; - if (typeIdx >= 0) - { - int bandPos = (typeIdx == SAO_BO) ? mergeSrcParam->bandPos : 0; - for (int classIdx = 0; classIdx < SAO_NUM_OFFSET; classIdx++) - { - int mergeOffset = mergeSrcParam->offset[classIdx]; - estDist += estSaoDist(m_count[plane][typeIdx][classIdx + bandPos + 1], mergeOffset, m_offsetOrg[plane][typeIdx][classIdx + bandPos + 1]); - } - } + lclCtuParam[compIdx]->mergeMode = SAO_MERGE_NONE; + lclCtuParam[compIdx]->typeIdx = bestTypeIdx; + lclCtuParam[compIdx]->bandPos = 0; + for (int classIdx = 0; classIdx < SAO_NUM_OFFSET; classIdx++) + lclCtuParam[compIdx]->offset[classIdx] = m_offset[compIdx + 1][bestTypeIdx][classIdx + 1]; + } + } + + // BO RDO + int64_t estDist[2]; + + // Estimate Best Position + for (int compIdx = 1; compIdx < 3; compIdx++) + { + int64_t bestRDCostBO = MAX_INT64; + + for (int classIdx = 0; classIdx < MAX_NUM_SAO_CLASS; classIdx++) + { + int32_t& count = m_count[compIdx][SAO_BO][classIdx]; + int32_t& offsetOrg = m_offsetOrg[compIdx][SAO_BO][classIdx]; + int32_t& offsetOut = m_offset[compIdx][SAO_BO][classIdx]; + + estIterOffset(SAO_BO, lambda[1], count, offsetOrg, offsetOut, distClasses[classIdx], costClasses[classIdx]); + } + + for (int i = 0; i < MAX_NUM_SAO_CLASS - SAO_NUM_OFFSET + 1; i++) + { + int64_t currentRDCost = 0; + for (int j = i; j < i + SAO_NUM_OFFSET; j++) + currentRDCost += costClasses[j]; - copySaoUnit(&mergeSaoParam[plane][mergeIdx], mergeSrcParam); - mergeSaoParam[plane][mergeIdx].mergeMode = mergeIdx ? SAO_MERGE_UP : SAO_MERGE_LEFT; - mergeDist[mergeIdx + 1] += ((double)estDist / m_chromaLambda); + if (currentRDCost < bestRDCostBO) + { + bestRDCostBO = currentRDCost; + bestClassBO[compIdx - 1] = i; } } + + estDist[compIdx - 1] = 0; + for (int classIdx = bestClassBO[compIdx - 1]; classIdx < bestClassBO[compIdx - 1] + SAO_NUM_OFFSET; classIdx++) + estDist[compIdx - 1] += distClasses[classIdx]; + } + + m_entropyCoder.load(m_rdContexts.temp); + m_entropyCoder.resetBits(); + + for (int compIdx = 0; compIdx < 2; compIdx++) + m_entropyCoder.codeSaoOffsetBO(m_offset[compIdx + 1][SAO_BO] + bestClassBO[compIdx], bestClassBO[compIdx], compIdx + 1); + + uint32_t estRate = m_entropyCoder.getNumberOfWrittenBits(); + int64_t cost = calcSaoRdoCost((estDist[0] + estDist[1]), estRate, lambda[1]); + + if (cost < costPartBest) + { + costPartBest = cost; + bestDist = (estDist[0] + estDist[1]); + + for (int compIdx = 0; compIdx < 2; compIdx++) + { + lclCtuParam[compIdx]->mergeMode = SAO_MERGE_NONE; + lclCtuParam[compIdx]->typeIdx = SAO_BO; + lclCtuParam[compIdx]->bandPos = bestClassBO[compIdx]; + for (int classIdx = 0; classIdx < SAO_NUM_OFFSET; classIdx++) + lclCtuParam[compIdx]->offset[classIdx] = m_offset[compIdx + 1][SAO_BO][classIdx + bestClassBO[compIdx]]; + } + } + + rateDist += (bestDist << 8) / lambda[1]; + m_entropyCoder.load(m_rdContexts.temp); + + if (saoParam->bSaoFlag[1]) + { + m_entropyCoder.codeSaoOffset(*lclCtuParam[0], 1); + m_entropyCoder.codeSaoOffset(*lclCtuParam[1], 2); + m_entropyCoder.store(m_rdContexts.temp); + + uint32_t rate = m_entropyCoder.getNumberOfWrittenBits(); + bestCost = rateDist + rate; + } + else + { + uint32_t rate = m_entropyCoder.getNumberOfWrittenBits(); + bestCost = rateDist + rate; } } // NOTE: must put in namespace X265_NS since we need class SAO void saoCuStatsBO_c(const int16_t *diff, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count) { - int x, y; const int boShift = X265_DEPTH - SAO_BO_BITS; - for (y = 0; y < endY; y++) + for (int y = 0; y < endY; y++) { - for (x = 0; x < endX; x++) + for (int x = 0; x < endX; x++) { - int classIdx = 1 + (rec[x] >> boShift); + int classIdx = rec[x] >> boShift; stats[classIdx] += diff[x]; count[classIdx]++; } @@ -1766,7 +1699,6 @@ void saoCuStatsE0_c(const int16_t *diff, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count) { - int x, y; int32_t tmp_stats[SAO::NUM_EDGETYPE]; int32_t tmp_count[SAO::NUM_EDGETYPE]; @@ -1775,10 +1707,10 @@ memset(tmp_stats, 0, sizeof(tmp_stats)); memset(tmp_count, 0, sizeof(tmp_count)); - for (y = 0; y < endY; y++) + for (int y = 0; y < endY; y++) { int signLeft = signOf(rec[0] - rec[-1]); - for (x = 0; x < endX; x++) + for (int x = 0; x < endX; x++) { int signRight = signOf2(rec[x], rec[x + 1]); X265_CHECK(signRight == signOf(rec[x] - rec[x + 1]), "signDown check failure\n"); @@ -1794,7 +1726,7 @@ rec += stride; } - for (x = 0; x < SAO::NUM_EDGETYPE; x++) + for (int x = 0; x < SAO::NUM_EDGETYPE; x++) { stats[SAO::s_eoTable[x]] += tmp_stats[x]; count[SAO::s_eoTable[x]] += tmp_count[x]; @@ -1806,7 +1738,6 @@ X265_CHECK(endX <= MAX_CU_SIZE, "endX check failure\n"); X265_CHECK(endY <= MAX_CU_SIZE, "endY check failure\n"); - int x, y; int32_t tmp_stats[SAO::NUM_EDGETYPE]; int32_t tmp_count[SAO::NUM_EDGETYPE]; @@ -1814,9 +1745,9 @@ memset(tmp_count, 0, sizeof(tmp_count)); X265_CHECK(endX * endY <= (4096 - 16), "Assembly of saoE1 may overflow with this block size\n"); - for (y = 0; y < endY; y++) + for (int y = 0; y < endY; y++) { - for (x = 0; x < endX; x++) + for (int x = 0; x < endX; x++) { int signDown = signOf2(rec[x], rec[x + stride]); X265_CHECK(signDown == signOf(rec[x] - rec[x + stride]), "signDown check failure\n"); @@ -1831,7 +1762,7 @@ rec += stride; } - for (x = 0; x < SAO::NUM_EDGETYPE; x++) + for (int x = 0; x < SAO::NUM_EDGETYPE; x++) { stats[SAO::s_eoTable[x]] += tmp_stats[x]; count[SAO::s_eoTable[x]] += tmp_count[x]; @@ -1843,17 +1774,16 @@ X265_CHECK(endX < MAX_CU_SIZE, "endX check failure\n"); X265_CHECK(endY < MAX_CU_SIZE, "endY check failure\n"); - int x, y; int32_t tmp_stats[SAO::NUM_EDGETYPE]; int32_t tmp_count[SAO::NUM_EDGETYPE]; memset(tmp_stats, 0, sizeof(tmp_stats)); memset(tmp_count, 0, sizeof(tmp_count)); - for (y = 0; y < endY; y++) + for (int y = 0; y < endY; y++) { upBufft[0] = signOf(rec[stride] - rec[-1]); - for (x = 0; x < endX; x++) + for (int x = 0; x < endX; x++) { int signDown = signOf2(rec[x], rec[x + stride + 1]); X265_CHECK(signDown == signOf(rec[x] - rec[x + stride + 1]), "signDown check failure\n"); @@ -1869,7 +1799,7 @@ diff += MAX_CU_SIZE; } - for (x = 0; x < SAO::NUM_EDGETYPE; x++) + for (int x = 0; x < SAO::NUM_EDGETYPE; x++) { stats[SAO::s_eoTable[x]] += tmp_stats[x]; count[SAO::s_eoTable[x]] += tmp_count[x]; @@ -1881,16 +1811,15 @@ X265_CHECK(endX < MAX_CU_SIZE, "endX check failure\n"); X265_CHECK(endY < MAX_CU_SIZE, "endY check failure\n"); - int x, y; int32_t tmp_stats[SAO::NUM_EDGETYPE]; int32_t tmp_count[SAO::NUM_EDGETYPE]; memset(tmp_stats, 0, sizeof(tmp_stats)); memset(tmp_count, 0, sizeof(tmp_count)); - for (y = 0; y < endY; y++) + for (int y = 0; y < endY; y++) { - for (x = 0; x < endX; x++) + for (int x = 0; x < endX; x++) { int signDown = signOf2(rec[x], rec[x + stride - 1]); X265_CHECK(signDown == signOf(rec[x] - rec[x + stride - 1]), "signDown check failure\n"); @@ -1908,7 +1837,7 @@ diff += MAX_CU_SIZE; } - for (x = 0; x < SAO::NUM_EDGETYPE; x++) + for (int x = 0; x < SAO::NUM_EDGETYPE; x++) { stats[SAO::s_eoTable[x]] += tmp_stats[x]; count[SAO::s_eoTable[x]] += tmp_count[x];
View file
x265_1.9.tar.gz/source/encoder/sao.h -> x265_2.0.tar.gz/source/encoder/sao.h
Changed
@@ -33,13 +33,6 @@ namespace X265_NS { // private namespace -enum SAOTypeLen -{ - SAO_EO_LEN = 4, - SAO_BO_LEN = 4, - SAO_NUM_BO_CLASSES = 32 -}; - enum SAOType { SAO_EO_0 = 0, @@ -56,12 +49,11 @@ enum { SAO_MAX_DEPTH = 4 }; enum { SAO_BO_BITS = 5 }; - enum { MAX_NUM_SAO_CLASS = 33 }; + enum { MAX_NUM_SAO_CLASS = 32 }; enum { SAO_BIT_INC = 0 }; /* in HM12.0, it wrote as X265_MAX(X265_DEPTH - 10, 0) */ enum { OFFSET_THRESH = 1 << X265_MIN(X265_DEPTH - 5, 5) }; enum { NUM_EDGETYPE = 5 }; enum { NUM_PLANE = 3 }; - enum { NUM_MERGE_MODE = 3 }; enum { SAO_DEPTHRATE_SIZE = 4 }; static const uint32_t s_eoTable[NUM_EDGETYPE]; @@ -81,7 +73,7 @@ PerPlane* m_offsetOrgPreDblk; double* m_depthSaoRate; - int8_t m_offsetBo[NUM_PLANE][SAO_NUM_BO_CLASSES]; + int8_t m_offsetBo[NUM_PLANE][MAX_NUM_SAO_CLASS]; int8_t m_offsetEo[NUM_PLANE][NUM_EDGETYPE]; int m_chromaFormat; @@ -114,10 +106,6 @@ int m_refDepth; int m_numNoSao[2]; - double m_lumaLambda; - double m_chromaLambda; - /* TODO: No doubles for distortion */ - SAO(); bool create(x265_param* param, int initCommon); @@ -126,31 +114,27 @@ void allocSaoParam(SAOParam* saoParam) const; - void startSlice(Frame* pic, Entropy& initState, int qp); + void startSlice(Frame* pic, Entropy& initState); void resetStats(); - void resetSaoUnit(SaoCtuParam* saoUnit); // CTU-based SAO process without slice granularity - void processSaoCu(int addr, int typeIdx, int plane); + void applyPixelOffsets(int addr, int typeIdx, int plane); void processSaoUnitRow(SaoCtuParam* ctuParam, int idxY, int plane); - void processSaoUnitCuLuma(SaoCtuParam* ctuParam, int idxY, int idxX); - void processSaoUnitCuChroma(SaoCtuParam* ctuParam[3], int idxY, int idxX); + void generateLumaOffsets(SaoCtuParam* ctuParam, int idxY, int idxX); + void generateChromaOffsets(SaoCtuParam* ctuParam[3], int idxY, int idxX); - void copySaoUnit(SaoCtuParam* saoUnitDst, const SaoCtuParam* saoUnitSrc); - - void calcSaoStatsCu(int addr, int plane); + void calcSaoStatsCTU(int addr, int plane); void calcSaoStatsCu_BeforeDblk(Frame* pic, int idxX, int idxY); - void saoComponentParamDist(SAOParam* saoParam, int addr, int addrUp, int addrLeft, SaoCtuParam mergeSaoParam[2], double* mergeDist); - void sao2ChromaParamDist(SAOParam* saoParam, int addr, int addrUp, int addrLeft, SaoCtuParam mergeSaoParam[][2], double* mergeDist); - - inline int estIterOffset(int typeIdx, int classIdx, double lambda, int offset, int32_t count, int32_t offsetOrg, - int32_t* currentDistortionTableBo, double* currentRdCostTableBo); - inline int64_t estSaoTypeDist(int plane, int typeIdx, double lambda, int32_t* currentDistortionTableBo, double* currentRdCostTableBo); + void saoLumaComponentParamDist(SAOParam* saoParam, int addr, int64_t& rateDist, int64_t* lambda, int64_t& bestCost); + void saoChromaComponentParamDist(SAOParam* saoParam, int addr, int64_t& rateDist, int64_t* lambda, int64_t& bestCost); + void estIterOffset(int typeIdx, int64_t lambda, int32_t count, int32_t offsetOrg, int32_t& offset, int32_t& distClasses, int64_t& costClasses); void rdoSaoUnitRowEnd(const SAOParam* saoParam, int numctus); - void rdoSaoUnitRow(SAOParam* saoParam, int idxY); void rdoSaoUnitCu(SAOParam* saoParam, int rowBaseAddr, int idxX, int addr); + int64_t calcSaoRdoCost(int64_t distortion, uint32_t bits, int64_t lambda); + + void saoStatsInitialOffset(int planes); friend class FrameFilter; };
View file
x265_1.9.tar.gz/source/encoder/search.cpp -> x265_2.0.tar.gz/source/encoder/search.cpp
Changed
@@ -73,14 +73,13 @@ { uint32_t maxLog2CUSize = g_log2Size[param.maxCUSize]; m_param = ¶m; - m_bEnableRDOQ = !!param.rdoqLevel; m_bFrameParallel = param.frameNumThreads > 1; m_numLayers = g_log2Size[param.maxCUSize] - 2; m_rdCost.setPsyRdScale(param.psyRd); - m_me.init(param.searchMethod, param.subpelRefine, param.internalCsp); + m_me.init(param.internalCsp); - bool ok = m_quant.init(param.rdoqLevel, param.psyRdoq, scalingList, m_entropyCoder); + bool ok = m_quant.init(param.psyRdoq, scalingList, m_entropyCoder); if (m_param->noiseReductionIntra || m_param->noiseReductionInter || m_param->rc.vbvBufferSize) ok &= m_quant.allocNoiseReduction(param); @@ -223,9 +222,10 @@ if (!(log2TrSize - m_hChromaShift < 2)) { - if (!tuDepth || cu.getCbf(absPartIdx, TEXT_CHROMA_U, tuDepth - 1)) + uint32_t parentIdx = absPartIdx & (0xFF << (log2TrSize + 1 - LOG2_UNIT_SIZE) * 2); + if (!tuDepth || cu.getCbf(parentIdx, TEXT_CHROMA_U, tuDepth - 1)) m_entropyCoder.codeQtCbfChroma(cu, absPartIdx, TEXT_CHROMA_U, tuDepth, !subdiv); - if (!tuDepth || cu.getCbf(absPartIdx, TEXT_CHROMA_V, tuDepth - 1)) + if (!tuDepth || cu.getCbf(parentIdx, TEXT_CHROMA_V, tuDepth - 1)) m_entropyCoder.codeQtCbfChroma(cu, absPartIdx, TEXT_CHROMA_V, tuDepth, !subdiv); } @@ -296,6 +296,7 @@ uint32_t sizeIdx = log2TrSize - 2; bool mightNotSplit = log2TrSize <= depthRange[1]; bool mightSplit = (log2TrSize > depthRange[0]) && (bAllowSplit || !mightNotSplit); + bool bEnableRDOQ = !!m_param->rdoqLevel; /* If maximum RD penalty, force spits at TU size 32x32 if SPS allows TUs of 16x16 */ if (m_param->rdPenalty == 2 && m_slice->m_sliceType != I_SLICE && log2TrSize == 5 && depthRange[0] <= 4) @@ -336,7 +337,7 @@ coeff_t* coeffY = m_rqt[qtLayer].coeffRQT[0] + coeffOffsetY; // store original entropy coding status - if (m_bEnableRDOQ) + if (bEnableRDOQ) m_entropyCoder.estBit(m_entropyCoder.m_estBitsSbac, log2TrSize, true); primitives.cu[sizeIdx].calcresidual(fenc, pred, residual, stride); @@ -434,8 +435,7 @@ cbf |= cu.getCbf(qPartIdx, TEXT_LUMA, tuDepth + 1); } - for (uint32_t offs = 0; offs < 4 * qNumParts; offs++) - cu.m_cbf[0][absPartIdx + offs] |= (cbf << tuDepth); + cu.m_cbf[0][absPartIdx] |= (cbf << tuDepth); if (mightNotSplit && log2TrSize != depthRange[0]) { @@ -487,6 +487,7 @@ uint32_t fullDepth = cuGeom.depth + tuDepth; uint32_t log2TrSize = cuGeom.log2CUSize - tuDepth; uint32_t tuSize = 1 << log2TrSize; + bool bEnableRDOQ = !!m_param->rdoqLevel; X265_CHECK(tuSize <= MAX_TS_SIZE, "transform skip is only possible at 4x4 TUs\n"); @@ -525,7 +526,7 @@ // store original entropy coding status m_entropyCoder.store(m_rqt[fullDepth].rqtRoot); - if (m_bEnableRDOQ) + if (bEnableRDOQ) m_entropyCoder.estBit(m_entropyCoder.m_estBitsSbac, log2TrSize, true); int checkTransformSkip = 1; @@ -717,8 +718,7 @@ residualTransformQuantIntra(mode, cuGeom, qPartIdx, tuDepth + 1, depthRange); cbf |= cu.getCbf(qPartIdx, TEXT_LUMA, tuDepth + 1); } - for (uint32_t offs = 0; offs < 4 * qNumParts; offs++) - cu.m_cbf[0][absPartIdx + offs] |= (cbf << tuDepth); + cu.m_cbf[0][absPartIdx] |= (cbf << tuDepth); } } @@ -782,6 +782,7 @@ { CUData& cu = mode.cu; uint32_t log2TrSize = cuGeom.log2CUSize - tuDepth; + bool bEnableRDOQ = !!m_param->rdoqLevel; if (tuDepth < cu.m_tuDepth[absPartIdx]) { @@ -793,11 +794,9 @@ splitCbfU |= cu.getCbf(qPartIdx, TEXT_CHROMA_U, tuDepth + 1); splitCbfV |= cu.getCbf(qPartIdx, TEXT_CHROMA_V, tuDepth + 1); } - for (uint32_t offs = 0; offs < 4 * qNumParts; offs++) - { - cu.m_cbf[1][absPartIdx + offs] |= (splitCbfU << tuDepth); - cu.m_cbf[2][absPartIdx + offs] |= (splitCbfV << tuDepth); - } + cu.m_cbf[1][absPartIdx] |= (splitCbfU << tuDepth); + cu.m_cbf[2][absPartIdx] |= (splitCbfV << tuDepth); + return; } @@ -812,7 +811,7 @@ tuDepthC--; } - if (m_bEnableRDOQ) + if (bEnableRDOQ) m_entropyCoder.estBit(m_entropyCoder.m_estBitsSbac, log2TrSizeC, false); bool checkTransformSkip = m_slice->m_pps->bTransformSkipEnabled && log2TrSizeC <= MAX_LOG2_TS_SIZE && !cu.m_tqBypass[0]; @@ -1091,11 +1090,8 @@ splitCbfU |= cu.getCbf(qPartIdx, TEXT_CHROMA_U, tuDepth + 1); splitCbfV |= cu.getCbf(qPartIdx, TEXT_CHROMA_V, tuDepth + 1); } - for (uint32_t offs = 0; offs < 4 * qNumParts; offs++) - { - cu.m_cbf[1][absPartIdx + offs] |= (splitCbfU << tuDepth); - cu.m_cbf[2][absPartIdx + offs] |= (splitCbfV << tuDepth); - } + cu.m_cbf[1][absPartIdx] |= (splitCbfU << tuDepth); + cu.m_cbf[2][absPartIdx] |= (splitCbfV << tuDepth); return; } @@ -1629,8 +1625,7 @@ for (uint32_t qIdx = 0, qPartIdx = 0; qIdx < 4; ++qIdx, qPartIdx += qNumParts) combCbfY |= cu.getCbf(qPartIdx, TEXT_LUMA, 1); - for (uint32_t offs = 0; offs < 4 * qNumParts; offs++) - cu.m_cbf[0][offs] |= combCbfY; + cu.m_cbf[0][0] |= combCbfY; } // TODO: remove this @@ -1732,6 +1727,12 @@ else cu.getAllowedChromaDir(absPartIdxC, modeList); + if (m_frame->m_fencPic->m_picCsp == X265_CSP_I400 && m_csp != X265_CSP_I400) + { + for (uint32_t l = 1; l < NUM_CHROMA_MODE; l++) + modeList[l] = modeList[0]; + maxMode = 1; + } // check chroma modes for (uint32_t mode = minMode; mode < maxMode; mode++) { @@ -1816,11 +1817,8 @@ combCbfV |= cu.getCbf(qPartIdx, TEXT_CHROMA_V, 1); } - for (uint32_t offs = 0; offs < 4 * qNumParts; offs++) - { - cu.m_cbf[1][offs] |= combCbfU; - cu.m_cbf[2][offs] |= combCbfV; - } + cu.m_cbf[1][0] |= combCbfU; + cu.m_cbf[2][0] |= combCbfV; } /* TODO: remove this */ @@ -1974,7 +1972,8 @@ slave.m_frame = m_frame; slave.m_param = m_param; slave.setLambdaFromQP(pme.mode.cu, m_rdCost.m_qp); - slave.m_me.setSourcePU(*pme.mode.fencYuv, pme.pu.ctuAddr, pme.pu.cuAbsPartIdx, pme.pu.puAbsPartIdx, pme.pu.width, pme.pu.height); + bool bChroma = slave.m_frame->m_fencPic->m_picCsp != X265_CSP_I400; + slave.m_me.setSourcePU(*pme.mode.fencYuv, pme.pu.ctuAddr, pme.pu.cuAbsPartIdx, pme.pu.puAbsPartIdx, pme.pu.width, pme.pu.height, m_param->searchMethod, m_param->subpelRefine, bChroma); } /* Perform ME, repeat until no more work is available */ @@ -2015,9 +2014,12 @@ int mvpIdx = selectMVP(interMode.cu, pu, amvp, list, ref); MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx]; - MV lmv = getLowresMV(interMode.cu, pu, list, ref); - if (lmv.notZero()) - mvc[numMvc++] = lmv; + if (!m_param->analysisMode) /* Prevents load/save outputs from diverging if lowresMV is not available */ + { + MV lmv = getLowresMV(interMode.cu, pu, list, ref); + if (lmv.notZero()) + mvc[numMvc++] = lmv; + } setSearchRange(interMode.cu, mvp, m_param->searchRange, mvmin, mvmax); @@ -2074,7 +2076,7 @@ MotionData* bestME = interMode.bestME[puIdx]; PredictionUnit pu(cu, cuGeom, puIdx); - m_me.setSourcePU(*interMode.fencYuv, pu.ctuAddr, pu.cuAbsPartIdx, pu.puAbsPartIdx, pu.width, pu.height); + m_me.setSourcePU(*interMode.fencYuv, pu.ctuAddr, pu.cuAbsPartIdx, pu.puAbsPartIdx, pu.width, pu.height, m_param->searchMethod, m_param->subpelRefine, bChromaMC); /* find best cost merge candidate. note: 2Nx2N merge and bidir are handled as separate modes */ uint32_t mrgCost = numPart == 1 ? MAX_UINT : mergeEstimation(cu, cuGeom, pu, puIdx, merge); @@ -2104,10 +2106,7 @@ const MV* amvp = interMode.amvpCand[list][ref]; int mvpIdx = selectMVP(cu, pu, amvp, list, ref); MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx]; - MV lmv = bestME[list].mv; - if (lmv.notZero()) - mvc[numMvc++] = lmv; - + setSearchRange(cu, mvp, m_param->searchRange, mvmin, mvmax); int satdCost = m_me.motionEstimate(&slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, m_param->searchRange, outmv); @@ -2128,8 +2127,8 @@ bestME[list].bits = bits; bestME[list].mvCost = mvCost; } - } - bDoUnidir = false; + bDoUnidir = false; + } } else if (m_param->bDistributeMotionEstimation) { @@ -2199,9 +2198,12 @@ int mvpIdx = selectMVP(cu, pu, amvp, list, ref); MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx]; - MV lmv = getLowresMV(cu, pu, list, ref); - if (lmv.notZero()) - mvc[numMvc++] = lmv; + if (!m_param->analysisMode) /* Prevents load/save outputs from diverging when lowresMV is not available */ + { + MV lmv = getLowresMV(cu, pu, list, ref); + if (lmv.notZero()) + mvc[numMvc++] = lmv; + } setSearchRange(cu, mvp, m_param->searchRange, mvmin, mvmax); int satdCost = m_me.motionEstimate(&slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, m_param->searchRange, outmv); @@ -2534,7 +2536,7 @@ interMode.lumaDistortion = primitives.cu[part].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, reconYuv->m_buf[0], reconYuv->m_size); interMode.distortion = interMode.lumaDistortion; // Chroma - if (m_csp != X265_CSP_I400) + if (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400) { interMode.chromaDistortion = m_rdCost.scaleChromaDist(1, primitives.chroma[m_csp].cu[part].sse_pp(fencYuv->m_buf[1], fencYuv->m_csize, reconYuv->m_buf[1], reconYuv->m_csize)); interMode.chromaDistortion += m_rdCost.scaleChromaDist(2, primitives.chroma[m_csp].cu[part].sse_pp(fencYuv->m_buf[2], fencYuv->m_csize, reconYuv->m_buf[2], reconYuv->m_csize)); @@ -2575,7 +2577,7 @@ uint32_t log2CUSize = cuGeom.log2CUSize; int sizeIdx = log2CUSize - 2; - resiYuv->subtract(*fencYuv, *predYuv, log2CUSize); + resiYuv->subtract(*fencYuv, *predYuv, log2CUSize, m_frame->m_fencPic->m_picCsp); uint32_t tuDepthRange[2]; cu.getInterTUQtDepthRange(tuDepthRange, 0); @@ -2589,7 +2591,7 @@ if (!tqBypass) { sse_t cbf0Dist = primitives.cu[sizeIdx].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, predYuv->m_buf[0], predYuv->m_size); - if (m_csp != X265_CSP_I400) + if (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400) { cbf0Dist += m_rdCost.scaleChromaDist(1, primitives.chroma[m_csp].cu[sizeIdx].sse_pp(fencYuv->m_buf[1], predYuv->m_csize, predYuv->m_buf[1], predYuv->m_csize)); cbf0Dist += m_rdCost.scaleChromaDist(2, primitives.chroma[m_csp].cu[sizeIdx].sse_pp(fencYuv->m_buf[2], predYuv->m_csize, predYuv->m_buf[2], predYuv->m_csize)); @@ -2660,14 +2662,14 @@ m_entropyCoder.store(interMode.contexts); if (cu.getQtRootCbf(0)) - reconYuv->addClip(*predYuv, *resiYuv, log2CUSize); + reconYuv->addClip(*predYuv, *resiYuv, log2CUSize, m_frame->m_fencPic->m_picCsp); else reconYuv->copyFromYuv(*predYuv); // update with clipped distortion and cost (qp estimation loop uses unclipped values) sse_t bestLumaDist = primitives.cu[sizeIdx].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, reconYuv->m_buf[0], reconYuv->m_size); interMode.distortion = bestLumaDist; - if (m_csp != X265_CSP_I400) + if (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400) { sse_t bestChromaDist = m_rdCost.scaleChromaDist(1, primitives.chroma[m_csp].cu[sizeIdx].sse_pp(fencYuv->m_buf[1], fencYuv->m_csize, reconYuv->m_buf[1], reconYuv->m_csize)); bestChromaDist += m_rdCost.scaleChromaDist(2, primitives.chroma[m_csp].cu[sizeIdx].sse_pp(fencYuv->m_buf[2], fencYuv->m_csize, reconYuv->m_buf[2], reconYuv->m_csize)); @@ -2699,7 +2701,7 @@ { // code full block uint32_t log2TrSizeC = log2TrSize - m_hChromaShift; - uint32_t codeChroma = (m_csp != X265_CSP_I400) ? 1 : 0; + uint32_t codeChroma = (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400) ? 1 : 0; uint32_t tuDepthC = tuDepth; if (log2TrSizeC < 2) @@ -2807,20 +2809,17 @@ { residualTransformQuantInter(mode, cuGeom, qPartIdx, tuDepth + 1, depthRange); ycbf |= cu.getCbf(qPartIdx, TEXT_LUMA, tuDepth + 1); - if (m_csp != X265_CSP_I400) + if (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400) { ucbf |= cu.getCbf(qPartIdx, TEXT_CHROMA_U, tuDepth + 1); vcbf |= cu.getCbf(qPartIdx, TEXT_CHROMA_V, tuDepth + 1); } } - for (uint32_t i = 0; i < 4 * qNumParts; ++i) + cu.m_cbf[0][absPartIdx] |= ycbf << tuDepth; + if (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400) { - cu.m_cbf[0][absPartIdx + i] |= ycbf << tuDepth; - if (m_csp != X265_CSP_I400) - { - cu.m_cbf[1][absPartIdx + i] |= ucbf << tuDepth; - cu.m_cbf[2][absPartIdx + i] |= vcbf << tuDepth; - } + cu.m_cbf[1][absPartIdx] |= ucbf << tuDepth; + cu.m_cbf[2][absPartIdx] |= vcbf << tuDepth; } } } @@ -2840,6 +2839,7 @@ CUData& cu = mode.cu; uint32_t depth = cuGeom.depth + tuDepth; uint32_t log2TrSize = cuGeom.log2CUSize - tuDepth; + bool bEnableRDOQ = !!m_param->rdoqLevel; bool bCheckSplit = log2TrSize > depthRange[0]; bool bCheckFull = log2TrSize <= depthRange[1]; @@ -2851,7 +2851,7 @@ X265_CHECK(bCheckFull || bCheckSplit, "check-full or check-split must be set\n"); uint32_t log2TrSizeC = log2TrSize - m_hChromaShift; - uint32_t codeChroma = (m_csp != X265_CSP_I400) ? 1 : 0; + uint32_t codeChroma = (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400) ? 1 : 0; uint32_t tuDepthC = tuDepth; if (log2TrSizeC < 2) { @@ -2897,7 +2897,7 @@ cu.setTUDepthSubParts(tuDepth, absPartIdx, depth); cu.setTransformSkipSubParts(0, TEXT_LUMA, absPartIdx, depth); - if (m_bEnableRDOQ) + if (bEnableRDOQ) m_entropyCoder.estBit(m_entropyCoder.m_estBitsSbac, log2TrSize, true); const pixel* fenc = fencYuv->getLumaAddr(absPartIdx); @@ -3011,7 +3011,7 @@ cu.setTransformSkipPartRange(0, (TextType)chromaId, absPartIdxC, tuIterator.absPartIdxStep); - if (m_bEnableRDOQ && (chromaId != TEXT_CHROMA_V)) + if (bEnableRDOQ && (chromaId != TEXT_CHROMA_V)) m_entropyCoder.estBit(m_entropyCoder.m_estBitsSbac, log2TrSizeC, false); fenc = fencYuv->getChromaAddr(chromaId, absPartIdxC); @@ -3102,6 +3102,19 @@ } } + if (m_frame->m_fencPic->m_picCsp == X265_CSP_I400 && m_csp != X265_CSP_I400) + { + for (uint32_t chromaId = TEXT_CHROMA_U; chromaId <= TEXT_CHROMA_V; chromaId++) + { + TURecurse tuIterator(splitIntoSubTUs ? VERTICAL_SPLIT : DONT_SPLIT, absPartIdxStep, absPartIdx); + do + { + uint32_t absPartIdxC = tuIterator.absPartIdxTURelCU; + cu.setCbfPartRange(0, (TextType)chromaId, absPartIdxC, tuIterator.absPartIdxStep); + } + while(tuIterator.isNextSection()); + } + } if (checkTransformSkipY) { sse_t nonZeroDistY = 0; @@ -3112,7 +3125,7 @@ cu.setTransformSkipSubParts(1, TEXT_LUMA, absPartIdx, depth); - if (m_bEnableRDOQ) + if (bEnableRDOQ) m_entropyCoder.estBit(m_entropyCoder.m_estBitsSbac, log2TrSize, true); fenc = fencYuv->getLumaAddr(absPartIdx); @@ -3180,7 +3193,7 @@ cu.setTransformSkipPartRange(1, (TextType)chromaId, absPartIdxC, tuIterator.absPartIdxStep); - if (m_bEnableRDOQ && (chromaId != TEXT_CHROMA_V)) + if (bEnableRDOQ && (chromaId != TEXT_CHROMA_V)) m_entropyCoder.estBit(m_entropyCoder.m_estBitsSbac, log2TrSizeC, false); fenc = fencYuv->getChromaAddr(chromaId, absPartIdxC); @@ -3311,20 +3324,17 @@ { estimateResidualQT(mode, cuGeom, qPartIdx, tuDepth + 1, resiYuv, splitCost, depthRange); ycbf |= cu.getCbf(qPartIdx, TEXT_LUMA, tuDepth + 1); - if (m_csp != X265_CSP_I400) + if (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400) { ucbf |= cu.getCbf(qPartIdx, TEXT_CHROMA_U, tuDepth + 1); vcbf |= cu.getCbf(qPartIdx, TEXT_CHROMA_V, tuDepth + 1); } } - for (uint32_t i = 0; i < 4 * qNumParts; ++i) + cu.m_cbf[0][absPartIdx] |= ycbf << tuDepth; + if (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400) { - cu.m_cbf[0][absPartIdx + i] |= ycbf << tuDepth; - if (m_csp != X265_CSP_I400) - { - cu.m_cbf[1][absPartIdx + i] |= ucbf << tuDepth; - cu.m_cbf[2][absPartIdx + i] |= vcbf << tuDepth; - } + cu.m_cbf[1][absPartIdx] |= ucbf << tuDepth; + cu.m_cbf[2][absPartIdx] |= vcbf << tuDepth; } // Here we were encoding cbfs and coefficients for splitted blocks. Since I have collected coefficient bits @@ -3413,25 +3423,21 @@ const bool bSubdiv = tuDepth < cu.m_tuDepth[absPartIdx]; uint32_t log2TrSize = cu.m_log2CUSize[0] - tuDepth; - if (m_csp != X265_CSP_I400) + if (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400) { if (!(log2TrSize - m_hChromaShift < 2)) { - if (!tuDepth || cu.getCbf(absPartIdx, TEXT_CHROMA_U, tuDepth - 1)) + uint32_t parentIdx = absPartIdx & (0xFF << (log2TrSize + 1 - LOG2_UNIT_SIZE) * 2); + if (!tuDepth || cu.getCbf(parentIdx, TEXT_CHROMA_U, tuDepth - 1)) m_entropyCoder.codeQtCbfChroma(cu, absPartIdx, TEXT_CHROMA_U, tuDepth, !bSubdiv); - if (!tuDepth || cu.getCbf(absPartIdx, TEXT_CHROMA_V, tuDepth - 1)) + if (!tuDepth || cu.getCbf(parentIdx, TEXT_CHROMA_V, tuDepth - 1)) m_entropyCoder.codeQtCbfChroma(cu, absPartIdx, TEXT_CHROMA_V, tuDepth, !bSubdiv); } - else - { - X265_CHECK(cu.getCbf(absPartIdx, TEXT_CHROMA_U, tuDepth) == cu.getCbf(absPartIdx, TEXT_CHROMA_U, tuDepth - 1), "chroma CBF not matching\n"); - X265_CHECK(cu.getCbf(absPartIdx, TEXT_CHROMA_V, tuDepth) == cu.getCbf(absPartIdx, TEXT_CHROMA_V, tuDepth - 1), "chroma CBF not matching\n"); - } } if (!bSubdiv) { - m_entropyCoder.codeQtCbfLuma(cu, absPartIdx, tuDepth); + m_entropyCoder.codeQtCbfLuma(cu.getCbf(absPartIdx, TEXT_LUMA, tuDepth), tuDepth); } else { @@ -3456,7 +3462,7 @@ const uint32_t qtLayer = log2TrSize - 2; uint32_t log2TrSizeC = log2TrSize - m_hChromaShift; - uint32_t codeChroma = (m_csp != X265_CSP_I400) ? 1 : 0; + uint32_t codeChroma = (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400) ? 1 : 0; uint32_t tuDepthC = tuDepth; if (log2TrSizeC < 2) {
View file
x265_1.9.tar.gz/source/encoder/search.h -> x265_2.0.tar.gz/source/encoder/search.h
Changed
@@ -272,7 +272,6 @@ pixel* m_tsRecon; /* transform skip reconstructed pixels 32x32 */ bool m_bFrameParallel; - bool m_bEnableRDOQ; uint32_t m_numLayers; uint32_t m_refLagPixels;
View file
x265_1.9.tar.gz/source/encoder/slicetype.cpp -> x265_2.0.tar.gz/source/encoder/slicetype.cpp
Changed
@@ -83,7 +83,7 @@ uint32_t var; var = acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[0] + blockOffsetLuma, stride, 0, csp); - if (csp != X265_CSP_I400) + if (csp != X265_CSP_I400 && curFrame->m_fencPic->m_picCsp != X265_CSP_I400) { var += acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[1] + blockOffsetChroma, cStride, 1, csp); var += acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[2] + blockOffsetChroma, cStride, 2, csp); @@ -456,10 +456,13 @@ COPY4_IF_LT(minscore, s, minscale, curScale, minoff, curOffset, found, 1); /* Use a smaller denominator if possible */ - while (mindenom > 0 && !(minscale & 1)) + if (mindenom > 0 && !(minscale & 1)) { - mindenom--; - minscale >>= 1; + unsigned long idx; + CTZ(idx, minscale); + int shift = X265_MIN((int)idx, mindenom); + mindenom -= shift; + minscale >>= shift; } if (!found || (minscale == 1 << mindenom && minoff == 0) || (float)minscore / origscore > 0.998f) @@ -2081,7 +2084,7 @@ const intptr_t pelOffset = cuSize * cuX + cuSize * cuY * fenc->lumaStride; if (bBidir || bDoSearch[0] || bDoSearch[1]) - tld.me.setSourcePU(fenc->lowresPlane[0], fenc->lumaStride, pelOffset, cuSize, cuSize); + tld.me.setSourcePU(fenc->lowresPlane[0], fenc->lumaStride, pelOffset, cuSize, cuSize, X265_HEX_SEARCH, 1); /* A small, arbitrary bias to avoid VBV problems caused by zero-residual lookahead blocks. */ int lowresPenalty = 4;
View file
x265_1.9.tar.gz/source/encoder/slicetype.h -> x265_2.0.tar.gz/source/encoder/slicetype.h
Changed
@@ -60,8 +60,8 @@ LookaheadTLD() { + me.init(X265_CSP_I400); me.setQP(X265_LOOKAHEAD_QP); - me.init(X265_HEX_SEARCH, 1, X265_CSP_I400); for (int i = 0; i < 4; i++) wbuffer[i] = NULL; widthInCU = heightInCU = ncu = paddedLines = 0;
View file
x265_1.9.tar.gz/source/encoder/weightPrediction.cpp -> x265_2.0.tar.gz/source/encoder/weightPrediction.cpp
Changed
@@ -31,6 +31,7 @@ #include "slice.h" #include "mv.h" #include "bitstream.h" +#include "threading.h" using namespace X265_NS; namespace { @@ -132,25 +133,25 @@ intptr_t fpeloffset = (mv.y >> 2) * stride + (mv.x >> 2); pixel *temp = src + pixoff + fpeloffset; - int xFrac = mv.x & 0x7; - int yFrac = mv.y & 0x7; - if ((yFrac | xFrac) == 0) + int xFrac = mv.x & 7; + int yFrac = mv.y & 7; + if (!(yFrac | xFrac)) { primitives.chroma[csp].pu[LUMA_16x16].copy_pp(mcout + pixoff, stride, temp, stride); } - else if (yFrac == 0) + else if (!yFrac) { primitives.chroma[csp].pu[LUMA_16x16].filter_hpp(temp, stride, mcout + pixoff, stride, xFrac); } - else if (xFrac == 0) + else if (!xFrac) { primitives.chroma[csp].pu[LUMA_16x16].filter_vpp(temp, stride, mcout + pixoff, stride, yFrac); } else { - ALIGN_VAR_16(int16_t, imm[16 * (16 + NTAPS_CHROMA)]); - primitives.chroma[csp].pu[LUMA_16x16].filter_hps(temp, stride, imm, bw, xFrac, 1); - primitives.chroma[csp].pu[LUMA_16x16].filter_vsp(imm + ((NTAPS_CHROMA >> 1) - 1) * bw, bw, mcout + pixoff, stride, yFrac); + ALIGN_VAR_16(int16_t, immed[16 * (16 + NTAPS_CHROMA - 1)]); + primitives.chroma[csp].pu[LUMA_16x16].filter_hps(temp, stride, immed, bw, xFrac, 1); + primitives.chroma[csp].pu[LUMA_16x16].filter_vsp(immed + ((NTAPS_CHROMA >> 1) - 1) * bw, bw, mcout + pixoff, stride, yFrac); } } else @@ -232,7 +233,7 @@ cache.numPredDir = slice.isInterP() ? 1 : 2; cache.lowresWidthInCU = fenc.width >> 3; cache.lowresHeightInCU = fenc.lines >> 3; - cache.csp = fencPic->m_picCsp; + cache.csp = param.internalCsp; cache.hshift = CHROMA_H_SHIFT(cache.csp); cache.vshift = CHROMA_V_SHIFT(cache.csp); @@ -329,7 +330,7 @@ { /* reference chroma planes must be extended prior to being * used as motion compensation sources */ - if (!refFrame->m_bChromaExtended && param.internalCsp != X265_CSP_I400) + if (!refFrame->m_bChromaExtended && param.internalCsp != X265_CSP_I400 && frame.m_fencPic->m_picCsp != X265_CSP_I400) { refFrame->m_bChromaExtended = true; PicYuv *refPic = refFrame->m_fencPic; @@ -456,10 +457,13 @@ /* Use a smaller luma denominator if possible */ if (!(plane || list)) { - while (mindenom > 0 && !(minscale & 1)) + if (mindenom > 0 && !(minscale & 1)) { - mindenom--; - minscale >>= 1; + unsigned long idx; + CTZ(idx, minscale); + int shift = X265_MIN((int)idx, mindenom); + mindenom -= shift; + minscale >>= shift; } }
View file
x265_1.9.tar.gz/source/input/y4m.cpp -> x265_2.0.tar.gz/source/input/y4m.cpp
Changed
@@ -417,6 +417,8 @@ { int pixelbytes = depth > 8 ? 2 : 1; pic.bitDepth = depth; + pic.framesize = framesize; + pic.height = height; pic.colorSpace = colorSpace; pic.stride[0] = width * pixelbytes; pic.stride[1] = pic.stride[0] >> x265_cli_csps[colorSpace].width[1];
View file
x265_1.9.tar.gz/source/input/yuv.cpp -> x265_2.0.tar.gz/source/input/yuv.cpp
Changed
@@ -225,6 +225,8 @@ uint32_t pixelbytes = depth > 8 ? 2 : 1; pic.colorSpace = colorSpace; pic.bitDepth = depth; + pic.framesize = framesize; + pic.height = height; pic.stride[0] = width * pixelbytes; pic.stride[1] = pic.stride[0] >> x265_cli_csps[colorSpace].width[1]; pic.stride[2] = pic.stride[0] >> x265_cli_csps[colorSpace].width[2];
View file
x265_1.9.tar.gz/source/output/raw.cpp -> x265_2.0.tar.gz/source/output/raw.cpp
Changed
@@ -32,11 +32,11 @@ b_fail = false; if (!strcmp(fname, "-")) { - ofs = &cout; + ofs = stdout; return; } - ofs = new ofstream(fname, ios::binary | ios::out); - if (ofs->fail()) + ofs = x265_fopen(fname, "wb"); + if (!ofs || ferror(ofs)) b_fail = true; } @@ -51,7 +51,7 @@ for (uint32_t i = 0; i < nalcount; i++) { - ofs->write((const char*)nal->payload, nal->sizeBytes); + fwrite((const void*)nal->payload, 1, nal->sizeBytes, ofs); bytes += nal->sizeBytes; nal++; } @@ -65,7 +65,7 @@ for (uint32_t i = 0; i < nalcount; i++) { - ofs->write((const char*)nal->payload, nal->sizeBytes); + fwrite((const void*)nal->payload, 1, nal->sizeBytes, ofs); bytes += nal->sizeBytes; nal++; } @@ -75,6 +75,6 @@ void RAWOutput::closeFile(int64_t, int64_t) { - if (ofs != &cout) - delete ofs; + if (ofs != stdout) + fclose(ofs); }
View file
x265_1.9.tar.gz/source/output/raw.h -> x265_2.0.tar.gz/source/output/raw.h
Changed
@@ -35,7 +35,7 @@ { protected: - std::ostream* ofs; + FILE* ofs; bool b_fail;
View file
x265_1.9.tar.gz/source/test/CMakeLists.txt -> x265_2.0.tar.gz/source/test/CMakeLists.txt
Changed
@@ -1,4 +1,12 @@ # vim: syntax=cmake + +check_symbol_exists(__rdtsc "intrin.h" HAVE_RDTSC) +if(HAVE_RDTSC) + add_definitions(-DHAVE_RDTSC=1) +endif() + +# add X86 assembly files +if(X86) enable_language(ASM_YASM) if(MSVC_IDE) @@ -11,11 +19,23 @@ else() set(YASM_SRC checkasm-a.asm) endif() +endif(X86) -check_symbol_exists(__rdtsc "intrin.h" HAVE_RDTSC) -if(HAVE_RDTSC) - add_definitions(-DHAVE_RDTSC=1) -endif() +# add ARM assembly files +if(ARM OR CROSS_COMPILE_ARM) + enable_language(ASM) + set(YASM_SRC checkasm-arm.S) + add_custom_command( + OUTPUT checkasm-arm.obj + COMMAND ${CMAKE_CXX_COMPILER} + ARGS ${YASM_FLAGS} ${CMAKE_CURRENT_SOURCE_DIR}/checkasm-arm.S -o checkasm-arm.obj + DEPENDS checkasm-arm.S) +endif(ARM OR CROSS_COMPILE_ARM) + +# add PowerPC assembly files +if(POWER) + set(YASM_SRC) +endif(POWER) add_executable(TestBench ${YASM_SRC} testbench.cpp testharness.h @@ -23,6 +43,7 @@ mbdstharness.cpp mbdstharness.h ipfilterharness.cpp ipfilterharness.h intrapredharness.cpp intrapredharness.h) + target_link_libraries(TestBench x265-static ${PLATFORM_LIBS}) if(LINKER_OPTIONS) if(EXTRA_LIB)
View file
x265_2.0.tar.gz/source/test/checkasm-arm.S
Added
@@ -0,0 +1,133 @@ +/**************************************************************************** + * checkasm-arm.S: assembly check tool + ***************************************************************************** + * Copyright (C) 2016 x265 project + * + * Authors: Martin Storsjo <martin@martin.st> + * Dnyaneshwar Gorade <dnyaneshwar@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "../common/arm/asm.S" + +.section .rodata +.align 4 +register_init: +.quad 0x21f86d66c8ca00ce +.quad 0x75b6ba21077c48ad +.quad 0xed56bb2dcb3c7736 +.quad 0x8bda43d3fd1a7e06 +.quad 0xb64a9c9e5d318408 +.quad 0xdf9a54b303f1d3a3 +.quad 0x4a75479abd64e097 +.quad 0x249214109d5d1c88 + +error_message: +.asciz "failed to preserve register" + +.text + +@ max number of args used by any x265 asm function. +#define MAX_ARGS 15 + +#define ARG_STACK 4*(MAX_ARGS - 2) + +.macro clobbercheck variant +.equ pushed, 4*10 +function x265_checkasm_call_\variant + push {r4-r11, lr} +.ifc \variant, neon + vpush {q4-q7} +.equ pushed, pushed + 16*4 +.endif + + movrel r12, register_init +.ifc \variant, neon + vldm r12, {q4-q7} +.endif + ldm r12, {r4-r11} + + push {r1} + + sub sp, sp, #ARG_STACK +.equ pos, 0 +.rept MAX_ARGS-2 + ldr r12, [sp, #ARG_STACK + pushed + 8 + pos] + str r12, [sp, #pos] +.equ pos, pos + 4 +.endr + + mov r12, r0 + mov r0, r2 + mov r1, r3 + ldrd r2, r3, [sp, #ARG_STACK + pushed] + blx r12 + add sp, sp, #ARG_STACK + pop {r2} + + push {r0, r1} + movrel r12, register_init +.ifc \variant, neon + vldm r12, {q0-q3} + veor q0, q0, q4 + veor q1, q1, q5 + veor q2, q2, q6 + veor q3, q3, q7 + vorr q0, q0, q1 + vorr q0, q0, q2 + vorr q0, q0, q3 + vorr d0, d0, d1 + vrev64.32 d1, d0 + vorr d0, d0, d1 + vmov.32 r3, d0[0] +.else + mov r3, #0 +.endif + +.macro check_reg reg1, reg2 + ldrd r0, r1, [r12], #8 + eor r0, r0, \reg1 + eor r1, r1, \reg2 + orr r3, r3, r0 + orr r3, r3, r1 +.endm + check_reg r4, r5 + check_reg r6, r7 + check_reg r8, r9 + check_reg r10, r11 +.purgem check_reg + + cmp r3, #0 + beq 0f + + mov r12, #0 + str r12, [r2] + movrel r0, error_message + bl puts +0: + pop {r0, r1} +.ifc \variant, neon + vpop {q4-q7} +.endif + pop {r4-r11, pc} +endfunc +.endm + +clobbercheck neon +clobbercheck noneon
View file
x265_1.9.tar.gz/source/test/pixelharness.cpp -> x265_2.0.tar.gz/source/test/pixelharness.cpp
Changed
@@ -43,6 +43,7 @@ ushort_test_buff[0][i] = rand() % ((1 << 16) - 1); uchar_test_buff[0][i] = rand() % ((1 << 8) - 1); residual_test_buff[0][i] = (rand() % (2 * RMAX + 1)) - RMAX - 1;// For sse_ss only + double_test_buff[0][i] = (double)(short_test_buff[0][i]) / 256.0; pixel_test_buff[1][i] = PIXEL_MIN; short_test_buff[1][i] = SMIN; @@ -52,6 +53,7 @@ ushort_test_buff[1][i] = PIXEL_MIN; uchar_test_buff[1][i] = PIXEL_MIN; residual_test_buff[1][i] = RMIN; + double_test_buff[1][i] = (double)(short_test_buff[1][i]) / 256.0; pixel_test_buff[2][i] = PIXEL_MAX; short_test_buff[2][i] = SMAX; @@ -61,6 +63,7 @@ ushort_test_buff[2][i] = ((1 << 16) - 1); uchar_test_buff[2][i] = 255; residual_test_buff[2][i] = RMAX; + double_test_buff[2][i] = (double)(short_test_buff[2][i]) / 256.0; pbuf1[i] = rand() & PIXEL_MAX; pbuf2[i] = rand() & PIXEL_MAX; @@ -858,9 +861,8 @@ int width = (rand() % 4) + 1; // range[1-4] float cres = ref(sum0, sum1, width); float vres = checked_float(opt, sum0, sum1, width); - if (fabs(vres - cres) > 0.0001) + if (fabs(vres - cres) > 0.001) return false; - reportfail(); } @@ -1398,6 +1400,60 @@ return true; } +bool PixelHarness::check_cutree_fix8_pack(cutree_fix8_pack ref, cutree_fix8_pack opt) +{ + ALIGN_VAR_32(uint16_t, ref_dest[64 * 64]); + ALIGN_VAR_32(uint16_t, opt_dest[64 * 64]); + + memset(ref_dest, 0xCD, sizeof(ref_dest)); + memset(opt_dest, 0xCD, sizeof(opt_dest)); + + int j = 0; + + for (int i = 0; i < ITERS; i++) + { + int count = 256 + i; + int index = i % TEST_CASES; + checked(opt, opt_dest, double_test_buff[index] + j, count); + ref(ref_dest, double_test_buff[index] + j, count); + + if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(uint16_t))) + return false; + + reportfail(); + j += INCR; + } + + return true; +} + +bool PixelHarness::check_cutree_fix8_unpack(cutree_fix8_unpack ref, cutree_fix8_unpack opt) +{ + ALIGN_VAR_32(double, ref_dest[64 * 64]); + ALIGN_VAR_32(double, opt_dest[64 * 64]); + + memset(ref_dest, 0xCD, sizeof(ref_dest)); + memset(opt_dest, 0xCD, sizeof(opt_dest)); + + int j = 0; + + for (int i = 0; i < ITERS; i++) + { + int count = 256 + i; + int index = i % TEST_CASES; + checked(opt, opt_dest, ushort_test_buff[index] + j, count); + ref(ref_dest, ushort_test_buff[index] + j, count); + + if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(double))) + return false; + + reportfail(); + j += INCR; + } + + return true; +} + bool PixelHarness::check_psyCost_pp(pixelcmp_t ref, pixelcmp_t opt) { int j = 0, index1, index2, optres, refres; @@ -1819,34 +1875,6 @@ return true; } -bool PixelHarness::check_planeClipAndMax(planeClipAndMax_t ref, planeClipAndMax_t opt) -{ - for (int i = 0; i < ITERS; i++) - { - intptr_t rand_stride = rand() % STRIDE; - int rand_width = (rand() % (STRIDE * 2)) + 1; - const int rand_height = (rand() % MAX_HEIGHT) + 1; - const pixel rand_min = rand() % 32; - const pixel rand_max = PIXEL_MAX - (rand() % 32); - uint64_t ref_sum, opt_sum; - - // video width must be more than or equal to 32 - if (rand_width < 32) - rand_width = 32; - - // stride must be more than or equal to width - if (rand_stride < rand_width) - rand_stride = rand_width; - - pixel ref_max = ref(pbuf1, rand_stride, rand_width, rand_height, &ref_sum, rand_min, rand_max); - pixel opt_max = (pixel)checked(opt, pbuf1, rand_stride, rand_width, rand_height, &opt_sum, rand_min, rand_max); - - if (ref_max != opt_max) - return false; - } - return true; -} - bool PixelHarness::check_pelFilterLumaStrong_H(pelFilterLumaStrong_t ref, pelFilterLumaStrong_t opt) { intptr_t srcStep = 1, offset = 64; @@ -1913,6 +1941,68 @@ return true; } +bool PixelHarness::check_pelFilterChroma_H(pelFilterChroma_t ref, pelFilterChroma_t opt) +{ + intptr_t srcStep = 1, offset = 64; + int32_t maskP, maskQ, tc; + int j = 0; + + pixel pixel_test_buff1[TEST_CASES][BUFFSIZE]; + for (int i = 0; i < TEST_CASES; i++) + memcpy(pixel_test_buff1[i], pixel_test_buff[i], sizeof(pixel)* BUFFSIZE); + + for (int i = 0; i < ITERS; i++) + { + tc = rand() % PIXEL_MAX; + maskP = (rand() % PIXEL_MAX) - 1; + maskQ = (rand() % PIXEL_MAX) - 1; + + int index = rand() % 3; + + ref(pixel_test_buff[index] + 4 * offset + j, srcStep, offset, tc, maskP, maskQ); + checked(opt, pixel_test_buff1[index] + 4 * offset + j, srcStep, offset, tc, maskP, maskQ); + + if (memcmp(pixel_test_buff[index], pixel_test_buff1[index], sizeof(pixel)* BUFFSIZE)) + return false; + + reportfail() + j += INCR; + } + + return true; +} + +bool PixelHarness::check_pelFilterChroma_V(pelFilterChroma_t ref, pelFilterChroma_t opt) +{ + intptr_t srcStep = 64, offset = 1; + int32_t maskP, maskQ, tc; + int j = 0; + + pixel pixel_test_buff1[TEST_CASES][BUFFSIZE]; + for (int i = 0; i < TEST_CASES; i++) + memcpy(pixel_test_buff1[i], pixel_test_buff[i], sizeof(pixel)* BUFFSIZE); + + for (int i = 0; i < ITERS; i++) + { + tc = rand() % PIXEL_MAX; + maskP = (rand() % PIXEL_MAX) - 1; + maskQ = (rand() % PIXEL_MAX) - 1; + + int index = rand() % 3; + + ref(pixel_test_buff[index] + 4 + j, srcStep, offset, tc, maskP, maskQ); + checked(opt, pixel_test_buff1[index] + 4 + j, srcStep, offset, tc, maskP, maskQ); + + if (memcmp(pixel_test_buff[index], pixel_test_buff1[index], sizeof(pixel)* BUFFSIZE)) + return false; + + reportfail() + j += INCR; + } + + return true; +} + bool PixelHarness::testPU(int part, const EncoderPrimitives& ref, const EncoderPrimitives& opt) { if (opt.pu[part].satd) @@ -2498,6 +2588,24 @@ } } + if (opt.fix8Pack) + { + if (!check_cutree_fix8_pack(ref.fix8Pack, opt.fix8Pack)) + { + printf("cuTreeFix8Pack failed\n"); + return false; + } + } + + if (opt.fix8Unpack) + { + if (!check_cutree_fix8_unpack(ref.fix8Unpack, opt.fix8Unpack)) + { + printf("cuTreeFix8Unpack failed\n"); + return false; + } + } + if (opt.scanPosLast) { if (!check_scanPosLast(ref.scanPosLast, opt.scanPosLast)) @@ -2544,15 +2652,6 @@ } - if (opt.planeClipAndMax) - { - if (!check_planeClipAndMax(ref.planeClipAndMax, opt.planeClipAndMax)) - { - printf("planeClipAndMax failed!\n"); - return false; - } - } - if (opt.pelFilterLumaStrong[0]) { if (!check_pelFilterLumaStrong_V(ref.pelFilterLumaStrong[0], opt.pelFilterLumaStrong[0])) @@ -2571,6 +2670,24 @@ } } + if (opt.pelFilterChroma[0]) + { + if (!check_pelFilterChroma_V(ref.pelFilterChroma[0], opt.pelFilterChroma[0])) + { + printf("pelFilterChroma Vertical failed!\n"); + return false; + } + } + + if (opt.pelFilterChroma[1]) + { + if (!check_pelFilterChroma_H(ref.pelFilterChroma[1], opt.pelFilterChroma[1])) + { + printf("pelFilterChroma Horizontal failed!\n"); + return false; + } + } + return true; } @@ -2988,6 +3105,18 @@ REPORT_SPEEDUP(opt.propagateCost, ref.propagateCost, ibuf1, ushort_test_buff[0], int_test_buff[0], ushort_test_buff[0], int_test_buff[0], double_test_buff[0], 80); } + if (opt.fix8Pack) + { + HEADER0("cuTreeFix8Pack"); + REPORT_SPEEDUP(opt.fix8Pack, ref.fix8Pack, ushort_test_buff[0], double_test_buff[0], 390); + } + + if (opt.fix8Unpack) + { + HEADER0("cuTreeFix8Unpack"); + REPORT_SPEEDUP(opt.fix8Unpack, ref.fix8Unpack, double_test_buff[0], ushort_test_buff[0], 390); + } + if (opt.scanPosLast) { HEADER0("scanPosLast"); @@ -3048,13 +3177,6 @@ REPORT_SPEEDUP(opt.costC1C2Flag, ref.costC1C2Flag, abscoefBuf, C1FLAG_NUMBER, (uint8_t*)psbuf1, 1); } - if (opt.planeClipAndMax) - { - HEADER0("planeClipAndMax"); - uint64_t dummy; - REPORT_SPEEDUP(opt.planeClipAndMax, ref.planeClipAndMax, pbuf1, 128, 63, 62, &dummy, 1, PIXEL_MAX - 1); - } - if (opt.pelFilterLumaStrong[0]) { int32_t tcP = (rand() % PIXEL_MAX) - 1; @@ -3070,4 +3192,22 @@ HEADER0("pelFilterLumaStrong_Horizontal"); REPORT_SPEEDUP(opt.pelFilterLumaStrong[1], ref.pelFilterLumaStrong[1], pbuf1, 1, STRIDE, tcP, tcQ); } + + if (opt.pelFilterChroma[0]) + { + int32_t tc = (rand() % PIXEL_MAX); + int32_t maskP = (rand() % PIXEL_MAX) - 1; + int32_t maskQ = (rand() % PIXEL_MAX) - 1; + HEADER0("pelFilterChroma_Vertical"); + REPORT_SPEEDUP(opt.pelFilterChroma[0], ref.pelFilterChroma[0], pbuf1, STRIDE, 1, tc, maskP, maskQ); + } + + if (opt.pelFilterChroma[1]) + { + int32_t tc = (rand() % PIXEL_MAX); + int32_t maskP = (rand() % PIXEL_MAX) - 1; + int32_t maskQ = (rand() % PIXEL_MAX) - 1; + HEADER0("pelFilterChroma_Horizontal"); + REPORT_SPEEDUP(opt.pelFilterChroma[1], ref.pelFilterChroma[1], pbuf1, 1, STRIDE, tc, maskP, maskQ); + } }
View file
x265_1.9.tar.gz/source/test/pixelharness.h -> x265_2.0.tar.gz/source/test/pixelharness.h
Changed
@@ -113,6 +113,8 @@ bool check_planecopy_sp(planecopy_sp_t ref, planecopy_sp_t opt); bool check_planecopy_cp(planecopy_cp_t ref, planecopy_cp_t opt); bool check_cutree_propagate_cost(cutree_propagate_cost ref, cutree_propagate_cost opt); + bool check_cutree_fix8_pack(cutree_fix8_pack ref, cutree_fix8_pack opt); + bool check_cutree_fix8_unpack(cutree_fix8_unpack ref, cutree_fix8_unpack opt); bool check_psyCost_pp(pixelcmp_t ref, pixelcmp_t opt); bool check_calSign(sign_t ref, sign_t opt); bool check_scanPosLast(scanPosLast_t ref, scanPosLast_t opt); @@ -120,9 +122,10 @@ bool check_costCoeffNxN(costCoeffNxN_t ref, costCoeffNxN_t opt); bool check_costCoeffRemain(costCoeffRemain_t ref, costCoeffRemain_t opt); bool check_costC1C2Flag(costC1C2Flag_t ref, costC1C2Flag_t opt); - bool check_planeClipAndMax(planeClipAndMax_t ref, planeClipAndMax_t opt); bool check_pelFilterLumaStrong_V(pelFilterLumaStrong_t ref, pelFilterLumaStrong_t opt); bool check_pelFilterLumaStrong_H(pelFilterLumaStrong_t ref, pelFilterLumaStrong_t opt); + bool check_pelFilterChroma_V(pelFilterChroma_t ref, pelFilterChroma_t opt); + bool check_pelFilterChroma_H(pelFilterChroma_t ref, pelFilterChroma_t opt); public:
View file
x265_1.9.tar.gz/source/test/rate-control-tests.txt -> x265_2.0.tar.gz/source/test/rate-control-tests.txt
Changed
@@ -25,6 +25,11 @@ # multi-pass rate control tests +sita_1920x1080_30.yuv, --preset ultrafast --crf 20 --no-cutree --no-scenecut --keyint 50 --no-open-gop --pass 1 --vbv-bufsize 7000 --vbv-maxrate 5000, --preset ultrafast --crf 20 --no-cutree --no-scenecut --keyint 50 --no-open-gop --pass 2 --vbv-bufsize 7000 --vbv-maxrate 5000 +sita_1920x1080_30.yuv, --preset medium --crf 20 --no-cutree --no-scenecut --keyint 50 --no-open-gop --pass 1 --vbv-bufsize 7000 --vbv-maxrate 5000, --preset medium --crf 20 --no-cutree --no-scenecut --keyint 50 --no-open-gop --pass 2 --vbv-bufsize 7000 --vbv-maxrate 5000 +sintel_trailer_2k_480p24.y4m, --preset medium --crf 18 --no-cutree --no-scenecut --no-open-gop --keyint 50 --vbv-bufsize 1200 --vbv-maxrate 1000 --pass 1, --preset medium --crf 18 --no-cutree --no-scenecut --no-open-gop --keyint 50 --vbv-bufsize 1200 --vbv-maxrate 1000 --pass 2 +sintel_trailer_2k_480p24.y4m, --preset veryslow --crf 18 --no-cutree --no-scenecut --no-open-gop --keyint 50 --vbv-bufsize 1200 --vbv-maxrate 1000 --pass 1, --preset veryslow --crf 18 --no-cutree --no-scenecut --no-open-gop --keyint 50 --vbv-bufsize 1200 --vbv-maxrate 1000 --pass 2 +ten_teaser_3840x2160_50_10bit.yuv, --preset medium --crf 25 --no-cutree --no-open-gop --no-scenecut --keyint 50 --vbv-maxrate 10000 --vbv-bufsize 12000 --pass 1, --preset medium --crf 25 --no-cutree --no-open-gop --no-scenecut --keyint 50 --vbv-maxrate 10000 --vbv-bufsize 12000 --pass 2 big_buck_bunny_360p24.y4m,--preset slow --crf 40 --pass 1 -f 5000,--preset slow --bitrate 200 --pass 2 -f 5000 big_buck_bunny_360p24.y4m,--preset medium --bitrate 700 --pass 1 -F4 --slow-firstpass -f 5000 ,--preset medium --bitrate 700 --vbv-bufsize 900 --vbv-maxrate 700 --pass 2 -F4 -f 5000 112_1920x1080_25.yuv,--preset fast --bitrate 1000 --vbv-maxrate 1000 --vbv-bufsize 1000 --strict-cbr --pass 1 -F4,--preset fast --bitrate 1000 --vbv-maxrate 3000 --vbv-bufsize 3000 --pass 2 -F4
View file
x265_1.9.tar.gz/source/test/regression-tests.txt -> x265_2.0.tar.gz/source/test/regression-tests.txt
Changed
@@ -67,6 +67,7 @@ News-4k.y4m,--preset ultrafast --no-cutree --analysis-mode=save --bitrate 15000,--preset ultrafast --no-cutree --analysis-mode=load --bitrate 15000 News-4k.y4m,--preset superfast --lookahead-slices 6 --aq-mode 0 News-4k.y4m,--preset medium --tune ssim --no-sao --qg-size 16 +News-4k.y4m,--preset veryslow --no-rskip OldTownCross_1920x1080_50_10bit_422.yuv,--preset superfast --weightp OldTownCross_1920x1080_50_10bit_422.yuv,--preset medium --no-weightp OldTownCross_1920x1080_50_10bit_422.yuv,--preset slower --tune fastdecode
View file
x265_1.9.tar.gz/source/test/testbench.cpp -> x265_2.0.tar.gz/source/test/testbench.cpp
Changed
@@ -169,6 +169,9 @@ { "XOP", X265_CPU_XOP }, { "AVX2", X265_CPU_AVX2 }, { "BMI2", X265_CPU_AVX2 | X265_CPU_BMI1 | X265_CPU_BMI2 }, + { "ARMv6", X265_CPU_ARMV6 }, + { "NEON", X265_CPU_NEON }, + { "FastNeonMRC", X265_CPU_FAST_NEON_MRC }, { "", 0 }, }; @@ -182,6 +185,7 @@ else continue; +#if X265_ARCH_X86 EncoderPrimitives vecprim; memset(&vecprim, 0, sizeof(vecprim)); setupInstrinsicPrimitives(vecprim, test_arch[i].flag); @@ -197,6 +201,7 @@ return -1; } } +#endif EncoderPrimitives asmprim; memset(&asmprim, 0, sizeof(asmprim)); @@ -220,7 +225,9 @@ EncoderPrimitives optprim; memset(&optprim, 0, sizeof(optprim)); +#if X265_ARCH_X86 setupInstrinsicPrimitives(optprim, cpuid); +#endif setupAssemblyPrimitives(optprim, cpuid); /* Note that we do not setup aliases for performance tests, that would be
View file
x265_1.9.tar.gz/source/test/testharness.h -> x265_2.0.tar.gz/source/test/testharness.h
Changed
@@ -32,7 +32,6 @@ #pragma warning(disable: 4324) // structure was padded due to __declspec(align()) #endif -#define PIXEL_MAX ((1 << X265_DEPTH) - 1) #define PIXEL_MIN 0 #define SHORT_MAX 32767 #define SHORT_MIN -32767 @@ -75,10 +74,17 @@ { uint32_t a = 0; +#if X265_ARCH_X86 asm volatile("rdtsc" : "=a" (a) ::"edx"); +#elif X265_ARCH_ARM + // TOD-DO: verify following inline asm to get cpu Timestamp Counter for ARM arch + // asm volatile("mrc p15, 0, %0, c9, c13, 0" : "=r"(a)); + + // TO-DO: replace clock() function with appropriate ARM cpu instructions + a = clock(); +#endif return a; } - #endif // ifdef _MSC_VER #define BENCH_RUNS 1000 @@ -125,7 +131,7 @@ * needs an explicit asm check because it only sometimes crashes in normal use. */ intptr_t PFX(checkasm_call)(intptr_t (*func)(), int *ok, ...); float PFX(checkasm_call_float)(float (*func)(), int *ok, ...); -#else +#elif X265_ARCH_ARM == 0 #define PFX(stack_pagealign)(func, align) func() #endif
View file
x265_1.9.tar.gz/source/x265-extras.cpp -> x265_2.0.tar.gz/source/x265-extras.cpp
Changed
@@ -46,17 +46,17 @@ return NULL; } - FILE *csvfp = fopen(fname, "r"); + FILE *csvfp = x265_fopen(fname, "r"); if (csvfp) { /* file already exists, re-open for append */ fclose(csvfp); - return fopen(fname, "ab"); + return x265_fopen(fname, "ab"); } else { /* new CSV file, write header */ - csvfp = fopen(fname, "wb"); + csvfp = x265_fopen(fname, "wb"); if (csvfp) { if (level) @@ -280,9 +280,9 @@ fprintf(csvfp, " %-6u, %-6u, %s\n", stats.maxCLL, stats.maxFALL, api.version_str); } -/* The dithering algorithm is based on Sierra-2-4A error diffusion. */ -static void ditherPlane(pixel *dst, int dstStride, uint16_t *src, int srcStride, - int width, int height, int16_t *errors, int bitDepth) +/* The dithering algorithm is based on Sierra-2-4A error diffusion. + * We convert planes in place (without allocating a new buffer). */ +static void ditherPlane(uint16_t *src, int srcStride, int width, int height, int16_t *errors, int bitDepth) { const int lShift = 16 - bitDepth; const int rShift = 16 - bitDepth + 2; @@ -290,15 +290,34 @@ const int pixelMax = (1 << bitDepth) - 1; memset(errors, 0, (width + 1) * sizeof(int16_t)); - int pitch = 1; - for (int y = 0; y < height; y++, src += srcStride, dst += dstStride) + + if (bitDepth == 8) { - int16_t err = 0; - for (int x = 0; x < width; x++) + for (int y = 0; y < height; y++, src += srcStride) { - err = err * 2 + errors[x] + errors[x + 1]; - dst[x * pitch] = (pixel)x265_clip3(0, pixelMax, ((src[x * 1] << 2) + err + half) >> rShift); - errors[x] = err = src[x * pitch] - (dst[x * pitch] << lShift); + uint8_t* dst = (uint8_t *)src; + int16_t err = 0; + for (int x = 0; x < width; x++) + { + err = err * 2 + errors[x] + errors[x + 1]; + int tmpDst = x265_clip3(0, pixelMax, ((src[x] << 2) + err + half) >> rShift); + errors[x] = err = (int16_t)(src[x] - (tmpDst << lShift)); + dst[x] = (uint8_t)tmpDst; + } + } + } + else + { + for (int y = 0; y < height; y++, src += srcStride) + { + int16_t err = 0; + for (int x = 0; x < width; x++) + { + err = err * 2 + errors[x] + errors[x + 1]; + int tmpDst = x265_clip3(0, pixelMax, ((src[x] << 2) + err + half) >> rShift); + errors[x] = err = (int16_t)(src[x] - (tmpDst << lShift)); + src[x] = (uint16_t)tmpDst; + } } } } @@ -317,10 +336,16 @@ return; } + if (picIn.bitDepth == bitDepth) + { + fprintf(stderr, "extras[error]: dither support enabled only if encoder depth is different from picture depth\n"); + return; + } + /* This portion of code is from readFrame in x264. */ for (int i = 0; i < x265_cli_csps[picIn.colorSpace].planes; i++) { - if ((picIn.bitDepth & 7) && (picIn.bitDepth != 16)) + if (picIn.bitDepth < 16) { /* upconvert non 16bit high depth planes to 16bit */ uint16_t *plane = (uint16_t*)picIn.planes[i]; @@ -332,14 +357,10 @@ for (uint32_t j = 0; j < pixelCount; j++) plane[j] = plane[j] << lShift; } - } - for (int i = 0; i < x265_cli_csps[picIn.colorSpace].planes; i++) - { int height = (int)(picHeight >> x265_cli_csps[picIn.colorSpace].height[i]); int width = (int)(picWidth >> x265_cli_csps[picIn.colorSpace].width[i]); - ditherPlane(((pixel*)picIn.planes[i]), picIn.stride[i] / sizeof(pixel), ((uint16_t*)picIn.planes[i]), - picIn.stride[i] / 2, width, height, errorBuf, bitDepth); + ditherPlane(((uint16_t*)picIn.planes[i]), picIn.stride[i] / 2, width, height, errorBuf, bitDepth); } }
View file
x265_1.9.tar.gz/source/x265.cpp -> x265_2.0.tar.gz/source/x265.cpp
Changed
@@ -29,14 +29,10 @@ #include "x265-extras.h" #include "x265cli.h" -#include "common.h" #include "input/input.h" #include "output/output.h" #include "output/reconplay.h" -#include "param.h" -#include "cpu.h" - #if HAVE_VLD /* Visual Leak Detector */ #include <vld.h> @@ -312,12 +308,9 @@ OPT("recon-y4m-exec") reconPlayCmd = optarg; OPT("qpfile") { - this->qpfile = fopen(optarg, "rb"); + this->qpfile = x265_fopen(optarg, "rb"); if (!this->qpfile) - { - x265_log(param, X265_LOG_ERROR, "%s qpfile not found or error in opening qp file\n", optarg); - return false; - } + x265_log_file(param, X265_LOG_ERROR, "%s qpfile not found or error in opening qp file\n", optarg); } else bError |= !!api->param_parse(param, long_options[long_options_index].name, optarg); @@ -378,7 +371,7 @@ this->input = InputFile::open(info, this->bForceY4m); if (!this->input || this->input->isFail()) { - x265_log(param, X265_LOG_ERROR, "unable to open input file <%s>\n", inputfn); + x265_log_file(param, X265_LOG_ERROR, "unable to open input file <%s>\n", inputfn); return true; } @@ -455,10 +448,10 @@ this->output = OutputFile::open(outputfn, info); if (this->output->isFail()) { - x265_log(param, X265_LOG_ERROR, "failed to open output file <%s> for writing\n", outputfn); + x265_log_file(param, X265_LOG_ERROR, "failed to open output file <%s> for writing\n", outputfn); return true; } - general_log(param, this->output->getName(), X265_LOG_INFO, "output file: %s\n", outputfn); + general_log_file(param, this->output->getName(), X265_LOG_INFO, "output file: %s\n", outputfn); return false; } @@ -497,6 +490,39 @@ return 1; } +#ifdef _WIN32 +/* Copy of x264 code, which allows for Unicode characters in the command line. + * Retrieve command line arguments as UTF-8. */ +static int get_argv_utf8(int *argc_ptr, char ***argv_ptr) +{ + int ret = 0; + wchar_t **argv_utf16 = CommandLineToArgvW(GetCommandLineW(), argc_ptr); + if (argv_utf16) + { + int argc = *argc_ptr; + int offset = (argc + 1) * sizeof(char*); + int size = offset; + + for (int i = 0; i < argc; i++) + size += WideCharToMultiByte(CP_UTF8, 0, argv_utf16[i], -1, NULL, 0, NULL, NULL); + + char **argv = *argv_ptr = (char**)malloc(size); + if (argv) + { + for (int i = 0; i < argc; i++) + { + argv[i] = (char*)argv + offset; + offset += WideCharToMultiByte(CP_UTF8, 0, argv_utf16[i], -1, argv[i], size - offset, NULL, NULL); + } + argv[argc] = NULL; + ret = 1; + } + LocalFree(argv_utf16); + } + return ret; +} +#endif + /* CLI return codes: * * 0 - encode successful @@ -517,6 +543,10 @@ GetConsoleTitle(orgConsoleTitle, CONSOLE_TITLE_SIZE); SetThreadExecutionState(ES_CONTINUOUS | ES_SYSTEM_REQUIRED | ES_AWAYMODE_REQUIRED); +#if _WIN32 + char** orgArgv = argv; + get_argv_utf8(&argc, &argv); +#endif ReconPlay* reconPlay = NULL; CLIOptions cliopt; @@ -560,7 +590,7 @@ cliopt.csvfpt = x265_csvlog_open(*api, *param, cliopt.csvfn, cliopt.csvLogLevel); if (!cliopt.csvfpt) { - x265_log(param, X265_LOG_ERROR, "Unable to open CSV log file <%s>, aborting\n", cliopt.csvfn); + x265_log_file(param, X265_LOG_ERROR, "Unable to open CSV log file <%s>, aborting\n", cliopt.csvfn); cliopt.destroy(); if (cliopt.api) cliopt.api->param_free(cliopt.param); @@ -747,6 +777,14 @@ SetConsoleTitle(orgConsoleTitle); SetThreadExecutionState(ES_CONTINUOUS); +#if _WIN32 + if (argv != orgArgv) + { + free(argv); + argv = orgArgv; + } +#endif + #if HAVE_VLD assert(VLDReportLeaks() == 0); #endif
View file
x265_1.9.tar.gz/source/x265.h -> x265_2.0.tar.gz/source/x265.h
Changed
@@ -98,9 +98,9 @@ uint32_t sliceType; uint32_t numCUsInFrame; uint32_t numPartitions; + int bScenecut; void* interData; void* intraData; - int bScenecut; } x265_analysis_data; /* cu statistics */ @@ -221,6 +221,14 @@ /* Frame level statistics */ x265_frame_stats frameData; + /* Ratecontrol statistics for collecting the ratecontrol information. + * It is not used for collecting the last pass ratecontrol data in + * multi pass ratecontrol mode. */ + void* rcData; + + uint64_t framesize; + + int height; } x265_picture; typedef enum @@ -587,6 +595,11 @@ * Main (0) and High (1) tier. Default is Main tier (0) */ int bHighTier; + /* Enable UHD Blu-ray compatibility support. If specified, the encoder will + * attempt to modify/set the encode specifications. If the encoder is unable + * to do so, this option will be turned OFF. */ + int uhdBluray; + /* The maximum number of L0 references a P or B slice may use. This * influences the size of the decoded picture buffer. The higher this * number, the more reference frames there will be available for motion @@ -764,7 +777,7 @@ * enabled). At level 2 rate-distortion cost is used to make decimate decisions * on each 4x4 coding group (including the cost of signaling the group within * the group bitmap). Psy-rdoq is less effective at preserving energy when - * RDOQ is at level 2 */ + * RDOQ is at level 2. Default: 0 */ int rdoqLevel; /* Enable the implicit signaling of the sign bit of the last coefficient of @@ -896,23 +909,27 @@ /* Note: when deblocking and SAO are both enabled, the loop filter CU lag is * only one row, as they operate in series on the same row. */ - /* Select the method in which SAO deals with deblocking boundary pixels. If + /* Select the method in which SAO deals with deblocking boundary pixels. If * disabled the right and bottom boundary areas are skipped. If enabled, * non-deblocked pixels are used entirely. Default is disabled */ int bSaoNonDeblocked; /*== Analysis tools ==*/ - /* A value between X265_NO_RDO_NO_RDOQ and X265_RDO_LEVEL which determines - * the level of rate distortion optimizations to perform during mode - * decisions and quantization. The more RDO the better the compression - * efficiency at a major cost of performance. Default is no RDO (0) */ + /* A value between 1 and 6 (both inclusive) which determines the level of + * rate distortion optimizations to perform during mode and depth decisions. + * The more RDO the better the compression efficiency at a major cost of + * performance. Default is 3 */ int rdLevel; - /* Enable early skip decisions to avoid intra and inter analysis in likely + /* Enable early skip decisions to avoid analysing additional modes in likely * skip blocks. Default is disabled */ int bEnableEarlySkip; + /* Enable early CU size decisions to avoid recursing to higher depths. + * Default is enabled */ + int bEnableRecursionSkip; + /* Use a faster search method to find the best intra mode. Default is 0 */ int bEnableFastIntra; @@ -947,10 +964,16 @@ double psyRd; /* Strength of psycho-visual optimizations in quantization. Only has an - * effect in presets which use RDOQ (rd-levels 4 and 5). The value must be - * between 0 and 50, 1.0 is typical. Default 1.0 */ + * effect when RDOQ is enabled (presets slow, slower and veryslow). The + * value must be between 0 and 50, 1.0 is typical. Default 0 */ double psyRdoq; + /* Perform quantisation parameter based RD refinement. RD cost is calculated + * on the best CU partitions, chosen after the CU analysis, for a range of QPs + * to find the optimal rounding effect. Only effective at rd-levels 5 and 6. + * Default disabled */ + int bEnableRdRefine; + /* If X265_ANALYSIS_SAVE, write per-frame analysis information into analysis * buffers. if X265_ANALYSIS_LOAD, read analysis information into analysis * buffer and use this analysis information to reduce the amount of work @@ -1083,6 +1106,9 @@ * (QG) size. Allowed values are 64, 32, 16 provided it falls within the * inclusuve range [maxCUSize, minCUSize]. Experimental, default: maxCUSize */ uint32_t qgSize; + + /* internally enable if tune grain is set */ + int bEnableGrain; } rc; /*== Video Usability Information ==*/
View file
x265_1.9.tar.gz/source/x265cli.h -> x265_2.0.tar.gz/source/x265cli.h
Changed
@@ -53,6 +53,7 @@ { "profile", required_argument, NULL, 'P' }, { "level-idc", required_argument, NULL, 0 }, { "high-tier", no_argument, NULL, 0 }, + { "uhd-bd", no_argument, NULL, 0 }, { "no-high-tier", no_argument, NULL, 0 }, { "allow-non-conformance",no_argument, NULL, 0 }, { "no-allow-non-conformance",no_argument, NULL, 0 }, @@ -96,6 +97,8 @@ { "amp", no_argument, NULL, 0 }, { "no-early-skip", no_argument, NULL, 0 }, { "early-skip", no_argument, NULL, 0 }, + { "no-rskip", no_argument, NULL, 0 }, + { "rskip", no_argument, NULL, 0 }, { "no-fast-cbf", no_argument, NULL, 0 }, { "fast-cbf", no_argument, NULL, 0 }, { "no-tskip", no_argument, NULL, 0 }, @@ -143,6 +146,8 @@ { "qp", required_argument, NULL, 'q' }, { "aq-mode", required_argument, NULL, 0 }, { "aq-strength", required_argument, NULL, 0 }, + { "rc-grain", no_argument, NULL, 0 }, + { "no-rc-grain", no_argument, NULL, 0 }, { "ipratio", required_argument, NULL, 0 }, { "pbratio", required_argument, NULL, 0 }, { "qcomp", required_argument, NULL, 0 }, @@ -159,6 +164,8 @@ { "psy-rdoq", required_argument, NULL, 0 }, { "no-psy-rd", no_argument, NULL, 0 }, { "no-psy-rdoq", no_argument, NULL, 0 }, + { "rd-refine", no_argument, NULL, 0 }, + { "no-rd-refine", no_argument, NULL, 0 }, { "scaling-list", required_argument, NULL, 0 }, { "lossless", no_argument, NULL, 0 }, { "no-lossless", no_argument, NULL, 0 }, @@ -279,6 +286,7 @@ H0("-P/--profile <string> Enforce an encode profile: main, main10, mainstillpicture\n"); H0(" --level-idc <integer|float> Force a minimum required decoder level (as '5.0' or '50')\n"); H0(" --[no-]high-tier If a decoder level is specified, this modifier selects High tier of that level\n"); + H0(" --uhd-bd Enable UHD Bluray compatibility support\n"); H0(" --[no-]allow-non-conformance Allow the encoder to generate profile NONE bitstreams. Default %s\n", OPT(param->bAllowNonConformance)); H0("\nThreading, performance:\n"); H0(" --pools <integer,...> Comma separated thread count per thread pool (pool per NUMA node)\n"); @@ -300,11 +308,13 @@ H0(" --tu-intra-depth <integer> Max TU recursive depth for intra CUs. Default %d\n", param->tuQTMaxIntraDepth); H0(" --tu-inter-depth <integer> Max TU recursive depth for inter CUs. Default %d\n", param->tuQTMaxInterDepth); H0("\nAnalysis:\n"); - H0(" --rd <0..6> Level of RDO in mode decision 0:least....6:full RDO. Default %d\n", param->rdLevel); + H0(" --rd <1..6> Level of RDO in mode decision 1:least....6:full RDO. Default %d\n", param->rdLevel); H0(" --[no-]psy-rd <0..5.0> Strength of psycho-visual rate distortion optimization, 0 to disable. Default %.1f\n", param->psyRd); H0(" --[no-]rdoq-level <0|1|2> Level of RDO in quantization 0:none, 1:levels, 2:levels & coding groups. Default %d\n", param->rdoqLevel); H0(" --[no-]psy-rdoq <0..50.0> Strength of psycho-visual optimization in RDO quantization, 0 to disable. Default %.1f\n", param->psyRdoq); + H0(" --[no-]rd-refine Enable QP based RD refinement for rd levels 5 and 6. Default %s\n", OPT(param->bEnableRdRefine)); H0(" --[no-]early-skip Enable early SKIP detection. Default %s\n", OPT(param->bEnableEarlySkip)); + H0(" --[no-]rskip Enable early exit from recursion. Default %s\n", OPT(param->bEnableRecursionSkip)); H1(" --[no-]tskip-fast Enable fast intra transform skipping. Default %s\n", OPT(param->bEnableTSkipFast)); H1(" --nr-intra <integer> An integer value in range of 0 to 2000, which denotes strength of noise reduction in intra CUs. Default 0\n"); H1(" --nr-inter <integer> An integer value in range of 0 to 2000, which denotes strength of noise reduction in inter CUs. Default 0\n"); @@ -373,6 +383,7 @@ H0(" --aq-strength <float> Reduces blocking and blurring in flat and textured areas (0 to 3.0). Default %.2f\n", param->rc.aqStrength); H0(" --qg-size <int> Specifies the size of the quantization group (64, 32, 16). Default %d\n", param->rc.qgSize); H0(" --[no-]cutree Enable cutree for Adaptive Quantization. Default %s\n", OPT(param->rc.cuTree)); + H0(" --[no-]rc-grain Enable ratecontrol mode to handle grains specifically. turned on with tune grain. Default %s\n", OPT(param->rc.bEnableGrain)); H1(" --ipratio <float> QP factor between I and P. Default %.2f\n", param->rc.ipFactor); H1(" --pbratio <float> QP factor between P and B. Default %.2f\n", param->rc.pbFactor); H1(" --qcomp <float> Weight given to predicted complexity. Default %.2f\n", param->rc.qCompress);
Locations
Projects
Search
Status Monitor
Help
Open Build Service
OBS Manuals
API Documentation
OBS Portal
Reporting a Bug
Contact
Mailing List
Forums
Chat (IRC)
Twitter
Open Build Service (OBS)
is an
openSUSE project
.