Projects
Essentials
x265
Sign Up
Log In
Username
Password
Overview
Repositories
Revisions
Requests
Users
Attributes
Meta
Expand all
Collapse all
Changes of Revision 19
View file
x265.changes
Changed
@@ -1,4 +1,34 @@ ------------------------------------------------------------------- +Sun Jan 1 20:32:07 UTC 2017 - idonmez@suse.com + +- Update to version 2.2 + Encode enhancements + * Enhancements to TU selection algorithm with early-outs for + improved speed; use --limit-tu to exercise. + * New motion search method SEA (Successive Elimination Algorithm) + supported now as :option: –me 4 + * Bit-stream optimizations to improve fields in PPS and SPS for + bit-rate savings through --[no-]opt-qp-pps, + --[no-]opt-ref-list-length-pps, and --[no-]multi-pass-opt-rps. + * Enabled using VBV constraints when encoding without WPP. + * All param options dumped in SEI packet in bitstream when info + selected. + API changes + * Options to disable SEI and optional-VUI messages from bitstream + made more descriptive. + * New option --scenecut-bias to enable controlling bias to mark + scene-cuts via cli. + * Support mono and mono16 color spaces for y4m input. + * --min-cu-size of 64 no-longer supported for reasons of + visual quality. + * API for CSV now expects version string for better integration + of x265 into other applications. + Bug fixes + * Several fixes to slice-based encoding. + * --log2-max-poc-lsb‘s range limited according to HEVC spec. + * Restrict MVs to within legal boundaries when encoding. + +------------------------------------------------------------------- Thu Dec 22 12:59:47 UTC 2016 - scarabeus@opensuse.org - Add conditional for the numa-devel again it was not ment to be dropped
View file
x265.spec
Changed
@@ -1,10 +1,10 @@ # based on the spec file from https://build.opensuse.org/package/view_file/home:Simmphonie/libx265/ Name: x265 -%define soname 95 +%define soname 102 %define libname lib%{name} %define libsoname %{libname}-%{soname} -Version: 2.1 +Version: 2.2 Release: 0 License: GPL-2.0+ Summary: A free h265/HEVC encoder - encoder binary
View file
arm.patch
Changed
@@ -1,11 +1,11 @@ -Index: x265_2.1/source/CMakeLists.txt +Index: x265_2.2/source/CMakeLists.txt =================================================================== ---- x265_2.1.orig/source/CMakeLists.txt -+++ x265_2.1/source/CMakeLists.txt -@@ -60,15 +60,22 @@ elseif(POWERMATCH GREATER "-1") - message(STATUS "Detected POWER target processor") - set(POWER 1) - add_definitions(-DX265_ARCH_POWER=1) +--- x265_2.2.orig/source/CMakeLists.txt ++++ x265_2.2/source/CMakeLists.txt +@@ -65,15 +65,22 @@ elseif(POWERMATCH GREATER "-1") + add_definitions(-DPPC64=1) + message(STATUS "Detected POWER PPC64 target processor") + endif() -elseif(ARMMATCH GREATER "-1") - if(CROSS_COMPILE_ARM) - message(STATUS "Cross compiling for ARM arch") @@ -34,7 +34,7 @@ else() message(STATUS "CMAKE_SYSTEM_PROCESSOR value `${CMAKE_SYSTEM_PROCESSOR}` is unknown") message(STATUS "Please add this value near ${CMAKE_CURRENT_LIST_FILE}:${CMAKE_CURRENT_LIST_LINE}") -@@ -190,18 +197,9 @@ if(GCC) +@@ -208,18 +215,9 @@ if(GCC) endif() endif() endif() @@ -55,10 +55,10 @@ if(FPROFILE_GENERATE) if(INTEL_CXX) add_definitions(-prof-gen -prof-dir="${CMAKE_CURRENT_BINARY_DIR}") -Index: x265_2.1/source/common/cpu.cpp +Index: x265_2.2/source/common/cpu.cpp =================================================================== ---- x265_2.1.orig/source/common/cpu.cpp -+++ x265_2.1/source/common/cpu.cpp +--- x265_2.2.orig/source/common/cpu.cpp ++++ x265_2.2/source/common/cpu.cpp @@ -37,7 +37,7 @@ #include <machine/cpu.h> #endif @@ -68,7 +68,7 @@ #include <signal.h> #include <setjmp.h> static sigjmp_buf jmpbuf; -@@ -340,7 +340,6 @@ uint32_t cpu_detect(void) +@@ -344,7 +344,6 @@ uint32_t cpu_detect(void) } canjump = 1; @@ -76,7 +76,7 @@ canjump = 0; signal(SIGILL, oldsig); #endif // if !HAVE_NEON -@@ -356,7 +355,7 @@ uint32_t cpu_detect(void) +@@ -360,7 +359,7 @@ uint32_t cpu_detect(void) // which may result in incorrect detection and the counters stuck enabled. // right now Apple does not seem to support performance counters for this test #ifndef __MACH__
View file
baselibs.conf
Changed
@@ -1,1 +1,1 @@ -libx265-95 +libx265-102
View file
x265_2.1.tar.gz/.hg_archival.txt -> x265_2.2.tar.gz/.hg_archival.txt
Changed
@@ -1,6 +1,4 @@ repo: 09fe40627f03a0f9c3e6ac78b22ac93da23f9fdf -node: 3e8ce3b26319dbd53ab6369e4c4e986bf30f1315 +node: be14a7e9755e54f0fd34911c72bdfa66981220bc branch: stable -latesttag: 2.1 -latesttagdistance: 1 -changessincelatesttag: 1 +tag: 2.2
View file
x265_2.1.tar.gz/doc/reST/cli.rst -> x265_2.2.tar.gz/doc/reST/cli.rst
Changed
@@ -662,7 +662,7 @@ and less frame parallelism as well. Because of this the faster presets use a CU size of 32. Default: 64 -.. option:: --min-cu-size <64|32|16|8> +.. option:: --min-cu-size <32|16|8> Minimum CU size (width and height). By using 16 or 32 the encoder will not analyze the cost of CUs below that minimum threshold, @@ -869,6 +869,24 @@ partitions, in which case a TU split is implied and thus the residual quad-tree begins one layer below the CU quad-tree. +.. option:: --limit-tu <0..4> + + Enables early exit from TU depth recursion, for inter coded blocks. + Level 1 - decides to recurse to next higher depth based on cost + comparison of full size TU and split TU. + + Level 2 - based on first split subTU's depth, limits recursion of + other split subTUs. + + Level 3 - based on the average depth of the co-located and the neighbor + CUs' TU depth, limits recursion of the current CU. + + Level 4 - uses the depth of the neighbouring/ co-located CUs TU depth + to limit the 1st subTU depth. The 1st subTU depth is taken as the + limiting depth for the other subTUs. + + Default: 0 + .. option:: --nr-intra <integer>, --nr-inter <integer> Noise reduction - an adaptive deadzone applied after DCT @@ -949,13 +967,17 @@ encoder: a star-pattern search followed by an optional radix scan followed by an optional star-search refinement. Full is an exhaustive search; an order of magnitude slower than all other - searches but not much better than umh or star. + searches but not much better than umh or star. SEA is similar to + FULL search; a three step motion search adopted from x264: DC + calculation followed by ADS calculation followed by SAD of the + passed motion vector candidates, hence faster than Full search. 0. dia 1. hex **(default)** 2. umh 3. star - 4. full + 4. sea + 5. full .. option:: --subme, -m <0..7> @@ -1153,6 +1175,13 @@ :option:`--scenecut` 0 or :option:`--no-scenecut` disables adaptive I frame placement. Default 40 +.. option:: --scenecut-bias <0..100.0> + + This value represents the percentage difference between the inter cost and + intra cost of a frame used in scenecut detection. For example, a value of 5 indicates, + if the inter cost of a frame is greater than or equal to 95 percent of the intra cost of the frame, + then detect this frame as scenecut. Values between 5 and 15 are recommended. Default 5. + .. option:: --intra-refresh Enables Periodic Intra Refresh(PIR) instead of keyframe insertion. @@ -1304,7 +1333,7 @@ slices using param->rc.ipFactor and param->rc.pbFactor unless QP 0 is specified, in which case QP 0 is used for all slice types. Note that QP 0 does not cause lossless encoding, it only disables - quantization. Default disabled (CRF) + quantization. Default disabled. **Range of values:** an integer from 0 to 51 @@ -1824,7 +1853,7 @@ enhancement layer. A decoder may chose to drop the enhancement layer and only decode and display the base layer slices. - If used with a fixed GOP (:option:`b-adapt` 0) and :option:`bframes` + If used with a fixed GOP (:option:`--b-adapt` 0) and :option:`--bframes` 3 then the two layers evenly split the frame rate, with a cadence of PbBbP. You probably also want :option:`--no-scenecut` and a keyframe interval that is a multiple of 4. @@ -1833,15 +1862,29 @@ Maximum of the picture order count. Default 8 -.. option:: --discard-sei +.. option:: --[no-]vui-timing-info - Discard SEI messages generated from the final bitstream. HDR-related SEI - messages are always dumped, immaterial of this option. Default disabled. - -.. option:: --discard-vui + Emit VUI timing info in bitstream. Default enabled. + +.. option:: --[no-]vui-hrd-info + + Emit VUI HRD info in bitstream. Default enabled when + :option:`--hrd` is enabled. + +.. option:: --[no-]opt-qp-pps + + Optimize QP in PPS (instead of default value of 26) based on the QP values + observed in last GOP. Default enabled. + +.. option:: --[no-]opt-ref-list-length-pps + + Optimize L0 and L1 ref list length in PPS (instead of default value of 0) + based on the lengths observed in the last GOP. Default enabled. + +.. option:: --[no-]multi-pass-opt-rps + + Enable storing commonly used RPS in SPS in multi pass mode. Default disabled. - Discard optional VUI information (timing, HRD info) from the - bitstream. Default disabled. Debugging options =================
View file
x265_2.1.tar.gz/doc/reST/index.rst -> x265_2.2.tar.gz/doc/reST/index.rst
Changed
@@ -9,3 +9,4 @@ threading presets lossless + releasenotes
View file
x265_2.2.tar.gz/doc/reST/releasenotes.rst
Added
@@ -0,0 +1,141 @@ +************* +Release Notes +************* + +Version 2.2 +=========== + +Release date - 26th December, 2016. + +Encoder enhancements +-------------------- +1. Enhancements to TU selection algorithm with early-outs for improved speed; use :option:`--limit-tu` to exercise. +2. New motion search method SEA (Successive Elimination Algorithm) supported now as :option: `--me` 4 +3. Bit-stream optimizations to improve fields in PPS and SPS for bit-rate savings through :option:`--[no-]opt-qp-pps`, :option:`--[no-]opt-ref-list-length-pps`, and :option:`--[no-]multi-pass-opt-rps`. +4. Enabled using VBV constraints when encoding without WPP. +5. All param options dumped in SEI packet in bitstream when info selected. +6. x265 now supports POWERPC-based systems. Several key functions also have optimized ALTIVEC kernels. + +API changes +----------- +1. Options to disable SEI and optional-VUI messages from bitstream made more descriptive. +2. New option :option:`--scenecut-bias` to enable controlling bias to mark scene-cuts via cli. +3. Support mono and mono16 color spaces for y4m input. +4. :option:`--min-cu-size` of 64 no-longer supported for reasons of visual quality (was crashing earlier anyways.) +5. API for CSV now expects version string for better integration of x265 into other applications. + +Bug fixes +--------- +1. Several fixes to slice-based encoding. +2. :option:`--log2-max-poc-lsb`'s range limited according to HEVC spec. +3. Restrict MVs to within legal boundaries when encoding. + +Version 2.1 +=========== + +Release date - 27th September, 2016 + +Encoder enhancements +-------------------- +1. Support for qg-size of 8 +2. Support for inserting non-IDR I-frames at scenecuts and when running with settings for fixed-GOP (min-keyint = max-keyint) +3. Experimental support for slice-parallelism. + +API changes +----------- +1. Encode user-define SEI messages passed in through x265_picture object. +2. Disable SEI and VUI messages from the bitstream +3. Specify qpmin and qpmax +4. Control number of bits to encode POC. + +Bug fixes +--------- +1. QP fluctuation fix for first B-frame in mini-GOP for 2-pass encoding with tune-grain. +2. Assembly fix for crashes in 32-bit from dct_sse4. +3. Threadpool creation fix in windows platform. + +Version 2.0 +=========== + +Release date - 13th July, 2016 + +New Features +------------ + +1. uhd-bd: Enable Ultra-HD Bluray support +2. rskip: Enables skipping recursion to analyze lower CU sizes using heuristics at different rd-levels. Provides good visual quality gains at the highest quality presets. +3. rc-grain: Enables a new ratecontrol mode specifically for grainy content. Strictly prevents QP oscillations within and between frames to avoid grain fluctuations. +4. tune grain: A fully refactored and improved option to encode film grain content including QP control as well as analysis options. +5. asm: ARM assembly is now enabled by default, native or cross compiled builds supported on armv6 and later systems. + +API and Key Behaviour Changes +----------------------------- + +1. x265_rc_stats added to x265_picture, containing all RC decision points for that frame +2. PTL: high tier is now allowed by default, chosen only if necessary +3. multi-pass: First pass now uses slow-firstpass by default, enabling better RC decisions in future passes +4. pools: fix behaviour on multi-socketed Windows systems, provide more flexibility in determining thread and pool counts +5. ABR: improve bits allocation in the first few frames, abr reset, vbv and cutree improved + +Misc +---- +1. An SSIM calculation bug was corrected + +Version 1.9 +=========== + +Release date - 29th January, 2016 + +New Features +------------ + +1. Quant offsets: This feature allows block level quantization offsets to be specified for every frame. An API-only feature. +2. --intra-refresh: Keyframes can be replaced by a moving column of intra blocks in non-keyframes. +3. --limit-modes: Intelligently restricts mode analysis. +4. --max-luma and --min-luma for luma clipping, optional for HDR use-cases +5. Emergency denoising is now enabled by default in very low bitrate, VBV encodes + +API Changes +----------- + +1. x265_frame_stats returns many additional fields: maxCLL, maxFALL, residual energy, scenecut and latency logging +2. --qpfile now supports frametype 'K" +3. x265 now allows CRF ratecontrol in pass N (N greater than or equal to 2) +4. Chroma subsampling format YUV 4:0:0 is now fully supported and tested + +Presets and Performance +----------------------- + +1. Recently added features lookahead-slices, limit-modes, limit-refs have been enabled by default for applicable presets. +2. The default psy-rd strength has been increased to 2.0 +3. Multi-socket machines now use a single pool of threads that can work cross-socket. + +Version 1.8 +=========== + +Release date - 10th August, 2015 + +API Changes +----------- +1. Experimental support for Main12 is now enabled. Partial assembly support exists. +2. Main12 and Intra/Still picture profiles are now supported. Still picture profile is detected based on x265_param::totalFrames. +3. Three classes of encoding statistics are now available through the API. +a) x265_stats - contains encoding statistics, available through x265_encoder_get_stats() +b) x265_frame_stats and x265_cu_stats - contains frame encoding statistics, available through recon x265_picture +4. --csv +a) x265_encoder_log() is now deprecated +b) x265_param::csvfn is also deprecated +5. --log-level now controls only console logging, frame level console logging has been removed. +6. Support added for new color transfer characteristic ARIB STD-B67 + +New Features +------------ +1. limit-refs: This feature limits the references analysed for individual CUS. Provides a nice tradeoff between efficiency and performance. +2. aq-mode 3: A new aq-mode that provides additional biasing for low-light conditions. +3. An improved scene cut detection logic that allows ratecontrol to manage visual quality at fade-ins and fade-outs better. + +Preset and Tune Options +----------------------- + +1. tune grain: Increases psyRdoq strength to 10.0, and rdoq-level to 2. +2. qg-size: Default value changed to 32.
View file
x265_2.1.tar.gz/source/CMakeLists.txt -> x265_2.2.tar.gz/source/CMakeLists.txt
Changed
@@ -30,7 +30,7 @@ mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD) # X265_BUILD must be incremented each time the public API is changed -set(X265_BUILD 95) +set(X265_BUILD 102) configure_file("${PROJECT_SOURCE_DIR}/x265.def.in" "${PROJECT_BINARY_DIR}/x265.def") configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in" @@ -60,6 +60,11 @@ message(STATUS "Detected POWER target processor") set(POWER 1) add_definitions(-DX265_ARCH_POWER=1) + if("${CMAKE_SIZEOF_VOID_P}" MATCHES 8) + set(PPC64 1) + add_definitions(-DPPC64=1) + message(STATUS "Detected POWER PPC64 target processor") + endif() elseif(ARMMATCH GREATER "-1") if(CROSS_COMPILE_ARM) message(STATUS "Cross compiling for ARM arch") @@ -167,6 +172,19 @@ elseif(CMAKE_COMPILER_IS_GNUCXX) set(GCC 1) endif() + +if(CC STREQUAL "xlc") + message(STATUS "Use XLC compiler") + set(XLC 1) + set(GCC 0) + #set(CMAKE_C_COMPILER "/usr/bin/xlc") + #set(CMAKE_CXX_COMPILER "/usr/bin/xlc++") + add_definitions(-D__XLC__=1) + add_definitions(-O3 -qstrict -qhot -qaltivec) + add_definitions(-qinline=level=10 -qpath=IL:/data/video_files/latest.tpo/) +endif() + + if(GCC) add_definitions(-Wall -Wextra -Wshadow) add_definitions(-D__STDC_LIMIT_MACROS=1) @@ -396,6 +414,22 @@ endif(WINXP_SUPPORT) endif() +if(POWER) + # IBM Power8 + option(ENABLE_ALTIVEC "Enable ALTIVEC profiling instrumentation" ON) + if(ENABLE_ALTIVEC) + add_definitions(-DHAVE_ALTIVEC=1 -maltivec -mabi=altivec) + add_definitions(-flax-vector-conversions -fpermissive) + else() + add_definitions(-DHAVE_ALTIVEC=0) + endif() + + option(CPU_POWER8 "Enable CPU POWER8 profiling instrumentation" ON) + if(CPU_POWER8) + add_definitions(-mcpu=power8 -DX265_ARCH_POWER8=1) + endif() +endif() + include(version) # determine X265_VERSION and X265_LATEST_TAG include_directories(. common encoder "${PROJECT_BINARY_DIR}")
View file
x265_2.1.tar.gz/source/common/CMakeLists.txt -> x265_2.2.tar.gz/source/common/CMakeLists.txt
Changed
@@ -99,6 +99,19 @@ source_group(Assembly FILES ${ASM_PRIMITIVES}) endif(ENABLE_ASSEMBLY AND (ARM OR CROSS_COMPILE_ARM)) +if(POWER) + set_source_files_properties(version.cpp PROPERTIES COMPILE_FLAGS -DX265_VERSION=${X265_VERSION}) + if(ENABLE_ALTIVEC) + set(ALTIVEC_SRCS pixel_altivec.cpp dct_altivec.cpp ipfilter_altivec.cpp intrapred_altivec.cpp) + foreach(SRC ${ALTIVEC_SRCS}) + set(ALTIVEC_PRIMITIVES ${ALTIVEC_PRIMITIVES} ppc/${SRC}) + endforeach() + source_group(Intrinsics_altivec FILES ${ALTIVEC_PRIMITIVES}) + set_source_files_properties(${ALTIVEC_PRIMITIVES} PROPERTIES COMPILE_FLAGS "-Wno-unused -Wno-unknown-pragmas -Wno-maybe-uninitialized") + endif() +endif() + + # set_target_properties can't do list expansion string(REPLACE ";" " " VERSION_FLAGS "${VFLAGS}") set_source_files_properties(version.cpp PROPERTIES COMPILE_FLAGS ${VERSION_FLAGS}) @@ -116,7 +129,7 @@ endif(WIN32) add_library(common OBJECT - ${ASM_PRIMITIVES} ${VEC_PRIMITIVES} ${WINXP} + ${ASM_PRIMITIVES} ${VEC_PRIMITIVES} ${ALTIVEC_PRIMITIVES} ${WINXP} primitives.cpp primitives.h pixel.cpp dct.cpp ipfilter.cpp intrapred.cpp loopfilter.cpp constants.cpp constants.h
View file
x265_2.1.tar.gz/source/common/bitstream.h -> x265_2.2.tar.gz/source/common/bitstream.h
Changed
@@ -71,6 +71,7 @@ uint32_t getNumberOfWrittenBytes() const { return m_byteOccupancy; } uint32_t getNumberOfWrittenBits() const { return m_byteOccupancy * 8 + m_partialByteBits; } const uint8_t* getFIFO() const { return m_fifo; } + void copyBits(Bitstream* stream) { m_partialByteBits = stream->m_partialByteBits; m_byteOccupancy = stream->m_byteOccupancy; m_partialByte = stream->m_partialByte; } void write(uint32_t val, uint32_t numBits); void writeByte(uint32_t val);
View file
x265_2.1.tar.gz/source/common/common.h -> x265_2.2.tar.gz/source/common/common.h
Changed
@@ -176,7 +176,7 @@ #define X265_MIN(a, b) ((a) < (b) ? (a) : (b)) #define X265_MAX(a, b) ((a) > (b) ? (a) : (b)) -#define COPY1_IF_LT(x, y) if ((y) < (x)) (x) = (y); +#define COPY1_IF_LT(x, y) {if ((y) < (x)) (x) = (y);} #define COPY2_IF_LT(x, y, a, b) \ if ((y) < (x)) \ { \ @@ -312,6 +312,7 @@ #define MAX_NUM_REF_PICS 16 // max. number of pictures used for reference #define MAX_NUM_REF 16 // max. number of entries in picture reference list +#define MAX_NUM_SHORT_TERM_RPS 64 // max. number of short term reference picture set in SPS #define REF_NOT_VALID -1 @@ -327,6 +328,8 @@ #define PIXEL_MAX ((1 << X265_DEPTH) - 1) +#define INTEGRAL_PLANE_NUM 12 // 12 integral planes for 32x32, 32x24, 32x8, 24x32, 16x16, 16x12, 16x4, 12x16, 8x32, 8x8, 4x16 and 4x4. + namespace X265_NS { enum { SAO_NUM_OFFSET = 4 };
View file
x265_2.1.tar.gz/source/common/cpu.cpp -> x265_2.2.tar.gz/source/common/cpu.cpp
Changed
@@ -99,6 +99,10 @@ { "ARMv6", X265_CPU_ARMV6 }, { "NEON", X265_CPU_NEON }, { "FastNeonMRC", X265_CPU_FAST_NEON_MRC }, + +#elif X265_ARCH_POWER8 + { "Altivec", X265_CPU_ALTIVEC }, + #endif // if X265_ARCH_X86 { "", 0 }, }; @@ -363,7 +367,18 @@ return flags; } -#else // if X265_ARCH_X86 +#elif X265_ARCH_POWER8 + +uint32_t cpu_detect(void) +{ +#if HAVE_ALTIVEC + return X265_CPU_ALTIVEC; +#else + return 0; +#endif +} + +#else // if X265_ARCH_POWER8 uint32_t cpu_detect(void) {
View file
x265_2.1.tar.gz/source/common/cudata.cpp -> x265_2.2.tar.gz/source/common/cudata.cpp
Changed
@@ -296,6 +296,9 @@ /* initialize the remaining CU data in one memset */ memset(m_cuDepth, 0, (frame.m_param->internalCsp == X265_CSP_I400 ? BytesPerPartition - 11 : BytesPerPartition - 7) * m_numPartitions); + for (int8_t i = 0; i < NUM_TU_DEPTH; i++) + m_refTuDepth[i] = -1; + uint32_t widthInCU = m_slice->m_sps->numCuInWidth; m_cuLeft = (m_cuAddr % widthInCU) ? m_encData->getPicCTU(m_cuAddr - 1) : NULL; m_cuAbove = (m_cuAddr >= widthInCU) && !m_bFirstRowInSlice ? m_encData->getPicCTU(m_cuAddr - widthInCU) : NULL;
View file
x265_2.1.tar.gz/source/common/cudata.h -> x265_2.2.tar.gz/source/common/cudata.h
Changed
@@ -28,6 +28,8 @@ #include "slice.h" #include "mv.h" +#define NUM_TU_DEPTH 21 + namespace X265_NS { // private namespace @@ -204,6 +206,7 @@ enum { BytesPerPartition = 21 }; // combined sizeof() of all per-part data coeff_t* m_trCoeff[3]; // transformed coefficient buffer per plane + int8_t m_refTuDepth[NUM_TU_DEPTH]; // TU depth of CU at depths 0, 1 and 2 MV* m_mv[2]; // array of motion vectors per list MV* m_mvd[2]; // array of coded motion vector deltas per list @@ -355,9 +358,8 @@ CHECKED_MALLOC(trCoeffMemBlock, coeff_t, (sizeL + sizeC * 2) * numInstances); } CHECKED_MALLOC(charMemBlock, uint8_t, numPartition * numInstances * CUData::BytesPerPartition); - CHECKED_MALLOC(mvMemBlock, MV, numPartition * 4 * numInstances); + CHECKED_MALLOC_ZERO(mvMemBlock, MV, numPartition * 4 * numInstances); return true; - fail: return false; }
View file
x265_2.1.tar.gz/source/common/framedata.cpp -> x265_2.2.tar.gz/source/common/framedata.cpp
Changed
@@ -37,6 +37,9 @@ m_slice = new Slice; m_picCTU = new CUData[sps.numCUsInFrame]; m_picCsp = csp; + m_spsrpsIdx = -1; + if (param.rc.bStatWrite) + m_spsrps = const_cast<RPS*>(sps.spsrps); m_cuMemPool.create(0, param.internalCsp, sps.numCUsInFrame); for (uint32_t ctuAddr = 0; ctuAddr < sps.numCUsInFrame; ctuAddr++) @@ -45,6 +48,12 @@ CHECKED_MALLOC_ZERO(m_cuStat, RCStatCU, sps.numCUsInFrame); CHECKED_MALLOC(m_rowStat, RCStatRow, sps.numCuInHeight); reinit(sps); + + for (int i = 0; i < INTEGRAL_PLANE_NUM; i++) + { + m_meBuffer[i] = NULL; + m_meIntegral[i] = NULL; + } return true; fail: @@ -67,4 +76,16 @@ X265_FREE(m_cuStat); X265_FREE(m_rowStat); + + if (m_meBuffer) + { + for (int i = 0; i < INTEGRAL_PLANE_NUM; i++) + { + if (m_meBuffer[i] != NULL) + { + X265_FREE(m_meBuffer[i]); + m_meBuffer[i] = NULL; + } + } + } }
View file
x265_2.1.tar.gz/source/common/framedata.h -> x265_2.2.tar.gz/source/common/framedata.h
Changed
@@ -106,6 +106,9 @@ CUDataMemPool m_cuMemPool; CUData* m_picCTU; + RPS* m_spsrps; + int m_spsrpsIdx; + /* Rate control data used during encode and by references */ struct RCStatCU { @@ -123,10 +126,10 @@ uint32_t encodedBits; /* sum of 'totalBits' of encoded CTUs */ uint32_t satdForVbv; /* sum of lowres (estimated) costs for entire row */ uint32_t intraSatdForVbv; /* sum of lowres (estimated) intra costs for entire row */ - uint32_t diagSatd; - uint32_t diagIntraSatd; - double diagQp; - double diagQpScale; + uint32_t rowSatd; + uint32_t rowIntraSatd; + double rowQp; + double rowQpScale; double sumQpRc; double sumQpAq; }; @@ -148,6 +151,9 @@ double m_rateFactor; /* calculated based on the Frame QP */ int m_picCsp; + uint32_t* m_meIntegral[INTEGRAL_PLANE_NUM]; // 12 integral planes for 32x32, 32x24, 32x8, 24x32, 16x16, 16x12, 16x4, 12x16, 8x32, 8x8, 4x16 and 4x4. + uint32_t* m_meBuffer[INTEGRAL_PLANE_NUM]; + FrameData(); bool create(const x265_param& param, const SPS& sps, int csp); @@ -168,7 +174,6 @@ /* Stores inter analysis data for a single frame */ struct analysis_inter_data { - MV* mv; WeightParam* wt; int32_t* ref; uint8_t* depth;
View file
x265_2.1.tar.gz/source/common/param.cpp -> x265_2.2.tar.gz/source/common/param.cpp
Changed
@@ -149,6 +149,7 @@ param->bBPyramid = 1; param->scenecutThreshold = 40; /* Magic number pulled in from x264 */ param->lookaheadSlices = 8; + param->scenecutBias = 5.0; /* Intra Coding Tools */ param->bEnableConstrainedIntra = 0; @@ -176,6 +177,7 @@ param->maxNumReferences = 3; param->bEnableTemporalMvp = 1; param->bSourceReferenceEstimation = 0; + param->limitTU = 0; /* Loop Filter */ param->bEnableLoopFilter = 1; @@ -197,6 +199,7 @@ param->bCULossless = 0; param->bEnableTemporalSubLayers = 0; param->bEnableRdRefine = 0; + param->bMultiPassOptRPS = 0; /* Rate control options */ param->rc.vbvMaxBitrate = 0; @@ -229,8 +232,6 @@ param->rc.qpMin = 0; param->rc.qpMax = QP_MAX_MAX; - param->bDiscardOptionalVUI = 0; - /* Video Usability Information (VUI) */ param->vui.aspectRatioIdc = 0; param->vui.sarWidth = 0; @@ -256,8 +257,13 @@ param->minLuma = 0; param->maxLuma = PIXEL_MAX; param->log2MaxPocLsb = 8; - param->bDiscardSEI = false; param->maxSlices = 1; + + param->bEmitVUITimingInfo = 1; + param->bEmitVUIHRDInfo = 1; + param->bOptQpPPS = 1; + param->bOptRefListLengthPPS = 1; + } int x265_param_default_preset(x265_param* param, const char* preset, const char* tune) @@ -901,21 +907,19 @@ // solve "fatal error C1061: compiler limit : blocks nested too deeply" if (bExtraParams) { - bExtraParams = false; - if (0) ; - OPT("slices") p->maxSlices = atoi(value); - else - bExtraParams = true; - } - - if (bExtraParams) - { if (0) ; OPT("qpmin") p->rc.qpMin = atoi(value); OPT("analyze-src-pics") p->bSourceReferenceEstimation = atobool(value); OPT("log2-max-poc-lsb") p->log2MaxPocLsb = atoi(value); - OPT("discard-sei") p->bDiscardSEI = atobool(value); - OPT("discard-vui") p->bDiscardOptionalVUI = atobool(value); + OPT("vui-timing-info") p->bEmitVUITimingInfo = atobool(value); + OPT("vui-hrd-info") p->bEmitVUIHRDInfo = atobool(value); + OPT("slices") p->maxSlices = atoi(value); + OPT("limit-tu") p->limitTU = atoi(value); + OPT("opt-qp-pps") p->bOptQpPPS = atobool(value); + OPT("opt-ref-list-length-pps") p->bOptRefListLengthPPS = atobool(value); + OPT("multi-pass-opt-rps") p->bMultiPassOptRPS = atobool(value); + OPT("scenecut-bias") p->scenecutBias = atof(value); + else return X265_PARAM_BAD_NAME; } @@ -1078,8 +1082,8 @@ "Multiple-Slices mode must be enable Wavefront Parallel Processing (--wpp)"); CHECK(param->internalBitDepth != X265_DEPTH, "internalBitDepth must match compiled bit depth"); - CHECK(param->minCUSize != 64 && param->minCUSize != 32 && param->minCUSize != 16 && param->minCUSize != 8, - "minimim CU size must be 8, 16, 32, or 64"); + CHECK(param->minCUSize != 32 && param->minCUSize != 16 && param->minCUSize != 8, + "minimim CU size must be 8, 16 or 32"); CHECK(param->minCUSize > param->maxCUSize, "min CU size must be less than or equal to max CU size"); CHECK(param->rc.qp < -6 * (param->internalBitDepth - 8) || param->rc.qp > QP_MAX_SPEC, @@ -1088,8 +1092,8 @@ "Frame rate numerator and denominator must be specified"); CHECK(param->interlaceMode < 0 || param->interlaceMode > 2, "Interlace mode must be 0 (progressive) 1 (top-field first) or 2 (bottom field first)"); - CHECK(param->searchMethod<0 || param->searchMethod> X265_FULL_SEARCH, - "Search method is not supported value (0:DIA 1:HEX 2:UMH 3:HM 5:FULL)"); + CHECK(param->searchMethod < 0 || param->searchMethod > X265_FULL_SEARCH, + "Search method is not supported value (0:DIA 1:HEX 2:UMH 3:HM 4:SEA 5:FULL)"); CHECK(param->searchRange < 0, "Search Range must be more than 0"); CHECK(param->searchRange >= 32768, @@ -1122,6 +1126,7 @@ "QuadtreeTUMaxDepthInter must be less than or equal to the difference between log2(maxCUSize) and QuadtreeTULog2MinSize plus 1"); CHECK((param->maxTUSize != 32 && param->maxTUSize != 16 && param->maxTUSize != 8 && param->maxTUSize != 4), "max TU size must be 4, 8, 16, or 32"); + CHECK(param->limitTU > 4, "Invalid limit-tu option, limit-TU must be between 0 and 4"); CHECK(param->maxNumMergeCand < 1, "MaxNumMergeCand must be 1 or greater."); CHECK(param->maxNumMergeCand > 5, "MaxNumMergeCand must be 5 or smaller."); @@ -1217,6 +1222,8 @@ "Valid Logging level -1:none 0:error 1:warning 2:info 3:debug 4:full"); CHECK(param->scenecutThreshold < 0, "scenecutThreshold must be greater than 0"); + CHECK(param->scenecutBias < 0 || 100 < param->scenecutBias, + "scenecut-bias must be between 0 and 100"); CHECK(param->rdPenalty < 0 || param->rdPenalty > 2, "Valid penalty for 32x32 intra TU in non-I slices. 0:disabled 1:RD-penalty 2:maximum"); CHECK(param->keyframeMax < -1, @@ -1247,10 +1254,12 @@ "qpmax exceeds supported range (0 to 69)"); CHECK(param->rc.qpMin < QP_MIN || param->rc.qpMin > QP_MAX_MAX, "qpmin exceeds supported range (0 to 69)"); - CHECK(param->log2MaxPocLsb < 4, - "maximum of the picture order count can not be less than 4"); - CHECK(1 > param->maxSlices || param->maxSlices > ((param->sourceHeight + param->maxCUSize - 1) / param->maxCUSize), - "The slices can not be more than number of rows"); + CHECK(param->log2MaxPocLsb < 4 || param->log2MaxPocLsb > 16, + "Supported range for log2MaxPocLsb is 4 to 16"); +#if !X86_64 + CHECK(param->searchMethod == X265_SEA && (param->sourceWidth > 840 || param->sourceHeight > 480), + "SEA motion search does not support resolutions greater than 480p in 32 bit build"); +#endif return check_failed; } @@ -1338,9 +1347,8 @@ x265_log(param, X265_LOG_INFO, "ME / range / subpel / merge : %s / %d / %d / %d\n", x265_motion_est_names[param->searchMethod], param->searchRange, param->subpelRefine, param->maxNumMergeCand); - if (param->keyframeMax != INT_MAX || param->scenecutThreshold) - x265_log(param, X265_LOG_INFO, "Keyframe min / max / scenecut : %d / %d / %d\n", param->keyframeMin, param->keyframeMax, param->scenecutThreshold); + x265_log(param, X265_LOG_INFO, "Keyframe min / max / scenecut / bias: %d / %d / %d / %.2lf\n", param->keyframeMin, param->keyframeMax, param->scenecutThreshold, param->scenecutBias * 100); else x265_log(param, X265_LOG_INFO, "Keyframe min / max / scenecut : disabled\n"); @@ -1395,6 +1403,7 @@ TOOLVAL(param->noiseReductionInter, "nr-inter=%d"); TOOLOPT(param->bEnableTSkipFast, "tskip-fast"); TOOLOPT(!param->bEnableTSkipFast && param->bEnableTransformSkip, "tskip"); + TOOLVAL(param->limitTU , "limit-tu=%d"); TOOLOPT(param->bCULossless, "cu-lossless"); TOOLOPT(param->bEnableSignHiding, "signhide"); TOOLOPT(param->bEnableTemporalMvp, "tmvp"); @@ -1423,7 +1432,7 @@ fflush(stderr); } -char *x265_param2string(x265_param* p) +char *x265_param2string(x265_param* p, int padx, int pady) { char *buf, *s; @@ -1434,70 +1443,92 @@ #define BOOL(param, cliopt) \ s += sprintf(s, " %s", (param) ? cliopt : "no-" cliopt); - s += sprintf(s, "%dx%d", p->sourceWidth,p->sourceHeight); - s += sprintf(s, " fps=%u/%u", p->fpsNum, p->fpsDenom); - s += sprintf(s, " bitdepth=%d", p->internalBitDepth); + s += sprintf(s, "cpuid=%d", p->cpuid); + s += sprintf(s, " frame-threads=%d", p->frameNumThreads); + if (p->numaPools) + s += sprintf(s, " numa-pools=%s", p->numaPools); BOOL(p->bEnableWavefront, "wpp"); + BOOL(p->bDistributeModeAnalysis, "pmode"); + BOOL(p->bDistributeMotionEstimation, "pme"); + BOOL(p->bEnablePsnr, "psnr"); + BOOL(p->bEnableSsim, "ssim"); + s += sprintf(s, " log-level=%d", p->logLevel); + s += sprintf(s, " bitdepth=%d", p->internalBitDepth); + s += sprintf(s, " input-csp=%d", p->internalCsp); + s += sprintf(s, " fps=%u/%u", p->fpsNum, p->fpsDenom); + s += sprintf(s, " input-res=%dx%d", p->sourceWidth - padx, p->sourceHeight - pady); + s += sprintf(s, " interlace=%d", p->interlaceMode); + s += sprintf(s, " total-frames=%d", p->totalFrames); + s += sprintf(s, " level-idc=%d", p->levelIdc); + s += sprintf(s, " high-tier=%d", p->bHighTier); + s += sprintf(s, " uhd-bd=%d", p->uhdBluray); + s += sprintf(s, " ref=%d", p->maxNumReferences); + BOOL(p->bAllowNonConformance, "allow-non-conformance"); + BOOL(p->bRepeatHeaders, "repeat-headers"); + BOOL(p->bAnnexB, "annexb"); + BOOL(p->bEnableAccessUnitDelimiters, "aud"); + BOOL(p->bEmitHRDSEI, "hrd"); + BOOL(p->bEmitInfoSEI, "info"); + s += sprintf(s, " hash=%d", p->decodedPictureHashSEI); + BOOL(p->bEnableTemporalSubLayers, "temporal-layers"); + BOOL(p->bOpenGOP, "open-gop"); + s += sprintf(s, " min-keyint=%d", p->keyframeMin); + s += sprintf(s, " keyint=%d", p->keyframeMax); + s += sprintf(s, " bframes=%d", p->bframes); + s += sprintf(s, " b-adapt=%d", p->bFrameAdaptive); + BOOL(p->bBPyramid, "b-pyramid"); + s += sprintf(s, " bframe-bias=%d", p->bFrameBias); + s += sprintf(s, " rc-lookahead=%d", p->lookaheadDepth); + s += sprintf(s, " lookahead-slices=%d", p->lookaheadSlices); + s += sprintf(s, " scenecut=%d", p->scenecutThreshold); + BOOL(p->bIntraRefresh, "intra-refresh"); s += sprintf(s, " ctu=%d", p->maxCUSize); s += sprintf(s, " min-cu-size=%d", p->minCUSize); - s += sprintf(s, " max-tu-size=%d", p->maxTUSize); - s += sprintf(s, " tu-intra-depth=%d", p->tuQTMaxIntraDepth); - s += sprintf(s, " tu-inter-depth=%d", p->tuQTMaxInterDepth); - s += sprintf(s, " me=%d", p->searchMethod); - s += sprintf(s, " subme=%d", p->subpelRefine); - s += sprintf(s, " merange=%d", p->searchRange); BOOL(p->bEnableRectInter, "rect"); BOOL(p->bEnableAMP, "amp"); - s += sprintf(s, " max-merge=%d", p->maxNumMergeCand); - BOOL(p->bEnableTemporalMvp, "temporal-mvp"); - BOOL(p->bEnableEarlySkip, "early-skip"); - BOOL(p->bEnableRecursionSkip, "rskip"); - s += sprintf(s, " rdpenalty=%d", p->rdPenalty); + s += sprintf(s, " max-tu-size=%d", p->maxTUSize); + s += sprintf(s, " tu-inter-depth=%d", p->tuQTMaxInterDepth); + s += sprintf(s, " tu-intra-depth=%d", p->tuQTMaxIntraDepth); + s += sprintf(s, " limit-tu=%d", p->limitTU); + s += sprintf(s, " rdoq-level=%d", p->rdoqLevel); + BOOL(p->bEnableSignHiding, "signhide"); BOOL(p->bEnableTransformSkip, "tskip"); - BOOL(p->bEnableTSkipFast, "tskip-fast"); - BOOL(p->bEnableStrongIntraSmoothing, "strong-intra-smoothing"); - BOOL(p->bLossless, "lossless"); - BOOL(p->bCULossless, "cu-lossless"); + s += sprintf(s, " nr-intra=%d", p->noiseReductionIntra); + s += sprintf(s, " nr-inter=%d", p->noiseReductionInter); BOOL(p->bEnableConstrainedIntra, "constrained-intra"); - BOOL(p->bEnableFastIntra, "fast-intra"); - BOOL(p->bOpenGOP, "open-gop"); - BOOL(p->bEnableTemporalSubLayers, "temporal-layers"); - s += sprintf(s, " interlace=%d", p->interlaceMode); - s += sprintf(s, " keyint=%d", p->keyframeMax); - s += sprintf(s, " min-keyint=%d", p->keyframeMin); - s += sprintf(s, " scenecut=%d", p->scenecutThreshold); - s += sprintf(s, " rc-lookahead=%d", p->lookaheadDepth); - s += sprintf(s, " lookahead-slices=%d", p->lookaheadSlices); - s += sprintf(s, " bframes=%d", p->bframes); - s += sprintf(s, " bframe-bias=%d", p->bFrameBias); - s += sprintf(s, " b-adapt=%d", p->bFrameAdaptive); - s += sprintf(s, " ref=%d", p->maxNumReferences); + BOOL(p->bEnableStrongIntraSmoothing, "strong-intra-smoothing"); + s += sprintf(s, " max-merge=%d", p->maxNumMergeCand); s += sprintf(s, " limit-refs=%d", p->limitReferences); BOOL(p->limitModes, "limit-modes"); + s += sprintf(s, " me=%d", p->searchMethod); + s += sprintf(s, " subme=%d", p->subpelRefine); + s += sprintf(s, " merange=%d", p->searchRange); + BOOL(p->bEnableTemporalMvp, "temporal-mvp"); BOOL(p->bEnableWeightedPred, "weightp"); BOOL(p->bEnableWeightedBiPred, "weightb"); - s += sprintf(s, " aq-mode=%d", p->rc.aqMode); - s += sprintf(s, " qg-size=%d", p->rc.qgSize); - s += sprintf(s, " aq-strength=%.2f", p->rc.aqStrength); - s += sprintf(s, " cbqpoffs=%d", p->cbQpOffset); - s += sprintf(s, " crqpoffs=%d", p->crQpOffset); - s += sprintf(s, " rd=%d", p->rdLevel); - s += sprintf(s, " psy-rd=%.2f", p->psyRd); - s += sprintf(s, " rdoq-level=%d", p->rdoqLevel); - s += sprintf(s, " psy-rdoq=%.2f", p->psyRdoq); - s += sprintf(s, " log2-max-poc-lsb=%d", p->log2MaxPocLsb); - BOOL(p->bEnableRdRefine, "rd-refine"); - BOOL(p->bEnableSignHiding, "signhide"); + BOOL(p->bSourceReferenceEstimation, "analyze-src-pics"); BOOL(p->bEnableLoopFilter, "deblock"); if (p->bEnableLoopFilter) s += sprintf(s, "=%d:%d", p->deblockingFilterTCOffset, p->deblockingFilterBetaOffset); BOOL(p->bEnableSAO, "sao"); BOOL(p->bSaoNonDeblocked, "sao-non-deblock"); - BOOL(p->bBPyramid, "b-pyramid"); - BOOL(p->rc.cuTree, "cutree"); - BOOL(p->bIntraRefresh, "intra-refresh"); + s += sprintf(s, " rd=%d", p->rdLevel); + BOOL(p->bEnableEarlySkip, "early-skip"); + BOOL(p->bEnableRecursionSkip, "rskip"); + BOOL(p->bEnableFastIntra, "fast-intra"); + BOOL(p->bEnableTSkipFast, "tskip-fast"); + BOOL(p->bCULossless, "cu-lossless"); + BOOL(p->bIntraInBFrames, "b-intra"); + s += sprintf(s, " rdpenalty=%d", p->rdPenalty); + s += sprintf(s, " psy-rd=%.2f", p->psyRd); + s += sprintf(s, " psy-rdoq=%.2f", p->psyRdoq); + BOOL(p->bEnableRdRefine, "rd-refine"); + s += sprintf(s, " analysis-mode=%d", p->analysisMode); + BOOL(p->bLossless, "lossless"); + s += sprintf(s, " cbqpoffs=%d", p->cbQpOffset); + s += sprintf(s, " crqpoffs=%d", p->crQpOffset); s += sprintf(s, " rc=%s", p->rc.rateControlMode == X265_RC_ABR ? ( - p->rc.bStatRead ? "2 pass" : p->rc.bitrate == p->rc.vbvMaxBitrate ? "cbr" : "abr") + p->rc.bitrate == p->rc.vbvMaxBitrate ? "cbr" : "abr") : p->rc.rateControlMode == X265_RC_CRF ? "crf" : "cqp"); if (p->rc.rateControlMode == X265_RC_ABR || p->rc.rateControlMode == X265_RC_CRF) { @@ -1505,17 +1536,20 @@ s += sprintf(s, " crf=%.1f", p->rc.rfConstant); else s += sprintf(s, " bitrate=%d", p->rc.bitrate); - s += sprintf(s, " qcomp=%.2f qpmin=%d qpmax=%d qpstep=%d", - p->rc.qCompress, p->rc.qpMin, p->rc.qpMax, p->rc.qpStep); + s += sprintf(s, " qcomp=%.2f qpstep=%d", p->rc.qCompress, p->rc.qpStep); + s += sprintf(s, " stats-write=%d", p->rc.bStatWrite); + s += sprintf(s, " stats-read=%d", p->rc.bStatRead); if (p->rc.bStatRead) - s += sprintf( s, " cplxblur=%.1f qblur=%.1f", - p->rc.complexityBlur, p->rc.qblur); + s += sprintf(s, " cplxblur=%.1f qblur=%.1f", + p->rc.complexityBlur, p->rc.qblur); + if (p->rc.bStatWrite && !p->rc.bStatRead) + BOOL(p->rc.bEnableSlowFirstPass, "slow-firstpass"); if (p->rc.vbvBufferSize) { - s += sprintf(s, " vbv-maxrate=%d vbv-bufsize=%d", - p->rc.vbvMaxBitrate, p->rc.vbvBufferSize); + s += sprintf(s, " vbv-maxrate=%d vbv-bufsize=%d vbv-init=%.1f", + p->rc.vbvMaxBitrate, p->rc.vbvBufferSize, p->rc.vbvBufferInit); if (p->rc.rateControlMode == X265_RC_CRF) - s += sprintf(s, " crf-max=%.1f", p->rc.rfConstantMax); + s += sprintf(s, " crf-max=%.1f crf-min=%.1f", p->rc.rfConstantMax, p->rc.rfConstantMin); } } else if (p->rc.rateControlMode == X265_RC_CQP) @@ -1526,6 +1560,59 @@ if (p->bframes) s += sprintf(s, " pbratio=%.2f", p->rc.pbFactor); } + s += sprintf(s, " aq-mode=%d", p->rc.aqMode); + s += sprintf(s, " aq-strength=%.2f", p->rc.aqStrength); + BOOL(p->rc.cuTree, "cutree"); + s += sprintf(s, " zone-count=%d", p->rc.zoneCount); + if (p->rc.zoneCount) + { + for (int i = 0; i < p->rc.zoneCount; ++i) + { + s += sprintf(s, " zones: start-frame=%d end-frame=%d", + p->rc.zones[i].startFrame, p->rc.zones[i].endFrame); + if (p->rc.zones[i].bForceQp) + s += sprintf(s, " qp=%d", p->rc.zones[i].qp); + else + s += sprintf(s, " bitrate-factor=%f", p->rc.zones[i].bitrateFactor); + } + } + BOOL(p->rc.bStrictCbr, "strict-cbr"); + s += sprintf(s, " qg-size=%d", p->rc.qgSize); + BOOL(p->rc.bEnableGrain, "rc-grain"); + s += sprintf(s, " qpmax=%d qpmin=%d", p->rc.qpMax, p->rc.qpMin); + s += sprintf(s, " sar=%d", p->vui.aspectRatioIdc); + if (p->vui.aspectRatioIdc == X265_EXTENDED_SAR) + s += sprintf(s, " sar-width : sar-height=%d:%d", p->vui.sarWidth, p->vui.sarHeight); + s += sprintf(s, " overscan=%d", p->vui.bEnableOverscanInfoPresentFlag); + if (p->vui.bEnableOverscanInfoPresentFlag) + s += sprintf(s, " overscan-crop=%d", p->vui.bEnableOverscanAppropriateFlag); + s += sprintf(s, " videoformat=%d", p->vui.videoFormat); + s += sprintf(s, " range=%d", p->vui.bEnableVideoFullRangeFlag); + s += sprintf(s, " colorprim=%d", p->vui.colorPrimaries); + s += sprintf(s, " transfer=%d", p->vui.transferCharacteristics); + s += sprintf(s, " colormatrix=%d", p->vui.matrixCoeffs); + s += sprintf(s, " chromaloc=%d", p->vui.bEnableChromaLocInfoPresentFlag); + if (p->vui.bEnableChromaLocInfoPresentFlag) + s += sprintf(s, " chromaloc-top=%d chromaloc-bottom=%d", + p->vui.chromaSampleLocTypeTopField, p->vui.chromaSampleLocTypeBottomField); + s += sprintf(s, " display-window=%d", p->vui.bEnableDefaultDisplayWindowFlag); + if (p->vui.bEnableDefaultDisplayWindowFlag) + s += sprintf(s, " left=%d top=%d right=%d bottom=%d", + p->vui.defDispWinLeftOffset, p->vui.defDispWinTopOffset, + p->vui.defDispWinRightOffset, p->vui.defDispWinBottomOffset); + if (p->masteringDisplayColorVolume) + s += sprintf(s, " master-display=%s", p->masteringDisplayColorVolume); + s += sprintf(s, " max-cll=%hu,%hu", p->maxCLL, p->maxFALL); + s += sprintf(s, " min-luma=%hu", p->minLuma); + s += sprintf(s, " max-luma=%hu", p->maxLuma); + s += sprintf(s, " log2-max-poc-lsb=%d", p->log2MaxPocLsb); + BOOL(p->bEmitVUITimingInfo, "vui-timing-info"); + BOOL(p->bEmitVUIHRDInfo, "vui-hrd-info"); + s += sprintf(s, " slices=%d", p->maxSlices); + BOOL(p->bOptQpPPS, "opt-qp-pps"); + BOOL(p->bOptRefListLengthPPS, "opt-ref-list-length-pps"); + BOOL(p->bMultiPassOptRPS, "multi-pass-opt-rps"); + s += sprintf(s, " scenecut-bias=%.2f", p->scenecutBias); #undef BOOL return buf; }
View file
x265_2.1.tar.gz/source/common/param.h -> x265_2.2.tar.gz/source/common/param.h
Changed
@@ -31,7 +31,7 @@ int x265_set_globals(x265_param *param); void x265_print_params(x265_param *param); void x265_param_apply_fastfirstpass(x265_param *p); -char* x265_param2string(x265_param *param); +char* x265_param2string(x265_param *param, int padx, int pady); int x265_atoi(const char *str, bool& bError); double x265_atof(const char *str, bool& bError); int parseCpuName(const char *value, bool& bError);
View file
x265_2.1.tar.gz/source/common/pixel.cpp -> x265_2.2.tar.gz/source/common/pixel.cpp
Changed
@@ -117,6 +117,52 @@ } } +template<int lx, int ly> +int ads_x4(int encDC[4], uint32_t *sums, int delta, uint16_t *costMvX, int16_t *mvs, int width, int thresh) +{ + int nmv = 0; + for (int16_t i = 0; i < width; i++, sums++) + { + int ads = abs(encDC[0] - long(sums[0])) + + abs(encDC[1] - long(sums[lx >> 1])) + + abs(encDC[2] - long(sums[delta])) + + abs(encDC[3] - long(sums[delta + (lx >> 1)])) + + costMvX[i]; + if (ads < thresh) + mvs[nmv++] = i; + } + return nmv; +} + +template<int lx, int ly> +int ads_x2(int encDC[2], uint32_t *sums, int delta, uint16_t *costMvX, int16_t *mvs, int width, int thresh) +{ + int nmv = 0; + for (int16_t i = 0; i < width; i++, sums++) + { + int ads = abs(encDC[0] - long(sums[0])) + + abs(encDC[1] - long(sums[delta])) + + costMvX[i]; + if (ads < thresh) + mvs[nmv++] = i; + } + return nmv; +} + +template<int lx, int ly> +int ads_x1(int encDC[1], uint32_t *sums, int, uint16_t *costMvX, int16_t *mvs, int width, int thresh) +{ + int nmv = 0; + for (int16_t i = 0; i < width; i++, sums++) + { + int ads = abs(encDC[0] - long(sums[0])) + + costMvX[i]; + if (ads < thresh) + mvs[nmv++] = i; + } + return nmv; +} + template<int lx, int ly, class T1, class T2> sse_t sse(const T1* pix1, intptr_t stride_pix1, const T2* pix2, intptr_t stride_pix2) { @@ -991,6 +1037,32 @@ LUMA_PU(64, 16); LUMA_PU(16, 64); + p.pu[LUMA_4x4].ads = ads_x1<4, 4>; + p.pu[LUMA_8x8].ads = ads_x1<8, 8>; + p.pu[LUMA_8x4].ads = ads_x2<8, 4>; + p.pu[LUMA_4x8].ads = ads_x2<4, 8>; + p.pu[LUMA_16x16].ads = ads_x4<16, 16>; + p.pu[LUMA_16x8].ads = ads_x2<16, 8>; + p.pu[LUMA_8x16].ads = ads_x2<8, 16>; + p.pu[LUMA_16x12].ads = ads_x1<16, 12>; + p.pu[LUMA_12x16].ads = ads_x1<12, 16>; + p.pu[LUMA_16x4].ads = ads_x1<16, 4>; + p.pu[LUMA_4x16].ads = ads_x1<4, 16>; + p.pu[LUMA_32x32].ads = ads_x4<32, 32>; + p.pu[LUMA_32x16].ads = ads_x2<32, 16>; + p.pu[LUMA_16x32].ads = ads_x2<16, 32>; + p.pu[LUMA_32x24].ads = ads_x4<32, 24>; + p.pu[LUMA_24x32].ads = ads_x4<24, 32>; + p.pu[LUMA_32x8].ads = ads_x4<32, 8>; + p.pu[LUMA_8x32].ads = ads_x4<8, 32>; + p.pu[LUMA_64x64].ads = ads_x4<64, 64>; + p.pu[LUMA_64x32].ads = ads_x2<64, 32>; + p.pu[LUMA_32x64].ads = ads_x2<32, 64>; + p.pu[LUMA_64x48].ads = ads_x4<64, 48>; + p.pu[LUMA_48x64].ads = ads_x4<48, 64>; + p.pu[LUMA_64x16].ads = ads_x4<64, 16>; + p.pu[LUMA_16x64].ads = ads_x4<16, 64>; + p.pu[LUMA_4x4].satd = satd_4x4; p.pu[LUMA_8x8].satd = satd8<8, 8>; p.pu[LUMA_8x4].satd = satd_8x4;
View file
x265_2.2.tar.gz/source/common/ppc/dct_altivec.cpp
Added
@@ -0,0 +1,819 @@ +/***************************************************************************** + * Copyright (C) 2013 x265 project + * + * Authors: Roger Moussalli <rmoussal@us.ibm.com> + * Min Chen <min.chen@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "common.h" +#include "primitives.h" +#include "contexts.h" // costCoeffNxN_c +#include "threading.h" // CLZ +#include "ppccommon.h" + +using namespace X265_NS; + +static uint32_t quant_altivec(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff) +{ + + X265_CHECK(qBits >= 8, "qBits less than 8\n"); + + X265_CHECK((numCoeff % 16) == 0, "numCoeff must be multiple of 16\n"); + + int qBits8 = qBits - 8; + uint32_t numSig = 0; + + + int level[8] ; + int sign[8] ; + int tmplevel[8] ; + + const vector signed short v_zeros = {0, 0, 0, 0, 0, 0, 0, 0} ; + const vector signed short v_neg1 = {-1, -1, -1, -1, -1, -1, -1, -1} ; + const vector signed short v_pos1_ss = {1, 1, 1, 1, 1, 1, 1, 1} ; + const vector signed int v_pos1_sw = {1, 1, 1, 1} ; + + const vector signed int v_clip_high = {32767, 32767, 32767, 32767} ; + const vector signed int v_clip_low = {-32768, -32768, -32768, -32768} ; + + + vector signed short v_level_ss ; + vector signed int v_level_0, v_level_1 ; + vector signed int v_tmplevel_0, v_tmplevel_1 ; + vector signed short v_sign_ss ; + vector signed int v_sign_0, v_sign_1 ; + vector signed int v_quantCoeff_0, v_quantCoeff_1 ; + + vector signed int v_numSig = {0, 0, 0, 0} ; + + vector signed int v_add ; + v_add[0] = add ; + v_add = vec_splat(v_add, 0) ; + + vector unsigned int v_qBits ; + v_qBits[0] = qBits ; + v_qBits = vec_splat(v_qBits, 0) ; + + vector unsigned int v_qBits8 ; + v_qBits8[0] = qBits8 ; + v_qBits8 = vec_splat(v_qBits8, 0) ; + + + for (int blockpos_outer = 0; blockpos_outer < numCoeff; blockpos_outer+=16) + { + int blockpos = blockpos_outer ; + + // for(int ii=0; ii<8; ii++) { level[ii] = coef[blockpos+ii] ;} + v_level_ss = vec_xl(0, &coef[blockpos]) ; + v_level_0 = vec_unpackh(v_level_ss) ; + v_level_1 = vec_unpackl(v_level_ss) ; + + + // for(int ii=0; ii<8; ii++) { sign[ii] = (level[ii] < 0 ? -1 : 1) ;} + vector bool short v_level_cmplt0 ; + v_level_cmplt0 = vec_cmplt(v_level_ss, v_zeros) ; + v_sign_ss = vec_sel(v_pos1_ss, v_neg1, v_level_cmplt0) ; + v_sign_0 = vec_unpackh(v_sign_ss) ; + v_sign_1 = vec_unpackl(v_sign_ss) ; + + + + // for(int ii=0; ii<8; ii++) { tmplevel[ii] = abs(level[ii]) * quantCoeff[blockpos+ii] ;} + v_level_0 = vec_abs(v_level_0) ; + v_level_1 = vec_abs(v_level_1) ; + v_quantCoeff_0 = vec_xl(0, &quantCoeff[blockpos]) ; + v_quantCoeff_1 = vec_xl(16, &quantCoeff[blockpos]) ; + + asm ("vmuluwm %0,%1,%2" + : "=v" (v_tmplevel_0) + : "v" (v_level_0) , "v" (v_quantCoeff_0) + ) ; + + asm ("vmuluwm %0,%1,%2" + : "=v" (v_tmplevel_1) + : "v" (v_level_1) , "v" (v_quantCoeff_1) + ) ; + + + + // for(int ii=0; ii<8; ii++) { level[ii] = ((tmplevel[ii] + add) >> qBits) ;} + v_level_0 = vec_sra(vec_add(v_tmplevel_0, v_add), v_qBits) ; + v_level_1 = vec_sra(vec_add(v_tmplevel_1, v_add), v_qBits) ; + + // for(int ii=0; ii<8; ii++) { deltaU[blockpos+ii] = ((tmplevel[ii] - (level[ii] << qBits)) >> qBits8) ;} + vector signed int v_temp_0_sw, v_temp_1_sw ; + v_temp_0_sw = vec_sl(v_level_0, v_qBits) ; + v_temp_1_sw = vec_sl(v_level_1, v_qBits) ; + + v_temp_0_sw = vec_sub(v_tmplevel_0, v_temp_0_sw) ; + v_temp_1_sw = vec_sub(v_tmplevel_1, v_temp_1_sw) ; + + v_temp_0_sw = vec_sra(v_temp_0_sw, v_qBits8) ; + v_temp_1_sw = vec_sra(v_temp_1_sw, v_qBits8) ; + + vec_xst(v_temp_0_sw, 0, &deltaU[blockpos]) ; + vec_xst(v_temp_1_sw, 16, &deltaU[blockpos]) ; + + + // for(int ii=0; ii<8; ii++) { if(level[ii]) ++numSig ; } + vector bool int v_level_cmpeq0 ; + vector signed int v_level_inc ; + v_level_cmpeq0 = vec_cmpeq(v_level_0, (vector signed int)v_zeros) ; + v_level_inc = vec_sel(v_pos1_sw, (vector signed int)v_zeros, v_level_cmpeq0) ; + v_numSig = vec_add(v_numSig, v_level_inc) ; + + v_level_cmpeq0 = vec_cmpeq(v_level_1, (vector signed int)v_zeros) ; + v_level_inc = vec_sel(v_pos1_sw, (vector signed int)v_zeros, v_level_cmpeq0) ; + v_numSig = vec_add(v_numSig, v_level_inc) ; + + + // for(int ii=0; ii<8; ii++) { level[ii] *= sign[ii]; } + asm ("vmuluwm %0,%1,%2" + : "=v" (v_level_0) + : "v" (v_level_0) , "v" (v_sign_0) + ) ; + + asm ("vmuluwm %0,%1,%2" + : "=v" (v_level_1) + : "v" (v_level_1) , "v" (v_sign_1) + ) ; + + + + // for(int ii=0; ii<8; ii++) {qCoef[blockpos+ii] = (int16_t)x265_clip3(-32768, 32767, level[ii]);} + vector bool int v_level_cmp_clip_high, v_level_cmp_clip_low ; + + v_level_cmp_clip_high = vec_cmpgt(v_level_0, v_clip_high) ; + v_level_0 = vec_sel(v_level_0, v_clip_high, v_level_cmp_clip_high) ; + v_level_cmp_clip_low = vec_cmplt(v_level_0, v_clip_low) ; + v_level_0 = vec_sel(v_level_0, v_clip_low, v_level_cmp_clip_low) ; + + + v_level_cmp_clip_high = vec_cmpgt(v_level_1, v_clip_high) ; + v_level_1 = vec_sel(v_level_1, v_clip_high, v_level_cmp_clip_high) ; + v_level_cmp_clip_low = vec_cmplt(v_level_1, v_clip_low) ; + v_level_1 = vec_sel(v_level_1, v_clip_low, v_level_cmp_clip_low) ; + + v_level_ss = vec_pack(v_level_0, v_level_1) ; + + vec_xst(v_level_ss, 0, &qCoef[blockpos]) ; + + + + + // UNROLL ONCE MORE (which is ok since loops for multiple of 16 times, though that is NOT obvious to the compiler) + blockpos += 8 ; + + // for(int ii=0; ii<8; ii++) { level[ii] = coef[blockpos+ii] ;} + v_level_ss = vec_xl(0, &coef[blockpos]) ; + v_level_0 = vec_unpackh(v_level_ss) ; + v_level_1 = vec_unpackl(v_level_ss) ; + + + // for(int ii=0; ii<8; ii++) { sign[ii] = (level[ii] < 0 ? -1 : 1) ;} + v_level_cmplt0 = vec_cmplt(v_level_ss, v_zeros) ; + v_sign_ss = vec_sel(v_pos1_ss, v_neg1, v_level_cmplt0) ; + v_sign_0 = vec_unpackh(v_sign_ss) ; + v_sign_1 = vec_unpackl(v_sign_ss) ; + + + + // for(int ii=0; ii<8; ii++) { tmplevel[ii] = abs(level[ii]) * quantCoeff[blockpos+ii] ;} + v_level_0 = vec_abs(v_level_0) ; + v_level_1 = vec_abs(v_level_1) ; + v_quantCoeff_0 = vec_xl(0, &quantCoeff[blockpos]) ; + v_quantCoeff_1 = vec_xl(16, &quantCoeff[blockpos]) ; + + asm ("vmuluwm %0,%1,%2" + : "=v" (v_tmplevel_0) + : "v" (v_level_0) , "v" (v_quantCoeff_0) + ) ; + + asm ("vmuluwm %0,%1,%2" + : "=v" (v_tmplevel_1) + : "v" (v_level_1) , "v" (v_quantCoeff_1) + ) ; + + + + // for(int ii=0; ii<8; ii++) { level[ii] = ((tmplevel[ii] + add) >> qBits) ;} + v_level_0 = vec_sra(vec_add(v_tmplevel_0, v_add), v_qBits) ; + v_level_1 = vec_sra(vec_add(v_tmplevel_1, v_add), v_qBits) ; + + // for(int ii=0; ii<8; ii++) { deltaU[blockpos+ii] = ((tmplevel[ii] - (level[ii] << qBits)) >> qBits8) ;} + v_temp_0_sw = vec_sl(v_level_0, v_qBits) ; + v_temp_1_sw = vec_sl(v_level_1, v_qBits) ; + + v_temp_0_sw = vec_sub(v_tmplevel_0, v_temp_0_sw) ; + v_temp_1_sw = vec_sub(v_tmplevel_1, v_temp_1_sw) ; + + v_temp_0_sw = vec_sra(v_temp_0_sw, v_qBits8) ; + v_temp_1_sw = vec_sra(v_temp_1_sw, v_qBits8) ; + + vec_xst(v_temp_0_sw, 0, &deltaU[blockpos]) ; + vec_xst(v_temp_1_sw, 16, &deltaU[blockpos]) ; + + + // for(int ii=0; ii<8; ii++) { if(level[ii]) ++numSig ; } + v_level_cmpeq0 = vec_cmpeq(v_level_0, (vector signed int)v_zeros) ; + v_level_inc = vec_sel(v_pos1_sw, (vector signed int)v_zeros, v_level_cmpeq0) ; + v_numSig = vec_add(v_numSig, v_level_inc) ; + + v_level_cmpeq0 = vec_cmpeq(v_level_1, (vector signed int)v_zeros) ; + v_level_inc = vec_sel(v_pos1_sw, (vector signed int)v_zeros, v_level_cmpeq0) ; + v_numSig = vec_add(v_numSig, v_level_inc) ; + + + // for(int ii=0; ii<8; ii++) { level[ii] *= sign[ii]; } + asm ("vmuluwm %0,%1,%2" + : "=v" (v_level_0) + : "v" (v_level_0) , "v" (v_sign_0) + ) ; + + asm ("vmuluwm %0,%1,%2" + : "=v" (v_level_1) + : "v" (v_level_1) , "v" (v_sign_1) + ) ; + + + + // for(int ii=0; ii<8; ii++) {qCoef[blockpos+ii] = (int16_t)x265_clip3(-32768, 32767, level[ii]);} + v_level_cmp_clip_high = vec_cmpgt(v_level_0, v_clip_high) ; + v_level_0 = vec_sel(v_level_0, v_clip_high, v_level_cmp_clip_high) ; + v_level_cmp_clip_low = vec_cmplt(v_level_0, v_clip_low) ; + v_level_0 = vec_sel(v_level_0, v_clip_low, v_level_cmp_clip_low) ; + + + v_level_cmp_clip_high = vec_cmpgt(v_level_1, v_clip_high) ; + v_level_1 = vec_sel(v_level_1, v_clip_high, v_level_cmp_clip_high) ; + v_level_cmp_clip_low = vec_cmplt(v_level_1, v_clip_low) ; + v_level_1 = vec_sel(v_level_1, v_clip_low, v_level_cmp_clip_low) ; + + v_level_ss = vec_pack(v_level_0, v_level_1) ; + + vec_xst(v_level_ss, 0, &qCoef[blockpos]) ; + + + } + + v_numSig = vec_sums(v_numSig, (vector signed int)v_zeros) ; + + // return numSig; + return v_numSig[3] ; +} // end quant_altivec() + + +inline void denoiseDct_unroll8_altivec(int16_t* dctCoef, uint32_t* resSum, const uint16_t* offset, int numCoeff, int index_offset) +{ + vector short v_level_ss, v_sign_ss ; + vector int v_level_h_sw, v_level_l_sw ; + vector int v_level_h_processed_sw, v_level_l_processed_sw ; + vector int v_sign_h_sw, v_sign_l_sw ; + vector unsigned int v_resSum_h_uw, v_resSum_l_uw ; + vector unsigned short v_offset_us ; + vector unsigned int v_offset_h_uw, v_offset_l_uw ; + const vector unsigned short v_shamt_us = {15,15,15,15,15,15,15,15} ; + const vector unsigned int v_unpack_mask = {0x0FFFF, 0x0FFFF, 0x0FFFF, 0x0FFFF} ; + vector bool int vec_less_than_zero_h_bw, vec_less_than_zero_l_bw ; + LOAD_ZERO; + + // for(int jj=0; jj<8; jj++) v_level[jj]=dctCoef[ii*8 + jj] ; + v_level_ss = vec_xl(0, &dctCoef[index_offset]) ; + v_level_h_sw = vec_unpackh(v_level_ss) ; + v_level_l_sw = vec_unpackl(v_level_ss) ; + + // for(int jj=0; jj<8; jj++) v_sign[jj] = v_level[jj] >> 31 ; + v_sign_ss = vec_sra(v_level_ss, v_shamt_us) ; + v_sign_h_sw = vec_unpackh(v_sign_ss) ; + v_sign_l_sw = vec_unpackl(v_sign_ss) ; + + + + // for(int jj=0; jj<8; jj++) v_level[jj] = (v_level[jj] + v_sign[jj]) ^ v_sign[jj] ; + v_level_h_sw = vec_add(v_level_h_sw, v_sign_h_sw) ; + v_level_l_sw = vec_add(v_level_l_sw, v_sign_l_sw) ; + + v_level_h_sw = vec_xor(v_level_h_sw, v_sign_h_sw) ; + v_level_l_sw = vec_xor(v_level_l_sw, v_sign_l_sw) ; + + + + // for(int jj=0; jj<8; jj++) resSum[ii*8 + jj] += v_level[jj] ; + v_resSum_h_uw = vec_xl(0, &resSum[index_offset]) ; + v_resSum_l_uw = vec_xl(0, &resSum[index_offset + 4]) ; + + v_resSum_h_uw = vec_add(v_resSum_h_uw, (vector unsigned int)v_level_h_sw) ; + v_resSum_l_uw = vec_add(v_resSum_l_uw, (vector unsigned int)v_level_l_sw) ; + + vec_xst(v_resSum_h_uw, 0, &resSum[index_offset]) ; + vec_xst(v_resSum_l_uw, 0, &resSum[index_offset + 4]) ; + + + // for(int jj=0; jj<8; jj++) v_level[jj] -= offset[ii*8 + jj] ; + v_offset_us = vec_xl(0, &offset[index_offset]) ; + v_offset_h_uw = (vector unsigned int)vec_unpackh((vector signed short)v_offset_us) ; + v_offset_l_uw = (vector unsigned int)vec_unpackl((vector signed short)v_offset_us) ; + v_offset_h_uw = vec_and(v_offset_h_uw, v_unpack_mask) ; + v_offset_l_uw = vec_and(v_offset_l_uw, v_unpack_mask) ; + v_level_h_sw = vec_sub(v_level_h_sw, (vector signed int) v_offset_h_uw) ; + v_level_l_sw = vec_sub(v_level_l_sw, (vector signed int) v_offset_l_uw) ; + + + // for (int jj = 0; jj < 8; jj++) dctCoef[ii*8 + jj] = (int16_t)(v_level[jj] < 0 ? 0 : (v_level[jj] ^ v_sign[jj]) - v_sign[jj]); + // (level ^ sign) - sign + v_level_h_processed_sw = vec_xor(v_level_h_sw, v_sign_h_sw) ; + v_level_l_processed_sw = vec_xor(v_level_l_sw, v_sign_l_sw) ; + v_level_h_processed_sw = vec_sub(v_level_h_processed_sw, v_sign_h_sw) ; + v_level_l_processed_sw = vec_sub(v_level_l_processed_sw, v_sign_l_sw) ; + + //vec_less_than_zero_h_bw = vec_cmplt(v_level_h_sw, (vector signed int){0, 0, 0, 0}) ; + //vec_less_than_zero_l_bw = vec_cmplt(v_level_l_sw, (vector signed int){0, 0, 0, 0}) ; + vec_less_than_zero_h_bw = vec_cmplt(v_level_h_sw, zero_s32v) ; + vec_less_than_zero_l_bw = vec_cmplt(v_level_l_sw, zero_s32v) ; + + v_level_h_sw = vec_sel(v_level_h_processed_sw, (vector signed int){0, 0, 0, 0}, vec_less_than_zero_h_bw) ; + v_level_l_sw = vec_sel(v_level_l_processed_sw, (vector signed int){0, 0, 0, 0}, vec_less_than_zero_l_bw) ; + + v_level_ss = vec_pack(v_level_h_sw, v_level_l_sw) ; + + vec_xst(v_level_ss, 0, &dctCoef[index_offset]) ; +} + + +void denoiseDct_altivec(int16_t* dctCoef, uint32_t* resSum, const uint16_t* offset, int numCoeff) +{ + int ii_offset ; + + // For each set of 256 + for(int ii=0; ii<(numCoeff/256); ii++) + { + #pragma unroll + for(int jj=0; jj<32; jj++) + { + denoiseDct_unroll8_altivec(dctCoef, resSum, offset, numCoeff, ii*256 + jj*8) ; + } + } + + ii_offset = ((numCoeff >> 8) << 8) ; + + // For each set of 64 + for(int ii=0; ii<((numCoeff%256) /64); ii++) + { + #pragma unroll + for(int jj=0; jj<8; jj++) + { + denoiseDct_unroll8_altivec(dctCoef, resSum, offset, numCoeff, ii_offset + ii*64 + jj*8) ; + } + } + + + ii_offset = ((numCoeff >> 6) << 6) ; + + // For each set of 8 + for(int ii=0; ii < ((numCoeff%64) /8); ii++) + { + denoiseDct_unroll8_altivec(dctCoef, resSum, offset, numCoeff, ii_offset + (ii*8)) ; + } + + + ii_offset = ((numCoeff >> 3) << 3) ; + + for (int ii = 0; ii < (numCoeff % 8); ii++) + { + int level = dctCoef[ii + ii_offset]; + int sign = level >> 31; + level = (level + sign) ^ sign; + resSum[ii+ii_offset] += level; + level -= offset[ii+ii_offset] ; + dctCoef[ii+ii_offset] = (int16_t)(level < 0 ? 0 : (level ^ sign) - sign); + } + +} // end denoiseDct_altivec() + + + + +inline void transpose_matrix_8_altivec(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +{ + vector signed short v_src_0 ; + vector signed short v_src_1 ; + vector signed short v_src_2 ; + vector signed short v_src_3 ; + vector signed short v_src_4 ; + vector signed short v_src_5 ; + vector signed short v_src_6 ; + vector signed short v_src_7 ; + + vector signed short v_dst_32s_0 ; + vector signed short v_dst_32s_1 ; + vector signed short v_dst_32s_2 ; + vector signed short v_dst_32s_3 ; + vector signed short v_dst_32s_4 ; + vector signed short v_dst_32s_5 ; + vector signed short v_dst_32s_6 ; + vector signed short v_dst_32s_7 ; + + vector signed short v_dst_64s_0 ; + vector signed short v_dst_64s_1 ; + vector signed short v_dst_64s_2 ; + vector signed short v_dst_64s_3 ; + vector signed short v_dst_64s_4 ; + vector signed short v_dst_64s_5 ; + vector signed short v_dst_64s_6 ; + vector signed short v_dst_64s_7 ; + + vector signed short v_dst_128s_0 ; + vector signed short v_dst_128s_1 ; + vector signed short v_dst_128s_2 ; + vector signed short v_dst_128s_3 ; + vector signed short v_dst_128s_4 ; + vector signed short v_dst_128s_5 ; + vector signed short v_dst_128s_6 ; + vector signed short v_dst_128s_7 ; + + v_src_0 = vec_xl(0, src) ; + v_src_1 = vec_xl( (srcStride*2) , src) ; + v_src_2 = vec_xl( (srcStride*2) * 2, src) ; + v_src_3 = vec_xl( (srcStride*2) * 3, src) ; + v_src_4 = vec_xl( (srcStride*2) * 4, src) ; + v_src_5 = vec_xl( (srcStride*2) * 5, src) ; + v_src_6 = vec_xl( (srcStride*2) * 6, src) ; + v_src_7 = vec_xl( (srcStride*2) * 7, src) ; + + vector unsigned char v_permute_32s_high = {0x00, 0x01, 0x10, 0x11, 0x02, 0x03, 0x12, 0x13, 0x04, 0x05, 0x14, 0x15, 0x06, 0x07, 0x16, 0x17} ; + vector unsigned char v_permute_32s_low = {0x08, 0x09, 0x18, 0x19, 0x0A, 0x0B, 0x1A, 0x1B, 0x0C, 0x0D, 0x1C, 0x1D, 0x0E, 0x0F, 0x1E, 0x1F} ; + vector unsigned char v_permute_64s_high = {0x00, 0x01, 0x02, 0x03, 0x10, 0x11, 0x12, 0x13, 0x04, 0x05, 0x06, 0x07, 0x14, 0x015, 0x16, 0x17} ; + vector unsigned char v_permute_64s_low = {0x08, 0x09, 0x0A, 0x0B, 0x18, 0x19, 0x1A, 0x1B, 0x0C, 0x0D, 0x0E, 0x0F, 0x1C, 0x1D, 0x1E, 0x1F} ; + vector unsigned char v_permute_128s_high = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x10, 0x11, 0x12, 0x13, 0x14, 0x015, 0x16, 0x17} ; + vector unsigned char v_permute_128s_low = {0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F} ; + + v_dst_32s_0 = vec_perm(v_src_0, v_src_1, v_permute_32s_high) ; + v_dst_32s_1 = vec_perm(v_src_2, v_src_3, v_permute_32s_high) ; + v_dst_32s_2 = vec_perm(v_src_4, v_src_5, v_permute_32s_high) ; + v_dst_32s_3 = vec_perm(v_src_6, v_src_7, v_permute_32s_high) ; + v_dst_32s_4 = vec_perm(v_src_0, v_src_1, v_permute_32s_low) ; + v_dst_32s_5 = vec_perm(v_src_2, v_src_3, v_permute_32s_low) ; + v_dst_32s_6 = vec_perm(v_src_4, v_src_5, v_permute_32s_low) ; + v_dst_32s_7 = vec_perm(v_src_6, v_src_7, v_permute_32s_low) ; + + v_dst_64s_0 = vec_perm(v_dst_32s_0, v_dst_32s_1, v_permute_64s_high) ; + v_dst_64s_1 = vec_perm(v_dst_32s_2, v_dst_32s_3, v_permute_64s_high) ; + v_dst_64s_2 = vec_perm(v_dst_32s_0, v_dst_32s_1, v_permute_64s_low) ; + v_dst_64s_3 = vec_perm(v_dst_32s_2, v_dst_32s_3, v_permute_64s_low) ; + v_dst_64s_4 = vec_perm(v_dst_32s_4, v_dst_32s_5, v_permute_64s_high) ; + v_dst_64s_5 = vec_perm(v_dst_32s_6, v_dst_32s_7, v_permute_64s_high) ; + v_dst_64s_6 = vec_perm(v_dst_32s_4, v_dst_32s_5, v_permute_64s_low) ; + v_dst_64s_7 = vec_perm(v_dst_32s_6, v_dst_32s_7, v_permute_64s_low) ; + + v_dst_128s_0 = vec_perm(v_dst_64s_0, v_dst_64s_1, v_permute_128s_high) ; + v_dst_128s_1 = vec_perm(v_dst_64s_0, v_dst_64s_1, v_permute_128s_low) ; + v_dst_128s_2 = vec_perm(v_dst_64s_2, v_dst_64s_3, v_permute_128s_high) ; + v_dst_128s_3 = vec_perm(v_dst_64s_2, v_dst_64s_3, v_permute_128s_low) ; + v_dst_128s_4 = vec_perm(v_dst_64s_4, v_dst_64s_5, v_permute_128s_high) ; + v_dst_128s_5 = vec_perm(v_dst_64s_4, v_dst_64s_5, v_permute_128s_low) ; + v_dst_128s_6 = vec_perm(v_dst_64s_6, v_dst_64s_7, v_permute_128s_high) ; + v_dst_128s_7 = vec_perm(v_dst_64s_6, v_dst_64s_7, v_permute_128s_low) ; + + + vec_xst(v_dst_128s_0, 0, dst) ; + vec_xst(v_dst_128s_1, (dstStride*2) , dst) ; + vec_xst(v_dst_128s_2, (dstStride*2) * 2, dst) ; + vec_xst(v_dst_128s_3, (dstStride*2) * 3, dst) ; + vec_xst(v_dst_128s_4, (dstStride*2) * 4, dst) ; + vec_xst(v_dst_128s_5, (dstStride*2) * 5, dst) ; + vec_xst(v_dst_128s_6, (dstStride*2) * 6, dst) ; + vec_xst(v_dst_128s_7, (dstStride*2) * 7, dst) ; + +} // end transpose_matrix_8_altivec() + + +inline void transpose_matrix_16_altivec(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +{ + transpose_matrix_8_altivec((int16_t *)src, srcStride, (int16_t *)dst, dstStride) ; + transpose_matrix_8_altivec((int16_t *)&src[8] , srcStride, (int16_t *)&dst[dstStride*8], dstStride) ; + transpose_matrix_8_altivec((int16_t *)&src[srcStride*8], srcStride, (int16_t *)&dst[8], dstStride) ; + transpose_matrix_8_altivec((int16_t *)&src[srcStride*8 + 8], srcStride, (int16_t *)&dst[dstStride*8 + 8], dstStride) ; +} // end transpose_matrix_16_altivec() + + +inline void transpose_matrix_32_altivec(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +{ + transpose_matrix_16_altivec((int16_t *)src, srcStride, (int16_t *)dst, dstStride) ; + transpose_matrix_16_altivec((int16_t *)&src[16] , srcStride, (int16_t *)&dst[dstStride*16], dstStride) ; + transpose_matrix_16_altivec((int16_t *)&src[srcStride*16], srcStride, (int16_t *)&dst[16], dstStride) ; + transpose_matrix_16_altivec((int16_t *)&src[srcStride*16 + 16], srcStride, (int16_t *)&dst[dstStride*16 + 16], dstStride) ; +} // end transpose_matrix_32_altivec() + + +inline static void partialButterfly32_transposedSrc_altivec(const int16_t* __restrict__ src, int16_t* __restrict__ dst, int shift) +{ + const int line = 32 ; + + int j, k; + int E[16][8], O[16][8]; + int EE[8][8], EO[8][8]; + int EEE[4][8], EEO[4][8]; + int EEEE[2][8], EEEO[2][8]; + int add = 1 << (shift - 1); + + for (j = 0; j < line/8; j++) + { + /* E and O*/ + for(int ii=0; ii<8; ii++) { E[0][ii] = src[(0*line) + ii] + src[((31 - 0)*line) + ii] ; } + for(int ii=0; ii<8; ii++) { O[0][ii] = src[(0*line) + ii] - src[((31 - 0)*line) + ii] ; } + + for(int ii=0; ii<8; ii++) { E[1][ii] = src[(1*line) + ii] + src[((31 - 1)*line) + ii] ; } + for(int ii=0; ii<8; ii++) { O[1][ii] = src[(1*line) + ii] - src[((31 - 1)*line) + ii] ; } + + for(int ii=0; ii<8; ii++) { E[2][ii] = src[(2*line) + ii] + src[((31 - 2)*line) + ii] ; } + for(int ii=0; ii<8; ii++) { O[2][ii] = src[(2*line) + ii] - src[((31 - 2)*line) + ii] ; } + + for(int ii=0; ii<8; ii++) { E[3][ii] = src[(3*line) + ii] + src[((31 - 3)*line) + ii] ; } + for(int ii=0; ii<8; ii++) { O[3][ii] = src[(3*line) + ii] - src[((31 - 3)*line) + ii] ; } + + for(int ii=0; ii<8; ii++) { E[4][ii] = src[(4*line) + ii] + src[((31 - 4)*line) + ii] ; } + for(int ii=0; ii<8; ii++) { O[4][ii] = src[(4*line) + ii] - src[((31 - 4)*line) + ii] ; } + + for(int ii=0; ii<8; ii++) { E[5][ii] = src[(5*line) + ii] + src[((31 - 5)*line) + ii] ; } + for(int ii=0; ii<8; ii++) { O[5][ii] = src[(5*line) + ii] - src[((31 - 5)*line) + ii] ; } + + for(int ii=0; ii<8; ii++) { E[6][ii] = src[(6*line) + ii] + src[((31 - 6)*line) + ii] ; } + for(int ii=0; ii<8; ii++) { O[6][ii] = src[(6*line) + ii] - src[((31 - 6)*line) + ii] ; } + + for(int ii=0; ii<8; ii++) { E[7][ii] = src[(7*line) + ii] + src[((31 - 7)*line) + ii] ; } + for(int ii=0; ii<8; ii++) { O[7][ii] = src[(7*line) + ii] - src[((31 - 7)*line) + ii] ; } + + for(int ii=0; ii<8; ii++) { E[8][ii] = src[(8*line) + ii] + src[((31 - 8)*line) + ii] ; } + for(int ii=0; ii<8; ii++) { O[8][ii] = src[(8*line) + ii] - src[((31 - 8)*line) + ii] ; } + + for(int ii=0; ii<8; ii++) { E[9][ii] = src[(9*line) + ii] + src[((31 - 9)*line) + ii] ; } + for(int ii=0; ii<8; ii++) { O[9][ii] = src[(9*line) + ii] - src[((31 - 9)*line) + ii] ; } + + for(int ii=0; ii<8; ii++) { E[10][ii] = src[(10*line) + ii] + src[((31 - 10)*line) + ii] ; } + for(int ii=0; ii<8; ii++) { O[10][ii] = src[(10*line) + ii] - src[((31 - 10)*line) + ii] ; } + + for(int ii=0; ii<8; ii++) { E[11][ii] = src[(11*line) + ii] + src[((31 - 11)*line) + ii] ; } + for(int ii=0; ii<8; ii++) { O[11][ii] = src[(11*line) + ii] - src[((31 - 11)*line) + ii] ; } + + for(int ii=0; ii<8; ii++) { E[12][ii] = src[(12*line) + ii] + src[((31 - 12)*line) + ii] ; } + for(int ii=0; ii<8; ii++) { O[12][ii] = src[(12*line) + ii] - src[((31 - 12)*line) + ii] ; } + + for(int ii=0; ii<8; ii++) { E[13][ii] = src[(13*line) + ii] + src[((31 - 13)*line) + ii] ; } + for(int ii=0; ii<8; ii++) { O[13][ii] = src[(13*line) + ii] - src[((31 - 13)*line) + ii] ; } + + for(int ii=0; ii<8; ii++) { E[14][ii] = src[(14*line) + ii] + src[((31 - 14)*line) + ii] ; } + for(int ii=0; ii<8; ii++) { O[14][ii] = src[(14*line) + ii] - src[((31 - 14)*line) + ii] ; } + + for(int ii=0; ii<8; ii++) { E[15][ii] = src[(15*line) + ii] + src[((31 - 15)*line) + ii] ; } + for(int ii=0; ii<8; ii++) { O[15][ii] = src[(15*line) + ii] - src[((31 - 15)*line) + ii] ; } + + + /* EE and EO */ + for(int ii=0; ii<8; ii++) {EE[0][ii] = E[0][ii] + E[15 - 0][ii];} + for(int ii=0; ii<8; ii++) {EO[0][ii] = E[0][ii] - E[15 - 0][ii];} + + for(int ii=0; ii<8; ii++) {EE[1][ii] = E[1][ii] + E[15 - 1][ii];} + for(int ii=0; ii<8; ii++) {EO[1][ii] = E[1][ii] - E[15 - 1][ii];} + + for(int ii=0; ii<8; ii++) {EE[2][ii] = E[2][ii] + E[15 - 2][ii];} + for(int ii=0; ii<8; ii++) {EO[2][ii] = E[2][ii] - E[15 - 2][ii];} + + for(int ii=0; ii<8; ii++) {EE[3][ii] = E[3][ii] + E[15 - 3][ii];} + for(int ii=0; ii<8; ii++) {EO[3][ii] = E[3][ii] - E[15 - 3][ii];} + + for(int ii=0; ii<8; ii++) {EE[4][ii] = E[4][ii] + E[15 - 4][ii];} + for(int ii=0; ii<8; ii++) {EO[4][ii] = E[4][ii] - E[15 - 4][ii];} + + for(int ii=0; ii<8; ii++) {EE[5][ii] = E[5][ii] + E[15 - 5][ii];} + for(int ii=0; ii<8; ii++) {EO[5][ii] = E[5][ii] - E[15 - 5][ii];} + + for(int ii=0; ii<8; ii++) {EE[6][ii] = E[6][ii] + E[15 - 6][ii];} + for(int ii=0; ii<8; ii++) {EO[6][ii] = E[6][ii] - E[15 - 6][ii];} + + for(int ii=0; ii<8; ii++) {EE[7][ii] = E[7][ii] + E[15 - 7][ii];} + for(int ii=0; ii<8; ii++) {EO[7][ii] = E[7][ii] - E[15 - 7][ii];} + + + /* EEE and EEO */ + for(int ii=0; ii<8; ii++) {EEE[0][ii] = EE[0][ii] + EE[7 - 0][ii];} + for(int ii=0; ii<8; ii++) {EEO[0][ii] = EE[0][ii] - EE[7 - 0][ii];} + + for(int ii=0; ii<8; ii++) {EEE[1][ii] = EE[1][ii] + EE[7 - 1][ii];} + for(int ii=0; ii<8; ii++) {EEO[1][ii] = EE[1][ii] - EE[7 - 1][ii];} + + for(int ii=0; ii<8; ii++) {EEE[2][ii] = EE[2][ii] + EE[7 - 2][ii];} + for(int ii=0; ii<8; ii++) {EEO[2][ii] = EE[2][ii] - EE[7 - 2][ii];} + + for(int ii=0; ii<8; ii++) {EEE[3][ii] = EE[3][ii] + EE[7 - 3][ii];} + for(int ii=0; ii<8; ii++) {EEO[3][ii] = EE[3][ii] - EE[7 - 3][ii];} + + + + /* EEEE and EEEO */ + for(int ii=0; ii<8; ii++) {EEEE[0][ii] = EEE[0][ii] + EEE[3][ii];} + for(int ii=0; ii<8; ii++) {EEEO[0][ii] = EEE[0][ii] - EEE[3][ii];} + + for(int ii=0; ii<8; ii++) {EEEE[1][ii] = EEE[1][ii] + EEE[2][ii];} + for(int ii=0; ii<8; ii++) {EEEO[1][ii] = EEE[1][ii] - EEE[2][ii];} + + + /* writing to dst */ + for(int ii=0; ii<8; ii++) {dst[0 + ii] = (int16_t)((g_t32[0][0] * EEEE[0][ii] + g_t32[0][1] * EEEE[1][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) {dst[(16 * line) + ii] = (int16_t)((g_t32[16][0] * EEEE[0][ii] + g_t32[16][1] * EEEE[1][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) {dst[(8 * line ) + ii] = (int16_t)((g_t32[8][0] * EEEO[0][ii] + g_t32[8][1] * EEEO[1][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) {dst[(24 * line) + ii] = (int16_t)((g_t32[24][0] * EEEO[0][ii] + g_t32[24][1] * EEEO[1][ii] + add) >> shift);} + + for(int ii=0; ii<8; ii++) {dst[(4 * line) + ii] = (int16_t)((g_t32[4][0] * EEO[0][ii] + g_t32[4][1] * EEO[1][ii] + g_t32[4][2] * EEO[2][ii] + g_t32[4][3] * EEO[3][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) {dst[(12 * line) + ii] = (int16_t)((g_t32[12][0] * EEO[0][ii] + g_t32[12][1] * EEO[1][ii] + g_t32[12][2] * EEO[2][ii] + g_t32[12][3] * EEO[3][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) {dst[(20 * line) + ii] = (int16_t)((g_t32[20][0] * EEO[0][ii] + g_t32[20][1] * EEO[1][ii] + g_t32[20][2] * EEO[2][ii] + g_t32[20][3] * EEO[3][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) {dst[(28 * line) + ii] = (int16_t)((g_t32[28][0] * EEO[0][ii] + g_t32[28][1] * EEO[1][ii] + g_t32[28][2] * EEO[2][ii] + g_t32[28][3] * EEO[3][ii] + add) >> shift);} + + for(int ii=0; ii<8; ii++) {dst[(2 * line) + ii] = (int16_t)((g_t32[2][0] * EO[0][ii] + g_t32[2][1] * EO[1][ii] + g_t32[2][2] * EO[2][ii] + g_t32[2][3] * EO[3][ii] + g_t32[2][4] * EO[4][ii] + g_t32[2][5] * EO[5][ii] + g_t32[2][6] * EO[6][ii] + g_t32[2][7] * EO[7][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) {dst[(6 * line) + ii] = (int16_t)((g_t32[6][0] * EO[0][ii] + g_t32[6][1] * EO[1][ii] + g_t32[6][2] * EO[2][ii] + g_t32[6][3] * EO[3][ii] + g_t32[6][4] * EO[4][ii] + g_t32[6][5] * EO[5][ii] + g_t32[6][6] * EO[6][ii] + g_t32[6][7] * EO[7][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) {dst[(10 * line) + ii] = (int16_t)((g_t32[10][0] * EO[0][ii] + g_t32[10][1] * EO[1][ii] + g_t32[10][2] * EO[2][ii] + g_t32[10][3] * EO[3][ii] + g_t32[10][4] * EO[4][ii] + g_t32[10][5] * EO[5][ii] + g_t32[10][6] * EO[6][ii] + g_t32[10][7] * EO[7][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) {dst[(14 * line) + ii] = (int16_t)((g_t32[14][0] * EO[0][ii] + g_t32[14][1] * EO[1][ii] + g_t32[14][2] * EO[2][ii] + g_t32[14][3] * EO[3][ii] + g_t32[14][4] * EO[4][ii] + g_t32[14][5] * EO[5][ii] + g_t32[14][6] * EO[6][ii] + g_t32[14][7] * EO[7][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) {dst[(18 * line) + ii] = (int16_t)((g_t32[18][0] * EO[0][ii] + g_t32[18][1] * EO[1][ii] + g_t32[18][2] * EO[2][ii] + g_t32[18][3] * EO[3][ii] + g_t32[18][4] * EO[4][ii] + g_t32[18][5] * EO[5][ii] + g_t32[18][6] * EO[6][ii] + g_t32[18][7] * EO[7][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) {dst[(22 * line) + ii] = (int16_t)((g_t32[22][0] * EO[0][ii] + g_t32[22][1] * EO[1][ii] + g_t32[22][2] * EO[2][ii] + g_t32[22][3] * EO[3][ii] + g_t32[22][4] * EO[4][ii] + g_t32[22][5] * EO[5][ii] + g_t32[22][6] * EO[6][ii] + g_t32[22][7] * EO[7][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) {dst[(26 * line) + ii] = (int16_t)((g_t32[26][0] * EO[0][ii] + g_t32[26][1] * EO[1][ii] + g_t32[26][2] * EO[2][ii] + g_t32[26][3] * EO[3][ii] + g_t32[26][4] * EO[4][ii] + g_t32[26][5] * EO[5][ii] + g_t32[26][6] * EO[6][ii] + g_t32[26][7] * EO[7][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) {dst[(30 * line) + ii] = (int16_t)((g_t32[30][0] * EO[0][ii] + g_t32[30][1] * EO[1][ii] + g_t32[30][2] * EO[2][ii] + g_t32[30][3] * EO[3][ii] + g_t32[30][4] * EO[4][ii] + g_t32[30][5] * EO[5][ii] + g_t32[30][6] * EO[6][ii] + g_t32[30][7] * EO[7][ii] + add) >> shift);} + + + for(int ii=0; ii<8; ii++) { dst[(1 * line) + ii] = (int16_t)((g_t32[1][0] * O[0][ii] + g_t32[1][1] * O[1][ii] + g_t32[1][2] * O[2][ii] + g_t32[1][3] * O[3][ii] + g_t32[1][4] * O[4][ii] + g_t32[1][5] * O[5][ii] + g_t32[1][6] * O[6][ii] + g_t32[1][7] * O[7][ii] + g_t32[1][8] * O[8][ii] + g_t32[1][9] * O[9][ii] + g_t32[1][10] * O[10][ii] + g_t32[1][11] * O[11][ii] + g_t32[1][12] * O[12][ii] + g_t32[1][13] * O[13][ii] + g_t32[1][14] * O[14][ii] + g_t32[1][15] * O[15][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) { dst[(3 * line) + ii] = (int16_t)((g_t32[3][0] * O[0][ii] + g_t32[3][1] * O[1][ii] + g_t32[3][2] * O[2][ii] + g_t32[3][3] * O[3][ii] + g_t32[3][4] * O[4][ii] + g_t32[3][5] * O[5][ii] + g_t32[3][6] * O[6][ii] + g_t32[3][7] * O[7][ii] + g_t32[3][8] * O[8][ii] + g_t32[3][9] * O[9][ii] + g_t32[3][10] * O[10][ii] + g_t32[3][11] * O[11][ii] + g_t32[3][12] * O[12][ii] + g_t32[3][13] * O[13][ii] + g_t32[3][14] * O[14][ii] + g_t32[3][15] * O[15][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) { dst[(5 * line) + ii] = (int16_t)((g_t32[5][0] * O[0][ii] + g_t32[5][1] * O[1][ii] + g_t32[5][2] * O[2][ii] + g_t32[5][3] * O[3][ii] + g_t32[5][4] * O[4][ii] + g_t32[5][5] * O[5][ii] + g_t32[5][6] * O[6][ii] + g_t32[5][7] * O[7][ii] + g_t32[5][8] * O[8][ii] + g_t32[5][9] * O[9][ii] + g_t32[5][10] * O[10][ii] + g_t32[5][11] * O[11][ii] + g_t32[5][12] * O[12][ii] + g_t32[5][13] * O[13][ii] + g_t32[5][14] * O[14][ii] + g_t32[5][15] * O[15][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) { dst[(7 * line) + ii] = (int16_t)((g_t32[7][0] * O[0][ii] + g_t32[7][1] * O[1][ii] + g_t32[7][2] * O[2][ii] + g_t32[7][3] * O[3][ii] + g_t32[7][4] * O[4][ii] + g_t32[7][5] * O[5][ii] + g_t32[7][6] * O[6][ii] + g_t32[7][7] * O[7][ii] + g_t32[7][8] * O[8][ii] + g_t32[7][9] * O[9][ii] + g_t32[7][10] * O[10][ii] + g_t32[7][11] * O[11][ii] + g_t32[7][12] * O[12][ii] + g_t32[7][13] * O[13][ii] + g_t32[7][14] * O[14][ii] + g_t32[7][15] * O[15][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) { dst[(9 * line) + ii] = (int16_t)((g_t32[9][0] * O[0][ii] + g_t32[9][1] * O[1][ii] + g_t32[9][2] * O[2][ii] + g_t32[9][3] * O[3][ii] + g_t32[9][4] * O[4][ii] + g_t32[9][5] * O[5][ii] + g_t32[9][6] * O[6][ii] + g_t32[9][7] * O[7][ii] + g_t32[9][8] * O[8][ii] + g_t32[9][9] * O[9][ii] + g_t32[9][10] * O[10][ii] + g_t32[9][11] * O[11][ii] + g_t32[9][12] * O[12][ii] + g_t32[9][13] * O[13][ii] + g_t32[9][14] * O[14][ii] + g_t32[9][15] * O[15][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) { dst[(11 * line) + ii] = (int16_t)((g_t32[11][0] * O[0][ii] + g_t32[11][1] * O[1][ii] + g_t32[11][2] * O[2][ii] + g_t32[11][3] * O[3][ii] + g_t32[11][4] * O[4][ii] + g_t32[11][5] * O[5][ii] + g_t32[11][6] * O[6][ii] + g_t32[11][7] * O[7][ii] + g_t32[11][8] * O[8][ii] + g_t32[11][9] * O[9][ii] + g_t32[11][10] * O[10][ii] + g_t32[11][11] * O[11][ii] + g_t32[11][12] * O[12][ii] + g_t32[11][13] * O[13][ii] + g_t32[11][14] * O[14][ii] + g_t32[11][15] * O[15][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) { dst[(13 * line) + ii] = (int16_t)((g_t32[13][0] * O[0][ii] + g_t32[13][1] * O[1][ii] + g_t32[13][2] * O[2][ii] + g_t32[13][3] * O[3][ii] + g_t32[13][4] * O[4][ii] + g_t32[13][5] * O[5][ii] + g_t32[13][6] * O[6][ii] + g_t32[13][7] * O[7][ii] + g_t32[13][8] * O[8][ii] + g_t32[13][9] * O[9][ii] + g_t32[13][10] * O[10][ii] + g_t32[13][11] * O[11][ii] + g_t32[13][12] * O[12][ii] + g_t32[13][13] * O[13][ii] + g_t32[13][14] * O[14][ii] + g_t32[13][15] * O[15][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) { dst[(15 * line) + ii] = (int16_t)((g_t32[15][0] * O[0][ii] + g_t32[15][1] * O[1][ii] + g_t32[15][2] * O[2][ii] + g_t32[15][3] * O[3][ii] + g_t32[15][4] * O[4][ii] + g_t32[15][5] * O[5][ii] + g_t32[15][6] * O[6][ii] + g_t32[15][7] * O[7][ii] + g_t32[15][8] * O[8][ii] + g_t32[15][9] * O[9][ii] + g_t32[15][10] * O[10][ii] + g_t32[15][11] * O[11][ii] + g_t32[15][12] * O[12][ii] + g_t32[15][13] * O[13][ii] + g_t32[15][14] * O[14][ii] + g_t32[15][15] * O[15][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) { dst[(17 * line) + ii] = (int16_t)((g_t32[17][0] * O[0][ii] + g_t32[17][1] * O[1][ii] + g_t32[17][2] * O[2][ii] + g_t32[17][3] * O[3][ii] + g_t32[17][4] * O[4][ii] + g_t32[17][5] * O[5][ii] + g_t32[17][6] * O[6][ii] + g_t32[17][7] * O[7][ii] + g_t32[17][8] * O[8][ii] + g_t32[17][9] * O[9][ii] + g_t32[17][10] * O[10][ii] + g_t32[17][11] * O[11][ii] + g_t32[17][12] * O[12][ii] + g_t32[17][13] * O[13][ii] + g_t32[17][14] * O[14][ii] + g_t32[17][15] * O[15][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) { dst[(19 * line) + ii] = (int16_t)((g_t32[19][0] * O[0][ii] + g_t32[19][1] * O[1][ii] + g_t32[19][2] * O[2][ii] + g_t32[19][3] * O[3][ii] + g_t32[19][4] * O[4][ii] + g_t32[19][5] * O[5][ii] + g_t32[19][6] * O[6][ii] + g_t32[19][7] * O[7][ii] + g_t32[19][8] * O[8][ii] + g_t32[19][9] * O[9][ii] + g_t32[19][10] * O[10][ii] + g_t32[19][11] * O[11][ii] + g_t32[19][12] * O[12][ii] + g_t32[19][13] * O[13][ii] + g_t32[19][14] * O[14][ii] + g_t32[19][15] * O[15][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) { dst[(21 * line) + ii] = (int16_t)((g_t32[21][0] * O[0][ii] + g_t32[21][1] * O[1][ii] + g_t32[21][2] * O[2][ii] + g_t32[21][3] * O[3][ii] + g_t32[21][4] * O[4][ii] + g_t32[21][5] * O[5][ii] + g_t32[21][6] * O[6][ii] + g_t32[21][7] * O[7][ii] + g_t32[21][8] * O[8][ii] + g_t32[21][9] * O[9][ii] + g_t32[21][10] * O[10][ii] + g_t32[21][11] * O[11][ii] + g_t32[21][12] * O[12][ii] + g_t32[21][13] * O[13][ii] + g_t32[21][14] * O[14][ii] + g_t32[21][15] * O[15][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) { dst[(23 * line) + ii] = (int16_t)((g_t32[23][0] * O[0][ii] + g_t32[23][1] * O[1][ii] + g_t32[23][2] * O[2][ii] + g_t32[23][3] * O[3][ii] + g_t32[23][4] * O[4][ii] + g_t32[23][5] * O[5][ii] + g_t32[23][6] * O[6][ii] + g_t32[23][7] * O[7][ii] + g_t32[23][8] * O[8][ii] + g_t32[23][9] * O[9][ii] + g_t32[23][10] * O[10][ii] + g_t32[23][11] * O[11][ii] + g_t32[23][12] * O[12][ii] + g_t32[23][13] * O[13][ii] + g_t32[23][14] * O[14][ii] + g_t32[23][15] * O[15][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) { dst[(25 * line) + ii] = (int16_t)((g_t32[25][0] * O[0][ii] + g_t32[25][1] * O[1][ii] + g_t32[25][2] * O[2][ii] + g_t32[25][3] * O[3][ii] + g_t32[25][4] * O[4][ii] + g_t32[25][5] * O[5][ii] + g_t32[25][6] * O[6][ii] + g_t32[25][7] * O[7][ii] + g_t32[25][8] * O[8][ii] + g_t32[25][9] * O[9][ii] + g_t32[25][10] * O[10][ii] + g_t32[25][11] * O[11][ii] + g_t32[25][12] * O[12][ii] + g_t32[25][13] * O[13][ii] + g_t32[25][14] * O[14][ii] + g_t32[25][15] * O[15][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) { dst[(27 * line) + ii] = (int16_t)((g_t32[27][0] * O[0][ii] + g_t32[27][1] * O[1][ii] + g_t32[27][2] * O[2][ii] + g_t32[27][3] * O[3][ii] + g_t32[27][4] * O[4][ii] + g_t32[27][5] * O[5][ii] + g_t32[27][6] * O[6][ii] + g_t32[27][7] * O[7][ii] + g_t32[27][8] * O[8][ii] + g_t32[27][9] * O[9][ii] + g_t32[27][10] * O[10][ii] + g_t32[27][11] * O[11][ii] + g_t32[27][12] * O[12][ii] + g_t32[27][13] * O[13][ii] + g_t32[27][14] * O[14][ii] + g_t32[27][15] * O[15][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) { dst[(29 * line) + ii] = (int16_t)((g_t32[29][0] * O[0][ii] + g_t32[29][1] * O[1][ii] + g_t32[29][2] * O[2][ii] + g_t32[29][3] * O[3][ii] + g_t32[29][4] * O[4][ii] + g_t32[29][5] * O[5][ii] + g_t32[29][6] * O[6][ii] + g_t32[29][7] * O[7][ii] + g_t32[29][8] * O[8][ii] + g_t32[29][9] * O[9][ii] + g_t32[29][10] * O[10][ii] + g_t32[29][11] * O[11][ii] + g_t32[29][12] * O[12][ii] + g_t32[29][13] * O[13][ii] + g_t32[29][14] * O[14][ii] + g_t32[29][15] * O[15][ii] + add) >> shift);} + for(int ii=0; ii<8; ii++) { dst[(31 * line) + ii] = (int16_t)((g_t32[31][0] * O[0][ii] + g_t32[31][1] * O[1][ii] + g_t32[31][2] * O[2][ii] + g_t32[31][3] * O[3][ii] + g_t32[31][4] * O[4][ii] + g_t32[31][5] * O[5][ii] + g_t32[31][6] * O[6][ii] + g_t32[31][7] * O[7][ii] + g_t32[31][8] * O[8][ii] + g_t32[31][9] * O[9][ii] + g_t32[31][10] * O[10][ii] + g_t32[31][11] * O[11][ii] + g_t32[31][12] * O[12][ii] + g_t32[31][13] * O[13][ii] + g_t32[31][14] * O[14][ii] + g_t32[31][15] * O[15][ii] + add) >> shift);} + + src += 8 ; + dst += 8 ; + } +} // end partialButterfly32_transposedSrc_altivec() + + +inline static void partialButterfly16_transposedSrc_altivec(const int16_t* __restrict__ src, int16_t* __restrict__ dst, int shift) +{ + const int line = 16 ; + + int j, k; + int add = 1 << (shift - 1); + + int E[8][8], O[8][8] ; + int EE[4][8], EO[4][8] ; + int EEE[2][8], EEO[2][8] ; + + + for (j = 0; j < line/8; j++) + { + /* E and O */ + for(int ii=0; ii<8; ii++) { E[0][ii] = src[(0*line) + ii] + src[ ((15 - 0) * line) + ii] ;} + for(int ii=0; ii<8; ii++) { O[0][ii] = src[(0*line) + ii] - src[ ((15 - 0) * line) + ii] ;} + + for(int ii=0; ii<8; ii++) { E[1][ii] = src[(1*line) + ii] + src[ ((15 - 1) * line) + ii] ;} + for(int ii=0; ii<8; ii++) { O[1][ii] = src[(1*line) + ii] - src[ ((15 - 1) * line) + ii] ;} + + for(int ii=0; ii<8; ii++) { E[2][ii] = src[(2*line) + ii] + src[ ((15 - 2) * line) + ii] ;} + for(int ii=0; ii<8; ii++) { O[2][ii] = src[(2*line) + ii] - src[ ((15 - 2) * line) + ii] ;} + + for(int ii=0; ii<8; ii++) { E[3][ii] = src[(3*line) + ii] + src[ ((15 - 3) * line) + ii] ;} + for(int ii=0; ii<8; ii++) { O[3][ii] = src[(3*line) + ii] - src[ ((15 - 3) * line) + ii] ;} + + for(int ii=0; ii<8; ii++) { E[4][ii] = src[(4*line) + ii] + src[ ((15 - 4) * line) + ii] ;} + for(int ii=0; ii<8; ii++) { O[4][ii] = src[(4*line) + ii] - src[ ((15 - 4) * line) + ii] ;} + + for(int ii=0; ii<8; ii++) { E[5][ii] = src[(5*line) + ii] + src[ ((15 - 5) * line) + ii] ;} + for(int ii=0; ii<8; ii++) { O[5][ii] = src[(5*line) + ii] - src[ ((15 - 5) * line) + ii] ;} + + for(int ii=0; ii<8; ii++) { E[6][ii] = src[(6*line) + ii] + src[ ((15 - 6) * line) + ii] ;} + for(int ii=0; ii<8; ii++) { O[6][ii] = src[(6*line) + ii] - src[ ((15 - 6) * line) + ii] ;} + + for(int ii=0; ii<8; ii++) { E[7][ii] = src[(7*line) + ii] + src[ ((15 - 7) * line) + ii] ;} + for(int ii=0; ii<8; ii++) { O[7][ii] = src[(7*line) + ii] - src[ ((15 - 7) * line) + ii] ;} + + + /* EE and EO */ + for(int ii=0; ii<8; ii++) { EE[0][ii] = E[0][ii] + E[7-0][ii] ;} + for(int ii=0; ii<8; ii++) { EO[0][ii] = E[0][ii] - E[7-0][ii] ;} + + for(int ii=0; ii<8; ii++) { EE[1][ii] = E[1][ii] + E[7-1][ii] ;} + for(int ii=0; ii<8; ii++) { EO[1][ii] = E[1][ii] - E[7-1][ii] ;} + + for(int ii=0; ii<8; ii++) { EE[2][ii] = E[2][ii] + E[7-2][ii] ;} + for(int ii=0; ii<8; ii++) { EO[2][ii] = E[2][ii] - E[7-2][ii] ;} + + for(int ii=0; ii<8; ii++) { EE[3][ii] = E[3][ii] + E[7-3][ii] ;} + for(int ii=0; ii<8; ii++) { EO[3][ii] = E[3][ii] - E[7-3][ii] ;} + + + /* EEE and EEO */ + for(int ii=0; ii<8; ii++) { EEE[0][ii] = EE[0][ii] + EE[3][ii] ;} + for(int ii=0; ii<8; ii++) { EEO[0][ii] = EE[0][ii] - EE[3][ii] ;} + + for(int ii=0; ii<8; ii++) { EEE[1][ii] = EE[1][ii] + EE[2][ii] ;} + for(int ii=0; ii<8; ii++) { EEO[1][ii] = EE[1][ii] - EE[2][ii] ;} + + + /* Writing to dst */ + for(int ii=0; ii<8; ii++) { dst[ 0 + ii] = (int16_t)((g_t16[0][0] * EEE[0][ii] + g_t16[0][1] * EEE[1][ii] + add) >> shift) ;} + for(int ii=0; ii<8; ii++) { dst[(8 * line) + ii] = (int16_t)((g_t16[8][0] * EEE[0][ii] + g_t16[8][1] * EEE[1][ii] + add) >> shift) ;} + for(int ii=0; ii<8; ii++) { dst[(4 * line) + ii] = (int16_t)((g_t16[4][0] * EEO[0][ii] + g_t16[4][1] * EEO[1][ii] + add) >> shift) ;} + for(int ii=0; ii<8; ii++) { dst[(12 * line) + ii] = (int16_t)((g_t16[12][0] * EEO[0][ii] + g_t16[12][1] * EEO[1][ii] + add) >> shift) ; } + + for(int ii=0; ii<8; ii++) { dst[(2 * line) + ii] = (int16_t)((g_t16[2][0] * EO[0][ii] + g_t16[2][1] * EO[1][ii] + g_t16[2][2] * EO[2][ii] + g_t16[2][3] * EO[3][ii] + add) >> shift) ;} + for(int ii=0; ii<8; ii++) { dst[(6 * line) + ii] = (int16_t)((g_t16[6][0] * EO[0][ii] + g_t16[6][1] * EO[1][ii] + g_t16[6][2] * EO[2][ii] + g_t16[6][3] * EO[3][ii] + add) >> shift) ;} + for(int ii=0; ii<8; ii++) { dst[(10 * line) + ii] = (int16_t)((g_t16[10][0] * EO[0][ii] + g_t16[10][1] * EO[1][ii] + g_t16[10][2] * EO[2][ii] + g_t16[10][3] * EO[3][ii] + add) >> shift) ;} + for(int ii=0; ii<8; ii++) { dst[(14 * line) + ii] = (int16_t)((g_t16[14][0] * EO[0][ii] + g_t16[14][1] * EO[1][ii] + g_t16[14][2] * EO[2][ii] + g_t16[14][3] * EO[3][ii] + add) >> shift) ;} + + for(int ii=0; ii<8; ii++) { dst[(1 * line) + ii] = (int16_t)((g_t16[1][0] * O[0][ii] + g_t16[1][1] * O[1][ii] + g_t16[1][2] * O[2][ii] + g_t16[1][3] * O[3][ii] + g_t16[1][4] * O[4][ii] + g_t16[1][5] * O[5][ii] + g_t16[1][6] * O[6][ii] + g_t16[1][7] * O[7][ii] + add) >> shift) ;} + for(int ii=0; ii<8; ii++) { dst[(3 * line) + ii] = (int16_t)((g_t16[3][0] * O[0][ii] + g_t16[3][1] * O[1][ii] + g_t16[3][2] * O[2][ii] + g_t16[3][3] * O[3][ii] + g_t16[3][4] * O[4][ii] + g_t16[3][5] * O[5][ii] + g_t16[3][6] * O[6][ii] + g_t16[3][7] * O[7][ii] + add) >> shift) ;} + for(int ii=0; ii<8; ii++) { dst[(5 * line) + ii] = (int16_t)((g_t16[5][0] * O[0][ii] + g_t16[5][1] * O[1][ii] + g_t16[5][2] * O[2][ii] + g_t16[5][3] * O[3][ii] + g_t16[5][4] * O[4][ii] + g_t16[5][5] * O[5][ii] + g_t16[5][6] * O[6][ii] + g_t16[5][7] * O[7][ii] + add) >> shift) ;} + for(int ii=0; ii<8; ii++) { dst[(7 * line) + ii] = (int16_t)((g_t16[7][0] * O[0][ii] + g_t16[7][1] * O[1][ii] + g_t16[7][2] * O[2][ii] + g_t16[7][3] * O[3][ii] + g_t16[7][4] * O[4][ii] + g_t16[7][5] * O[5][ii] + g_t16[7][6] * O[6][ii] + g_t16[7][7] * O[7][ii] + add) >> shift) ;} + for(int ii=0; ii<8; ii++) { dst[(9 * line) + ii] = (int16_t)((g_t16[9][0] * O[0][ii] + g_t16[9][1] * O[1][ii] + g_t16[9][2] * O[2][ii] + g_t16[9][3] * O[3][ii] + g_t16[9][4] * O[4][ii] + g_t16[9][5] * O[5][ii] + g_t16[9][6] * O[6][ii] + g_t16[9][7] * O[7][ii] + add) >> shift) ;} + for(int ii=0; ii<8; ii++) { dst[(11 * line) + ii] = (int16_t)((g_t16[11][0] * O[0][ii] + g_t16[11][1] * O[1][ii] + g_t16[11][2] * O[2][ii] + g_t16[11][3] * O[3][ii] + g_t16[11][4] * O[4][ii] + g_t16[11][5] * O[5][ii] + g_t16[11][6] * O[6][ii] + g_t16[11][7] * O[7][ii] + add) >> shift) ;} + for(int ii=0; ii<8; ii++) { dst[(13 * line) + ii] = (int16_t)((g_t16[13][0] * O[0][ii] + g_t16[13][1] * O[1][ii] + g_t16[13][2] * O[2][ii] + g_t16[13][3] * O[3][ii] + g_t16[13][4] * O[4][ii] + g_t16[13][5] * O[5][ii] + g_t16[13][6] * O[6][ii] + g_t16[13][7] * O[7][ii] + add) >> shift) ;} + for(int ii=0; ii<8; ii++) { dst[(15 * line) + ii] = (int16_t)((g_t16[15][0] * O[0][ii] + g_t16[15][1] * O[1][ii] + g_t16[15][2] * O[2][ii] + g_t16[15][3] * O[3][ii] + g_t16[15][4] * O[4][ii] + g_t16[15][5] * O[5][ii] + g_t16[15][6] * O[6][ii] + g_t16[15][7] * O[7][ii] + add) >> shift) ;} + + + src += 8; + dst += 8 ; + + } +} // end partialButterfly16_transposedSrc_altivec() + + +static void dct16_altivec(const int16_t* src, int16_t* dst, intptr_t srcStride) +{ + const int shift_1st = 3 + X265_DEPTH - 8; + const int shift_2nd = 10; + + ALIGN_VAR_32(int16_t, coef[16 * 16]); + ALIGN_VAR_32(int16_t, block_transposed[16 * 16]); + ALIGN_VAR_32(int16_t, coef_transposed[16 * 16]); + + transpose_matrix_16_altivec((int16_t *)src, srcStride, (int16_t *)block_transposed, 16) ; + partialButterfly16_transposedSrc_altivec(block_transposed, coef, shift_1st) ; + + transpose_matrix_16_altivec((int16_t *)coef, 16, (int16_t *)coef_transposed, 16) ; + partialButterfly16_transposedSrc_altivec(coef_transposed, dst, shift_2nd); +} // end dct16_altivec() + + + + +static void dct32_altivec(const int16_t* src, int16_t* dst, intptr_t srcStride) +{ + const int shift_1st = 4 + X265_DEPTH - 8; + const int shift_2nd = 11; + + ALIGN_VAR_32(int16_t, coef[32 * 32]); + ALIGN_VAR_32(int16_t, block_transposed[32 * 32]); + ALIGN_VAR_32(int16_t, coef_transposed[32 * 32]); + + transpose_matrix_32_altivec((int16_t *)src, srcStride, (int16_t *)block_transposed, 32) ; + partialButterfly32_transposedSrc_altivec(block_transposed, coef, shift_1st) ; + + transpose_matrix_32_altivec((int16_t *)coef, 32, (int16_t *)coef_transposed, 32) ; + partialButterfly32_transposedSrc_altivec(coef_transposed, dst, shift_2nd); +} // end dct32_altivec() + + +namespace X265_NS { +// x265 private namespace + +void setupDCTPrimitives_altivec(EncoderPrimitives& p) +{ + p.quant = quant_altivec ; + + p.cu[BLOCK_16x16].dct = dct16_altivec ; + p.cu[BLOCK_32x32].dct = dct32_altivec ; + + p.denoiseDct = denoiseDct_altivec ; +} +}
View file
x265_2.2.tar.gz/source/common/ppc/intrapred_altivec.cpp
Added
@@ -0,0 +1,30809 @@ +/***************************************************************************** + * Copyright (C) 2013 x265 project + * + * Authors: Roger Moussalli <rmoussal@us.ibm.com> + * Min Chen <min.chen@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include <iostream> +#include <vector> +#include <assert.h> +#include <math.h> +#include <cmath> +#include <linux/types.h> +#include <stdlib.h> +#include <stdio.h> +#include <stdint.h> +#include <sys/time.h> +#include <string.h> + +#include "common.h" +#include "primitives.h" +#include "x265.h" +#include "ppccommon.h" + +//using namespace std ; +namespace X265_NS { + +/* INTRA Prediction - altivec implementation */ +template<int width, int dirMode> +void intra_pred(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter){}; + +template<> +void intra_pred<4, 2>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + if(dstStride == 4) { + const vec_u8_t srcV = vec_xl(10, srcPix0); /* offset = width2+2 = width<<1 + 2*/ + const vec_u8_t mask = {0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03,0x04, 0x02, 0x03,0x04,0x05, 0x03,0x04,0x05, 0x06}; + vec_u8_t vout = vec_perm(srcV, srcV, mask); + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_u8_t v0 = vec_xl(10, srcPix0); + vec_ste((vec_u32_t)v0, 0, (unsigned int*)dst); + vec_u8_t v1 = vec_xl(11, srcPix0); + vec_ste((vec_u32_t)v1, 0, (unsigned int*)(dst+dstStride)); + vec_u8_t v2 = vec_xl(12, srcPix0); + vec_ste((vec_u32_t)v2, 0, (unsigned int*)(dst+dstStride*2)); + vec_u8_t v3 = vec_xl(13, srcPix0); + vec_ste((vec_u32_t)v3, 0, (unsigned int*)(dst+dstStride*3)); + } + else{ + const vec_u8_t srcV = vec_xl(10, srcPix0); /* offset = width2+2 = width<<1 + 2*/ + const vec_u8_t mask_0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + const vec_u8_t mask_1 = {0x01, 0x02, 0x03, 0x04, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + const vec_u8_t mask_2 = {0x02, 0x03, 0x04, 0x05, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + const vec_u8_t mask_3 = {0x03, 0x04, 0x05, 0x06, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(srcV, vec_xl(0, dst), mask_0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(srcV, vec_xl(dstStride, dst), mask_1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(srcV, vec_xl(dstStride*2, dst), mask_2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(srcV, vec_xl(dstStride*3, dst), mask_3); + vec_xst(v3, dstStride*3, dst); + } +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 2>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + if(dstStride == 8) { + const vec_u8_t srcV1 = vec_xl(18, srcPix0); /* offset = width2+2 = width<<1 + 2*/ + const vec_u8_t mask_0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x01, 0x02, 0x03,0x04, 0x05, 0x06, 0x07, 0x08}; + const vec_u8_t mask_1 = {0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a}; + const vec_u8_t mask_2 = {0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c}; + const vec_u8_t mask_3 = {0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e}; + vec_u8_t v0 = vec_perm(srcV1, srcV1, mask_0); + vec_u8_t v1 = vec_perm(srcV1, srcV1, mask_1); + vec_u8_t v2 = vec_perm(srcV1, srcV1, mask_2); + vec_u8_t v3 = vec_perm(srcV1, srcV1, mask_3); + vec_xst(v0, 0, dst); + vec_xst(v1, 16, dst); + vec_xst(v2, 32, dst); + vec_xst(v3, 48, dst); + } + else{ + //pixel *out = dst; + const vec_u8_t srcV1 = vec_xl(18, srcPix0); /* offset = width2+2 = width<<1 + 2*/ + const vec_u8_t mask_0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + const vec_u8_t mask_1 = {0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + const vec_u8_t mask_2 = {0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + const vec_u8_t mask_3 = {0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + const vec_u8_t mask_4 = {0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + const vec_u8_t mask_5 = {0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + const vec_u8_t mask_6 = {0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + const vec_u8_t mask_7 = {0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(srcV1, vec_xl(0, dst), mask_0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(srcV1, vec_xl(dstStride, dst), mask_1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(srcV1, vec_xl(dstStride*2, dst), mask_2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(srcV1, vec_xl(dstStride*3, dst), mask_3); + vec_xst(v3, dstStride*3, dst); + vec_u8_t v4 = vec_perm(srcV1, vec_xl(dstStride*4, dst), mask_4); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(srcV1, vec_xl(dstStride*5, dst), mask_5); + vec_xst(v5, dstStride*5, dst); + vec_u8_t v6 = vec_perm(srcV1, vec_xl(dstStride*6, dst), mask_6); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(srcV1, vec_xl(dstStride*7, dst), mask_7); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 2>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + int i; + //int off = dstStride; + //const pixel *srcPix = srcPix0; + for(i=0; i<16; i++){ + vec_xst( vec_xl(34+i, srcPix0), i*dstStride, dst); /* first offset = width2+2 = width<<1 + 2*/ + } +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x <16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 2>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + int i; + int off = dstStride; + //const pixel *srcPix = srcPix0; + for(i=0; i<32; i++){ + off = i*dstStride; + vec_xst( vec_xl(66+i, srcPix0), off, dst); /* first offset = width2+2 = width<<1 + 2*/ + vec_xst( vec_xl(82+i, srcPix0), off+16, dst); /* first offset = width2+2 = width<<1 + 2*/ + } +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x <32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +#define one_line(s0, s1, vf32, vf, vout) {\ +vmle0 = vec_mule(s0, vf32);\ +vmlo0 = vec_mulo(s0, vf32);\ +vmle1 = vec_mule(s1, vf);\ +vmlo1 = vec_mulo(s1, vf);\ +vsume = vec_add(vec_add(vmle0, vmle1), u16_16);\ +ve = vec_sra(vsume, u16_5);\ +vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);\ +vo = vec_sra(vsumo, u16_5);\ +vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));\ +} + +template<> +void intra_pred<4, 3>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + //mode 33: + //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26}; + //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0}; + vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x1, 0x2, 0x3, 0x4, 0x2, 0x3, 0x4, 0x5, 0x3, 0x4, 0x5, 0x6}; + vec_u8_t mask1={0x1, 0x2, 0x3, 0x4, 0x2, 0x3, 0x4, 0x5, 0x3, 0x4, 0x5, 0x6, 0x4, 0x5, 0x6, 0x7}; + + vec_u8_t vfrac4 = (vec_u8_t){26, 20, 14, 8, 26, 20, 14, 8, 26, 20, 14, 8, 26, 20, 14, 8}; + vec_u8_t vfrac4_32 = (vec_u8_t){6, 12, 18, 24, 6, 12, 18, 24, 6, 12, 18, 24, 6, 12, 18, 24}; + + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 3>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + //mode 33: + //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26}; + //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0}; + vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x1, 0x2, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7}; + vec_u8_t mask1={0x1, 0x2, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x2, 0x3, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8}; + vec_u8_t mask2={0x2, 0x3, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x3, 0x4, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9}; + vec_u8_t mask3={0x3, 0x4, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa}; + vec_u8_t mask4={0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb}; + vec_u8_t mask5={0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc}; + vec_u8_t mask6={0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd}; + vec_u8_t mask7={0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe}; + //vec_u8_t mask8={0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0x9, 0xa, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf}; + + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 1 */ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 2 */ + vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 3 */ + vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 4 */ + vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 4, 4 */ + vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 4, 5 */ + vec_u8_t srv6 = vec_perm(srv, srv, mask6); /* 5, 6 */ + vec_u8_t srv7 = vec_perm(srv, srv, mask7); /* 6, 7 */ + + vec_u8_t vfrac8 = (vec_u8_t){26, 20, 14, 8, 2, 28, 22, 16, 26, 20, 14, 8, 2, 28, 22, 16}; + vec_u8_t vfrac8_32 = (vec_u8_t){6, 12, 18, 24, 30, 4, 10, 16, 6, 12, 18, 24, 30, 4, 10, 16}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + /* y0, y1 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32); /* (32 - fraction) * ref[offset + x], x=0-7 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32); + vec_u16_t vmle1 = vec_mule(srv1, vfrac8); /* fraction * ref[offset + x + 1], x=0-7 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y2, y3 */ + vmle0 = vec_mule(srv2, vfrac8_32); + vmlo0 = vec_mulo(srv2, vfrac8_32); + vmle1 = vec_mule(srv3, vfrac8); + vmlo1 = vec_mulo(srv3, vfrac8); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y4, y5 */ + vmle0 = vec_mule(srv4, vfrac8_32); + vmlo0 = vec_mulo(srv4, vfrac8_32); + vmle1 = vec_mule(srv5, vfrac8); + vmlo1 = vec_mulo(srv5, vfrac8); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y6, y7 */ + vmle0 = vec_mule(srv6, vfrac8_32); + vmlo0 = vec_mulo(srv6, vfrac8_32); + vmle1 = vec_mule(srv7, vfrac8); + vmlo1 = vec_mulo(srv7, vfrac8); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 3>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + //mode 33: + //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26}; + //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + +vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd}; +vec_u8_t mask1={0x1, 0x2, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe}; +vec_u8_t mask2={0x2, 0x3, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf}; +vec_u8_t mask3={0x3, 0x4, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10}; +vec_u8_t mask4={0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11}; +vec_u8_t mask5={0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12}; +vec_u8_t mask6={0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13}; +vec_u8_t mask7={0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14}; +vec_u8_t mask8={0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; +vec_u8_t mask9={0x9, 0xa, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; +vec_u8_t mask10={0xa, 0xb, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; +vec_u8_t mask11={0xb, 0xc, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; +vec_u8_t mask12={0xc, 0xd, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19}; +vec_u8_t mask13={0xd, 0xe, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a}; +vec_u8_t mask14={0xe, 0xf, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x15, 0x16, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b}; +vec_u8_t mask15={0xf, 0x10, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x16, 0x17, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c}; + + vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = vec_perm(sv0, sv1, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srva = vec_perm(sv0, sv1, mask10); + vec_u8_t srvb = vec_perm(sv0, sv1, mask11); + vec_u8_t srvc = vec_perm(sv0, sv1, mask12); + vec_u8_t srvd = vec_perm(sv0, sv1, mask13); + vec_u8_t srve = vec_perm(sv0, sv1, mask14); + vec_u8_t srvf = vec_perm(sv0, sv1, mask15); + vec_u8_t srv00 = vec_perm(sv1, sv1, mask0); + +vec_u8_t vfrac16 = (vec_u8_t){26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0}; +vec_u8_t vfrac16_32 = (vec_u8_t){6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv1, vfrac16_32, vfrac16, vout_0); + one_line(srv1, srv2, vfrac16_32, vfrac16, vout_1); + one_line(srv2, srv3, vfrac16_32, vfrac16, vout_2); + one_line(srv3, srv4, vfrac16_32, vfrac16, vout_3); + one_line(srv4, srv5, vfrac16_32, vfrac16, vout_4); + one_line(srv5, srv6, vfrac16_32, vfrac16, vout_5); + one_line(srv6, srv7, vfrac16_32, vfrac16, vout_6); + one_line(srv7, srv8, vfrac16_32, vfrac16, vout_7); + one_line(srv8, srv9, vfrac16_32, vfrac16, vout_8); + one_line(srv9, srva, vfrac16_32, vfrac16, vout_9); + one_line(srva, srvb, vfrac16_32, vfrac16, vout_10); + one_line(srvb, srvc, vfrac16_32, vfrac16, vout_11); + one_line(srvc, srvd, vfrac16_32, vfrac16, vout_12); + one_line(srvd, srve, vfrac16_32, vfrac16, vout_13); + one_line(srve, srvf, vfrac16_32, vfrac16, vout_14); + one_line(srvf, srv00, vfrac16_32, vfrac16, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 3>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + //mode 33: + //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26}; + //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0}; + +vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, }; +vec_u8_t mask16_0={0xd, 0xe, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; +vec_u8_t mask1={0x1, 0x2, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, }; +vec_u8_t mask16_1={0xe, 0xf, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x15, 0x16, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, }; +vec_u8_t mask2={0x2, 0x3, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask16_2={0xf, 0x10, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x16, 0x17, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, }; +vec_u8_t mask3={0x3, 0x4, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask16_3={0x0, 0x1, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, }; +vec_u8_t mask4={0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask16_4={0x1, 0x2, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, }; +vec_u8_t mask5={0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask16_5={0x2, 0x3, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask6={0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask16_6={0x3, 0x4, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask7={0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask16_7={0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask8={0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask16_8={0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask9={0x9, 0xa, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask16_9={0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask10={0xa, 0xb, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t mask16_10={0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask11={0xb, 0xc, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +vec_u8_t mask16_11={0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask12={0xc, 0xd, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; +vec_u8_t mask16_12={0x9, 0xa, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask13={0xd, 0xe, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; +vec_u8_t mask16_13={0xa, 0xb, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t mask14={0xe, 0xf, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x15, 0x16, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, }; +vec_u8_t mask16_14={0xb, 0xc, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +vec_u8_t mask15={0xf, 0x10, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x16, 0x17, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, }; +vec_u8_t mask16_15={0xc, 0xd, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; + +/*vec_u8_t mask16={0x0, 0x1, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, }; +vec_u8_t mask16_16={0xd, 0xe, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; +vec_u8_t mask17={0x1, 0x2, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, }; +vec_u8_t mask16_17={0xe, 0xf, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x15, 0x16, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, }; +vec_u8_t mask18={0x2, 0x3, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask16_18={0xf, 0x10, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x16, 0x17, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, }; +vec_u8_t mask19={0x3, 0x4, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask16_19={0x0, 0x1, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, }; +vec_u8_t mask20={0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask16_20={0x1, 0x2, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, }; +vec_u8_t mask21={0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask16_21={0x2, 0x3, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask22={0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask16_22={0x3, 0x4, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask23={0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask16_23={0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask24={0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask16_24={0x5, 0x6, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask25={0x9, 0xa, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask16_25={0x6, 0x7, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask26={0xa, 0xb, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t mask16_26={0x7, 0x8, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask27={0xb, 0xc, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +vec_u8_t mask16_27={0x8, 0x9, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask28={0xc, 0xd, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; +vec_u8_t mask16_28={0x9, 0xa, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask29={0xd, 0xe, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; +vec_u8_t mask16_29={0xa, 0xb, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t mask30={0xe, 0xf, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x15, 0x16, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, }; +vec_u8_t mask16_30={0xb, 0xc, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +vec_u8_t mask31={0xf, 0x10, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x16, 0x17, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, }; +vec_u8_t mask16_31={0xc, 0xd, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, };*/ + +vec_u8_t maskadd1_31={0x0, 0x1, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, }; +vec_u8_t maskadd1_16_31={0xd, 0xe, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv3 =vec_xl(113, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + //vec_u8_t sv4 =vec_xl(129, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + +/* + printf("source:\n"); + for(int i=0; i<32; i++){ + printf("%d ", srcPix0[i+65]); + } + printf("\n"); + for(int i=0; i<32; i++){ + printf("%d ", srcPix0[i+97]); + } + printf("\n\n"); +*/ + vec_u8_t srv0 = vec_perm(sv0, sv1, mask0); + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srv10 = vec_perm(sv0, sv1, mask10); + vec_u8_t srv11 = vec_perm(sv0, sv1, mask11); + vec_u8_t srv12 = vec_perm(sv0, sv1, mask12); + vec_u8_t srv13 = vec_perm(sv0, sv1, mask13); + vec_u8_t srv14 = vec_perm(sv0, sv1, mask14); + vec_u8_t srv15 = vec_perm(sv0, sv1, mask15); + + vec_u8_t srv16_0 = vec_perm(sv0, sv1, mask16_0); + vec_u8_t srv16_1 = vec_perm(sv0, sv1,mask16_1); + vec_u8_t srv16_2 = vec_perm(sv0, sv1, mask16_2); + vec_u8_t srv16_3 = vec_perm(sv1, sv2, mask16_3); + vec_u8_t srv16_4 = vec_perm(sv1, sv2, mask16_4); + vec_u8_t srv16_5 = vec_perm(sv1, sv2, mask16_5); + vec_u8_t srv16_6 = vec_perm(sv1, sv2, mask16_6); + vec_u8_t srv16_7 = vec_perm(sv1, sv2, mask16_7); + vec_u8_t srv16_8 = vec_perm(sv1, sv2, mask16_8); + vec_u8_t srv16_9 = vec_perm(sv1, sv2, mask16_9); + vec_u8_t srv16_10 = vec_perm(sv1, sv2, mask16_10); + vec_u8_t srv16_11 = vec_perm(sv1, sv2, mask16_11); + vec_u8_t srv16_12 = vec_perm(sv1, sv2, mask16_12); + vec_u8_t srv16_13 = vec_perm(sv1, sv2, mask16_13); + vec_u8_t srv16_14 = vec_perm(sv1, sv2, mask16_14); + vec_u8_t srv16_15 = vec_perm(sv1, sv2, mask16_15); + + vec_u8_t srv16 = vec_perm(sv1, sv2, mask0); /* mask16 == mask0 */ + vec_u8_t srv17 = vec_perm(sv1, sv2, mask1); + vec_u8_t srv18 = vec_perm(sv1, sv2, mask2); + vec_u8_t srv19 = vec_perm(sv1, sv2, mask3); + vec_u8_t srv20 = vec_perm(sv1, sv2, mask4); + vec_u8_t srv21 = vec_perm(sv1, sv2, mask5); + vec_u8_t srv22 = vec_perm(sv1, sv2, mask6); + vec_u8_t srv23 = vec_perm(sv1, sv2, mask7); + vec_u8_t srv24 = vec_perm(sv1, sv2, mask8); + vec_u8_t srv25 = vec_perm(sv1, sv2, mask9); + vec_u8_t srv26 = vec_perm(sv1, sv2, mask10); + vec_u8_t srv27 = vec_perm(sv1, sv2, mask11); + vec_u8_t srv28 = vec_perm(sv1, sv2, mask12); + vec_u8_t srv29 = vec_perm(sv1, sv2, mask13); + vec_u8_t srv30 = vec_perm(sv1, sv2, mask14); + vec_u8_t srv31 = vec_perm(sv1, sv2, mask15); + vec_u8_t srv32 = vec_perm(sv2, sv3, maskadd1_31); + + + vec_u8_t srv16_16= vec_perm(sv1, sv2, mask16_0); /* mask16_16 == mask16_0 */ + vec_u8_t srv16_17= vec_perm(sv1, sv2, mask16_1); + vec_u8_t srv16_18 = vec_perm(sv1, sv2, mask16_2); + vec_u8_t srv16_19 = vec_perm(sv2, sv3, mask16_3); + vec_u8_t srv16_20 = vec_perm(sv2, sv3, mask16_4); + vec_u8_t srv16_21 = vec_perm(sv2, sv3, mask16_5); + vec_u8_t srv16_22 = vec_perm(sv2, sv3, mask16_6); + vec_u8_t srv16_23 = vec_perm(sv2, sv3, mask16_7); + vec_u8_t srv16_24 = vec_perm(sv2, sv3, mask16_8); + vec_u8_t srv16_25 = vec_perm(sv2, sv3, mask16_9); + vec_u8_t srv16_26 = vec_perm(sv2, sv3, mask16_10); + vec_u8_t srv16_27 = vec_perm(sv2, sv3, mask16_11); + vec_u8_t srv16_28 = vec_perm(sv2, sv3, mask16_12); + vec_u8_t srv16_29 = vec_perm(sv2, sv3, mask16_13); + vec_u8_t srv16_30 = vec_perm(sv2, sv3, mask16_14); + vec_u8_t srv16_31 = vec_perm(sv2, sv3, mask16_15); + vec_u8_t srv16_32 = vec_perm(sv2, sv3, maskadd1_16_31); + + +vec_u8_t vfrac32_0 = (vec_u8_t){26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0}; +vec_u8_t vfrac32_1 = (vec_u8_t){26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0}; +vec_u8_t vfrac32_32_0 = (vec_u8_t){6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 32}; +vec_u8_t vfrac32_32_1 = (vec_u8_t){6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + + one_line(srv0, srv1, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_0, srv16_1, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv1, srv2, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_1, srv16_2, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv2, srv3, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_2, srv16_3, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv3, srv4, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_3, srv16_4, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv4, srv5, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_4, srv16_5, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv5, srv6, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_5, srv16_6, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv6, srv7, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_6, srv16_7, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv7, srv8, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_7, srv16_8, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv8, srv9, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_8, srv16_9, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv9, srv10, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_9, srv16_10, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv10, srv11, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_10, srv16_11, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv11, srv12, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_11, srv16_12, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv12, srv13, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_12, srv16_13, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv13, srv14, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_13, srv16_14, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv14, srv15, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_14, srv16_15, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv15, srv16, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_15, srv16_16, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv16, srv17, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_16, srv16_17, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv17, srv18, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_17, srv16_18, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv18, srv19, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_18, srv16_19, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv19, srv20, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_19, srv16_20, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv20, srv21, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_20, srv16_21, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv21, srv22, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_21, srv16_22, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv22, srv23, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_22, srv16_23, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv23, srv24, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_23, srv16_24, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv24, srv25, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_24, srv16_25, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv25, srv26, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_25, srv16_26, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv26, srv27, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_26, srv16_27, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv27, srv28, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_27, srv16_28, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv28, srv29, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_28, srv16_29, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv29, srv30, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_29, srv16_30, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv30, srv31, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_30, srv16_31, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv31, srv32, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_31, srv16_32, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<4, 4>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x0, 0x1, 0x1, 0x2, 0x1, 0x2, 0x2, 0x3, 0x2, 0x3, 0x3, 0x4, 0x3, 0x4, 0x4, 0x5}; +vec_u8_t mask1={0x1, 0x2, 0x2, 0x3, 0x2, 0x3, 0x3, 0x4, 0x3, 0x4, 0x4, 0x5, 0x4, 0x5, 0x5, 0x6}; + +vec_u8_t vfrac4 = (vec_u8_t){21, 10, 31, 20, 21, 10, 31, 20, 21, 10, 31, 20, 21, 10, 31, 20}; +vec_u8_t vfrac4_32 = (vec_u8_t){11, 22, 1, 12, 11, 22, 1, 12, 11, 22, 1, 12, 11, 22, 1, 12}; + + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 4>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x0, 0x1, 0x1, 0x2, 0x3, 0x3, 0x4, 0x5, 0x1, 0x2, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, }; +vec_u8_t mask1={0x1, 0x2, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x2, 0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, }; +vec_u8_t mask2={0x2, 0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, }; +vec_u8_t mask3={0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, }; +vec_u8_t mask4={0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, }; +vec_u8_t mask5={0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, }; +vec_u8_t mask6={0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, }; +vec_u8_t mask7={0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, }; +//vec_u8_t mask8={0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 1 */ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 2 */ + vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 3 */ + vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 4 */ + vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 4, 4 */ + vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 4, 5 */ + vec_u8_t srv6 = vec_perm(srv, srv, mask6); /* 5, 6 */ + vec_u8_t srv7 = vec_perm(srv, srv, mask7); /* 6, 7 */ + + //mode 4, mode32 + //int offset_4[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21}; + //int fraction_4[32] = {21, 10, 31, 20, 9, 30, 19, 8, 29, 18, 7, 28, 17, 6, 27, 16, 5, 26, 15, 4, 25, 14, 3, 24, 13, 2, 23, 12, 1, 22, 11, 0}; + +vec_u8_t vfrac8 = (vec_u8_t){21, 10, 31, 20, 9, 30, 19, 8, 21, 10, 31, 20, 9, 30, 19, 8, }; +vec_u8_t vfrac8_32 = (vec_u8_t){11, 22, 1, 12, 23, 2, 13, 24, 11, 22, 1, 12, 23, 2, 13, 24, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + /* y0, y1 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32); /* (32 - fraction) * ref[offset + x], x=0-7 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32); + vec_u16_t vmle1 = vec_mule(srv1, vfrac8); /* fraction * ref[offset + x + 1], x=0-7 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y2, y3 */ + vmle0 = vec_mule(srv2, vfrac8_32); + vmlo0 = vec_mulo(srv2, vfrac8_32); + vmle1 = vec_mule(srv3, vfrac8); + vmlo1 = vec_mulo(srv3, vfrac8); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y4, y5 */ + vmle0 = vec_mule(srv4, vfrac8_32); + vmlo0 = vec_mulo(srv4, vfrac8_32); + vmle1 = vec_mule(srv5, vfrac8); + vmlo1 = vec_mulo(srv5, vfrac8); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21}; + + /* y6, y7 */ + vmle0 = vec_mule(srv6, vfrac8_32); + vmlo0 = vec_mulo(srv6, vfrac8_32); + vmle1 = vec_mule(srv7, vfrac8); + vmlo1 = vec_mulo(srv7, vfrac8); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 4>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + +vec_u8_t mask0={0x0, 0x1, 0x1, 0x2, 0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, }; +vec_u8_t mask1={0x1, 0x2, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, }; +vec_u8_t mask2={0x2, 0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, }; +vec_u8_t mask3={0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, }; +vec_u8_t mask4={0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, }; +vec_u8_t mask5={0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, }; +vec_u8_t mask6={0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, }; +vec_u8_t mask7={0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, }; +vec_u8_t mask8={0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, }; +vec_u8_t mask9={0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, }; +vec_u8_t mask10={0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, }; +vec_u8_t mask11={0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, }; +vec_u8_t mask12={0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x15, 0x16, }; +vec_u8_t mask13={0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x16, 0x17, }; +vec_u8_t mask14={0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, 0x17, 0x18, }; +vec_u8_t mask15={0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x16, 0x17, 0x18, 0x18, 0x19, }; + + vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = vec_perm(sv0, sv1, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srva = vec_perm(sv0, sv1, mask10); + vec_u8_t srvb = vec_perm(sv0, sv1, mask11); + vec_u8_t srvc = vec_perm(sv0, sv1, mask12); + vec_u8_t srvd = vec_perm(sv0, sv1, mask13); + vec_u8_t srve = vec_perm(sv0, sv1, mask14); + vec_u8_t srvf = vec_perm(sv0, sv1, mask15); + vec_u8_t srv00 = vec_perm(sv1, sv1, mask0);; + +vec_u8_t vfrac16 = (vec_u8_t){21, 10, 31, 20, 9, 30, 19, 8, 29, 18, 7, 28, 17, 6, 27, 16, }; +vec_u8_t vfrac16_32 = (vec_u8_t){11, 22, 1, 12, 23, 2, 13, 24, 3, 14, 25, 4, 15, 26, 5, 16, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv1, vfrac16_32, vfrac16, vout_0); + one_line(srv1, srv2, vfrac16_32, vfrac16, vout_1); + one_line(srv2, srv3, vfrac16_32, vfrac16, vout_2); + one_line(srv3, srv4, vfrac16_32, vfrac16, vout_3); + one_line(srv4, srv5, vfrac16_32, vfrac16, vout_4); + one_line(srv5, srv6, vfrac16_32, vfrac16, vout_5); + one_line(srv6, srv7, vfrac16_32, vfrac16, vout_6); + one_line(srv7, srv8, vfrac16_32, vfrac16, vout_7); + one_line(srv8, srv9, vfrac16_32, vfrac16, vout_8); + one_line(srv9, srva, vfrac16_32, vfrac16, vout_9); + one_line(srva, srvb, vfrac16_32, vfrac16, vout_10); + one_line(srvb, srvc, vfrac16_32, vfrac16, vout_11); + one_line(srvc, srvd, vfrac16_32, vfrac16, vout_12); + one_line(srvd, srve, vfrac16_32, vfrac16, vout_13); + one_line(srve, srvf, vfrac16_32, vfrac16, vout_14); + one_line(srvf, srv00, vfrac16_32, vfrac16, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 4>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-3; + dst[0 * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[1 * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[2 * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + ... + dst[16 * dstStride + 0] = (pixel)((f32[16]* ref[off16 + 0] + f[16] * ref[off16 + 1] + 16) >> 5); + ... + dst[31 * dstStride + 0] = (pixel)((f32[31]* ref[off31 + 0] + f[31] * ref[off31 + 1] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-3; + dst[0 * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[1 * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[2 * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[3 * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-3; + dst[0 * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[1 * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[2 * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[3 * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5); + + .... + y=16; off3 = offset[3]; x=0-3; + dst[0 * dstStride + 16] = (pixel)((f32[0]* ref[off0 + 16] + f[0] * ref[off0 + 16] + 16) >> 5); + dst[1 * dstStride + 16] = (pixel)((f32[1]* ref[off1 + 16] + f[1] * ref[off1 + 16] + 16) >> 5); + dst[2 * dstStride + 16] = (pixel)((f32[2]* ref[off2 + 16] + f[2] * ref[off2 + 16] + 16) >> 5); + ... + dst[16 * dstStride + 16] = (pixel)((f32[16]* ref[off16 + 16] + f[16] * ref[off16 + 16] + 16) >> 5); + ... + dst[31 * dstStride + 16] = (pixel)((f32[31]* ref[off31 + 16] + f[31] * ref[off31 + 16] + 16) >> 5); + + .... + y=31; off3 = offset[3]; x=0-3; + dst[0 * dstStride + 31] = (pixel)((f32[0]* ref[off0 + 31] + f[0] * ref[off0 + 31] + 16) >> 5); + dst[1 * dstStride + 31] = (pixel)((f32[1]* ref[off1 + 31] + f[1] * ref[off1 + 31] + 16) >> 5); + dst[2 * dstStride + 31] = (pixel)((f32[2]* ref[off2 + 31] + f[2] * ref[off2 + 31] + 16) >> 5); + ... + dst[3 * dstStride + 31] = (pixel)((f32[31]* ref[off31 + 31] + f[31] * ref[off31 + 31] + 16) >> 5); + } + */ + //mode 33: + //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26}; + //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0}; + +vec_u8_t mask0={0x0, 0x1, 0x1, 0x2, 0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, }; +vec_u8_t mask16_0={0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, }; +vec_u8_t mask1={0x1, 0x2, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, }; +vec_u8_t mask16_1={0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, }; +vec_u8_t mask2={0x2, 0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, }; +vec_u8_t mask16_2={0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, }; +vec_u8_t mask3={0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, }; +vec_u8_t mask16_3={0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x16, 0x17, 0x18, }; +vec_u8_t mask4={0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, }; +vec_u8_t mask16_4={0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, 0x17, 0x18, 0x19, }; +vec_u8_t mask5={0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, }; +vec_u8_t mask16_5={0x0, 0x0, 0x1, 0x2, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, }; +vec_u8_t mask6={0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, }; +vec_u8_t mask16_6={0x1, 0x1, 0x2, 0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, }; +vec_u8_t mask7={0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, }; +vec_u8_t mask16_7={0x2, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, }; +vec_u8_t mask8={0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, }; +vec_u8_t mask16_8={0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, }; +vec_u8_t mask9={0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, }; +vec_u8_t mask16_9={0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, }; +vec_u8_t mask10={0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, }; +vec_u8_t mask16_10={0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, }; +vec_u8_t mask11={0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, }; +vec_u8_t mask16_11={0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, }; +vec_u8_t mask12={0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x15, 0x16, }; +vec_u8_t mask16_12={0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, }; +vec_u8_t mask13={0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x16, 0x17, }; +vec_u8_t mask16_13={0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, }; +vec_u8_t mask14={0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, 0x17, 0x18, }; +vec_u8_t mask16_14={0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, }; +vec_u8_t mask15={0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x16, 0x17, 0x18, 0x18, 0x19, }; +vec_u8_t mask16_15={0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, }; +/*vec_u8_t mask16={0x0, 0x1, 0x1, 0x2, 0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, }; +vec_u8_t mask16_16={0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, }; +vec_u8_t mask17={0x1, 0x2, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, }; +vec_u8_t mask16_17={0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, }; +vec_u8_t mask18={0x2, 0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, }; +vec_u8_t mask16_18={0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, }; +vec_u8_t mask19={0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, }; +vec_u8_t mask16_19={0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x16, 0x17, 0x18, }; +vec_u8_t mask20={0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, }; +vec_u8_t mask16_20={0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, 0x17, 0x18, 0x19, }; +vec_u8_t mask21={0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, }; +vec_u8_t mask16_21={0x0, 0x0, 0x1, 0x2, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, }; +vec_u8_t mask22={0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, }; +vec_u8_t mask16_22={0x1, 0x1, 0x2, 0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, }; +vec_u8_t mask23={0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, }; +vec_u8_t mask16_23={0x2, 0x2, 0x3, 0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, }; +vec_u8_t mask24={0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, }; +vec_u8_t mask16_24={0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, }; +vec_u8_t mask25={0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, }; +vec_u8_t mask16_25={0x4, 0x4, 0x5, 0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, }; +vec_u8_t mask26={0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, }; +vec_u8_t mask16_26={0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, }; +vec_u8_t mask27={0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, }; +vec_u8_t mask16_27={0x6, 0x6, 0x7, 0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, }; +vec_u8_t mask28={0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x15, 0x16, }; +vec_u8_t mask16_28={0x7, 0x7, 0x8, 0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, }; +vec_u8_t mask29={0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x16, 0x17, }; +vec_u8_t mask16_29={0x8, 0x8, 0x9, 0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, }; +vec_u8_t mask30={0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, 0x15, 0x16, 0x17, 0x17, 0x18, }; +vec_u8_t mask16_30={0x9, 0x9, 0xa, 0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, }; +vec_u8_t mask31={0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, 0x14, 0x15, 0x16, 0x16, 0x17, 0x18, 0x18, 0x19, }; +vec_u8_t mask16_31={0xa, 0xa, 0xb, 0xc, 0xc, 0xd, 0xe, 0xe, 0xf, 0x10, 0x10, 0x11, 0x12, 0x12, 0x13, 0x14, };*/ +vec_u8_t maskadd1_31={0x0, 0x1, 0x1, 0x2, 0x3, 0x3, 0x4, 0x5, 0x5, 0x6, 0x7, 0x7, 0x8, 0x9, 0x9, 0xa, }; +vec_u8_t maskadd1_16_31={0xb, 0xb, 0xc, 0xd, 0xd, 0xe, 0xf, 0xf, 0x10, 0x11, 0x11, 0x12, 0x13, 0x13, 0x14, 0x15, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv3 =vec_xl(113, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + +/* + printf("source:\n"); + for(int i=0; i<32; i++){ + printf("%d ", srcPix0[i+65]); + } + printf("\n"); + for(int i=0; i<32; i++){ + printf("%d ", srcPix0[i+97]); + } + printf("\n\n"); +*/ + vec_u8_t srv0 = vec_perm(sv0, sv1, mask0); + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srv10 = vec_perm(sv0, sv1, mask10); + vec_u8_t srv11 = vec_perm(sv0, sv1, mask11); + vec_u8_t srv12 = vec_perm(sv0, sv1, mask12); + vec_u8_t srv13 = vec_perm(sv0, sv1, mask13); + vec_u8_t srv14 = vec_perm(sv0, sv1, mask14); + vec_u8_t srv15 = vec_perm(sv0, sv1, mask15); + + vec_u8_t srv16_0 = vec_perm(sv0, sv1, mask16_0); + vec_u8_t srv16_1 = vec_perm(sv0, sv1,mask16_1); + vec_u8_t srv16_2 = vec_perm(sv0, sv1, mask16_2); + vec_u8_t srv16_3 = vec_perm(sv0, sv1, mask16_3); + vec_u8_t srv16_4 = vec_perm(sv0, sv1, mask16_4); + vec_u8_t srv16_5 = vec_perm(sv1, sv2, mask16_5); + vec_u8_t srv16_6 = vec_perm(sv1, sv2, mask16_6); + vec_u8_t srv16_7 = vec_perm(sv1, sv2, mask16_7); + vec_u8_t srv16_8 = vec_perm(sv1, sv2, mask16_8); + vec_u8_t srv16_9 = vec_perm(sv1, sv2, mask16_9); + vec_u8_t srv16_10 = vec_perm(sv1, sv2, mask16_10); + vec_u8_t srv16_11 = vec_perm(sv1, sv2, mask16_11); + vec_u8_t srv16_12 = vec_perm(sv1, sv2, mask16_12); + vec_u8_t srv16_13 = vec_perm(sv1, sv2, mask16_13); + vec_u8_t srv16_14 = vec_perm(sv1, sv2, mask16_14); + vec_u8_t srv16_15 = vec_perm(sv1, sv2, mask16_15); + + vec_u8_t srv16 = vec_perm(sv1, sv2, mask0); /* mask16 == mask0 */ + vec_u8_t srv17 = vec_perm(sv1, sv2, mask1); + vec_u8_t srv18 = vec_perm(sv1, sv2, mask2); + vec_u8_t srv19 = vec_perm(sv1, sv2, mask3); + vec_u8_t srv20 = vec_perm(sv1, sv2, mask4); + vec_u8_t srv21 = vec_perm(sv1, sv2, mask5); + vec_u8_t srv22 = vec_perm(sv1, sv2, mask6); + vec_u8_t srv23 = vec_perm(sv1, sv2, mask7); + vec_u8_t srv24 = vec_perm(sv1, sv2, mask8); + vec_u8_t srv25 = vec_perm(sv1, sv2, mask9); + vec_u8_t srv26 = vec_perm(sv1, sv2, mask10); + vec_u8_t srv27 = vec_perm(sv1, sv2, mask11); + vec_u8_t srv28 = vec_perm(sv1, sv2, mask12); + vec_u8_t srv29 = vec_perm(sv1, sv2, mask13); + vec_u8_t srv30 = vec_perm(sv1, sv2, mask14); + vec_u8_t srv31 = vec_perm(sv1, sv2, mask15); + vec_u8_t srv32 = vec_perm(sv2, sv3, maskadd1_31); + + + vec_u8_t srv16_16= vec_perm(sv1, sv2, mask16_0); /* mask16_16 == mask16_0 */ + vec_u8_t srv16_17= vec_perm(sv1, sv2, mask16_1); + vec_u8_t srv16_18 = vec_perm(sv1, sv2, mask16_2); + vec_u8_t srv16_19 = vec_perm(sv1, sv2, mask16_3); + vec_u8_t srv16_20 = vec_perm(sv1, sv2, mask16_4); + vec_u8_t srv16_21 = vec_perm(sv2, sv3, mask16_5); + vec_u8_t srv16_22 = vec_perm(sv2, sv3, mask16_6); + vec_u8_t srv16_23 = vec_perm(sv2, sv3, mask16_7); + vec_u8_t srv16_24 = vec_perm(sv2, sv3, mask16_8); + vec_u8_t srv16_25 = vec_perm(sv2, sv3, mask16_9); + vec_u8_t srv16_26 = vec_perm(sv2, sv3, mask16_10); + vec_u8_t srv16_27 = vec_perm(sv2, sv3, mask16_11); + vec_u8_t srv16_28 = vec_perm(sv2, sv3, mask16_12); + vec_u8_t srv16_29 = vec_perm(sv2, sv3, mask16_13); + vec_u8_t srv16_30 = vec_perm(sv2, sv3, mask16_14); + vec_u8_t srv16_31 = vec_perm(sv2, sv3, mask16_15); + vec_u8_t srv16_32 = vec_perm(sv2, sv3, maskadd1_16_31); + + +vec_u8_t vfrac32_0 = (vec_u8_t){21, 10, 31, 20, 9, 30, 19, 8, 29, 18, 7, 28, 17, 6, 27, 16, }; +vec_u8_t vfrac32_1 = (vec_u8_t){5, 26, 15, 4, 25, 14, 3, 24, 13, 2, 23, 12, 1, 22, 11, 0, }; +vec_u8_t vfrac32_32_0 = (vec_u8_t){11, 22, 1, 12, 23, 2, 13, 24, 3, 14, 25, 4, 15, 26, 5, 16, }; +vec_u8_t vfrac32_32_1 = (vec_u8_t){27, 6, 17, 28, 7, 18, 29, 8, 19, 30, 9, 20, 31, 10, 21, 32, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + + one_line(srv0, srv1, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_0, srv16_1, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv1, srv2, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_1, srv16_2, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv2, srv3, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_2, srv16_3, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv3, srv4, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_3, srv16_4, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv4, srv5, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_4, srv16_5, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv5, srv6, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_5, srv16_6, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv6, srv7, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_6, srv16_7, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv7, srv8, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_7, srv16_8, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv8, srv9, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_8, srv16_9, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv9, srv10, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_9, srv16_10, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv10, srv11, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_10, srv16_11, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv11, srv12, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_11, srv16_12, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv12, srv13, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_12, srv16_13, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv13, srv14, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_13, srv16_14, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv14, srv15, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_14, srv16_15, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv15, srv16, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_15, srv16_16, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv16, srv17, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_16, srv16_17, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv17, srv18, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_17, srv16_18, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv18, srv19, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_18, srv16_19, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv19, srv20, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_19, srv16_20, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv20, srv21, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_20, srv16_21, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv21, srv22, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_21, srv16_22, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv22, srv23, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_22, srv16_23, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv23, srv24, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_23, srv16_24, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv24, srv25, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_24, srv16_25, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv25, srv26, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_25, srv16_26, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv26, srv27, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_26, srv16_27, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv27, srv28, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_27, srv16_28, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv28, srv29, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_28, srv16_29, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv29, srv30, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_29, srv16_30, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv30, srv31, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_30, srv16_31, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv31, srv32, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_31, srv16_32, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<4, 5>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x0, 0x1, 0x1, 0x2, 0x1, 0x2, 0x2, 0x3, 0x2, 0x3, 0x3, 0x4, 0x3, 0x4, 0x4, 0x5, }; +vec_u8_t mask1={0x1, 0x2, 0x2, 0x3, 0x2, 0x3, 0x3, 0x4, 0x3, 0x4, 0x4, 0x5, 0x4, 0x5, 0x5, 0x6, }; + +vec_u8_t vfrac4 = (vec_u8_t){17, 2, 19, 4, 17, 2, 19, 4, 17, 2, 19, 4, 17, 2, 19, 4, }; +vec_u8_t vfrac4_32 = (vec_u8_t){15, 30, 13, 28, 15, 30, 13, 28, 15, 30, 13, 28, 15, 30, 13, 28, }; + + + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 5>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x0, 0x1, 0x1, 0x2, 0x2, 0x3, 0x3, 0x4, 0x1, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x5, }; +vec_u8_t mask1={0x1, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, }; +vec_u8_t mask2={0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, }; +vec_u8_t mask3={0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, }; +vec_u8_t mask4={0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, }; +vec_u8_t mask5={0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, }; +vec_u8_t mask6={0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, }; +vec_u8_t mask7={0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, }; +//vec_u8_t mask8={0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 1 */ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 2 */ + vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 3 */ + vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 4 */ + vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 4, 4 */ + vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 4, 5 */ + vec_u8_t srv6 = vec_perm(srv, srv, mask6); /* 5, 6 */ + vec_u8_t srv7 = vec_perm(srv, srv, mask7); /* 6, 7 */ + +vec_u8_t vfrac8 = (vec_u8_t){17, 2, 19, 4, 21, 6, 23, 8, 17, 2, 19, 4, 21, 6, 23, 8, }; +vec_u8_t vfrac8_32 = (vec_u8_t){15, 30, 13, 28, 11, 26, 9, 24, 15, 30, 13, 28, 11, 26, 9, 24, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + /* y0, y1 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32); /* (32 - fraction) * ref[offset + x], x=0-7 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32); + vec_u16_t vmle1 = vec_mule(srv1, vfrac8); /* fraction * ref[offset + x + 1], x=0-7 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y2, y3 */ + vmle0 = vec_mule(srv2, vfrac8_32); + vmlo0 = vec_mulo(srv2, vfrac8_32); + vmle1 = vec_mule(srv3, vfrac8); + vmlo1 = vec_mulo(srv3, vfrac8); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y4, y5 */ + vmle0 = vec_mule(srv4, vfrac8_32); + vmlo0 = vec_mulo(srv4, vfrac8_32); + vmle1 = vec_mule(srv5, vfrac8); + vmlo1 = vec_mulo(srv5, vfrac8); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21}; + + /* y6, y7 */ + vmle0 = vec_mule(srv6, vfrac8_32); + vmlo0 = vec_mulo(srv6, vfrac8_32); + vmle1 = vec_mule(srv7, vfrac8); + vmlo1 = vec_mulo(srv7, vfrac8); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 5>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + +vec_u8_t mask0={0x0, 0x1, 0x1, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, }; +vec_u8_t mask1={0x1, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, }; +vec_u8_t mask2={0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, }; +vec_u8_t mask3={0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, }; +vec_u8_t mask4={0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, }; +vec_u8_t mask5={0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, }; +vec_u8_t mask6={0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, }; +vec_u8_t mask7={0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, }; +vec_u8_t mask8={0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, }; +vec_u8_t mask9={0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, }; +vec_u8_t mask10={0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, }; +vec_u8_t mask11={0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, }; +vec_u8_t mask12={0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, }; +vec_u8_t mask13={0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, 0x15, }; +vec_u8_t mask14={0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, 0x15, 0x15, 0x16, }; +vec_u8_t mask15={0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, 0x15, 0x15, 0x16, 0x16, 0x17, }; +//vec_u8_t mask16={0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, 0x15, 0x15, 0x16, 0x16, 0x17, 0x17, 0x18, }; + + vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = vec_perm(sv0, sv1, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srva = vec_perm(sv0, sv1, mask10); + vec_u8_t srvb = vec_perm(sv0, sv1, mask11); + vec_u8_t srvc = vec_perm(sv0, sv1, mask12); + vec_u8_t srvd = vec_perm(sv0, sv1, mask13); + vec_u8_t srve = vec_perm(sv0, sv1, mask14); + vec_u8_t srvf = vec_perm(sv0, sv1, mask15); + vec_u8_t srv00 = vec_perm(sv1, sv1, mask0); + +vec_u8_t vfrac16 = (vec_u8_t){17, 2, 19, 4, 21, 6, 23, 8, 25, 10, 27, 12, 29, 14, 31, 16, }; +vec_u8_t vfrac16_32 = (vec_u8_t){15, 30, 13, 28, 11, 26, 9, 24, 7, 22, 5, 20, 3, 18, 1, 16, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv1, vfrac16_32, vfrac16, vout_0); + one_line(srv1, srv2, vfrac16_32, vfrac16, vout_1); + one_line(srv2, srv3, vfrac16_32, vfrac16, vout_2); + one_line(srv3, srv4, vfrac16_32, vfrac16, vout_3); + one_line(srv4, srv5, vfrac16_32, vfrac16, vout_4); + one_line(srv5, srv6, vfrac16_32, vfrac16, vout_5); + one_line(srv6, srv7, vfrac16_32, vfrac16, vout_6); + one_line(srv7, srv8, vfrac16_32, vfrac16, vout_7); + one_line(srv8, srv9, vfrac16_32, vfrac16, vout_8); + one_line(srv9, srva, vfrac16_32, vfrac16, vout_9); + one_line(srva, srvb, vfrac16_32, vfrac16, vout_10); + one_line(srvb, srvc, vfrac16_32, vfrac16, vout_11); + one_line(srvc, srvd, vfrac16_32, vfrac16, vout_12); + one_line(srvd, srve, vfrac16_32, vfrac16, vout_13); + one_line(srve, srvf, vfrac16_32, vfrac16, vout_14); + one_line(srvf, srv00, vfrac16_32, vfrac16, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 5>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-3; + dst[0 * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[1 * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[2 * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + ... + dst[16 * dstStride + 0] = (pixel)((f32[16]* ref[off16 + 0] + f[16] * ref[off16 + 1] + 16) >> 5); + ... + dst[31 * dstStride + 0] = (pixel)((f32[31]* ref[off31 + 0] + f[31] * ref[off31 + 1] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-3; + dst[0 * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[1 * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[2 * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[3 * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-3; + dst[0 * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[1 * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[2 * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[3 * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5); + + .... + y=16; off3 = offset[3]; x=0-3; + dst[0 * dstStride + 16] = (pixel)((f32[0]* ref[off0 + 16] + f[0] * ref[off0 + 16] + 16) >> 5); + dst[1 * dstStride + 16] = (pixel)((f32[1]* ref[off1 + 16] + f[1] * ref[off1 + 16] + 16) >> 5); + dst[2 * dstStride + 16] = (pixel)((f32[2]* ref[off2 + 16] + f[2] * ref[off2 + 16] + 16) >> 5); + ... + dst[16 * dstStride + 16] = (pixel)((f32[16]* ref[off16 + 16] + f[16] * ref[off16 + 16] + 16) >> 5); + ... + dst[31 * dstStride + 16] = (pixel)((f32[31]* ref[off31 + 16] + f[31] * ref[off31 + 16] + 16) >> 5); + + .... + y=31; off3 = offset[3]; x=0-3; + dst[0 * dstStride + 31] = (pixel)((f32[0]* ref[off0 + 31] + f[0] * ref[off0 + 31] + 16) >> 5); + dst[1 * dstStride + 31] = (pixel)((f32[1]* ref[off1 + 31] + f[1] * ref[off1 + 31] + 16) >> 5); + dst[2 * dstStride + 31] = (pixel)((f32[2]* ref[off2 + 31] + f[2] * ref[off2 + 31] + 16) >> 5); + ... + dst[3 * dstStride + 31] = (pixel)((f32[31]* ref[off31 + 31] + f[31] * ref[off31 + 31] + 16) >> 5); + } + */ +vec_u8_t mask0={0x0, 0x1, 0x1, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, }; +vec_u8_t mask16_0={0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x11, }; +vec_u8_t mask1={0x1, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, }; +vec_u8_t mask16_1={0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x12, }; +vec_u8_t mask2={0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, }; +vec_u8_t mask16_2={0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x13, }; +vec_u8_t mask3={0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, }; +vec_u8_t mask16_3={0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x14, }; +vec_u8_t mask4={0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, }; +vec_u8_t mask16_4={0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, 0x15, }; +vec_u8_t mask5={0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, }; +vec_u8_t mask16_5={0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, 0x15, 0x16, }; +vec_u8_t mask6={0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, }; +vec_u8_t mask16_6={0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, 0x15, 0x15, 0x16, 0x17, }; +vec_u8_t mask7={0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, }; +vec_u8_t mask16_7={0x0, 0x0, 0x1, 0x1, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x8, }; +vec_u8_t mask8={0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, }; +vec_u8_t mask16_8={0x1, 0x1, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x9, }; +vec_u8_t mask9={0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, }; +vec_u8_t mask16_9={0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0xa, }; +vec_u8_t mask10={0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, }; +vec_u8_t mask16_10={0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xb, }; +vec_u8_t mask11={0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, }; +vec_u8_t mask16_11={0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xc, }; +vec_u8_t mask12={0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, }; +vec_u8_t mask16_12={0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xd, }; +vec_u8_t mask13={0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, 0x15, }; +vec_u8_t mask16_13={0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xe, }; +vec_u8_t mask14={0xe, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, 0x15, 0x15, 0x16, }; +vec_u8_t mask16_14={0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xf, }; +vec_u8_t mask15={0xf, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, 0x15, 0x15, 0x16, 0x16, 0x17, }; +vec_u8_t mask16_15={0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_31={0x0, 0x1, 0x1, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, }; +vec_u8_t maskadd1_16_31={0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, 0x11, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv3 =vec_xl(113, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + //vec_u8_t sv4 =vec_xl(129, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + +/* + printf("source:\n"); + for(int i=0; i<32; i++){ + printf("%d ", srcPix0[i+65]); + } + printf("\n"); + for(int i=0; i<32; i++){ + printf("%d ", srcPix0[i+97]); + } + printf("\n\n"); +*/ + vec_u8_t srv0 = vec_perm(sv0, sv1, mask0); + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srv10 = vec_perm(sv0, sv1, mask10); + vec_u8_t srv11 = vec_perm(sv0, sv1, mask11); + vec_u8_t srv12 = vec_perm(sv0, sv1, mask12); + vec_u8_t srv13 = vec_perm(sv0, sv1, mask13); + vec_u8_t srv14 = vec_perm(sv0, sv1, mask14); + vec_u8_t srv15 = vec_perm(sv0, sv1, mask15); + + vec_u8_t srv16_0 = vec_perm(sv0, sv1, mask16_0); + vec_u8_t srv16_1 = vec_perm(sv0, sv1,mask16_1); + vec_u8_t srv16_2 = vec_perm(sv0, sv1, mask16_2); + vec_u8_t srv16_3 = vec_perm(sv0, sv1, mask16_3); + vec_u8_t srv16_4 = vec_perm(sv0, sv1, mask16_4); + vec_u8_t srv16_5 = vec_perm(sv0, sv1, mask16_5); + vec_u8_t srv16_6 = vec_perm(sv0, sv1, mask16_6); + vec_u8_t srv16_7 = vec_perm(sv1, sv2, mask16_7); + vec_u8_t srv16_8 = vec_perm(sv1, sv2, mask16_8); + vec_u8_t srv16_9 = vec_perm(sv1, sv2, mask16_9); + vec_u8_t srv16_10 = vec_perm(sv1, sv2, mask16_10); + vec_u8_t srv16_11 = vec_perm(sv1, sv2, mask16_11); + vec_u8_t srv16_12 = vec_perm(sv1, sv2, mask16_12); + vec_u8_t srv16_13 = vec_perm(sv1, sv2, mask16_13); + vec_u8_t srv16_14 = vec_perm(sv1, sv2, mask16_14); + vec_u8_t srv16_15 = vec_perm(sv1, sv2, mask16_15); + + vec_u8_t srv16 = vec_perm(sv1, sv2, mask0); /* mask16 == mask0 */ + vec_u8_t srv17 = vec_perm(sv1, sv2, mask1); + vec_u8_t srv18 = vec_perm(sv1, sv2, mask2); + vec_u8_t srv19 = vec_perm(sv1, sv2, mask3); + vec_u8_t srv20 = vec_perm(sv1, sv2, mask4); + vec_u8_t srv21 = vec_perm(sv1, sv2, mask5); + vec_u8_t srv22 = vec_perm(sv1, sv2, mask6); + vec_u8_t srv23 = vec_perm(sv1, sv2, mask7); + vec_u8_t srv24 = vec_perm(sv1, sv2, mask8); + vec_u8_t srv25 = vec_perm(sv1, sv2, mask9); + vec_u8_t srv26 = vec_perm(sv1, sv2, mask10); + vec_u8_t srv27 = vec_perm(sv1, sv2, mask11); + vec_u8_t srv28 = vec_perm(sv1, sv2, mask12); + vec_u8_t srv29 = vec_perm(sv1, sv2, mask13); + vec_u8_t srv30 = vec_perm(sv1, sv2, mask14); + vec_u8_t srv31 = vec_perm(sv1, sv2, mask15); + vec_u8_t srv32 = vec_perm(sv2, sv3, maskadd1_31); + + + vec_u8_t srv16_16= vec_perm(sv1, sv2, mask16_0); /* mask16_16 == mask16_0 */ + vec_u8_t srv16_17= vec_perm(sv1, sv2, mask16_1); + vec_u8_t srv16_18 = vec_perm(sv1, sv2, mask16_2); + vec_u8_t srv16_19 = vec_perm(sv1, sv2, mask16_3); + vec_u8_t srv16_20 = vec_perm(sv1, sv2, mask16_4); + vec_u8_t srv16_21 = vec_perm(sv1, sv2, mask16_5); + vec_u8_t srv16_22 = vec_perm(sv1, sv2, mask16_6); + vec_u8_t srv16_23 = vec_perm(sv2, sv3, mask16_7); + vec_u8_t srv16_24 = vec_perm(sv2, sv3, mask16_8); + vec_u8_t srv16_25 = vec_perm(sv2, sv3, mask16_9); + vec_u8_t srv16_26 = vec_perm(sv2, sv3, mask16_10); + vec_u8_t srv16_27 = vec_perm(sv2, sv3, mask16_11); + vec_u8_t srv16_28 = vec_perm(sv2, sv3, mask16_12); + vec_u8_t srv16_29 = vec_perm(sv2, sv3, mask16_13); + vec_u8_t srv16_30 = vec_perm(sv2, sv3, mask16_14); + vec_u8_t srv16_31 = vec_perm(sv2, sv3, mask16_15); + vec_u8_t srv16_32 = vec_perm(sv2, sv3, maskadd1_16_31); + + +vec_u8_t vfrac32_0 = (vec_u8_t){17, 2, 19, 4, 21, 6, 23, 8, 25, 10, 27, 12, 29, 14, 31, 16, }; +vec_u8_t vfrac32_1 = (vec_u8_t){1, 18, 3, 20, 5, 22, 7, 24, 9, 26, 11, 28, 13, 30, 15, 0, }; +vec_u8_t vfrac32_32_0 = (vec_u8_t){15, 30, 13, 28, 11, 26, 9, 24, 7, 22, 5, 20, 3, 18, 1, 16, }; +vec_u8_t vfrac32_32_1 = (vec_u8_t){31, 14, 29, 12, 27, 10, 25, 8, 23, 6, 21, 4, 19, 2, 17, 32, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + + one_line(srv0, srv1, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_0, srv16_1, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv1, srv2, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_1, srv16_2, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv2, srv3, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_2, srv16_3, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv3, srv4, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_3, srv16_4, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv4, srv5, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_4, srv16_5, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv5, srv6, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_5, srv16_6, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv6, srv7, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_6, srv16_7, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv7, srv8, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_7, srv16_8, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv8, srv9, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_8, srv16_9, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv9, srv10, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_9, srv16_10, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv10, srv11, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_10, srv16_11, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv11, srv12, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_11, srv16_12, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv12, srv13, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_12, srv16_13, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv13, srv14, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_13, srv16_14, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv14, srv15, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_14, srv16_15, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv15, srv16, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_15, srv16_16, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv16, srv17, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_16, srv16_17, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv17, srv18, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_17, srv16_18, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv18, srv19, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_18, srv16_19, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv19, srv20, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_19, srv16_20, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv20, srv21, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_20, srv16_21, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv21, srv22, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_21, srv16_22, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv22, srv23, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_22, srv16_23, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv23, srv24, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_23, srv16_24, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv24, srv25, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_24, srv16_25, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv25, srv26, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_25, srv16_26, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv26, srv27, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_26, srv16_27, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv27, srv28, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_27, srv16_28, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv28, srv29, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_28, srv16_29, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv29, srv30, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_29, srv16_30, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv30, srv31, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_30, srv16_31, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv31, srv32, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_31, srv16_32, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + + +template<> +void intra_pred<4, 6>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, }; +vec_u8_t mask1={0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, }; + + +vec_u8_t vfrac4 = (vec_u8_t){13, 26, 7, 20, 13, 26, 7, 20, 13, 26, 7, 20, 13, 26, 7, 20, }; + +vec_u8_t vfrac4_32 = (vec_u8_t){19, 6, 25, 12, 19, 6, 25, 12, 19, 6, 25, 12, 19, 6, 25, 12, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 6>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x0, 0x0, 0x1, 0x1, 0x2, 0x2, 0x2, 0x3, 0x1, 0x1, 0x2, 0x2, 0x3, 0x3, 0x3, 0x4, }; +vec_u8_t mask1={0x1, 0x1, 0x2, 0x2, 0x3, 0x3, 0x3, 0x4, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, }; +vec_u8_t mask2={0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, }; +vec_u8_t mask3={0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, }; +vec_u8_t mask4={0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, }; +vec_u8_t mask5={0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, }; +vec_u8_t mask6={0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, }; +vec_u8_t mask7={0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, }; +//vec_u8_t mask8={0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xb, 0xc, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 1 */ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 2 */ + vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 3 */ + vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 4 */ + vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 4, 4 */ + vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 4, 5 */ + vec_u8_t srv6 = vec_perm(srv, srv, mask6); /* 5, 6 */ + vec_u8_t srv7 = vec_perm(srv, srv, mask7); /* 6, 7 */ + +vec_u8_t vfrac8 = (vec_u8_t){13, 26, 7, 20, 1, 14, 27, 8, 13, 26, 7, 20, 1, 14, 27, 8, }; +vec_u8_t vfrac8_32 = (vec_u8_t){19, 6, 25, 12, 31, 18, 5, 24, 19, 6, 25, 12, 31, 18, 5, 24, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + /* y0, y1 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32); /* (32 - fraction) * ref[offset + x], x=0-7 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32); + vec_u16_t vmle1 = vec_mule(srv1, vfrac8); /* fraction * ref[offset + x + 1], x=0-7 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y2, y3 */ + vmle0 = vec_mule(srv2, vfrac8_32); + vmlo0 = vec_mulo(srv2, vfrac8_32); + vmle1 = vec_mule(srv3, vfrac8); + vmlo1 = vec_mulo(srv3, vfrac8); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y4, y5 */ + vmle0 = vec_mule(srv4, vfrac8_32); + vmlo0 = vec_mulo(srv4, vfrac8_32); + vmle1 = vec_mule(srv5, vfrac8); + vmlo1 = vec_mulo(srv5, vfrac8); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21}; + + /* y6, y7 */ + vmle0 = vec_mule(srv6, vfrac8_32); + vmlo0 = vec_mulo(srv6, vfrac8_32); + vmle1 = vec_mule(srv7, vfrac8); + vmlo1 = vec_mulo(srv7, vfrac8); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 6>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + +vec_u8_t mask0={0x0, 0x0, 0x1, 0x1, 0x2, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, }; +vec_u8_t mask1={0x1, 0x1, 0x2, 0x2, 0x3, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, }; +vec_u8_t mask2={0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, }; +vec_u8_t mask3={0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, }; +vec_u8_t mask4={0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, }; +vec_u8_t mask5={0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, }; +vec_u8_t mask6={0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, }; +vec_u8_t mask7={0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, }; +vec_u8_t mask8={0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, }; +vec_u8_t mask9={0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, }; +vec_u8_t mask10={0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, }; +vec_u8_t mask11={0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, }; +vec_u8_t mask12={0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, }; +vec_u8_t mask13={0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, }; +vec_u8_t mask14={0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, }; +vec_u8_t mask15={0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x13, 0x14, 0x14, 0x15, 0x15, }; + + vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = vec_perm(sv0, sv1, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srva = vec_perm(sv0, sv1, mask10); + vec_u8_t srvb = vec_perm(sv0, sv1, mask11); + vec_u8_t srvc = vec_perm(sv0, sv1, mask12); + vec_u8_t srvd = vec_perm(sv0, sv1, mask13); + vec_u8_t srve = vec_perm(sv0, sv1, mask14); + vec_u8_t srvf = vec_perm(sv0, sv1, mask15); + vec_u8_t srv00 = vec_perm(sv1, sv1, mask0); + +vec_u8_t vfrac16 = (vec_u8_t){13, 26, 7, 20, 1, 14, 27, 8, 21, 2, 15, 28, 9, 22, 3, 16, }; +vec_u8_t vfrac16_32 = (vec_u8_t){19, 6, 25, 12, 31, 18, 5, 24, 11, 30, 17, 4, 23, 10, 29, 16, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv1, vfrac16_32, vfrac16, vout_0); + one_line(srv1, srv2, vfrac16_32, vfrac16, vout_1); + one_line(srv2, srv3, vfrac16_32, vfrac16, vout_2); + one_line(srv3, srv4, vfrac16_32, vfrac16, vout_3); + one_line(srv4, srv5, vfrac16_32, vfrac16, vout_4); + one_line(srv5, srv6, vfrac16_32, vfrac16, vout_5); + one_line(srv6, srv7, vfrac16_32, vfrac16, vout_6); + one_line(srv7, srv8, vfrac16_32, vfrac16, vout_7); + one_line(srv8, srv9, vfrac16_32, vfrac16, vout_8); + one_line(srv9, srva, vfrac16_32, vfrac16, vout_9); + one_line(srva, srvb, vfrac16_32, vfrac16, vout_10); + one_line(srvb, srvc, vfrac16_32, vfrac16, vout_11); + one_line(srvc, srvd, vfrac16_32, vfrac16, vout_12); + one_line(srvd, srve, vfrac16_32, vfrac16, vout_13); + one_line(srve, srvf, vfrac16_32, vfrac16, vout_14); + one_line(srvf, srv00, vfrac16_32, vfrac16, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 6>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-3; + dst[0 * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[1 * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[2 * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + ... + dst[16 * dstStride + 0] = (pixel)((f32[16]* ref[off16 + 0] + f[16] * ref[off16 + 1] + 16) >> 5); + ... + dst[31 * dstStride + 0] = (pixel)((f32[31]* ref[off31 + 0] + f[31] * ref[off31 + 1] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-3; + dst[0 * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[1 * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[2 * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[3 * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-3; + dst[0 * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[1 * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[2 * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[3 * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5); + + .... + y=16; off3 = offset[3]; x=0-3; + dst[0 * dstStride + 16] = (pixel)((f32[0]* ref[off0 + 16] + f[0] * ref[off0 + 16] + 16) >> 5); + dst[1 * dstStride + 16] = (pixel)((f32[1]* ref[off1 + 16] + f[1] * ref[off1 + 16] + 16) >> 5); + dst[2 * dstStride + 16] = (pixel)((f32[2]* ref[off2 + 16] + f[2] * ref[off2 + 16] + 16) >> 5); + ... + dst[16 * dstStride + 16] = (pixel)((f32[16]* ref[off16 + 16] + f[16] * ref[off16 + 16] + 16) >> 5); + ... + dst[31 * dstStride + 16] = (pixel)((f32[31]* ref[off31 + 16] + f[31] * ref[off31 + 16] + 16) >> 5); + + .... + y=31; off3 = offset[3]; x=0-3; + dst[0 * dstStride + 31] = (pixel)((f32[0]* ref[off0 + 31] + f[0] * ref[off0 + 31] + 16) >> 5); + dst[1 * dstStride + 31] = (pixel)((f32[1]* ref[off1 + 31] + f[1] * ref[off1 + 31] + 16) >> 5); + dst[2 * dstStride + 31] = (pixel)((f32[2]* ref[off2 + 31] + f[2] * ref[off2 + 31] + 16) >> 5); + ... + dst[3 * dstStride + 31] = (pixel)((f32[31]* ref[off31 + 31] + f[31] * ref[off31 + 31] + 16) >> 5); + } + */ +vec_u8_t mask0={0x0, 0x0, 0x1, 0x1, 0x2, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, }; +vec_u8_t mask16_0={0x6, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, }; +vec_u8_t mask1={0x1, 0x1, 0x2, 0x2, 0x3, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, }; +vec_u8_t mask16_1={0x7, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xe, }; +vec_u8_t mask2={0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, }; +vec_u8_t mask16_2={0x8, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xf, }; +vec_u8_t mask3={0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, }; +vec_u8_t mask16_3={0x9, 0xa, 0xa, 0xb, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0x10, }; +vec_u8_t mask4={0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, }; +vec_u8_t mask16_4={0xa, 0xb, 0xb, 0xc, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x11, }; +vec_u8_t mask5={0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, }; +vec_u8_t mask16_5={0xb, 0xc, 0xc, 0xd, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x12, }; +vec_u8_t mask6={0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, }; +vec_u8_t mask16_6={0xc, 0xd, 0xd, 0xe, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x13, }; +vec_u8_t mask7={0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, }; +vec_u8_t mask16_7={0xd, 0xe, 0xe, 0xf, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x14, }; +vec_u8_t mask8={0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, }; +vec_u8_t mask16_8={0xe, 0xf, 0xf, 0x10, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, 0x15, }; +vec_u8_t mask9={0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, }; +vec_u8_t mask16_9={0xf, 0x10, 0x10, 0x11, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x13, 0x14, 0x14, 0x15, 0x15, 0x16, }; +vec_u8_t mask10={0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, }; +vec_u8_t mask16_10={0x0, 0x1, 0x1, 0x2, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x7, }; +vec_u8_t mask11={0xb, 0xb, 0xc, 0xc, 0xd, 0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, }; +vec_u8_t mask16_11={0x1, 0x2, 0x2, 0x3, 0x3, 0x3, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x8, }; +vec_u8_t mask12={0xc, 0xc, 0xd, 0xd, 0xe, 0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, }; +vec_u8_t mask16_12={0x2, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x9, }; +vec_u8_t mask13={0xd, 0xd, 0xe, 0xe, 0xf, 0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, }; +vec_u8_t mask16_13={0x3, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0xa, }; +vec_u8_t mask14={0xe, 0xe, 0xf, 0xf, 0x10, 0x10, 0x10, 0x11, 0x11, 0x12, 0x12, 0x12, 0x13, 0x13, 0x14, 0x14, }; +vec_u8_t mask16_14={0x4, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xb, }; +vec_u8_t mask15={0xf, 0xf, 0x10, 0x10, 0x11, 0x11, 0x11, 0x12, 0x12, 0x13, 0x13, 0x13, 0x14, 0x14, 0x15, 0x15, }; +vec_u8_t mask16_15={0x5, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, 0xa, 0xb, 0xb, 0xc, }; +vec_u8_t maskadd1_31={0x0, 0x0, 0x1, 0x1, 0x2, 0x2, 0x2, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, 0x5, 0x6, 0x6, }; +vec_u8_t maskadd1_16_31={0x6, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, 0xb, 0xc, 0xc, 0xd, }; + + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv3 =vec_xl(113, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + //vec_u8_t sv4 =vec_xl(129, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + +/* + printf("source:\n"); + for(int i=0; i<32; i++){ + printf("%d ", srcPix0[i+65]); + } + printf("\n"); + for(int i=0; i<32; i++){ + printf("%d ", srcPix0[i+97]); + } + printf("\n\n"); +*/ + vec_u8_t srv0 = vec_perm(sv0, sv1, mask0); + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srv10 = vec_perm(sv0, sv1, mask10); + vec_u8_t srv11 = vec_perm(sv0, sv1, mask11); + vec_u8_t srv12 = vec_perm(sv0, sv1, mask12); + vec_u8_t srv13 = vec_perm(sv0, sv1, mask13); + vec_u8_t srv14 = vec_perm(sv0, sv1, mask14); + vec_u8_t srv15 = vec_perm(sv0, sv1, mask15); + + vec_u8_t srv16_0 = vec_perm(sv0, sv1, mask16_0); + vec_u8_t srv16_1 = vec_perm(sv0, sv1,mask16_1); + vec_u8_t srv16_2 = vec_perm(sv0, sv1, mask16_2); + vec_u8_t srv16_3 = vec_perm(sv0, sv1, mask16_3); + vec_u8_t srv16_4 = vec_perm(sv0, sv1, mask16_4); + vec_u8_t srv16_5 = vec_perm(sv0, sv1, mask16_5); + vec_u8_t srv16_6 = vec_perm(sv0, sv1, mask16_6); + vec_u8_t srv16_7 = vec_perm(sv0, sv1, mask16_7); + vec_u8_t srv16_8 = vec_perm(sv0, sv1, mask16_8); + vec_u8_t srv16_9 = vec_perm(sv0, sv1, mask16_9); + vec_u8_t srv16_10 = vec_perm(sv1, sv2, mask16_10); + vec_u8_t srv16_11 = vec_perm(sv1, sv2, mask16_11); + vec_u8_t srv16_12 = vec_perm(sv1, sv2, mask16_12); + vec_u8_t srv16_13 = vec_perm(sv1, sv2, mask16_13); + vec_u8_t srv16_14 = vec_perm(sv1, sv2, mask16_14); + vec_u8_t srv16_15 = vec_perm(sv1, sv2, mask16_15); + + vec_u8_t srv16 = vec_perm(sv1, sv2, mask0); /* mask16 == mask0 */ + vec_u8_t srv17 = vec_perm(sv1, sv2, mask1); + vec_u8_t srv18 = vec_perm(sv1, sv2, mask2); + vec_u8_t srv19 = vec_perm(sv1, sv2, mask3); + vec_u8_t srv20 = vec_perm(sv1, sv2, mask4); + vec_u8_t srv21 = vec_perm(sv1, sv2, mask5); + vec_u8_t srv22 = vec_perm(sv1, sv2, mask6); + vec_u8_t srv23 = vec_perm(sv1, sv2, mask7); + vec_u8_t srv24 = vec_perm(sv1, sv2, mask8); + vec_u8_t srv25 = vec_perm(sv1, sv2, mask9); + vec_u8_t srv26 = vec_perm(sv1, sv2, mask10); + vec_u8_t srv27 = vec_perm(sv1, sv2, mask11); + vec_u8_t srv28 = vec_perm(sv1, sv2, mask12); + vec_u8_t srv29 = vec_perm(sv1, sv2, mask13); + vec_u8_t srv30 = vec_perm(sv1, sv2, mask14); + vec_u8_t srv31 = vec_perm(sv1, sv2, mask15); + vec_u8_t srv32 = vec_perm(sv2, sv3, maskadd1_31); + + + vec_u8_t srv16_16= vec_perm(sv1, sv2, mask16_0); /* mask16_16 == mask16_0 */ + vec_u8_t srv16_17= vec_perm(sv1, sv2, mask16_1); + vec_u8_t srv16_18 = vec_perm(sv1, sv2, mask16_2); + vec_u8_t srv16_19 = vec_perm(sv1, sv2, mask16_3); + vec_u8_t srv16_20 = vec_perm(sv1, sv2, mask16_4); + vec_u8_t srv16_21 = vec_perm(sv1, sv2, mask16_5); + vec_u8_t srv16_22 = vec_perm(sv1, sv2, mask16_6); + vec_u8_t srv16_23 = vec_perm(sv1, sv2, mask16_7); + vec_u8_t srv16_24 = vec_perm(sv1, sv2, mask16_8); + vec_u8_t srv16_25 = vec_perm(sv1, sv2, mask16_9); + vec_u8_t srv16_26 = vec_perm(sv2, sv3, mask16_10); + vec_u8_t srv16_27 = vec_perm(sv2, sv3, mask16_11); + vec_u8_t srv16_28 = vec_perm(sv2, sv3, mask16_12); + vec_u8_t srv16_29 = vec_perm(sv2, sv3, mask16_13); + vec_u8_t srv16_30 = vec_perm(sv2, sv3, mask16_14); + vec_u8_t srv16_31 = vec_perm(sv2, sv3, mask16_15); + vec_u8_t srv16_32 = vec_perm(sv2, sv3, maskadd1_16_31); + +vec_u8_t vfrac32_0 = (vec_u8_t){13, 26, 7, 20, 1, 14, 27, 8, 21, 2, 15, 28, 9, 22, 3, 16, }; +vec_u8_t vfrac32_1 = (vec_u8_t){29, 10, 23, 4, 17, 30, 11, 24, 5, 18, 31, 12, 25, 6, 19, 0, }; +vec_u8_t vfrac32_32_0 = (vec_u8_t){19, 6, 25, 12, 31, 18, 5, 24, 11, 30, 17, 4, 23, 10, 29, 16, }; +vec_u8_t vfrac32_32_1 = (vec_u8_t){3, 22, 9, 28, 15, 2, 21, 8, 27, 14, 1, 20, 7, 26, 13, 32, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + + one_line(srv0, srv1, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_0, srv16_1, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv1, srv2, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_1, srv16_2, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv2, srv3, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_2, srv16_3, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv3, srv4, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_3, srv16_4, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv4, srv5, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_4, srv16_5, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv5, srv6, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_5, srv16_6, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv6, srv7, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_6, srv16_7, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv7, srv8, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_7, srv16_8, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv8, srv9, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_8, srv16_9, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv9, srv10, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_9, srv16_10, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv10, srv11, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_10, srv16_11, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv11, srv12, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_11, srv16_12, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv12, srv13, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_12, srv16_13, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv13, srv14, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_13, srv16_14, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv14, srv15, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_14, srv16_15, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv15, srv16, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_15, srv16_16, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv16, srv17, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_16, srv16_17, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv17, srv18, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_17, srv16_18, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv18, srv19, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_18, srv16_19, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv19, srv20, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_19, srv16_20, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv20, srv21, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_20, srv16_21, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv21, srv22, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_21, srv16_22, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv22, srv23, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_22, srv16_23, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv23, srv24, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_23, srv16_24, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv24, srv25, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_24, srv16_25, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv25, srv26, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_25, srv16_26, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv26, srv27, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_26, srv16_27, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv27, srv28, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_27, srv16_28, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv28, srv29, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_28, srv16_29, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv29, srv30, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_29, srv16_30, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv30, srv31, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_30, srv16_31, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv31, srv32, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_31, srv16_32, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<4, 7>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, }; +vec_u8_t mask1={0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x5, }; + +vec_u8_t vfrac4 = (vec_u8_t){9, 18, 27, 4, 9, 18, 27, 4, 9, 18, 27, 4, 9, 18, 27, 4, }; + +vec_u8_t vfrac4_32 = (vec_u8_t){23, 14, 5, 28, 23, 14, 5, 28, 23, 14, 5, 28, 23, 14, 5, 28, }; + + + + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 7>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x2, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, }; +vec_u8_t mask1={0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, }; +vec_u8_t mask2={0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x5, }; +vec_u8_t mask3={0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x5, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x6, }; +vec_u8_t mask4={0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x6, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x7, }; +vec_u8_t mask5={0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x7, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x8, }; +vec_u8_t mask6={0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x8, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x9, }; +vec_u8_t mask7={0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x9, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0xa, }; +//vec_u8_t mask8={0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0xa, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, 0xb, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 1 */ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 2 */ + vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 3 */ + vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 4 */ + vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 4, 4 */ + vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 4, 5 */ + vec_u8_t srv6 = vec_perm(srv, srv, mask6); /* 5, 6 */ + vec_u8_t srv7 = vec_perm(srv, srv, mask7); /* 6, 7 */ + +vec_u8_t vfrac8 = (vec_u8_t){9, 18, 27, 4, 13, 22, 31, 8, 9, 18, 27, 4, 13, 22, 31, 8, }; +vec_u8_t vfrac8_32 = (vec_u8_t){23, 14, 5, 28, 19, 10, 1, 24, 23, 14, 5, 28, 19, 10, 1, 24, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + /* y0, y1 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32); /* (32 - fraction) * ref[offset + x], x=0-7 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32); + vec_u16_t vmle1 = vec_mule(srv1, vfrac8); /* fraction * ref[offset + x + 1], x=0-7 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y2, y3 */ + vmle0 = vec_mule(srv2, vfrac8_32); + vmlo0 = vec_mulo(srv2, vfrac8_32); + vmle1 = vec_mule(srv3, vfrac8); + vmlo1 = vec_mulo(srv3, vfrac8); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y4, y5 */ + vmle0 = vec_mule(srv4, vfrac8_32); + vmlo0 = vec_mulo(srv4, vfrac8_32); + vmle1 = vec_mule(srv5, vfrac8); + vmlo1 = vec_mulo(srv5, vfrac8); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21}; + + /* y6, y7 */ + vmle0 = vec_mule(srv6, vfrac8_32); + vmlo0 = vec_mulo(srv6, vfrac8_32); + vmle1 = vec_mule(srv7, vfrac8); + vmlo1 = vec_mulo(srv7, vfrac8); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 7>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + +vec_u8_t mask0={0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, }; +vec_u8_t mask1={0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, }; +vec_u8_t mask2={0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, }; +vec_u8_t mask3={0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, }; +vec_u8_t mask4={0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, }; +vec_u8_t mask5={0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, }; +vec_u8_t mask6={0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, }; +vec_u8_t mask7={0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, 0xb, 0xb, }; +vec_u8_t mask8={0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xb, 0xc, 0xc, }; +vec_u8_t mask9={0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xc, 0xd, 0xd, }; +vec_u8_t mask10={0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xd, 0xe, 0xe, }; +vec_u8_t mask11={0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xe, 0xf, 0xf, }; +vec_u8_t mask12={0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0xf, 0x10, 0x10, }; +vec_u8_t mask13={0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x10, 0x11, 0x11, }; +vec_u8_t mask14={0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x11, 0x11, 0x11, 0x11, 0x12, 0x12, }; +vec_u8_t mask15={0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x10, 0x11, 0x11, 0x11, 0x12, 0x12, 0x12, 0x12, 0x13, 0x13, }; + + vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = vec_perm(sv0, sv1, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srva = vec_perm(sv0, sv1, mask10); + vec_u8_t srvb = vec_perm(sv0, sv1, mask11); + vec_u8_t srvc = vec_perm(sv0, sv1, mask12); + vec_u8_t srvd = vec_perm(sv0, sv1, mask13); + vec_u8_t srve = vec_perm(sv0, sv1, mask14); + vec_u8_t srvf = vec_perm(sv0, sv1, mask15); + vec_u8_t srv00 = vec_perm(sv1, sv1, mask0); + +vec_u8_t vfrac16 = (vec_u8_t){9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, }; +vec_u8_t vfrac16_32 = (vec_u8_t){23, 14, 5, 28, 19, 10, 1, 24, 15, 6, 29, 20, 11, 2, 25, 16, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv1, vfrac16_32, vfrac16, vout_0); + one_line(srv1, srv2, vfrac16_32, vfrac16, vout_1); + one_line(srv2, srv3, vfrac16_32, vfrac16, vout_2); + one_line(srv3, srv4, vfrac16_32, vfrac16, vout_3); + one_line(srv4, srv5, vfrac16_32, vfrac16, vout_4); + one_line(srv5, srv6, vfrac16_32, vfrac16, vout_5); + one_line(srv6, srv7, vfrac16_32, vfrac16, vout_6); + one_line(srv7, srv8, vfrac16_32, vfrac16, vout_7); + one_line(srv8, srv9, vfrac16_32, vfrac16, vout_8); + one_line(srv9, srva, vfrac16_32, vfrac16, vout_9); + one_line(srva, srvb, vfrac16_32, vfrac16, vout_10); + one_line(srvb, srvc, vfrac16_32, vfrac16, vout_11); + one_line(srvc, srvd, vfrac16_32, vfrac16, vout_12); + one_line(srvd, srve, vfrac16_32, vfrac16, vout_13); + one_line(srve, srvf, vfrac16_32, vfrac16, vout_14); + one_line(srvf, srv00, vfrac16_32, vfrac16, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 7>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-3; + dst[0 * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[1 * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[2 * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + ... + dst[16 * dstStride + 0] = (pixel)((f32[16]* ref[off16 + 0] + f[16] * ref[off16 + 1] + 16) >> 5); + ... + dst[31 * dstStride + 0] = (pixel)((f32[31]* ref[off31 + 0] + f[31] * ref[off31 + 1] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-3; + dst[0 * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[1 * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[2 * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[3 * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-3; + dst[0 * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[1 * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[2 * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[3 * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5); + + .... + y=16; off3 = offset[3]; x=0-3; + dst[0 * dstStride + 16] = (pixel)((f32[0]* ref[off0 + 16] + f[0] * ref[off0 + 16] + 16) >> 5); + dst[1 * dstStride + 16] = (pixel)((f32[1]* ref[off1 + 16] + f[1] * ref[off1 + 16] + 16) >> 5); + dst[2 * dstStride + 16] = (pixel)((f32[2]* ref[off2 + 16] + f[2] * ref[off2 + 16] + 16) >> 5); + ... + dst[16 * dstStride + 16] = (pixel)((f32[16]* ref[off16 + 16] + f[16] * ref[off16 + 16] + 16) >> 5); + ... + dst[31 * dstStride + 16] = (pixel)((f32[31]* ref[off31 + 16] + f[31] * ref[off31 + 16] + 16) >> 5); + + .... + y=31; off3 = offset[3]; x=0-3; + dst[0 * dstStride + 31] = (pixel)((f32[0]* ref[off0 + 31] + f[0] * ref[off0 + 31] + 16) >> 5); + dst[1 * dstStride + 31] = (pixel)((f32[1]* ref[off1 + 31] + f[1] * ref[off1 + 31] + 16) >> 5); + dst[2 * dstStride + 31] = (pixel)((f32[2]* ref[off2 + 31] + f[2] * ref[off2 + 31] + 16) >> 5); + ... + dst[3 * dstStride + 31] = (pixel)((f32[31]* ref[off31 + 31] + f[31] * ref[off31 + 31] + 16) >> 5); + } + */ +vec_u8_t mask0={0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, }; +vec_u8_t mask16_0={0x4, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, }; +vec_u8_t mask1={0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, }; +vec_u8_t mask16_1={0x5, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, }; +vec_u8_t mask2={0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, }; +vec_u8_t mask16_2={0x6, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, }; +vec_u8_t mask3={0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, }; +vec_u8_t mask16_3={0x7, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xc, }; +vec_u8_t mask4={0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, }; +vec_u8_t mask16_4={0x8, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xd, }; +vec_u8_t mask5={0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, }; +vec_u8_t mask16_5={0x9, 0xa, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xe, }; +vec_u8_t mask6={0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, }; +vec_u8_t mask16_6={0xa, 0xb, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xf, }; +vec_u8_t mask7={0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, 0xb, 0xb, }; +vec_u8_t mask16_7={0xb, 0xc, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0x10, }; +vec_u8_t mask8={0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xb, 0xc, 0xc, }; +vec_u8_t mask16_8={0xc, 0xd, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x11, }; +vec_u8_t mask9={0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xc, 0xd, 0xd, }; +vec_u8_t mask16_9={0xd, 0xe, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x10, 0x11, 0x11, 0x11, 0x12, }; +vec_u8_t mask10={0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xd, 0xe, 0xe, }; +vec_u8_t mask16_10={0xe, 0xf, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x11, 0x11, 0x11, 0x11, 0x12, 0x12, 0x12, 0x13, }; +vec_u8_t mask11={0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xe, 0xf, 0xf, }; +vec_u8_t mask16_11={0xf, 0x10, 0x10, 0x10, 0x10, 0x11, 0x11, 0x11, 0x12, 0x12, 0x12, 0x12, 0x13, 0x13, 0x13, 0x14, }; +vec_u8_t mask12={0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0xf, 0x10, 0x10, }; +vec_u8_t mask16_12={0x0, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, }; +vec_u8_t mask13={0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x10, 0x11, 0x11, }; +vec_u8_t mask16_13={0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, }; +vec_u8_t mask14={0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x11, 0x11, 0x11, 0x11, 0x12, 0x12, }; +vec_u8_t mask16_14={0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, }; +vec_u8_t mask15={0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x10, 0x11, 0x11, 0x11, 0x12, 0x12, 0x12, 0x12, 0x13, 0x13, }; +vec_u8_t mask16_15={0x3, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x8, }; +vec_u8_t maskadd1_31={0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, }; +vec_u8_t maskadd1_16_31={0x4, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x9, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv3 =vec_xl(113, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + //vec_u8_t sv4 =vec_xl(129, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + +/* + printf("source:\n"); + for(int i=0; i<32; i++){ + printf("%d ", srcPix0[i+65]); + } + printf("\n"); + for(int i=0; i<32; i++){ + printf("%d ", srcPix0[i+97]); + } + printf("\n\n"); +*/ + vec_u8_t srv0 = vec_perm(sv0, sv1, mask0); + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srv10 = vec_perm(sv0, sv1, mask10); + vec_u8_t srv11 = vec_perm(sv0, sv1, mask11); + vec_u8_t srv12 = vec_perm(sv0, sv1, mask12); + vec_u8_t srv13 = vec_perm(sv0, sv1, mask13); + vec_u8_t srv14 = vec_perm(sv0, sv1, mask14); + vec_u8_t srv15 = vec_perm(sv0, sv1, mask15); + + vec_u8_t srv16_0 = vec_perm(sv0, sv1, mask16_0); + vec_u8_t srv16_1 = vec_perm(sv0, sv1,mask16_1); + vec_u8_t srv16_2 = vec_perm(sv0, sv1, mask16_2); + vec_u8_t srv16_3 = vec_perm(sv0, sv1, mask16_3); + vec_u8_t srv16_4 = vec_perm(sv0, sv1, mask16_4); + vec_u8_t srv16_5 = vec_perm(sv0, sv1, mask16_5); + vec_u8_t srv16_6 = vec_perm(sv0, sv1, mask16_6); + vec_u8_t srv16_7 = vec_perm(sv0, sv1, mask16_7); + vec_u8_t srv16_8 = vec_perm(sv0, sv1, mask16_8); + vec_u8_t srv16_9 = vec_perm(sv0, sv1, mask16_9); + vec_u8_t srv16_10 = vec_perm(sv0, sv1, mask16_10); + vec_u8_t srv16_11 = vec_perm(sv0, sv1, mask16_11); + vec_u8_t srv16_12 = vec_perm(sv1, sv2, mask16_12); + vec_u8_t srv16_13 = vec_perm(sv1, sv2, mask16_13); + vec_u8_t srv16_14 = vec_perm(sv1, sv2, mask16_14); + vec_u8_t srv16_15 = vec_perm(sv1, sv2, mask16_15); + + vec_u8_t srv16 = vec_perm(sv1, sv2, mask0); /* mask16 == mask0 */ + vec_u8_t srv17 = vec_perm(sv1, sv2, mask1); + vec_u8_t srv18 = vec_perm(sv1, sv2, mask2); + vec_u8_t srv19 = vec_perm(sv1, sv2, mask3); + vec_u8_t srv20 = vec_perm(sv1, sv2, mask4); + vec_u8_t srv21 = vec_perm(sv1, sv2, mask5); + vec_u8_t srv22 = vec_perm(sv1, sv2, mask6); + vec_u8_t srv23 = vec_perm(sv1, sv2, mask7); + vec_u8_t srv24 = vec_perm(sv1, sv2, mask8); + vec_u8_t srv25 = vec_perm(sv1, sv2, mask9); + vec_u8_t srv26 = vec_perm(sv1, sv2, mask10); + vec_u8_t srv27 = vec_perm(sv1, sv2, mask11); + vec_u8_t srv28 = vec_perm(sv1, sv2, mask12); + vec_u8_t srv29 = vec_perm(sv1, sv2, mask13); + vec_u8_t srv30 = vec_perm(sv1, sv2, mask14); + vec_u8_t srv31 = vec_perm(sv1, sv2, mask15); + vec_u8_t srv32 = vec_perm(sv2, sv3, maskadd1_31); + + + vec_u8_t srv16_16= vec_perm(sv1, sv2, mask16_0); /* mask16_16 == mask16_0 */ + vec_u8_t srv16_17= vec_perm(sv1, sv2, mask16_1); + vec_u8_t srv16_18 = vec_perm(sv1, sv2, mask16_2); + vec_u8_t srv16_19 = vec_perm(sv1, sv2, mask16_3); + vec_u8_t srv16_20 = vec_perm(sv1, sv2, mask16_4); + vec_u8_t srv16_21 = vec_perm(sv1, sv2, mask16_5); + vec_u8_t srv16_22 = vec_perm(sv1, sv2, mask16_6); + vec_u8_t srv16_23 = vec_perm(sv1, sv2, mask16_7); + vec_u8_t srv16_24 = vec_perm(sv1, sv2, mask16_8); + vec_u8_t srv16_25 = vec_perm(sv1, sv2, mask16_9); + vec_u8_t srv16_26 = vec_perm(sv1, sv2, mask16_10); + vec_u8_t srv16_27 = vec_perm(sv1, sv2, mask16_11); + vec_u8_t srv16_28 = vec_perm(sv2, sv3, mask16_12); + vec_u8_t srv16_29 = vec_perm(sv2, sv3, mask16_13); + vec_u8_t srv16_30 = vec_perm(sv2, sv3, mask16_14); + vec_u8_t srv16_31 = vec_perm(sv2, sv3, mask16_15); + vec_u8_t srv16_32 = vec_perm(sv2, sv3, maskadd1_16_31); + + +vec_u8_t vfrac32_0 = (vec_u8_t){9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, }; +vec_u8_t vfrac32_1 = (vec_u8_t){25, 2, 11, 20, 29, 6, 15, 24, 1, 10, 19, 28, 5, 14, 23, 0, }; +vec_u8_t vfrac32_32_0 = (vec_u8_t){23, 14, 5, 28, 19, 10, 1, 24, 15, 6, 29, 20, 11, 2, 25, 16, }; +vec_u8_t vfrac32_32_1 = (vec_u8_t){7, 30, 21, 12, 3, 26, 17, 8, 31, 22, 13, 4, 27, 18, 9, 32, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + + one_line(srv0, srv1, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_0, srv16_1, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv1, srv2, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_1, srv16_2, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv2, srv3, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_2, srv16_3, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv3, srv4, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_3, srv16_4, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv4, srv5, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_4, srv16_5, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv5, srv6, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_5, srv16_6, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv6, srv7, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_6, srv16_7, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv7, srv8, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_7, srv16_8, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv8, srv9, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_8, srv16_9, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv9, srv10, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_9, srv16_10, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv10, srv11, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_10, srv16_11, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv11, srv12, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_11, srv16_12, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv12, srv13, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_12, srv16_13, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv13, srv14, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_13, srv16_14, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv14, srv15, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_14, srv16_15, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv15, srv16, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_15, srv16_16, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv16, srv17, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_16, srv16_17, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv17, srv18, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_17, srv16_18, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv18, srv19, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_18, srv16_19, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv19, srv20, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_19, srv16_20, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv20, srv21, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_20, srv16_21, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv21, srv22, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_21, srv16_22, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv22, srv23, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_22, srv16_23, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv23, srv24, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_23, srv16_24, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv24, srv25, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_24, srv16_25, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv25, srv26, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_25, srv16_26, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv26, srv27, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_26, srv16_27, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv27, srv28, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_27, srv16_28, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv28, srv29, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_28, srv16_29, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv29, srv30, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_29, srv16_30, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv30, srv31, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_30, srv16_31, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv31, srv32, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_31, srv16_32, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<4, 8>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, }; +vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, }; + +vec_u8_t vfrac4 = (vec_u8_t){5, 10, 15, 20, 5, 10, 15, 20, 5, 10, 15, 20, 5, 10, 15, 20, }; +vec_u8_t vfrac4_32 = (vec_u8_t){27, 22, 17, 12, 27, 22, 17, 12, 27, 22, 17, 12, 27, 22, 17, 12, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 8>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, }; +vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, }; +vec_u8_t mask2={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, }; +vec_u8_t mask3={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, }; +vec_u8_t mask4={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, }; +vec_u8_t mask5={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, }; +vec_u8_t mask6={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, }; +vec_u8_t mask7={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, }; +//vec_u8_t mask8={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 1 */ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 2 */ + vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 3 */ + vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 4 */ + vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 4, 4 */ + vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 4, 5 */ + vec_u8_t srv6 = vec_perm(srv, srv, mask6); /* 5, 6 */ + vec_u8_t srv7 = vec_perm(srv, srv, mask7); /* 6, 7 */ + +vec_u8_t vfrac8 = (vec_u8_t){5, 10, 15, 20, 25, 30, 3, 8, 5, 10, 15, 20, 25, 30, 3, 8, }; +vec_u8_t vfrac8_32 = (vec_u8_t){27, 22, 17, 12, 7, 2, 29, 24, 27, 22, 17, 12, 7, 2, 29, 24, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + /* y0, y1 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32); /* (32 - fraction) * ref[offset + x], x=0-7 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32); + vec_u16_t vmle1 = vec_mule(srv1, vfrac8); /* fraction * ref[offset + x + 1], x=0-7 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y2, y3 */ + vmle0 = vec_mule(srv2, vfrac8_32); + vmlo0 = vec_mulo(srv2, vfrac8_32); + vmle1 = vec_mule(srv3, vfrac8); + vmlo1 = vec_mulo(srv3, vfrac8); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y4, y5 */ + vmle0 = vec_mule(srv4, vfrac8_32); + vmlo0 = vec_mulo(srv4, vfrac8_32); + vmle1 = vec_mule(srv5, vfrac8); + vmlo1 = vec_mulo(srv5, vfrac8); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21}; + + /* y6, y7 */ + vmle0 = vec_mule(srv6, vfrac8_32); + vmlo0 = vec_mulo(srv6, vfrac8_32); + vmle1 = vec_mule(srv7, vfrac8); + vmlo1 = vec_mulo(srv7, vfrac8); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 8>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + +vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, }; +vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, }; +vec_u8_t mask2={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, }; +vec_u8_t mask3={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, }; +vec_u8_t mask4={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, }; +vec_u8_t mask5={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, }; +vec_u8_t mask6={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, }; +vec_u8_t mask7={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, }; +vec_u8_t mask8={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, }; +vec_u8_t mask9={0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xb, }; +vec_u8_t mask10={0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xc, }; +vec_u8_t mask11={0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xd, }; +vec_u8_t mask12={0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xe, }; +vec_u8_t mask13={0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0xf, }; +vec_u8_t mask14={0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x10, }; +vec_u8_t mask15={0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x11, 0x11, 0x11, 0x11, }; + + vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = vec_perm(sv0, sv1, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srva = vec_perm(sv0, sv1, mask10); + vec_u8_t srvb = vec_perm(sv0, sv1, mask11); + vec_u8_t srvc = vec_perm(sv0, sv1, mask12); + vec_u8_t srvd = vec_perm(sv0, sv1, mask13); + vec_u8_t srve = vec_perm(sv0, sv1, mask14); + vec_u8_t srvf = vec_perm(sv0, sv1, mask15); + vec_u8_t srv00 = vec_perm(sv1, sv1, mask0);; + +vec_u8_t vfrac16 = (vec_u8_t){5, 10, 15, 20, 25, 30, 3, 8, 13, 18, 23, 28, 1, 6, 11, 16, }; +vec_u8_t vfrac16_32 = (vec_u8_t){27, 22, 17, 12, 7, 2, 29, 24, 19, 14, 9, 4, 31, 26, 21, 16, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv1, vfrac16_32, vfrac16, vout_0); + one_line(srv1, srv2, vfrac16_32, vfrac16, vout_1); + one_line(srv2, srv3, vfrac16_32, vfrac16, vout_2); + one_line(srv3, srv4, vfrac16_32, vfrac16, vout_3); + one_line(srv4, srv5, vfrac16_32, vfrac16, vout_4); + one_line(srv5, srv6, vfrac16_32, vfrac16, vout_5); + one_line(srv6, srv7, vfrac16_32, vfrac16, vout_6); + one_line(srv7, srv8, vfrac16_32, vfrac16, vout_7); + one_line(srv8, srv9, vfrac16_32, vfrac16, vout_8); + one_line(srv9, srva, vfrac16_32, vfrac16, vout_9); + one_line(srva, srvb, vfrac16_32, vfrac16, vout_10); + one_line(srvb, srvc, vfrac16_32, vfrac16, vout_11); + one_line(srvc, srvd, vfrac16_32, vfrac16, vout_12); + one_line(srvd, srve, vfrac16_32, vfrac16, vout_13); + one_line(srve, srvf, vfrac16_32, vfrac16, vout_14); + one_line(srvf, srv00, vfrac16_32, vfrac16, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 8>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-3; + dst[0 * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[1 * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[2 * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + ... + dst[16 * dstStride + 0] = (pixel)((f32[16]* ref[off16 + 0] + f[16] * ref[off16 + 1] + 16) >> 5); + ... + dst[31 * dstStride + 0] = (pixel)((f32[31]* ref[off31 + 0] + f[31] * ref[off31 + 1] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-3; + dst[0 * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[1 * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[2 * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[3 * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-3; + dst[0 * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[1 * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[2 * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[3 * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5); + + .... + y=16; off3 = offset[3]; x=0-3; + dst[0 * dstStride + 16] = (pixel)((f32[0]* ref[off0 + 16] + f[0] * ref[off0 + 16] + 16) >> 5); + dst[1 * dstStride + 16] = (pixel)((f32[1]* ref[off1 + 16] + f[1] * ref[off1 + 16] + 16) >> 5); + dst[2 * dstStride + 16] = (pixel)((f32[2]* ref[off2 + 16] + f[2] * ref[off2 + 16] + 16) >> 5); + ... + dst[16 * dstStride + 16] = (pixel)((f32[16]* ref[off16 + 16] + f[16] * ref[off16 + 16] + 16) >> 5); + ... + dst[31 * dstStride + 16] = (pixel)((f32[31]* ref[off31 + 16] + f[31] * ref[off31 + 16] + 16) >> 5); + + .... + y=31; off3 = offset[3]; x=0-3; + dst[0 * dstStride + 31] = (pixel)((f32[0]* ref[off0 + 31] + f[0] * ref[off0 + 31] + 16) >> 5); + dst[1 * dstStride + 31] = (pixel)((f32[1]* ref[off1 + 31] + f[1] * ref[off1 + 31] + 16) >> 5); + dst[2 * dstStride + 31] = (pixel)((f32[2]* ref[off2 + 31] + f[2] * ref[off2 + 31] + 16) >> 5); + ... + dst[3 * dstStride + 31] = (pixel)((f32[31]* ref[off31 + 31] + f[31] * ref[off31 + 31] + 16) >> 5); + } + */ +vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, }; +vec_u8_t mask16_0={0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, }; +vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, }; +vec_u8_t mask16_1={0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, }; +vec_u8_t mask2={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, }; +vec_u8_t mask16_2={0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, }; +vec_u8_t mask3={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, }; +vec_u8_t mask16_3={0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, }; +vec_u8_t mask4={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, }; +vec_u8_t mask16_4={0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, }; +vec_u8_t mask5={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, }; +vec_u8_t mask16_5={0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0xa, }; +vec_u8_t mask6={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, }; +vec_u8_t mask16_6={0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xb, }; +vec_u8_t mask7={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, }; +vec_u8_t mask16_7={0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xc, }; +vec_u8_t mask8={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, }; +vec_u8_t mask16_8={0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xd, }; +vec_u8_t mask9={0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xb, }; +vec_u8_t mask16_9={0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xe, }; +vec_u8_t mask10={0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xc, }; +vec_u8_t mask16_10={0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xf, }; +vec_u8_t mask11={0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xd, }; +vec_u8_t mask16_11={0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x10, }; +vec_u8_t mask12={0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xe, }; +vec_u8_t mask16_12={0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x11, }; +vec_u8_t mask13={0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0xf, }; +vec_u8_t mask16_13={0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x12, }; +vec_u8_t mask14={0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x10, }; +vec_u8_t mask16_14={0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, }; +vec_u8_t mask15={0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x11, 0x11, 0x11, 0x11, }; +vec_u8_t mask16_15={0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, }; +vec_u8_t maskadd1_31={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, }; +vec_u8_t maskadd1_16_31={0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv3 =vec_xl(113, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + //vec_u8_t sv4 =vec_xl(129, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + +/* + printf("source:\n"); + for(int i=0; i<32; i++){ + printf("%d ", srcPix0[i+65]); + } + printf("\n"); + for(int i=0; i<32; i++){ + printf("%d ", srcPix0[i+97]); + } + printf("\n\n"); +*/ + vec_u8_t srv0 = vec_perm(sv0, sv1, mask0); + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srv10 = vec_perm(sv0, sv1, mask10); + vec_u8_t srv11 = vec_perm(sv0, sv1, mask11); + vec_u8_t srv12 = vec_perm(sv0, sv1, mask12); + vec_u8_t srv13 = vec_perm(sv0, sv1, mask13); + vec_u8_t srv14 = vec_perm(sv0, sv1, mask14); + vec_u8_t srv15 = vec_perm(sv0, sv1, mask15); + + vec_u8_t srv16_0 = vec_perm(sv0, sv1, mask16_0); + vec_u8_t srv16_1 = vec_perm(sv0, sv1,mask16_1); + vec_u8_t srv16_2 = vec_perm(sv0, sv1, mask16_2); + vec_u8_t srv16_3 = vec_perm(sv0, sv1, mask16_3); + vec_u8_t srv16_4 = vec_perm(sv0, sv1, mask16_4); + vec_u8_t srv16_5 = vec_perm(sv0, sv1, mask16_5); + vec_u8_t srv16_6 = vec_perm(sv0, sv1, mask16_6); + vec_u8_t srv16_7 = vec_perm(sv0, sv1, mask16_7); + vec_u8_t srv16_8 = vec_perm(sv0, sv1, mask16_8); + vec_u8_t srv16_9 = vec_perm(sv0, sv1, mask16_9); + vec_u8_t srv16_10 = vec_perm(sv0, sv1, mask16_10); + vec_u8_t srv16_11 = vec_perm(sv0, sv1, mask16_11); + vec_u8_t srv16_12 = vec_perm(sv0, sv1, mask16_12); + vec_u8_t srv16_13 = vec_perm(sv0, sv1, mask16_13); + vec_u8_t srv16_14 = vec_perm(sv1, sv2, mask16_14); + vec_u8_t srv16_15 = vec_perm(sv1, sv2, mask16_15); + + vec_u8_t srv16 = vec_perm(sv1, sv2, mask0); /* mask16 == mask0 */ + vec_u8_t srv17 = vec_perm(sv1, sv2, mask1); + vec_u8_t srv18 = vec_perm(sv1, sv2, mask2); + vec_u8_t srv19 = vec_perm(sv1, sv2, mask3); + vec_u8_t srv20 = vec_perm(sv1, sv2, mask4); + vec_u8_t srv21 = vec_perm(sv1, sv2, mask5); + vec_u8_t srv22 = vec_perm(sv1, sv2, mask6); + vec_u8_t srv23 = vec_perm(sv1, sv2, mask7); + vec_u8_t srv24 = vec_perm(sv1, sv2, mask8); + vec_u8_t srv25 = vec_perm(sv1, sv2, mask9); + vec_u8_t srv26 = vec_perm(sv1, sv2, mask10); + vec_u8_t srv27 = vec_perm(sv1, sv2, mask11); + vec_u8_t srv28 = vec_perm(sv1, sv2, mask12); + vec_u8_t srv29 = vec_perm(sv1, sv2, mask13); + vec_u8_t srv30 = vec_perm(sv1, sv2, mask14); + vec_u8_t srv31 = vec_perm(sv1, sv2, mask15); + vec_u8_t srv32 = vec_perm(sv2, sv3, maskadd1_31); + + + vec_u8_t srv16_16= vec_perm(sv1, sv2, mask16_0); /* mask16_16 == mask16_0 */ + vec_u8_t srv16_17= vec_perm(sv1, sv2, mask16_1); + vec_u8_t srv16_18 = vec_perm(sv1, sv2, mask16_2); + vec_u8_t srv16_19 = vec_perm(sv1, sv2, mask16_3); + vec_u8_t srv16_20 = vec_perm(sv1, sv2, mask16_4); + vec_u8_t srv16_21 = vec_perm(sv1, sv2, mask16_5); + vec_u8_t srv16_22 = vec_perm(sv1, sv2, mask16_6); + vec_u8_t srv16_23 = vec_perm(sv1, sv2, mask16_7); + vec_u8_t srv16_24 = vec_perm(sv1, sv2, mask16_8); + vec_u8_t srv16_25 = vec_perm(sv1, sv2, mask16_9); + vec_u8_t srv16_26 = vec_perm(sv1, sv2, mask16_10); + vec_u8_t srv16_27 = vec_perm(sv1, sv2, mask16_11); + vec_u8_t srv16_28 = vec_perm(sv1, sv2, mask16_12); + vec_u8_t srv16_29 = vec_perm(sv1, sv2, mask16_13); + vec_u8_t srv16_30 = vec_perm(sv2, sv3, mask16_14); + vec_u8_t srv16_31 = vec_perm(sv2, sv3, mask16_15); + vec_u8_t srv16_32 = vec_perm(sv2, sv3, maskadd1_16_31); + +vec_u8_t vfrac32_0 = (vec_u8_t){5, 10, 15, 20, 25, 30, 3, 8, 13, 18, 23, 28, 1, 6, 11, 16, }; +vec_u8_t vfrac32_1 = (vec_u8_t){21, 26, 31, 4, 9, 14, 19, 24, 29, 2, 7, 12, 17, 22, 27, 0, }; +vec_u8_t vfrac32_32_0 = (vec_u8_t){27, 22, 17, 12, 7, 2, 29, 24, 19, 14, 9, 4, 31, 26, 21, 16, }; +vec_u8_t vfrac32_32_1 = (vec_u8_t){11, 6, 1, 28, 23, 18, 13, 8, 3, 30, 25, 20, 15, 10, 5, 32, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + + one_line(srv0, srv1, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_0, srv16_1, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv1, srv2, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_1, srv16_2, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv2, srv3, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_2, srv16_3, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv3, srv4, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_3, srv16_4, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv4, srv5, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_4, srv16_5, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv5, srv6, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_5, srv16_6, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv6, srv7, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_6, srv16_7, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv7, srv8, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_7, srv16_8, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv8, srv9, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_8, srv16_9, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv9, srv10, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_9, srv16_10, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv10, srv11, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_10, srv16_11, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv11, srv12, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_11, srv16_12, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv12, srv13, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_12, srv16_13, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv13, srv14, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_13, srv16_14, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv14, srv15, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_14, srv16_15, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv15, srv16, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_15, srv16_16, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv16, srv17, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_16, srv16_17, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv17, srv18, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_17, srv16_18, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv18, srv19, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_18, srv16_19, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv19, srv20, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_19, srv16_20, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv20, srv21, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_20, srv16_21, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv21, srv22, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_21, srv16_22, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv22, srv23, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_22, srv16_23, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv23, srv24, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_23, srv16_24, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv24, srv25, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_24, srv16_25, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv25, srv26, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_25, srv16_26, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv26, srv27, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_26, srv16_27, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv27, srv28, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_27, srv16_28, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv28, srv29, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_28, srv16_29, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv29, srv30, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_29, srv16_30, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv30, srv31, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_30, srv16_31, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv31, srv32, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_31, srv16_32, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<4, 9>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, }; +vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, }; + +vec_u8_t vfrac4 = (vec_u8_t){2, 4, 6, 8, 2, 4, 6, 8, 2, 4, 6, 8, 2, 4, 6, 8, }; + +vec_u8_t vfrac4_32 = (vec_u8_t){30, 28, 26, 24, 30, 28, 26, 24, 30, 28, 26, 24, 30, 28, 26, 24, }; + + + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 9>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, }; +vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, }; +vec_u8_t mask2={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, }; +vec_u8_t mask3={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, }; +vec_u8_t mask4={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, }; +vec_u8_t mask5={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, }; +vec_u8_t mask6={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, }; +vec_u8_t mask7={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, }; +//vec_u8_t mask8={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 1 */ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 2 */ + vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 3 */ + vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 4 */ + vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 4, 4 */ + vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 4, 5 */ + vec_u8_t srv6 = vec_perm(srv, srv, mask6); /* 5, 6 */ + vec_u8_t srv7 = vec_perm(srv, srv, mask7); /* 6, 7 */ + +vec_u8_t vfrac8 = (vec_u8_t){2, 4, 6, 8, 10, 12, 14, 16, 2, 4, 6, 8, 10, 12, 14, 16, }; +vec_u8_t vfrac8_32 = (vec_u8_t){30, 28, 26, 24, 22, 20, 18, 16, 30, 28, 26, 24, 22, 20, 18, 16, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + /* y0, y1 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32); /* (32 - fraction) * ref[offset + x], x=0-7 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32); + vec_u16_t vmle1 = vec_mule(srv1, vfrac8); /* fraction * ref[offset + x + 1], x=0-7 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y2, y3 */ + vmle0 = vec_mule(srv2, vfrac8_32); + vmlo0 = vec_mulo(srv2, vfrac8_32); + vmle1 = vec_mule(srv3, vfrac8); + vmlo1 = vec_mulo(srv3, vfrac8); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y4, y5 */ + vmle0 = vec_mule(srv4, vfrac8_32); + vmlo0 = vec_mulo(srv4, vfrac8_32); + vmle1 = vec_mule(srv5, vfrac8); + vmlo1 = vec_mulo(srv5, vfrac8); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21}; + + /* y6, y7 */ + vmle0 = vec_mule(srv6, vfrac8_32); + vmlo0 = vec_mulo(srv6, vfrac8_32); + vmle1 = vec_mule(srv7, vfrac8); + vmlo1 = vec_mulo(srv7, vfrac8); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 9>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + +vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, }; +vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, }; +vec_u8_t mask2={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, }; +vec_u8_t mask3={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, }; +vec_u8_t mask4={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, }; +vec_u8_t mask5={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, }; +vec_u8_t mask6={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, }; +vec_u8_t mask7={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, }; +vec_u8_t mask8={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, }; +vec_u8_t mask9={0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0xa, }; +vec_u8_t mask10={0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xb, }; +vec_u8_t mask11={0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xc, }; +vec_u8_t mask12={0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xd, }; +vec_u8_t mask13={0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xe, }; +vec_u8_t mask14={0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xf, }; +vec_u8_t mask15={0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x10, }; + + vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = vec_perm(sv0, sv1, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srva = vec_perm(sv0, sv1, mask10); + vec_u8_t srvb = vec_perm(sv0, sv1, mask11); + vec_u8_t srvc = vec_perm(sv0, sv1, mask12); + vec_u8_t srvd = vec_perm(sv0, sv1, mask13); + vec_u8_t srve = vec_perm(sv0, sv1, mask14); + vec_u8_t srvf = vec_perm(sv0, sv1, mask15); + vec_u8_t srv00 = vec_perm(sv1, sv1, mask0);; + +vec_u8_t vfrac16 = (vec_u8_t){2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 0, }; +vec_u8_t vfrac16_32 = (vec_u8_t){30, 28, 26, 24, 22, 20, 18, 16, 14, 12, 10, 8, 6, 4, 2, 32, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv1, vfrac16_32, vfrac16, vout_0); + one_line(srv1, srv2, vfrac16_32, vfrac16, vout_1); + one_line(srv2, srv3, vfrac16_32, vfrac16, vout_2); + one_line(srv3, srv4, vfrac16_32, vfrac16, vout_3); + one_line(srv4, srv5, vfrac16_32, vfrac16, vout_4); + one_line(srv5, srv6, vfrac16_32, vfrac16, vout_5); + one_line(srv6, srv7, vfrac16_32, vfrac16, vout_6); + one_line(srv7, srv8, vfrac16_32, vfrac16, vout_7); + one_line(srv8, srv9, vfrac16_32, vfrac16, vout_8); + one_line(srv9, srva, vfrac16_32, vfrac16, vout_9); + one_line(srva, srvb, vfrac16_32, vfrac16, vout_10); + one_line(srvb, srvc, vfrac16_32, vfrac16, vout_11); + one_line(srvc, srvd, vfrac16_32, vfrac16, vout_12); + one_line(srvd, srve, vfrac16_32, vfrac16, vout_13); + one_line(srve, srvf, vfrac16_32, vfrac16, vout_14); + one_line(srvf, srv00, vfrac16_32, vfrac16, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 9>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-3; + dst[0 * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[1 * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[2 * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + ... + dst[16 * dstStride + 0] = (pixel)((f32[16]* ref[off16 + 0] + f[16] * ref[off16 + 1] + 16) >> 5); + ... + dst[31 * dstStride + 0] = (pixel)((f32[31]* ref[off31 + 0] + f[31] * ref[off31 + 1] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-3; + dst[0 * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[1 * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[2 * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[3 * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-3; + dst[0 * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[1 * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[2 * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[3 * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5); + + .... + y=16; off3 = offset[3]; x=0-3; + dst[0 * dstStride + 16] = (pixel)((f32[0]* ref[off0 + 16] + f[0] * ref[off0 + 16] + 16) >> 5); + dst[1 * dstStride + 16] = (pixel)((f32[1]* ref[off1 + 16] + f[1] * ref[off1 + 16] + 16) >> 5); + dst[2 * dstStride + 16] = (pixel)((f32[2]* ref[off2 + 16] + f[2] * ref[off2 + 16] + 16) >> 5); + ... + dst[16 * dstStride + 16] = (pixel)((f32[16]* ref[off16 + 16] + f[16] * ref[off16 + 16] + 16) >> 5); + ... + dst[31 * dstStride + 16] = (pixel)((f32[31]* ref[off31 + 16] + f[31] * ref[off31 + 16] + 16) >> 5); + + .... + y=31; off3 = offset[3]; x=0-3; + dst[0 * dstStride + 31] = (pixel)((f32[0]* ref[off0 + 31] + f[0] * ref[off0 + 31] + 16) >> 5); + dst[1 * dstStride + 31] = (pixel)((f32[1]* ref[off1 + 31] + f[1] * ref[off1 + 31] + 16) >> 5); + dst[2 * dstStride + 31] = (pixel)((f32[2]* ref[off2 + 31] + f[2] * ref[off2 + 31] + 16) >> 5); + ... + dst[3 * dstStride + 31] = (pixel)((f32[31]* ref[off31 + 31] + f[31] * ref[off31 + 31] + 16) >> 5); + } + */ + +vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, }; +vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, }; +vec_u8_t mask2={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, }; +vec_u8_t mask3={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, }; +vec_u8_t mask4={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, }; +vec_u8_t mask5={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, }; +vec_u8_t mask6={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, }; +vec_u8_t mask7={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, }; +vec_u8_t mask8={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, }; +vec_u8_t mask9={0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0xa, }; +vec_u8_t mask10={0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xb, }; +vec_u8_t mask11={0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xc, }; +vec_u8_t mask12={0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xd, }; +vec_u8_t mask13={0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xe, }; +vec_u8_t mask14={0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xf, }; +vec_u8_t mask15={0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x10, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv3 =vec_xl(113, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + + vec_u8_t srv0 = vec_perm(sv0, sv1, mask0); + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srv10 = vec_perm(sv0, sv1, mask10); + vec_u8_t srv11 = vec_perm(sv0, sv1, mask11); + vec_u8_t srv12 = vec_perm(sv0, sv1, mask12); + vec_u8_t srv13 = vec_perm(sv0, sv1, mask13); + vec_u8_t srv14 = vec_perm(sv0, sv1, mask14); + vec_u8_t srv15 = vec_perm(sv0, sv1, mask15); + + vec_u8_t srv16 = vec_perm(sv1, sv2, mask0); /* mask16 == mask0 */ + vec_u8_t srv17 = vec_perm(sv1, sv2, mask1); + vec_u8_t srv18 = vec_perm(sv1, sv2, mask2); + vec_u8_t srv19 = vec_perm(sv1, sv2, mask3); + vec_u8_t srv20 = vec_perm(sv1, sv2, mask4); + vec_u8_t srv21 = vec_perm(sv1, sv2, mask5); + vec_u8_t srv22 = vec_perm(sv1, sv2, mask6); + vec_u8_t srv23 = vec_perm(sv1, sv2, mask7); + vec_u8_t srv24 = vec_perm(sv1, sv2, mask8); + vec_u8_t srv25 = vec_perm(sv1, sv2, mask9); + vec_u8_t srv26 = vec_perm(sv1, sv2, mask10); + vec_u8_t srv27 = vec_perm(sv1, sv2, mask11); + vec_u8_t srv28 = vec_perm(sv1, sv2, mask12); + vec_u8_t srv29 = vec_perm(sv1, sv2, mask13); + vec_u8_t srv30 = vec_perm(sv1, sv2, mask14); + vec_u8_t srv31 = vec_perm(sv1, sv2, mask15); + vec_u8_t srv32 = vec_perm(sv2, sv3, mask0); + vec_u8_t srv33 = vec_perm(sv2, sv3, mask1); + +vec_u8_t vfrac32_0 = (vec_u8_t){2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 0, }; +vec_u8_t vfrac32_1 = (vec_u8_t){2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 0, }; +vec_u8_t vfrac32_32_0 = (vec_u8_t){30, 28, 26, 24, 22, 20, 18, 16, 14, 12, 10, 8, 6, 4, 2, 32, }; +vec_u8_t vfrac32_32_1 = (vec_u8_t){30, 28, 26, 24, 22, 20, 18, 16, 14, 12, 10, 8, 6, 4, 2, 32, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv1, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv1, srv2, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv1, srv2, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv2, srv3, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv2, srv3, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv3, srv4, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv3, srv4, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv4, srv5, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv4, srv5, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv5, srv6, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv5, srv6, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv6, srv7, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv6, srv7, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv7, srv8, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv7, srv8, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv8, srv9, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv8, srv9, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv9, srv10, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv9, srv10, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv10, srv11, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv10, srv11, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv11, srv12, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv11, srv12, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv12, srv13, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv12, srv13, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv13, srv14, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv13, srv14, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv14, srv15, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv14, srv15, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv15, srv16, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv15, srv16, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16, srv17, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv16, srv17, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv17, srv18, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv17, srv18, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv18, srv19, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv18, srv19, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv19, srv20, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv19, srv20, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv20, srv21, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv20, srv21, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv21, srv22, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv21, srv22, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv22, srv23, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv22, srv23, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv23, srv24, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv23, srv24, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv24, srv25, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv24, srv25, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv25, srv26, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv25, srv26, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv26, srv27, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv26, srv27, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv27, srv28, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv27, srv28, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv28, srv29, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv28, srv29, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv29, srv30, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv29, srv30, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv30, srv31, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv30, srv31, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv31, srv32, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv31, srv32, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv32, srv33, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +#ifdef WORDS_BIGENDIAN + vec_u8_t u8_to_s16_w4x4_mask1 = {0x00, 0x11, 0x00, 0x12, 0x00, 0x13, 0x00, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t u8_to_s16_w4x4_mask9 = {0x00, 0x19, 0x00, 0x1a, 0x00, 0x1b, 0x00, 0x1c, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t u8_to_s16_w8x8_mask1 = {0x00, 0x11, 0x00, 0x12, 0x00, 0x13, 0x00, 0x14, 0x00, 0x15, 0x00, 0x16, 0x00, 0x17, 0x00, 0x18}; + vec_u8_t u8_to_s16_w8x8_maskh = {0x00, 0x10, 0x00, 0x11, 0x00, 0x12, 0x00, 0x13, 0x00, 0x14, 0x00, 0x15, 0x00, 0x16, 0x00, 0x17}; + vec_u8_t u8_to_s16_w8x8_maskl = {0x00, 0x18, 0x00, 0x19, 0x00, 0x1a, 0x00, 0x1b, 0x00, 0x1c, 0x00, 0x1d, 0x00, 0x1e, 0x00, 0x1f}; + vec_u8_t u8_to_s16_b0_mask = {0x00, 0x10, 0x00, 0x10, 0x00, 0x10, 0x00, 0x10, 0x00, 0x10, 0x00, 0x10, 0x00, 0x10, 0x00, 0x10}; + vec_u8_t u8_to_s16_b1_mask = {0x00, 0x11, 0x00, 0x11, 0x00, 0x11, 0x00, 0x11, 0x00, 0x11, 0x00, 0x11, 0x00, 0x11, 0x00, 0x11}; + vec_u8_t u8_to_s16_b9_mask = {0x09, 0x10, 0x09, 0x10, 0x09, 0x10, 0x09, 0x10, 0x09, 0x10, 0x09, 0x10, 0x09, 0x10, 0x09, 0x10}; +#else + vec_u8_t u8_to_s16_w4x4_mask1 = {0x11, 0x00, 0x12, 0x00, 0x13, 0x00, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t u8_to_s16_w4x4_mask9 = {0x19, 0x00, 0x1a, 0x00, 0x1b, 0x00, 0x1c, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t u8_to_s16_w8x8_mask1 = {0x11, 0x00, 0x12, 0x00, 0x13, 0x00, 0x14, 0x00, 0x15, 0x00, 0x16, 0x00, 0x17, 0x00, 0x18, 0x00}; + vec_u8_t u8_to_s16_w8x8_maskh = {0x10, 0x00, 0x11, 0x00, 0x12, 0x00, 0x13, 0x00, 0x14, 0x00, 0x15, 0x00, 0x16, 0x00, 0x17, 0x00}; + vec_u8_t u8_to_s16_w8x8_maskl = {0x18, 0x00, 0x19, 0x00, 0x1a, 0x00, 0x1b, 0x00, 0x1c, 0x00, 0x1d, 0x00, 0x1e, 0x00, 0x1f, 0x00}; + vec_u8_t u8_to_s16_b0_mask = {0x10, 0x00, 0x10, 0x00, 0x10, 0x00, 0x10, 0x00, 0x10, 0x00, 0x10, 0x00, 0x10, 0x00, 0x10, 0x00}; + vec_u8_t u8_to_s16_b1_mask = {0x11, 0x00, 0x11, 0x00, 0x11, 0x00, 0x11, 0x00, 0x11, 0x00, 0x11, 0x00, 0x11, 0x00, 0x11, 0x00}; + vec_u8_t u8_to_s16_b9_mask = {0x10, 0x09, 0x10, 0x09, 0x10, 0x09, 0x10, 0x09, 0x10, 0x09, 0x10, 0x09, 0x10, 0x09, 0x10, 0x09}; +#endif +vec_s16_t min_s16v = (vec_s16_t){255, 255, 255, 255, 255, 255, 255, 255}; +vec_u16_t one_u16v = (vec_u16_t)vec_splat_u16(1); + +template<> +void intra_pred<4, 10>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u8_t srv =vec_xl(9, srcPix0); /* offset = width2+1 = width<<1 + 1 */ + vec_u8_t v_filter_u8, v_mask0, v_mask; + if (bFilter){ + LOAD_ZERO; + vec_u8_t tmp_v = vec_xl(0, srcPix0); + vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask)); + vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask)); + vec_s16_t v0_s16 = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_w4x4_mask1)); + vec_s16_t v1_s16 = (vec_s16_t)vec_sra( vec_sub(v0_s16, c0_s16v), one_u16v ); + vec_s16_t v_sum = vec_add(c1_s16v, v1_s16); + vec_u16_t v_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v_sum)); + v_filter_u8 = vec_pack(v_filter_u16, zero_u16v); + v_mask0 = (vec_u8_t){0x10, 0x11, 0x12, 0x13, 0x01, 0x01, 0x01, 0x01, 0x02, 0x02, 0x02, 0x02, 0x03, 0x03, 0x03, 0x03}; + v_mask = (vec_u8_t){0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + } + else{ + v_mask0 = (vec_u8_t){0x00, 0x00, 0x00, 0x00, 0x01, 0x01, 0x01, 0x01, 0x02, 0x02, 0x02, 0x02, 0x03, 0x03, 0x03, 0x03}; + v_mask = (vec_u8_t){0x00, 0x00, 0x00, 0x00, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + v_filter_u8 = srv; + } + + + if(dstStride == 4) { + vec_u8_t v0 = vec_perm(srv, v_filter_u8, v_mask0); + vec_xst(v0, 0, dst); + } + else if(dstStride%16 == 0){ + vec_u8_t v0 = vec_perm(srv, v_filter_u8, v_mask0); + vec_ste((vec_u32_t)v0, 0, (unsigned int*)dst); + vec_u8_t v1 = vec_sld(v0, v0, 12); + vec_ste((vec_u32_t)v1, 0, (unsigned int*)(dst+dstStride)); + vec_u8_t v2 = vec_sld(v0, v0, 8); + vec_ste((vec_u32_t)v2, 0, (unsigned int*)(dst+dstStride*2)); + vec_u8_t v3 = vec_sld(v0, v0, 4); + vec_ste((vec_u32_t)v3, 0, (unsigned int*)(dst+dstStride*3)); + } + else{ + vec_u8_t v_mask1 = {0x01, 0x01, 0x01, 0x01, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x02, 0x02, 0x02, 0x02, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x03, 0x03, 0x03, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(v_filter_u8, vec_xl(0, dst), v_mask); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(srv, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(srv, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(srv, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 10>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u8_t srv =vec_xl(17, srcPix0); /* offset = width2+1 = width<<1 + 1 */ + + if(dstStride == 8) { + vec_u8_t v_mask0 = {0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01}; + vec_u8_t v_mask1 = {0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03}; + vec_u8_t v_mask2 = {0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05}; + vec_u8_t v_mask3 = {0x06, 0x06, 0x06, 0x06, 0x06, 0x06, 0x06, 0x06, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07}; + vec_u8_t v0 = vec_perm(srv, srv, v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(srv, srv, v_mask1); + vec_xst(v1, 16, dst); + vec_u8_t v2 = vec_perm(srv, srv, v_mask2); + vec_xst(v2, 32, dst); + vec_u8_t v3 = vec_perm(srv, srv, v_mask3); + vec_xst(v3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask4 = {0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask5 = {0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask6 = {0x06, 0x06, 0x06, 0x06, 0x06, 0x06, 0x06, 0x06, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask7 = {0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(srv, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(srv, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(srv, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(srv, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + vec_u8_t v4 = vec_perm(srv, vec_xl(dstStride*4, dst), v_mask4); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(srv, vec_xl(dstStride*5, dst), v_mask5); + vec_xst(v5, dstStride*5, dst); + vec_u8_t v6 = vec_perm(srv, vec_xl(dstStride*6, dst), v_mask6); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(srv, vec_xl(dstStride*7, dst), v_mask7); + vec_xst(v7, dstStride*7, dst); + } + + if (bFilter){ + LOAD_ZERO; + vec_u8_t tmp_v = vec_xl(0, srcPix0); + vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask)); + vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask)); + vec_s16_t v0_s16 = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_w8x8_mask1)); + vec_s16_t v1_s16 = (vec_s16_t)vec_sra( vec_sub(v0_s16, c0_s16v), one_u16v ); + vec_s16_t v_sum = vec_add(c1_s16v, v1_s16); + vec_u16_t v_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v_sum)); + vec_u8_t v_filter_u8 = vec_pack(v_filter_u16, zero_u16v); + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_xst( vec_perm(v_filter_u8, vec_xl(0, dst), v_mask0), 0, dst ); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 10>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u8_t srv =vec_xl(33, srcPix0); /* offset = width2+1 = width<<1 + 1 */ + + if(dstStride == 16) { + vec_xst(vec_splat(srv, 0), 0, dst); + vec_xst(vec_splat(srv, 1), 16, dst); + vec_xst(vec_splat(srv, 2), 32, dst); + vec_xst(vec_splat(srv, 3), 48, dst); + vec_xst(vec_splat(srv, 4), 64, dst); + vec_xst(vec_splat(srv, 5), 80, dst); + vec_xst(vec_splat(srv, 6), 96, dst); + vec_xst(vec_splat(srv, 7), 112, dst); + vec_xst(vec_splat(srv, 8), 128, dst); + vec_xst(vec_splat(srv, 9), 144, dst); + vec_xst(vec_splat(srv, 10), 160, dst); + vec_xst(vec_splat(srv, 11), 176, dst); + vec_xst(vec_splat(srv, 12), 192, dst); + vec_xst(vec_splat(srv, 13), 208, dst); + vec_xst(vec_splat(srv, 14), 224, dst); + vec_xst(vec_splat(srv, 15), 240, dst); + } + else{ + vec_xst(vec_splat(srv, 0), 0, dst); + vec_xst(vec_splat(srv, 1), 1*dstStride, dst); + vec_xst(vec_splat(srv, 2), 2*dstStride, dst); + vec_xst(vec_splat(srv, 3), 3*dstStride, dst); + vec_xst(vec_splat(srv, 4), 4*dstStride, dst); + vec_xst(vec_splat(srv, 5), 5*dstStride, dst); + vec_xst(vec_splat(srv, 6), 6*dstStride, dst); + vec_xst(vec_splat(srv, 7), 7*dstStride, dst); + vec_xst(vec_splat(srv, 8), 8*dstStride, dst); + vec_xst(vec_splat(srv, 9), 9*dstStride, dst); + vec_xst(vec_splat(srv, 10), 10*dstStride, dst); + vec_xst(vec_splat(srv, 11), 11*dstStride, dst); + vec_xst(vec_splat(srv, 12), 12*dstStride, dst); + vec_xst(vec_splat(srv, 13), 13*dstStride, dst); + vec_xst(vec_splat(srv, 14), 14*dstStride, dst); + vec_xst(vec_splat(srv, 15), 15*dstStride, dst); + } + + if (bFilter){ + LOAD_ZERO; + vec_u8_t tmp_v = vec_xl(0, srcPix0); + vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask)); + vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask)); + vec_u8_t srcv1 = vec_xl(1, srcPix0); + vec_s16_t v0h_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskh)); + vec_s16_t v0l_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskl)); + vec_s16_t v1h_s16 = (vec_s16_t)vec_sra( vec_sub(v0h_s16, c0_s16v), one_u16v ); + vec_s16_t v1l_s16 = (vec_s16_t)vec_sra( vec_sub(v0l_s16, c0_s16v), one_u16v ); + vec_s16_t vh_sum = vec_add(c1_s16v, v1h_s16); + vec_s16_t vl_sum = vec_add(c1_s16v, v1l_s16); + vec_u16_t vh_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vh_sum)); + vec_u16_t vl_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vl_sum)); + vec_u8_t v_filter_u8 = vec_pack(vh_filter_u16, vl_filter_u16); + vec_xst( v_filter_u8, 0, dst ); + } +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + + +template<> +void intra_pred<32, 10>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u8_t srv =vec_xl(65, srcPix0); /* offset = width2+1 = width<<1 + 1 */ + vec_u8_t srv1 =vec_xl(81, srcPix0); + vec_u8_t vout; + int offset = 0; + + #define v_pred32(vi, vo, i){\ + vo = vec_splat(vi, i);\ + vec_xst(vo, offset, dst);\ + vec_xst(vo, 16+offset, dst);\ + offset += dstStride;\ + } + + v_pred32(srv, vout, 0); + v_pred32(srv, vout, 1); + v_pred32(srv, vout, 2); + v_pred32(srv, vout, 3); + v_pred32(srv, vout, 4); + v_pred32(srv, vout, 5); + v_pred32(srv, vout, 6); + v_pred32(srv, vout, 7); + v_pred32(srv, vout, 8); + v_pred32(srv, vout, 9); + v_pred32(srv, vout, 10); + v_pred32(srv, vout, 11); + v_pred32(srv, vout, 12); + v_pred32(srv, vout, 13); + v_pred32(srv, vout, 14); + v_pred32(srv, vout, 15); + + v_pred32(srv1, vout, 0); + v_pred32(srv1, vout, 1); + v_pred32(srv1, vout, 2); + v_pred32(srv1, vout, 3); + v_pred32(srv1, vout, 4); + v_pred32(srv1, vout, 5); + v_pred32(srv1, vout, 6); + v_pred32(srv1, vout, 7); + v_pred32(srv1, vout, 8); + v_pred32(srv1, vout, 9); + v_pred32(srv1, vout, 10); + v_pred32(srv1, vout, 11); + v_pred32(srv1, vout, 12); + v_pred32(srv1, vout, 13); + v_pred32(srv1, vout, 14); + v_pred32(srv1, vout, 15); + + if (bFilter){ + LOAD_ZERO; + vec_u8_t tmp_v = vec_xl(0, srcPix0); + vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask)); + vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask)); + vec_u8_t srcv1 = vec_xl(1, srcPix0); + vec_s16_t v0h_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskh)); + vec_s16_t v0l_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskl)); + vec_s16_t v1h_s16 = (vec_s16_t)vec_sra( vec_sub(v0h_s16, c0_s16v), one_u16v ); + vec_s16_t v1l_s16 = (vec_s16_t)vec_sra( vec_sub(v0l_s16, c0_s16v), one_u16v ); + vec_s16_t vh_sum = vec_add(c1_s16v, v1h_s16); + vec_s16_t vl_sum = vec_add(c1_s16v, v1l_s16); + vec_u16_t vh_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vh_sum)); + vec_u16_t vl_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vl_sum)); + vec_u8_t v_filter_u8 = vec_pack(vh_filter_u16, vl_filter_u16); + vec_xst( v_filter_u8, 0, dst ); + + vec_u8_t srcv2 = vec_xl(17, srcPix0); + vec_s16_t v2h_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv2, u8_to_s16_w8x8_maskh)); + vec_s16_t v2l_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv2, u8_to_s16_w8x8_maskl)); + vec_s16_t v3h_s16 = (vec_s16_t)vec_sra( vec_sub(v2h_s16, c0_s16v), one_u16v ); + vec_s16_t v3l_s16 = (vec_s16_t)vec_sra( vec_sub(v2l_s16, c0_s16v), one_u16v ); + vec_s16_t v2h_sum = vec_add(c1_s16v, v3h_s16); + vec_s16_t v2l_sum = vec_add(c1_s16v, v3l_s16); + vec_u16_t v2h_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v2h_sum)); + vec_u16_t v2l_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v2l_sum)); + vec_u8_t v2_filter_u8 = vec_pack(v2h_filter_u16, v2l_filter_u16); + vec_xst( v2_filter_u8, 16, dst ); + + } +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<4, 11>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, }; + vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, }; + + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(9, srcPix0); + vec_u8_t refmask_4={0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, }; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + vec_u8_t vfrac4 = (vec_u8_t){30, 28, 26, 24, 30, 28, 26, 24, 30, 28, 26, 24, 30, 28, 26, 24, }; + vec_u8_t vfrac4_32 = (vec_u8_t){2, 4, 6, 8, 2, 4, 6, 8, 2, 4, 6, 8, 2, 4, 6, 8, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 11>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, }; +vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, }; +vec_u8_t mask2={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, }; +vec_u8_t mask3={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, }; +vec_u8_t mask4={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, }; +vec_u8_t mask5={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, }; +vec_u8_t mask6={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, }; +vec_u8_t mask7={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, }; +//vec_u8_t mask8={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, }; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vout_0, vout_1, vout_2, vout_3; + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(17, srcPix0); + vec_u8_t refmask_8={0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, }; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + vec_u8_t srv3 = vec_perm(srv, srv, mask3); + vec_u8_t srv4 = vec_perm(srv, srv, mask4); + vec_u8_t srv5 = vec_perm(srv, srv, mask5); + vec_u8_t srv6 = vec_perm(srv, srv, mask6); + vec_u8_t srv7 = vec_perm(srv, srv, mask7); +vec_u8_t vfrac8 = (vec_u8_t){30, 28, 26, 24, 22, 20, 18, 16, 30, 28, 26, 24, 22, 20, 18, 16, }; +vec_u8_t vfrac8_32 = (vec_u8_t){2, 4, 6, 8, 10, 12, 14, 16, 2, 4, 6, 8, 10, 12, 14, 16, }; + + one_line(srv0, srv1, vfrac8_32, vfrac8, vout_0); + one_line(srv2, srv3, vfrac8_32, vfrac8, vout_1); + one_line(srv4, srv5, vfrac8_32, vfrac8, vout_2); + one_line(srv6, srv7, vfrac8_32, vfrac8, vout_3); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 11>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, }; +vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, }; +vec_u8_t mask2={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, }; +vec_u8_t mask3={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, }; +vec_u8_t mask4={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, }; +vec_u8_t mask5={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, }; +vec_u8_t mask6={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, }; +vec_u8_t mask7={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, }; +vec_u8_t mask8={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, }; +vec_u8_t mask9={0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, }; +vec_u8_t mask10={0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, }; +vec_u8_t mask11={0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, }; +vec_u8_t mask12={0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, }; +vec_u8_t mask13={0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, }; +vec_u8_t mask14={0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, }; +vec_u8_t mask15={0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, }; +vec_u8_t maskadd1_15={0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_16={0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e}; + + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(48, srcPix0); + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = vec_perm(s0, s1, mask1); + vec_u8_t srv2 = vec_perm(s0, s1, mask2); + vec_u8_t srv3 = vec_perm(s0, s1, mask3); + vec_u8_t srv4 = vec_perm(s0, s1, mask4); + vec_u8_t srv5 =vec_perm(s0, s1, mask5); + vec_u8_t srv6 = vec_perm(s0, s1, mask6); + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = vec_perm(s0, s1, mask8); + vec_u8_t srv9 = vec_perm(s0, s1, mask9); + vec_u8_t srv10 = vec_perm(s0, s1, mask10); + vec_u8_t srv11 = vec_perm(s0, s1, mask11); + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = vec_perm(s0, s1, mask13); + vec_u8_t srv14 = vec_perm(s0, s1, mask14); + vec_u8_t srv15 = vec_perm(s0, s1, mask15); + + vec_u8_t srv0_add1 = srv1; + vec_u8_t srv1_add1 = srv2; + vec_u8_t srv2_add1 = srv3; + vec_u8_t srv3_add1 = srv4; + vec_u8_t srv4_add1 = srv5; + vec_u8_t srv5_add1 = srv6; + vec_u8_t srv6_add1 = srv7; + vec_u8_t srv7_add1 = srv8; + vec_u8_t srv8_add1 = srv9; + vec_u8_t srv9_add1 = srv10; + vec_u8_t srv10_add1 = srv11; + vec_u8_t srv11_add1 = srv12; + vec_u8_t srv12_add1= srv13; + vec_u8_t srv13_add1 = srv14; + vec_u8_t srv14_add1 = srv15; + vec_u8_t srv15_add1 = vec_perm(s0, s1, maskadd1_15); + +vec_u8_t vfrac16 = (vec_u8_t){30, 28, 26, 24, 22, 20, 18, 16, 14, 12, 10, 8, 6, 4, 2, 0, }; +vec_u8_t vfrac16_32 = (vec_u8_t){2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, }; + + one_line(srv0, srv0_add1, vfrac16_32, vfrac16, vout_0); + one_line(srv1, srv1_add1, vfrac16_32, vfrac16, vout_1); + one_line(srv2, srv2_add1, vfrac16_32, vfrac16, vout_2); + one_line(srv3, srv3_add1, vfrac16_32, vfrac16, vout_3); + one_line(srv4, srv4_add1, vfrac16_32, vfrac16, vout_4); + one_line(srv5, srv5_add1, vfrac16_32, vfrac16, vout_5); + one_line(srv6, srv6_add1, vfrac16_32, vfrac16, vout_6); + one_line(srv7, srv7_add1, vfrac16_32, vfrac16, vout_7); + one_line(srv8, srv8_add1, vfrac16_32, vfrac16, vout_8); + one_line(srv9, srv9_add1, vfrac16_32, vfrac16, vout_9); + one_line(srv10, srv10_add1, vfrac16_32, vfrac16, vout_10); + one_line(srv11, srv11_add1, vfrac16_32, vfrac16, vout_11); + one_line(srv12, srv12_add1, vfrac16_32, vfrac16, vout_12); + one_line(srv13, srv13_add1, vfrac16_32, vfrac16, vout_13); + one_line(srv14, srv14_add1, vfrac16_32, vfrac16, vout_14); + one_line(srv15, srv15_add1, vfrac16_32, vfrac16, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 11>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, }; +vec_u8_t mask1={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, }; +vec_u8_t mask2={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, }; +vec_u8_t mask3={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, }; +vec_u8_t mask4={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, }; +vec_u8_t mask5={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, }; +vec_u8_t mask6={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, }; +vec_u8_t mask7={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, }; +vec_u8_t mask8={0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, }; +vec_u8_t mask9={0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, }; +vec_u8_t mask10={0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, }; +vec_u8_t mask11={0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, }; +vec_u8_t mask12={0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, }; +vec_u8_t mask13={0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, }; +vec_u8_t mask14={0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, }; +vec_u8_t mask15={0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, }; + +vec_u8_t mask16_0={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, }; +/*vec_u8_t mask16_1={0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, }; +vec_u8_t mask16_2={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, }; +vec_u8_t mask16_3={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, }; +vec_u8_t mask16_4={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, }; +vec_u8_t mask16_5={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, }; +vec_u8_t mask16_6={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, }; +vec_u8_t mask16_7={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, }; +vec_u8_t mask16_8={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, }; +vec_u8_t mask16_9={0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, }; +vec_u8_t mask16_10={0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, }; +vec_u8_t mask16_11={0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, }; +vec_u8_t mask16_12={0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, }; +vec_u8_t mask16_13={0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, }; +vec_u8_t mask16_14={0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, }; +vec_u8_t mask16_15={0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, }; +*/ +vec_u8_t maskadd1_31={0x21, 0x21, 0x21, 0x21, 0x21, 0x21, 0x21, 0x21, 0x21, 0x21, 0x21, 0x21, 0x21, 0x21, 0x21, 0x21, }; +vec_u8_t maskadd1_16_31={0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + + vec_u8_t srv_left0=vec_xl(0, srcPix0); + vec_u8_t srv_left1=vec_xl(16, srcPix0); + vec_u8_t srv_right=vec_xl(65, srcPix0); + +vec_u8_t refmask_32_0={0x10, 0x00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}; +vec_u8_t refmask_32_1={0x0, 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d}; + + vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 ); + vec_u8_t s1 = vec_xl(79, srcPix0); + vec_u8_t s2 = vec_xl(95, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s0, mask0); + vec_u8_t srv1 = vec_perm(s0, s0, mask1); + vec_u8_t srv2 = vec_perm(s0, s0, mask2); + vec_u8_t srv3 = vec_perm(s0, s0, mask3); + vec_u8_t srv4 = vec_perm(s0, s0, mask4); + vec_u8_t srv5 = vec_perm(s0, s0, mask5); + vec_u8_t srv6 = vec_perm(s0, s0, mask6); + vec_u8_t srv7 = vec_perm(s0, s0, mask7); + vec_u8_t srv8 = vec_perm(s0, s0, mask8); + vec_u8_t srv9 = vec_perm(s0, s0, mask9); + vec_u8_t srv10 = vec_perm(s0, s0, mask10); + vec_u8_t srv11 = vec_perm(s0, s0, mask11); + vec_u8_t srv12= vec_perm(s0, s0, mask12); + vec_u8_t srv13 = vec_perm(s0, s0, mask13); + vec_u8_t srv14 = vec_perm(s0, s0, mask14); + vec_u8_t srv15 = vec_perm(s1, s1, mask15); + + vec_u8_t srv16_0 = vec_perm(s0, s0, mask16_0); + vec_u8_t srv16_1 = srv0; + vec_u8_t srv16_2 = srv1; + vec_u8_t srv16_3 = srv2; + vec_u8_t srv16_4 = srv3; + vec_u8_t srv16_5 = srv4; + vec_u8_t srv16_6 = srv5; + vec_u8_t srv16_7 = srv6; + vec_u8_t srv16_8 = srv7; + vec_u8_t srv16_9 = srv8; + vec_u8_t srv16_10 = srv9; + vec_u8_t srv16_11 = srv10; + vec_u8_t srv16_12= srv11; + vec_u8_t srv16_13 = srv12; + vec_u8_t srv16_14 = srv13; + vec_u8_t srv16_15 =srv14; + +/* + vec_u8_t srv16_0 = vec_perm(s0, s0, mask16_0); + vec_u8_t srv16_1 = vec_perm(s0, s0, mask16_1); + vec_u8_t srv16_2 = vec_perm(s0, s0, mask16_2); + vec_u8_t srv16_3 = vec_perm(s0, s0, mask16_3); + vec_u8_t srv16_4 = vec_perm(s0, s0, mask16_4); + vec_u8_t srv16_5 = vec_perm(s0, s0, mask16_5); + vec_u8_t srv16_6 = vec_perm(s0, s0, mask16_6); + vec_u8_t srv16_7 = vec_perm(s0, s0, mask16_7); + vec_u8_t srv16_8 = vec_perm(s0, s0, mask16_8); + vec_u8_t srv16_9 = vec_perm(s0, s0, mask16_9); + vec_u8_t srv16_10 = vec_perm(s0, s0, mask16_10); + vec_u8_t srv16_11 = vec_perm(s0, s0, mask16_11); + vec_u8_t srv16_12= vec_perm(s0, s0, mask16_12); + vec_u8_t srv16_13 = vec_perm(s0, s0, mask16_13); + vec_u8_t srv16_14 = vec_perm(s0, s0, mask16_14); + vec_u8_t srv16_15 = vec_perm(s0, s0, mask16_15); +*/ + vec_u8_t srv16 = vec_perm(s1, s1, mask0); + vec_u8_t srv17 = vec_perm(s1, s1, mask1); + vec_u8_t srv18 = vec_perm(s1, s1, mask2); + vec_u8_t srv19 = vec_perm(s1, s1, mask3); + vec_u8_t srv20 = vec_perm(s1, s1, mask4); + vec_u8_t srv21 = vec_perm(s1, s1, mask5); + vec_u8_t srv22 = vec_perm(s1, s1, mask6); + vec_u8_t srv23 = vec_perm(s1, s1, mask7); + vec_u8_t srv24 = vec_perm(s1, s1, mask8); + vec_u8_t srv25 = vec_perm(s1, s1, mask9); + vec_u8_t srv26 = vec_perm(s1, s1, mask10); + vec_u8_t srv27 = vec_perm(s1, s1, mask11); + vec_u8_t srv28 = vec_perm(s1, s1, mask12); + vec_u8_t srv29 = vec_perm(s1, s1, mask13); + vec_u8_t srv30 = vec_perm(s1, s1, mask14); + vec_u8_t srv31 = vec_perm(s2, s2, mask15); + +/* + vec_u8_t srv16_16 = vec_perm(s1, s1, mask16_0); + vec_u8_t srv16_17 = vec_perm(s1, s1, mask16_1); + vec_u8_t srv16_18 = vec_perm(s1, s1, mask16_2); + vec_u8_t srv16_19 = vec_perm(s1, s1, mask16_3); + vec_u8_t srv16_20 = vec_perm(s1, s1, mask16_4); + vec_u8_t srv16_21 = vec_perm(s1, s1, mask16_5); + vec_u8_t srv16_22 = vec_perm(s1, s1, mask16_6); + vec_u8_t srv16_23 = vec_perm(s1, s1, mask16_7); + vec_u8_t srv16_24 = vec_perm(s1, s1, mask16_8); + vec_u8_t srv16_25 = vec_perm(s1, s1, mask16_9); + vec_u8_t srv16_26 = vec_perm(s1, s1, mask16_10); + vec_u8_t srv16_27 = vec_perm(s1, s1, mask16_11); + vec_u8_t srv16_28 = vec_perm(s1, s1, mask16_12); + vec_u8_t srv16_29 = vec_perm(s1, s1, mask16_13); + vec_u8_t srv16_30 = vec_perm(s1, s1, mask16_14); + vec_u8_t srv16_31 = vec_perm(s1, s1, mask16_15); +*/ + vec_u8_t srv16_16 = vec_perm(s1, s1, mask16_0); + vec_u8_t srv16_17 = srv16; + vec_u8_t srv16_18 = srv17; + vec_u8_t srv16_19 = srv18; + vec_u8_t srv16_20 = srv19; + vec_u8_t srv16_21 = srv20; + vec_u8_t srv16_22 = srv21; + vec_u8_t srv16_23 = srv22; + vec_u8_t srv16_24 = srv23; + vec_u8_t srv16_25 = srv24; + vec_u8_t srv16_26 = srv25; + vec_u8_t srv16_27 = srv26; + vec_u8_t srv16_28 = srv27; + vec_u8_t srv16_29 = srv28; + vec_u8_t srv16_30 = srv29; + vec_u8_t srv16_31 = srv30; + + vec_u8_t srv0add1 = srv1; + vec_u8_t srv1add1 = srv2; + vec_u8_t srv2add1 = srv3; + vec_u8_t srv3add1 = srv4; + vec_u8_t srv4add1 = srv5; + vec_u8_t srv5add1 = srv6; + vec_u8_t srv6add1 = srv7; + vec_u8_t srv7add1 = srv8; + vec_u8_t srv8add1 = srv9; + vec_u8_t srv9add1 = srv10; + vec_u8_t srv10add1 = srv11; + vec_u8_t srv11add1 = srv12; + vec_u8_t srv12add1= srv13; + vec_u8_t srv13add1 = srv14; + vec_u8_t srv14add1 = srv15; + vec_u8_t srv15add1 = srv16; + + vec_u8_t srv16add1_0 = srv16_1; + vec_u8_t srv16add1_1 = srv16_2; + vec_u8_t srv16add1_2 = srv16_3; + vec_u8_t srv16add1_3 = srv16_4; + vec_u8_t srv16add1_4 = srv16_5; + vec_u8_t srv16add1_5 = srv16_6; + vec_u8_t srv16add1_6 = srv16_7; + vec_u8_t srv16add1_7 = srv16_8; + vec_u8_t srv16add1_8 = srv16_9; + vec_u8_t srv16add1_9 = srv16_10; + vec_u8_t srv16add1_10 = srv16_11; + vec_u8_t srv16add1_11 = srv16_12; + vec_u8_t srv16add1_12= srv16_13; + vec_u8_t srv16add1_13 = srv16_14; + vec_u8_t srv16add1_14 = srv16_15; + vec_u8_t srv16add1_15 = srv16_16; + + vec_u8_t srv16add1 = srv17; + vec_u8_t srv17add1 = srv18; + vec_u8_t srv18add1 = srv19; + vec_u8_t srv19add1 = srv20; + vec_u8_t srv20add1 = srv21; + vec_u8_t srv21add1 = srv22; + vec_u8_t srv22add1 = srv23; + vec_u8_t srv23add1 = srv24; + vec_u8_t srv24add1 = srv25; + vec_u8_t srv25add1 = srv26; + vec_u8_t srv26add1 = srv27; + vec_u8_t srv27add1 = srv28; + vec_u8_t srv28add1 = srv29; + vec_u8_t srv29add1 = srv30; + vec_u8_t srv30add1 = srv31; + vec_u8_t srv31add1 = vec_perm(s2, s2, maskadd1_31); + + vec_u8_t srv16add1_16 = srv16_17; + vec_u8_t srv16add1_17 = srv16_18; + vec_u8_t srv16add1_18 = srv16_19; + vec_u8_t srv16add1_19 = srv16_20; + vec_u8_t srv16add1_20 = srv16_21; + vec_u8_t srv16add1_21 = srv16_22; + vec_u8_t srv16add1_22 = srv16_23; + vec_u8_t srv16add1_23 = srv16_24; + vec_u8_t srv16add1_24 = srv16_25; + vec_u8_t srv16add1_25 = srv16_26; + vec_u8_t srv16add1_26 = srv16_27; + vec_u8_t srv16add1_27 = srv16_28; + vec_u8_t srv16add1_28 = srv16_29; + vec_u8_t srv16add1_29 = srv16_30; + vec_u8_t srv16add1_30 = srv16_31; + vec_u8_t srv16add1_31 = vec_perm(s2, s2, maskadd1_16_31); + +vec_u8_t vfrac32_0 = (vec_u8_t){30, 28, 26, 24, 22, 20, 18, 16, 14, 12, 10, 8, 6, 4, 2, 0, }; +vec_u8_t vfrac32_1 = (vec_u8_t){30, 28, 26, 24, 22, 20, 18, 16, 14, 12, 10, 8, 6, 4, 2, 0, }; +vec_u8_t vfrac32_32_0 = (vec_u8_t){2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, }; +vec_u8_t vfrac32_32_1 = (vec_u8_t){2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, }; + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv0add1, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_0, srv16add1_0, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv1, srv1add1, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_1, srv16add1_1, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv2, srv2add1, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_2, srv16add1_2, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv3, srv3add1, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_3, srv16add1_3, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv4, srv4add1, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_4, srv16add1_4, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv5, srv5add1, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_5, srv16add1_5, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv6, srv6add1, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_6, srv16add1_6, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv7, srv7add1, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_7, srv16add1_7, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv8, srv8add1, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_8, srv16add1_8, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv9, srv9add1, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_9, srv16add1_9, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv10, srv10add1, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_10, srv16add1_10, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv11, srv11add1, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_11, srv16add1_11, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv12, srv12add1, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_12, srv16add1_12, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv13, srv13add1, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_13, srv16add1_13, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv14, srv14add1, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_14, srv16add1_14, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv15, srv15add1, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_15, srv16add1_15, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv16, srv16add1, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_16, srv16add1_16, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv17, srv17add1, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_17, srv16add1_17, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv18, srv18add1, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_18, srv16add1_18, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv19, srv19add1, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_19, srv16add1_19, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv20, srv20add1, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_20, srv16add1_20, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv21, srv21add1, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_21, srv16add1_21, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv22, srv22add1, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_22, srv16add1_22, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv23, srv23add1, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_23, srv16add1_23, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv24, srv24add1, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_24, srv16add1_24, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv25, srv25add1, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_25, srv16add1_25, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv26, srv26add1, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_26, srv16add1_26, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv27, srv27add1, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_27, srv16add1_27, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv28, srv28add1, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_28, srv16add1_28, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv29, srv29add1, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_29, srv16add1_29, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv30, srv30add1, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_30, srv16add1_30, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv31, srv31add1, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_31, srv16add1_31, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + + +template<> +void intra_pred<4, 12>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t mask0={0x0, 0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, }; + vec_u8_t mask1={0x1, 0x1, 0x1, 0x1, 0x2, 0x2, 0x2, 0x2, 0x3, 0x3, 0x3, 0x3, 0x4, 0x4, 0x4, 0x4, }; + + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(9, srcPix0); + vec_u8_t refmask_4={0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, }; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + vec_u8_t vfrac4 = (vec_u8_t){27, 22, 17, 12, 27, 22, 17, 12, 27, 22, 17, 12, 27, 22, 17, 12, }; + vec_u8_t vfrac4_32 = (vec_u8_t){5, 10, 15, 20, 5, 10, 15, 20, 5, 10, 15, 20, 5, 10, 15, 20, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 12>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x0, 0x0, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x1, 0x1, }; +vec_u8_t mask1={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x1, 0x1, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, }; +vec_u8_t mask2={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, }; +vec_u8_t mask3={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, }; +vec_u8_t mask4={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, }; +vec_u8_t mask5={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, }; +vec_u8_t mask6={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, }; +vec_u8_t mask7={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vout_0, vout_1, vout_2, vout_3; + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(17, srcPix0); + vec_u8_t refmask_8={0x6, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, }; + + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + vec_u8_t srv3 = vec_perm(srv, srv, mask3); + vec_u8_t srv4 = vec_perm(srv, srv, mask4); + vec_u8_t srv5 = vec_perm(srv, srv, mask5); + vec_u8_t srv6 = vec_perm(srv, srv, mask6); + vec_u8_t srv7 = vec_perm(srv, srv, mask7); + +vec_u8_t vfrac8 = (vec_u8_t){27, 22, 17, 12, 7, 2, 29, 24, 27, 22, 17, 12, 7, 2, 29, 24, }; +vec_u8_t vfrac8_32 = (vec_u8_t){5, 10, 15, 20, 25, 30, 3, 8, 5, 10, 15, 20, 25, 30, 3, 8, }; + + one_line(srv0, srv1, vfrac8_32, vfrac8, vout_0); + one_line(srv2, srv3, vfrac8_32, vfrac8, vout_1); + one_line(srv4, srv5, vfrac8_32, vfrac8, vout_2); + one_line(srv6, srv7, vfrac8_32, vfrac8, vout_3); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 12>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0, }; +vec_u8_t mask1={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, }; +vec_u8_t mask2={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, }; +vec_u8_t mask3={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, }; +vec_u8_t mask4={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, }; +vec_u8_t mask5={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, }; +vec_u8_t mask6={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, }; +vec_u8_t mask7={0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, }; +vec_u8_t mask8={0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, }; +vec_u8_t mask9={0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, }; +vec_u8_t mask10={0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, }; +vec_u8_t mask11={0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, }; +vec_u8_t mask12={0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, }; +vec_u8_t mask13={0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, }; +vec_u8_t mask14={0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, }; +vec_u8_t mask15={0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, }; + +vec_u8_t maskadd1_15={0x12, 0x12, 0x12, 0x12, 0x12, 0x12, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0x10, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(33, srcPix0); + vec_u8_t refmask_16={0xd, 0x6, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(46, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s0, mask0); + vec_u8_t srv1 = vec_perm(s0, s0, mask1); + vec_u8_t srv2 = vec_perm(s0, s0, mask2); + vec_u8_t srv3 = vec_perm(s0, s0, mask3); + vec_u8_t srv4 = vec_perm(s0, s0, mask4); + vec_u8_t srv5 =vec_perm(s0, s0, mask5); + vec_u8_t srv6 = vec_perm(s0, s0, mask6); + vec_u8_t srv7 = vec_perm(s0, s0, mask7); + vec_u8_t srv8 = vec_perm(s0, s0, mask8); + vec_u8_t srv9 = vec_perm(s0, s0, mask9); + vec_u8_t srv10 = vec_perm(s0, s0, mask10); + vec_u8_t srv11 = vec_perm(s0, s0, mask11); + vec_u8_t srv12= vec_perm(s0, s0, mask12); + vec_u8_t srv13 = vec_perm(s0, s0, mask13); + vec_u8_t srv14 = vec_perm(s0, s1, mask14); + vec_u8_t srv15 = vec_perm(s0, s1, mask15); + + vec_u8_t srv0_add1 = srv1; + vec_u8_t srv1_add1 = srv2; + vec_u8_t srv2_add1 = srv3; + vec_u8_t srv3_add1 = srv4; + vec_u8_t srv4_add1 = srv5; + vec_u8_t srv5_add1 = srv6; + vec_u8_t srv6_add1 = srv7; + vec_u8_t srv7_add1 = srv8; + vec_u8_t srv8_add1 = srv9; + vec_u8_t srv9_add1 = srv10; + vec_u8_t srv10_add1 = srv11; + vec_u8_t srv11_add1 = srv12; + vec_u8_t srv12_add1= srv13; + vec_u8_t srv13_add1 = srv14; + vec_u8_t srv14_add1 = srv15; + vec_u8_t srv15_add1 = vec_perm(s1, s1, maskadd1_15); + +vec_u8_t vfrac16 = (vec_u8_t){27, 22, 17, 12, 7, 2, 29, 24, 19, 14, 9, 4, 31, 26, 21, 16, }; +vec_u8_t vfrac16_32 = (vec_u8_t){5, 10, 15, 20, 25, 30, 3, 8, 13, 18, 23, 28, 1, 6, 11, 16, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv0_add1, vfrac16_32, vfrac16, vout_0); + one_line(srv1, srv1_add1, vfrac16_32, vfrac16, vout_1); + one_line(srv2, srv2_add1, vfrac16_32, vfrac16, vout_2); + one_line(srv3, srv3_add1, vfrac16_32, vfrac16, vout_3); + one_line(srv4, srv4_add1, vfrac16_32, vfrac16, vout_4); + one_line(srv5, srv5_add1, vfrac16_32, vfrac16, vout_5); + one_line(srv6, srv6_add1, vfrac16_32, vfrac16, vout_6); + one_line(srv7, srv7_add1, vfrac16_32, vfrac16, vout_7); + one_line(srv8, srv8_add1, vfrac16_32, vfrac16, vout_8); + one_line(srv9, srv9_add1, vfrac16_32, vfrac16, vout_9); + one_line(srv10, srv10_add1, vfrac16_32, vfrac16, vout_10); + one_line(srv11, srv11_add1, vfrac16_32, vfrac16, vout_11); + one_line(srv12, srv12_add1, vfrac16_32, vfrac16, vout_12); + one_line(srv13, srv13_add1, vfrac16_32, vfrac16, vout_13); + one_line(srv14, srv14_add1, vfrac16_32, vfrac16, vout_14); + one_line(srv15, srv15_add1, vfrac16_32, vfrac16, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + + +template<> +void intra_pred<32, 12>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, }; +vec_u8_t mask1={0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, }; +vec_u8_t mask2={0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, }; +vec_u8_t mask3={0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, }; +vec_u8_t mask4={0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, }; +vec_u8_t mask5={0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, }; +vec_u8_t mask6={0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, }; +vec_u8_t mask7={0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, }; +vec_u8_t mask8={0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, }; +vec_u8_t mask9={0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, }; +vec_u8_t mask10={0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, }; +vec_u8_t mask11={0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, }; +vec_u8_t mask12={0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, }; +vec_u8_t mask13={0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, }; +vec_u8_t mask14={0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0, }; +vec_u8_t mask15={0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, }; + +vec_u8_t mask16_0={0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, }; +vec_u8_t mask16_1={0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, }; +vec_u8_t mask16_2={0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, 0x2, }; +vec_u8_t mask16_3={0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, }; +vec_u8_t mask16_4={0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x4, }; +vec_u8_t mask16_5={0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, }; +vec_u8_t mask16_6={0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, 0x6, }; +vec_u8_t mask16_7={0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x7, }; +vec_u8_t mask16_8={0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, 0x8, }; +vec_u8_t mask16_9={0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, }; +vec_u8_t mask16_10={0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, 0xa, }; +vec_u8_t mask16_11={0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, 0xb, }; +vec_u8_t mask16_12={0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, 0xc, }; +vec_u8_t mask16_13={0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, 0xd, }; +vec_u8_t mask16_14={0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, 0xe, }; +vec_u8_t mask16_15={0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, }; + +vec_u8_t maskadd1_31={0x4, 0x4, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, }; +vec_u8_t maskadd1_16_31={0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv_left0=vec_xl(0, srcPix0); + vec_u8_t srv_left1=vec_xl(16, srcPix0); + vec_u8_t srv_right=vec_xl(65, srcPix0); + vec_u8_t refmask_32_0={0x1a, 0x13, 0xd, 0x6, 0x00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}; + vec_u8_t refmask_32_1={0x0, 0x1, 0x2, 0x3, 0x4, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a}; + vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 ); + vec_u8_t s1 = vec_xl(76, srcPix0); + vec_u8_t s2 = vec_xl(92, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s0, mask0); + vec_u8_t srv1 = vec_perm(s0, s0, mask1); + vec_u8_t srv2 = vec_perm(s0, s0, mask2); + vec_u8_t srv3 = vec_perm(s0, s0, mask3); + vec_u8_t srv4 = vec_perm(s0, s0, mask4); + vec_u8_t srv5 = vec_perm(s0, s0, mask5); + vec_u8_t srv6 = vec_perm(s0, s0, mask6); + vec_u8_t srv7 = vec_perm(s0, s0, mask7); + vec_u8_t srv8 = vec_perm(s0, s0, mask8); + vec_u8_t srv9 = vec_perm(s0, s0, mask9); + vec_u8_t srv10 = vec_perm(s0, s0, mask10); + vec_u8_t srv11 = vec_perm(s0, s0, mask11); + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = vec_perm(s0, s1, mask13); + vec_u8_t srv14 = vec_perm(s1, s1, mask14); + vec_u8_t srv15 = vec_perm(s1, s1, mask15); + + vec_u8_t srv16_0 = vec_perm(s0, s0, mask16_0); + vec_u8_t srv16_1 = vec_perm(s0, s0, mask16_1); + vec_u8_t srv16_2 = vec_perm(s0, s0, mask16_2); + vec_u8_t srv16_3 = vec_perm(s0, s0, mask16_3); + vec_u8_t srv16_4 = vec_perm(s0, s0, mask16_4); + vec_u8_t srv16_5 = vec_perm(s0, s0, mask16_5); + vec_u8_t srv16_6 = vec_perm(s0, s0, mask16_6); + vec_u8_t srv16_7 = vec_perm(s0, s0, mask16_7); + vec_u8_t srv16_8 = vec_perm(s0, s0, mask16_8); + vec_u8_t srv16_9 = vec_perm(s0, s0, mask16_9); + vec_u8_t srv16_10 = vec_perm(s0, s0, mask16_10); + vec_u8_t srv16_11 = vec_perm(s0, s0, mask16_11); + vec_u8_t srv16_12= vec_perm(s0, s0, mask16_12); + vec_u8_t srv16_13 = vec_perm(s0, s0, mask16_13); + vec_u8_t srv16_14 = vec_perm(s0, s1, mask16_14); + vec_u8_t srv16_15 = vec_perm(s0, s1, mask16_15); + + vec_u8_t srv16 = vec_perm(s1, s1, mask0); + vec_u8_t srv17 = vec_perm(s1, s1, mask1); + vec_u8_t srv18 = vec_perm(s1, s1, mask2); + vec_u8_t srv19 = vec_perm(s1, s1, mask3); + vec_u8_t srv20 = vec_perm(s1, s1, mask4); + vec_u8_t srv21 = vec_perm(s1, s1, mask5); + vec_u8_t srv22 = vec_perm(s1, s1, mask6); + vec_u8_t srv23 = vec_perm(s1, s1, mask7); + vec_u8_t srv24 = vec_perm(s1, s1, mask8); + vec_u8_t srv25 = vec_perm(s1, s1, mask9); + vec_u8_t srv26 = vec_perm(s1, s1, mask10); + vec_u8_t srv27 = vec_perm(s1, s1, mask11); + vec_u8_t srv28 = vec_perm(s1, s2, mask12); + vec_u8_t srv29 = vec_perm(s1, s2, mask13); + vec_u8_t srv30 = vec_perm(s2, s2, mask14); + vec_u8_t srv31 = vec_perm(s2, s2, mask15); + + vec_u8_t srv16_16 = vec_perm(s1, s1, mask16_0); + vec_u8_t srv16_17 = vec_perm(s1, s1, mask16_1); + vec_u8_t srv16_18 = vec_perm(s1, s1, mask16_2); + vec_u8_t srv16_19 = vec_perm(s1, s1, mask16_3); + vec_u8_t srv16_20 = vec_perm(s1, s1, mask16_4); + vec_u8_t srv16_21 = vec_perm(s1, s1, mask16_5); + vec_u8_t srv16_22 = vec_perm(s1, s1, mask16_6); + vec_u8_t srv16_23 = vec_perm(s1, s1, mask16_7); + vec_u8_t srv16_24 = vec_perm(s1, s1, mask16_8); + vec_u8_t srv16_25 = vec_perm(s1, s1, mask16_9); + vec_u8_t srv16_26 = vec_perm(s1, s1, mask16_10); + vec_u8_t srv16_27 = vec_perm(s1, s1, mask16_11); + vec_u8_t srv16_28 = vec_perm(s1, s1, mask16_12); + vec_u8_t srv16_29 = vec_perm(s1, s1, mask16_13); + vec_u8_t srv16_30 = vec_perm(s1, s2, mask16_14); + vec_u8_t srv16_31 = vec_perm(s1, s2, mask16_15); + + vec_u8_t srv0add1 = srv1; + vec_u8_t srv1add1 = srv2; + vec_u8_t srv2add1 = srv3; + vec_u8_t srv3add1 = srv4; + vec_u8_t srv4add1 = srv5; + vec_u8_t srv5add1 = srv6; + vec_u8_t srv6add1 = srv7; + vec_u8_t srv7add1 = srv8; + vec_u8_t srv8add1 = srv9; + vec_u8_t srv9add1 = srv10; + vec_u8_t srv10add1 = srv11; + vec_u8_t srv11add1 = srv12; + vec_u8_t srv12add1= srv13; + vec_u8_t srv13add1 = srv14; + vec_u8_t srv14add1 = srv15; + vec_u8_t srv15add1 = srv16; + + vec_u8_t srv16add1_0 = srv16_1; + vec_u8_t srv16add1_1 = srv16_2; + vec_u8_t srv16add1_2 = srv16_3; + vec_u8_t srv16add1_3 = srv16_4; + vec_u8_t srv16add1_4 = srv16_5; + vec_u8_t srv16add1_5 = srv16_6; + vec_u8_t srv16add1_6 = srv16_7; + vec_u8_t srv16add1_7 = srv16_8; + vec_u8_t srv16add1_8 = srv16_9; + vec_u8_t srv16add1_9 = srv16_10; + vec_u8_t srv16add1_10 = srv16_11; + vec_u8_t srv16add1_11 = srv16_12; + vec_u8_t srv16add1_12= srv16_13; + vec_u8_t srv16add1_13 = srv16_14; + vec_u8_t srv16add1_14 = srv16_15; + vec_u8_t srv16add1_15 = srv16_16; + + vec_u8_t srv16add1 = srv17; + vec_u8_t srv17add1 = srv18; + vec_u8_t srv18add1 = srv19; + vec_u8_t srv19add1 = srv20; + vec_u8_t srv20add1 = srv21; + vec_u8_t srv21add1 = srv22; + vec_u8_t srv22add1 = srv23; + vec_u8_t srv23add1 = srv24; + vec_u8_t srv24add1 = srv25; + vec_u8_t srv25add1 = srv26; + vec_u8_t srv26add1 = srv27; + vec_u8_t srv27add1 = srv28; + vec_u8_t srv28add1 = srv29; + vec_u8_t srv29add1 = srv30; + vec_u8_t srv30add1 = srv31; + vec_u8_t srv31add1 = vec_perm(s2, s2, maskadd1_31); + + vec_u8_t srv16add1_16 = srv16_17; + vec_u8_t srv16add1_17 = srv16_18; + vec_u8_t srv16add1_18 = srv16_19; + vec_u8_t srv16add1_19 = srv16_20; + vec_u8_t srv16add1_20 = srv16_21; + vec_u8_t srv16add1_21 = srv16_22; + vec_u8_t srv16add1_22 = srv16_23; + vec_u8_t srv16add1_23 = srv16_24; + vec_u8_t srv16add1_24 = srv16_25; + vec_u8_t srv16add1_25 = srv16_26; + vec_u8_t srv16add1_26 = srv16_27; + vec_u8_t srv16add1_27 = srv16_28; + vec_u8_t srv16add1_28 = srv16_29; + vec_u8_t srv16add1_29 = srv16_30; + vec_u8_t srv16add1_30 = srv16_31; + vec_u8_t srv16add1_31 = vec_perm(s2, s2, maskadd1_16_31); + +vec_u8_t vfrac32_0 = (vec_u8_t){27, 22, 17, 12, 7, 2, 29, 24, 19, 14, 9, 4, 31, 26, 21, 16, }; +vec_u8_t vfrac32_1 = (vec_u8_t){11, 6, 1, 28, 23, 18, 13, 8, 3, 30, 25, 20, 15, 10, 5, 0, }; +vec_u8_t vfrac32_32_0 = (vec_u8_t){5, 10, 15, 20, 25, 30, 3, 8, 13, 18, 23, 28, 1, 6, 11, 16, }; +vec_u8_t vfrac32_32_1 = (vec_u8_t){21, 26, 31, 4, 9, 14, 19, 24, 29, 2, 7, 12, 17, 22, 27, 32, }; + + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv0add1, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_0, srv16add1_0, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv1, srv1add1, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_1, srv16add1_1, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv2, srv2add1, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_2, srv16add1_2, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv3, srv3add1, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_3, srv16add1_3, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv4, srv4add1, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_4, srv16add1_4, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv5, srv5add1, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_5, srv16add1_5, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv6, srv6add1, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_6, srv16add1_6, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv7, srv7add1, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_7, srv16add1_7, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv8, srv8add1, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_8, srv16add1_8, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv9, srv9add1, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_9, srv16add1_9, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv10, srv10add1, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_10, srv16add1_10, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv11, srv11add1, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_11, srv16add1_11, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv12, srv12add1, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_12, srv16add1_12, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv13, srv13add1, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_13, srv16add1_13, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv14, srv14add1, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_14, srv16add1_14, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv15, srv15add1, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_15, srv16add1_15, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv16, srv16add1, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_16, srv16add1_16, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv17, srv17add1, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_17, srv16add1_17, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv18, srv18add1, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_18, srv16add1_18, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv19, srv19add1, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_19, srv16add1_19, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv20, srv20add1, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_20, srv16add1_20, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv21, srv21add1, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_21, srv16add1_21, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv22, srv22add1, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_22, srv16add1_22, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv23, srv23add1, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_23, srv16add1_23, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv24, srv24add1, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_24, srv16add1_24, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv25, srv25add1, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_25, srv16add1_25, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv26, srv26add1, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_26, srv16add1_26, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv27, srv27add1, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_27, srv16add1_27, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv28, srv28add1, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_28, srv16add1_28, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv29, srv29add1, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_29, srv16add1_29, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv30, srv30add1, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_30, srv16add1_30, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv31, srv31add1, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_31, srv16add1_31, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + + +template<> +void intra_pred<4, 13>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t mask0={0x1, 0x1, 0x1, 0x0, 0x2, 0x2, 0x2, 0x1, 0x3, 0x3, 0x3, 0x2, 0x4, 0x4, 0x4, 0x3, }; + vec_u8_t mask1={0x2, 0x2, 0x2, 0x1, 0x3, 0x3, 0x3, 0x2, 0x4, 0x4, 0x4, 0x3, 0x5, 0x5, 0x5, 0x4, }; + + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(9, srcPix0); + vec_u8_t refmask_4={0x4, 0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, }; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + +vec_u8_t vfrac4 = (vec_u8_t){23, 14, 5, 28, 23, 14, 5, 28, 23, 14, 5, 28, 23, 14, 5, 28, }; +vec_u8_t vfrac4_32 = (vec_u8_t){9, 18, 27, 4, 9, 18, 27, 4, 9, 18, 27, 4, 9, 18, 27, 4, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 13>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u8_t mask0={0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, 0x0, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, 0x1, }; + vec_u8_t mask1={0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, 0x1, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x2, }; + vec_u8_t mask2={0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x2, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, 0x3, }; + vec_u8_t mask3={0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, 0x3, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x4, }; + vec_u8_t mask4={0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x4, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, 0x5, }; + vec_u8_t mask5={0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, 0x5, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x6, }; + vec_u8_t mask6={0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x6, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, 0x7, }; + vec_u8_t mask7={0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, 0x7, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, 0x8, }; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vout_0, vout_1, vout_2, vout_3; + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(17, srcPix0); + vec_u8_t refmask_8={0x7, 0x4, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, 0x00, 0x00, 0x00, }; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + vec_u8_t srv3 = vec_perm(srv, srv, mask3); + vec_u8_t srv4 = vec_perm(srv, srv, mask4); + vec_u8_t srv5 = vec_perm(srv, srv, mask5); + vec_u8_t srv6 = vec_perm(srv, srv, mask6); + vec_u8_t srv7 = vec_perm(srv, srv, mask7); + + vec_u8_t vfrac8 = (vec_u8_t){23, 14, 5, 28, 19, 10, 1, 24, 23, 14, 5, 28, 19, 10, 1, 24, }; + vec_u8_t vfrac8_32 = (vec_u8_t){9, 18, 27, 4, 13, 22, 31, 8, 9, 18, 27, 4, 13, 22, 31, 8, }; + + one_line(srv0, srv1, vfrac8_32, vfrac8, vout_0); + one_line(srv2, srv3, vfrac8_32, vfrac8, vout_1); + one_line(srv4, srv5, vfrac8_32, vfrac8, vout_2); + one_line(srv6, srv7, vfrac8_32, vfrac8, vout_3); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 13>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, 0x0, 0x0, }; +vec_u8_t mask1={0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, 0x1, 0x1, }; +vec_u8_t mask2={0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, }; +vec_u8_t mask3={0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, }; +vec_u8_t mask4={0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, }; +vec_u8_t mask5={0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, }; +vec_u8_t mask6={0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, }; +vec_u8_t mask7={0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, }; +vec_u8_t mask8={0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, }; +vec_u8_t mask9={0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, 0x9, 0x9, }; +vec_u8_t mask10={0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, 0xa, 0xa, }; +vec_u8_t mask11={0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, 0xb, 0xb, }; +vec_u8_t mask12={0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, 0xc, 0xc, }; +vec_u8_t mask13={0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, 0xd, 0xd, }; +vec_u8_t mask14={0x12, 0x12, 0x12, 0x11, 0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, 0xe, 0xe, }; +vec_u8_t mask15={0x13, 0x13, 0x13, 0x12, 0x12, 0x12, 0x12, 0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0x10, 0xf, 0xf, }; +vec_u8_t maskadd1_15={0x14, 0x14, 0x14, 0x13, 0x13, 0x13, 0x13, 0x12, 0x12, 0x12, 0x11, 0x11, 0x11, 0x11, 0x10, 0x10, }; + + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_16={0xe, 0xb, 0x7, 0x4, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(44, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = vec_perm(s0, s1, mask1); + vec_u8_t srv2 = vec_perm(s0, s1, mask2); + vec_u8_t srv3 = vec_perm(s0, s1, mask3); + vec_u8_t srv4 = vec_perm(s0, s1, mask4); + vec_u8_t srv5 =vec_perm(s0, s1, mask5); + vec_u8_t srv6 = vec_perm(s0, s1, mask6); + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = vec_perm(s0, s1, mask8); + vec_u8_t srv9 = vec_perm(s0, s1, mask9); + vec_u8_t srv10 = vec_perm(s0, s1, mask10); + vec_u8_t srv11 = vec_perm(s0, s1, mask11); + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = vec_perm(s0, s1, mask13); + vec_u8_t srv14 = vec_perm(s0, s1, mask14); + vec_u8_t srv15 = vec_perm(s0, s1, mask15); + + vec_u8_t srv0_add1 = srv1; + vec_u8_t srv1_add1 = srv2; + vec_u8_t srv2_add1 = srv3; + vec_u8_t srv3_add1 = srv4; + vec_u8_t srv4_add1 = srv5; + vec_u8_t srv5_add1 = srv6; + vec_u8_t srv6_add1 = srv7; + vec_u8_t srv7_add1 = srv8; + vec_u8_t srv8_add1 = srv9; + vec_u8_t srv9_add1 = srv10; + vec_u8_t srv10_add1 = srv11; + vec_u8_t srv11_add1 = srv12; + vec_u8_t srv12_add1= srv13; + vec_u8_t srv13_add1 = srv14; + vec_u8_t srv14_add1 = srv15; + vec_u8_t srv15_add1 = vec_perm(s0, s1, maskadd1_15); + +vec_u8_t vfrac16 = (vec_u8_t){23, 14, 5, 28, 19, 10, 1, 24, 15, 6, 29, 20, 11, 2, 25, 16, }; +vec_u8_t vfrac16_32 = (vec_u8_t){9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv0_add1, vfrac16_32, vfrac16, vout_0); + one_line(srv1, srv1_add1, vfrac16_32, vfrac16, vout_1); + one_line(srv2, srv2_add1, vfrac16_32, vfrac16, vout_2); + one_line(srv3, srv3_add1, vfrac16_32, vfrac16, vout_3); + one_line(srv4, srv4_add1, vfrac16_32, vfrac16, vout_4); + one_line(srv5, srv5_add1, vfrac16_32, vfrac16, vout_5); + one_line(srv6, srv6_add1, vfrac16_32, vfrac16, vout_6); + one_line(srv7, srv7_add1, vfrac16_32, vfrac16, vout_7); + one_line(srv8, srv8_add1, vfrac16_32, vfrac16, vout_8); + one_line(srv9, srv9_add1, vfrac16_32, vfrac16, vout_9); + one_line(srv10, srv10_add1, vfrac16_32, vfrac16, vout_10); + one_line(srv11, srv11_add1, vfrac16_32, vfrac16, vout_11); + one_line(srv12, srv12_add1, vfrac16_32, vfrac16, vout_12); + one_line(srv13, srv13_add1, vfrac16_32, vfrac16, vout_13); + one_line(srv14, srv14_add1, vfrac16_32, vfrac16, vout_14); + one_line(srv15, srv15_add1, vfrac16_32, vfrac16, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 13>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, }; +vec_u8_t mask1={0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, }; +vec_u8_t mask2={0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, }; +vec_u8_t mask3={0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, }; +vec_u8_t mask4={0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, }; +vec_u8_t mask5={0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, 0x9, 0x9, }; +vec_u8_t mask6={0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, 0xa, 0xa, }; +vec_u8_t mask7={0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, 0xb, 0xb, }; +vec_u8_t mask8={0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, 0xc, 0xc, }; +vec_u8_t mask9={0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, 0xd, 0xd, }; +vec_u8_t mask10={0x12, 0x12, 0x12, 0x11, 0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, 0xe, 0xe, }; +vec_u8_t mask11={0x13, 0x13, 0x13, 0x12, 0x12, 0x12, 0x12, 0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0x10, 0xf, 0xf, }; +vec_u8_t mask12={0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, 0x0, 0x0, }; +vec_u8_t mask13={0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, 0x1, 0x1, }; +vec_u8_t mask14={0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, }; +vec_u8_t mask15={0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, }; + +vec_u8_t mask16_0={0x4, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0, }; +vec_u8_t mask16_1={0x5, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, }; +vec_u8_t mask16_2={0x6, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x2, }; +vec_u8_t mask16_3={0x7, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, 0x3, 0x3, 0x3, 0x3, }; +vec_u8_t mask16_4={0x8, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x4, }; +vec_u8_t mask16_5={0x9, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, }; +vec_u8_t mask16_6={0xa, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x6, }; +vec_u8_t mask16_7={0xb, 0xa, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, }; +vec_u8_t mask16_8={0xc, 0xb, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, 0x8, 0x8, 0x8, 0x8, }; +vec_u8_t mask16_9={0xd, 0xc, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, 0x9, 0x9, 0x9, 0x9, }; +vec_u8_t mask16_10={0xe, 0xd, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, 0xa, 0xa, 0xa, 0xa, }; +vec_u8_t mask16_11={0xf, 0xe, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, 0xb, 0xb, 0xb, 0xb, }; +vec_u8_t mask16_12={0x10, 0xf, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, 0xc, 0xc, 0xc, 0xc, }; +vec_u8_t mask16_13={0x11, 0x10, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, 0xd, 0xd, 0xd, 0xd, }; +vec_u8_t mask16_14={0x12, 0x11, 0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, 0xe, 0xe, 0xe, 0xe, }; +vec_u8_t mask16_15={0x13, 0x12, 0x12, 0x12, 0x12, 0x11, 0x11, 0x11, 0x10, 0x10, 0x10, 0x10, 0xf, 0xf, 0xf, 0xf, }; + +vec_u8_t maskadd1_31={0x8, 0x8, 0x8, 0x7, 0x7, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, }; +vec_u8_t maskadd1_16_31={0x4, 0x3, 0x3, 0x3, 0x3, 0x2, 0x2, 0x2, 0x1, 0x1, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv_left0=vec_xl(0, srcPix0); + vec_u8_t srv_left1=vec_xl(16, srcPix0); + vec_u8_t srv_right=vec_xl(65, srcPix0); + vec_u8_t refmask_32_0={0x1c, 0x19, 0x15, 0x12, 0xe, 0xb, 0x7, 0x4, 0x00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}; + vec_u8_t refmask_32_1={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 ); + vec_u8_t s1 = vec_xl(72, srcPix0); + vec_u8_t s2 = vec_xl(88, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s0, mask0); + vec_u8_t srv1 = vec_perm(s0, s0, mask1); + vec_u8_t srv2 = vec_perm(s0, s0, mask2); + vec_u8_t srv3 = vec_perm(s0, s0, mask3); + vec_u8_t srv4 = vec_perm(s0, s0, mask4); + vec_u8_t srv5 = vec_perm(s0, s0, mask5); + vec_u8_t srv6 = vec_perm(s0, s0, mask6); + vec_u8_t srv7 = vec_perm(s0, s0, mask7); + vec_u8_t srv8 = vec_perm(s0, s1, mask8); + vec_u8_t srv9 = vec_perm(s0, s1, mask9); + vec_u8_t srv10 = vec_perm(s0, s1, mask10); + vec_u8_t srv11 = vec_perm(s0, s1, mask11); + vec_u8_t srv12= vec_perm(s1, s1, mask12); + vec_u8_t srv13 = vec_perm(s1, s1, mask13); + vec_u8_t srv14 = vec_perm(s1, s1, mask14); + vec_u8_t srv15 = vec_perm(s1, s1, mask15); + + vec_u8_t srv16_0 = vec_perm(s0, s0, mask16_0); + vec_u8_t srv16_1 = vec_perm(s0, s0, mask16_1); + vec_u8_t srv16_2 = vec_perm(s0, s0, mask16_2); + vec_u8_t srv16_3 = vec_perm(s0, s0, mask16_3); + vec_u8_t srv16_4 = vec_perm(s0, s0, mask16_4); + vec_u8_t srv16_5 = vec_perm(s0, s0, mask16_5); + vec_u8_t srv16_6 = vec_perm(s0, s0, mask16_6); + vec_u8_t srv16_7 = vec_perm(s0, s0, mask16_7); + vec_u8_t srv16_8 = vec_perm(s0, s0, mask16_8); + vec_u8_t srv16_9 = vec_perm(s0, s0, mask16_9); + vec_u8_t srv16_10 = vec_perm(s0, s0, mask16_10); + vec_u8_t srv16_11 = vec_perm(s0, s0, mask16_11); + vec_u8_t srv16_12= vec_perm(s0, s1, mask16_12); + vec_u8_t srv16_13 = vec_perm(s0, s1, mask16_13); + vec_u8_t srv16_14 = vec_perm(s0, s1, mask16_14); + vec_u8_t srv16_15 = vec_perm(s0, s1, mask16_15); + + vec_u8_t srv16 = vec_perm(s1, s1, mask0); + vec_u8_t srv17 = vec_perm(s1, s1, mask1); + vec_u8_t srv18 = vec_perm(s1, s1, mask2); + vec_u8_t srv19 = vec_perm(s1, s1, mask3); + vec_u8_t srv20 = vec_perm(s1, s1, mask4); + vec_u8_t srv21 = vec_perm(s1, s1, mask5); + vec_u8_t srv22 = vec_perm(s1, s1, mask6); + vec_u8_t srv23 = vec_perm(s1, s1, mask7); + vec_u8_t srv24 = vec_perm(s1, s2, mask8); + vec_u8_t srv25 = vec_perm(s1, s2, mask9); + vec_u8_t srv26 = vec_perm(s1, s2, mask10); + vec_u8_t srv27 = vec_perm(s1, s2, mask11); + vec_u8_t srv28 = vec_perm(s2, s2, mask12); + vec_u8_t srv29 = vec_perm(s2, s2, mask13); + vec_u8_t srv30 = vec_perm(s2, s2, mask14); + vec_u8_t srv31 = vec_perm(s2, s2, mask15); + + vec_u8_t srv16_16 = vec_perm(s1, s1, mask16_0); + vec_u8_t srv16_17 = vec_perm(s1, s1, mask16_1); + vec_u8_t srv16_18 = vec_perm(s1, s1, mask16_2); + vec_u8_t srv16_19 = vec_perm(s1, s1, mask16_3); + vec_u8_t srv16_20 = vec_perm(s1, s1, mask16_4); + vec_u8_t srv16_21 = vec_perm(s1, s1, mask16_5); + vec_u8_t srv16_22 = vec_perm(s1, s1, mask16_6); + vec_u8_t srv16_23 = vec_perm(s1, s1, mask16_7); + vec_u8_t srv16_24 = vec_perm(s1, s1, mask16_8); + vec_u8_t srv16_25 = vec_perm(s1, s1, mask16_9); + vec_u8_t srv16_26 = vec_perm(s1, s1, mask16_10); + vec_u8_t srv16_27 = vec_perm(s1, s1, mask16_11); + vec_u8_t srv16_28 = vec_perm(s1, s2, mask16_12); + vec_u8_t srv16_29 = vec_perm(s1, s2, mask16_13); + vec_u8_t srv16_30 = vec_perm(s1, s2, mask16_14); + vec_u8_t srv16_31 = vec_perm(s1, s2, mask16_15); + + vec_u8_t srv0add1 = srv1; + vec_u8_t srv1add1 = srv2; + vec_u8_t srv2add1 = srv3; + vec_u8_t srv3add1 = srv4; + vec_u8_t srv4add1 = srv5; + vec_u8_t srv5add1 = srv6; + vec_u8_t srv6add1 = srv7; + vec_u8_t srv7add1 = srv8; + vec_u8_t srv8add1 = srv9; + vec_u8_t srv9add1 = srv10; + vec_u8_t srv10add1 = srv11; + vec_u8_t srv11add1 = srv12; + vec_u8_t srv12add1= srv13; + vec_u8_t srv13add1 = srv14; + vec_u8_t srv14add1 = srv15; + vec_u8_t srv15add1 = srv16; + + vec_u8_t srv16add1_0 = srv16_1; + vec_u8_t srv16add1_1 = srv16_2; + vec_u8_t srv16add1_2 = srv16_3; + vec_u8_t srv16add1_3 = srv16_4; + vec_u8_t srv16add1_4 = srv16_5; + vec_u8_t srv16add1_5 = srv16_6; + vec_u8_t srv16add1_6 = srv16_7; + vec_u8_t srv16add1_7 = srv16_8; + vec_u8_t srv16add1_8 = srv16_9; + vec_u8_t srv16add1_9 = srv16_10; + vec_u8_t srv16add1_10 = srv16_11; + vec_u8_t srv16add1_11 = srv16_12; + vec_u8_t srv16add1_12= srv16_13; + vec_u8_t srv16add1_13 = srv16_14; + vec_u8_t srv16add1_14 = srv16_15; + vec_u8_t srv16add1_15 = srv16_16; + + vec_u8_t srv16add1 = srv17; + vec_u8_t srv17add1 = srv18; + vec_u8_t srv18add1 = srv19; + vec_u8_t srv19add1 = srv20; + vec_u8_t srv20add1 = srv21; + vec_u8_t srv21add1 = srv22; + vec_u8_t srv22add1 = srv23; + vec_u8_t srv23add1 = srv24; + vec_u8_t srv24add1 = srv25; + vec_u8_t srv25add1 = srv26; + vec_u8_t srv26add1 = srv27; + vec_u8_t srv27add1 = srv28; + vec_u8_t srv28add1 = srv29; + vec_u8_t srv29add1 = srv30; + vec_u8_t srv30add1 = srv31; + vec_u8_t srv31add1 = vec_perm(s2, s2, maskadd1_31); + + vec_u8_t srv16add1_16 = srv16_17; + vec_u8_t srv16add1_17 = srv16_18; + vec_u8_t srv16add1_18 = srv16_19; + vec_u8_t srv16add1_19 = srv16_20; + vec_u8_t srv16add1_20 = srv16_21; + vec_u8_t srv16add1_21 = srv16_22; + vec_u8_t srv16add1_22 = srv16_23; + vec_u8_t srv16add1_23 = srv16_24; + vec_u8_t srv16add1_24 = srv16_25; + vec_u8_t srv16add1_25 = srv16_26; + vec_u8_t srv16add1_26 = srv16_27; + vec_u8_t srv16add1_27 = srv16_28; + vec_u8_t srv16add1_28 = srv16_29; + vec_u8_t srv16add1_29 = srv16_30; + vec_u8_t srv16add1_30 = srv16_31; + vec_u8_t srv16add1_31 = vec_perm(s2, s2, maskadd1_16_31); + +vec_u8_t vfrac32_0 = (vec_u8_t){23, 14, 5, 28, 19, 10, 1, 24, 15, 6, 29, 20, 11, 2, 25, 16, }; +vec_u8_t vfrac32_1 = (vec_u8_t){7, 30, 21, 12, 3, 26, 17, 8, 31, 22, 13, 4, 27, 18, 9, 0, }; +vec_u8_t vfrac32_32_0 = (vec_u8_t){9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, }; +vec_u8_t vfrac32_32_1 = (vec_u8_t){25, 2, 11, 20, 29, 6, 15, 24, 1, 10, 19, 28, 5, 14, 23, 32, }; + + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv0add1, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_0, srv16add1_0, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv1, srv1add1, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_1, srv16add1_1, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv2, srv2add1, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_2, srv16add1_2, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv3, srv3add1, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_3, srv16add1_3, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv4, srv4add1, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_4, srv16add1_4, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv5, srv5add1, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_5, srv16add1_5, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv6, srv6add1, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_6, srv16add1_6, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv7, srv7add1, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_7, srv16add1_7, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv8, srv8add1, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_8, srv16add1_8, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv9, srv9add1, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_9, srv16add1_9, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv10, srv10add1, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_10, srv16add1_10, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv11, srv11add1, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_11, srv16add1_11, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv12, srv12add1, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_12, srv16add1_12, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv13, srv13add1, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_13, srv16add1_13, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv14, srv14add1, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_14, srv16add1_14, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv15, srv15add1, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_15, srv16add1_15, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv16, srv16add1, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_16, srv16add1_16, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv17, srv17add1, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_17, srv16add1_17, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv18, srv18add1, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_18, srv16add1_18, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv19, srv19add1, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_19, srv16add1_19, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv20, srv20add1, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_20, srv16add1_20, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv21, srv21add1, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_21, srv16add1_21, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv22, srv22add1, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_22, srv16add1_22, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv23, srv23add1, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_23, srv16add1_23, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv24, srv24add1, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_24, srv16add1_24, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv25, srv25add1, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_25, srv16add1_25, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv26, srv26add1, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_26, srv16add1_26, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv27, srv27add1, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_27, srv16add1_27, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv28, srv28add1, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_28, srv16add1_28, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv29, srv29add1, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_29, srv16add1_29, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv30, srv30add1, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_30, srv16add1_30, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv31, srv31add1, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_31, srv16add1_31, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + + + +template<> +void intra_pred<4, 14>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t mask0={0x1, 0x1, 0x0, 0x0, 0x2, 0x2, 0x1, 0x1, 0x3, 0x3, 0x2, 0x2, 0x4, 0x4, 0x3, 0x3, }; + vec_u8_t mask1={0x2, 0x2, 0x1, 0x1, 0x3, 0x3, 0x2, 0x2, 0x4, 0x4, 0x3, 0x3, 0x5, 0x5, 0x4, 0x4, }; + + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(9, srcPix0); + vec_u8_t refmask_4={0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, }; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + vec_u8_t vfrac4 = (vec_u8_t){19, 6, 25, 12, 19, 6, 25, 12, 19, 6, 25, 12, 19, 6, 25, 12, }; + vec_u8_t vfrac4_32 = (vec_u8_t){13, 26, 7, 20, 13, 26, 7, 20, 13, 26, 7, 20, 13, 26, 7, 20, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 14>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x3, 0x3, 0x2, 0x2, 0x1, 0x1, 0x1, 0x0, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x2, 0x1, }; +vec_u8_t mask1={0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x2, 0x1, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x3, 0x2, }; +vec_u8_t mask2={0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x3, 0x2, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, }; +vec_u8_t mask3={0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x5, 0x4, }; +vec_u8_t mask4={0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x5, 0x4, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, }; +vec_u8_t mask5={0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x7, 0x6, }; +vec_u8_t mask6={0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x7, 0x6, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, }; +vec_u8_t mask7={0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x9, 0x8, }; +//vec_u8_t mask8={0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x9, 0x8, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0xa, 0x9, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vout_0, vout_1, vout_2, vout_3; + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(17, srcPix0); + vec_u8_t refmask_8={0x7, 0x5, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, 0x00, 0x00, }; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + vec_u8_t srv3 = vec_perm(srv, srv, mask3); + vec_u8_t srv4 = vec_perm(srv, srv, mask4); + vec_u8_t srv5 = vec_perm(srv, srv, mask5); + vec_u8_t srv6 = vec_perm(srv, srv, mask6); + vec_u8_t srv7 = vec_perm(srv, srv, mask7); + + vec_u8_t vfrac8 = (vec_u8_t){19, 6, 25, 12, 31, 18, 5, 24, 19, 6, 25, 12, 31, 18, 5, 24, }; + vec_u8_t vfrac8_32 = (vec_u8_t){13, 26, 7, 20, 1, 14, 27, 8, 13, 26, 7, 20, 1, 14, 27, 8, }; + + one_line(srv0, srv1, vfrac8_32, vfrac8, vout_0); + one_line(srv2, srv3, vfrac8_32, vfrac8, vout_1); + one_line(srv4, srv5, vfrac8_32, vfrac8, vout_2); + one_line(srv6, srv7, vfrac8_32, vfrac8, vout_3); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 14>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x2, 0x1, 0x1, 0x0, 0x0, }; +vec_u8_t mask1={0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x3, 0x2, 0x2, 0x1, 0x1, }; +vec_u8_t mask2={0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, }; +vec_u8_t mask3={0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, }; +vec_u8_t mask4={0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, }; +vec_u8_t mask5={0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, }; +vec_u8_t mask6={0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, }; +vec_u8_t mask7={0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, }; +vec_u8_t mask8={0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, }; +vec_u8_t mask9={0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, }; +vec_u8_t mask10={0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, }; +vec_u8_t mask11={0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, }; +vec_u8_t mask12={0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, }; +vec_u8_t mask13={0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, }; +vec_u8_t mask14={0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, }; +vec_u8_t mask15={0x15, 0x15, 0x14, 0x14, 0x13, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, }; +vec_u8_t maskadd1_15={0x16, 0x16, 0x15, 0x15, 0x14, 0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_16={0xf, 0xc, 0xa, 0x7, 0x5, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + //vec_u8_t s1 = vec_xl(40, srcPix0); + vec_u8_t s1 = vec_xl(42, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = vec_perm(s0, s1, mask1); + vec_u8_t srv2 = vec_perm(s0, s1, mask2); + vec_u8_t srv3 = vec_perm(s0, s1, mask3); + vec_u8_t srv4 = vec_perm(s0, s1, mask4); + vec_u8_t srv5 =vec_perm(s0, s1, mask5); + vec_u8_t srv6 = vec_perm(s0, s1, mask6); + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = vec_perm(s0, s1, mask8); + vec_u8_t srv9 = vec_perm(s0, s1, mask9); + vec_u8_t srv10 = vec_perm(s0, s1, mask10); + vec_u8_t srv11 = vec_perm(s0, s1, mask11); + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = vec_perm(s0, s1, mask13); + vec_u8_t srv14 = vec_perm(s0, s1, mask14); + vec_u8_t srv15 = vec_perm(s0, s1, mask15); + + vec_u8_t srv0_add1 = srv1; + vec_u8_t srv1_add1 = srv2; + vec_u8_t srv2_add1 = srv3; + vec_u8_t srv3_add1 = srv4; + vec_u8_t srv4_add1 = srv5; + vec_u8_t srv5_add1 = srv6; + vec_u8_t srv6_add1 = srv7; + vec_u8_t srv7_add1 = srv8; + vec_u8_t srv8_add1 = srv9; + vec_u8_t srv9_add1 = srv10; + vec_u8_t srv10_add1 = srv11; + vec_u8_t srv11_add1 = srv12; + vec_u8_t srv12_add1= srv13; + vec_u8_t srv13_add1 = srv14; + vec_u8_t srv14_add1 = srv15; + vec_u8_t srv15_add1 = vec_perm(s0, s1, maskadd1_15); + +vec_u8_t vfrac16 = (vec_u8_t){19, 6, 25, 12, 31, 18, 5, 24, 11, 30, 17, 4, 23, 10, 29, 16, }; +vec_u8_t vfrac16_32 = (vec_u8_t){13, 26, 7, 20, 1, 14, 27, 8, 21, 2, 15, 28, 9, 22, 3, 16, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv0_add1, vfrac16_32, vfrac16, vout_0); + one_line(srv1, srv1_add1, vfrac16_32, vfrac16, vout_1); + one_line(srv2, srv2_add1, vfrac16_32, vfrac16, vout_2); + one_line(srv3, srv3_add1, vfrac16_32, vfrac16, vout_3); + one_line(srv4, srv4_add1, vfrac16_32, vfrac16, vout_4); + one_line(srv5, srv5_add1, vfrac16_32, vfrac16, vout_5); + one_line(srv6, srv6_add1, vfrac16_32, vfrac16, vout_6); + one_line(srv7, srv7_add1, vfrac16_32, vfrac16, vout_7); + one_line(srv8, srv8_add1, vfrac16_32, vfrac16, vout_8); + one_line(srv9, srv9_add1, vfrac16_32, vfrac16, vout_9); + one_line(srv10, srv10_add1, vfrac16_32, vfrac16, vout_10); + one_line(srv11, srv11_add1, vfrac16_32, vfrac16, vout_11); + one_line(srv12, srv12_add1, vfrac16_32, vfrac16, vout_12); + one_line(srv13, srv13_add1, vfrac16_32, vfrac16, vout_13); + one_line(srv14, srv14_add1, vfrac16_32, vfrac16, vout_14); + one_line(srv15, srv15_add1, vfrac16_32, vfrac16, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 14>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, }; +vec_u8_t mask1={0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, }; +vec_u8_t mask2={0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, }; +vec_u8_t mask3={0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, }; +vec_u8_t mask4={0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, }; +vec_u8_t mask5={0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, }; +vec_u8_t mask6={0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, }; +vec_u8_t mask7={0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, }; +vec_u8_t mask8={0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, }; +vec_u8_t mask9={0x15, 0x15, 0x14, 0x14, 0x13, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, }; +vec_u8_t mask10={0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x2, 0x1, 0x1, 0x0, 0x0, }; +vec_u8_t mask11={0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x3, 0x2, 0x2, 0x1, 0x1, }; +vec_u8_t mask12={0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, }; +vec_u8_t mask13={0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, }; +vec_u8_t mask14={0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, }; +vec_u8_t mask15={0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, }; + +vec_u8_t mask16_0={0x6, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x2, 0x1, 0x1, 0x0, 0x0, 0x0, }; +vec_u8_t mask16_1={0x7, 0x6, 0x6, 0x5, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x3, 0x2, 0x2, 0x1, 0x1, 0x1, }; +vec_u8_t mask16_2={0x8, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x2, }; +vec_u8_t mask16_3={0x9, 0x8, 0x8, 0x7, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x3, }; +vec_u8_t mask16_4={0xa, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x4, }; +vec_u8_t mask16_5={0xb, 0xa, 0xa, 0x9, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x5, }; +vec_u8_t mask16_6={0xc, 0xb, 0xb, 0xa, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x6, }; +vec_u8_t mask16_7={0xd, 0xc, 0xc, 0xb, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x7, }; +vec_u8_t mask16_8={0xe, 0xd, 0xd, 0xc, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x8, }; +vec_u8_t mask16_9={0xf, 0xe, 0xe, 0xd, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x9, }; +vec_u8_t mask16_10={0x10, 0xf, 0xf, 0xe, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0xa, }; +vec_u8_t mask16_11={0x11, 0x10, 0x10, 0xf, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xb, }; +vec_u8_t mask16_12={0x12, 0x11, 0x11, 0x10, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xc, }; +vec_u8_t mask16_13={0x13, 0x12, 0x12, 0x11, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xd, }; +vec_u8_t mask16_14={0x14, 0x13, 0x13, 0x12, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xe, }; +vec_u8_t mask16_15={0x15, 0x14, 0x14, 0x13, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xf, }; + +vec_u8_t maskadd1_31={0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, }; +vec_u8_t maskadd1_16_31={0x6, 0x5, 0x5, 0x4, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x2, 0x1, 0x1, 0x0, 0x0, 0x0, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + + vec_u8_t srv_left0=vec_xl(0, srcPix0); + vec_u8_t srv_left1=vec_xl(16, srcPix0); + vec_u8_t srv_right=vec_xl(65, srcPix0); + vec_u8_t refmask_32_0={0x1e, 0x1b, 0x19, 0x16, 0x14, 0x11, 0xf, 0xc, 0xa, 0x7, 0x5, 0x2, 0x00, 0x0, 0x0, 0x0}; + vec_u8_t refmask_32_1={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0x10, 0x11, 0x12}; + vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 ); + vec_u8_t s1 = vec_xl(68, srcPix0); + vec_u8_t s2 = vec_xl(84, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s0, mask0); + vec_u8_t srv1 = vec_perm(s0, s0, mask1); + vec_u8_t srv2 = vec_perm(s0, s0, mask2); + vec_u8_t srv3 = vec_perm(s0, s0, mask3); + vec_u8_t srv4 = vec_perm(s0, s1, mask4); + vec_u8_t srv5 = vec_perm(s0, s1, mask5); + vec_u8_t srv6 = vec_perm(s0, s1, mask6); + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = vec_perm(s0, s1, mask8); + vec_u8_t srv9 = vec_perm(s0, s1, mask9); + vec_u8_t srv10 = vec_perm(s1, s1, mask10); + vec_u8_t srv11 = vec_perm(s1, s1, mask11); + vec_u8_t srv12= vec_perm(s1, s1, mask12); + vec_u8_t srv13 = vec_perm(s1, s1, mask13); + vec_u8_t srv14 = vec_perm(s1, s1, mask14); + vec_u8_t srv15 = vec_perm(s1, s1, mask15); + + vec_u8_t srv16_0 = vec_perm(s0, s0, mask16_0); + vec_u8_t srv16_1 = vec_perm(s0, s0, mask16_1); + vec_u8_t srv16_2 = vec_perm(s0, s0, mask16_2); + vec_u8_t srv16_3 = vec_perm(s0, s0, mask16_3); + vec_u8_t srv16_4 = vec_perm(s0, s0, mask16_4); + vec_u8_t srv16_5 = vec_perm(s0, s0, mask16_5); + vec_u8_t srv16_6 = vec_perm(s0, s0, mask16_6); + vec_u8_t srv16_7 = vec_perm(s0, s0, mask16_7); + vec_u8_t srv16_8 = vec_perm(s0, s0, mask16_8); + vec_u8_t srv16_9 = vec_perm(s0, s0, mask16_9); + vec_u8_t srv16_10 = vec_perm(s0, s1, mask16_10); + vec_u8_t srv16_11 = vec_perm(s0, s1, mask16_11); + vec_u8_t srv16_12= vec_perm(s0, s1, mask16_12); + vec_u8_t srv16_13 = vec_perm(s0, s1, mask16_13); + vec_u8_t srv16_14 = vec_perm(s0, s1, mask16_14); + vec_u8_t srv16_15 = vec_perm(s0, s1, mask16_15); + + vec_u8_t srv16 = vec_perm(s1, s1, mask0); + vec_u8_t srv17 = vec_perm(s1, s1, mask1); + vec_u8_t srv18 = vec_perm(s1, s1, mask2); + vec_u8_t srv19 = vec_perm(s1, s1, mask3); + vec_u8_t srv20 = vec_perm(s1, s2, mask4); + vec_u8_t srv21 = vec_perm(s1, s2, mask5); + vec_u8_t srv22 = vec_perm(s1, s2, mask6); + vec_u8_t srv23 = vec_perm(s1, s2, mask7); + vec_u8_t srv24 = vec_perm(s1, s2, mask8); + vec_u8_t srv25 = vec_perm(s1, s2, mask9); + vec_u8_t srv26 = vec_perm(s2, s2, mask10); + vec_u8_t srv27 = vec_perm(s2, s2, mask11); + vec_u8_t srv28 = vec_perm(s2, s2, mask12); + vec_u8_t srv29 = vec_perm(s2, s2, mask13); + vec_u8_t srv30 = vec_perm(s2, s2, mask14); + vec_u8_t srv31 = vec_perm(s2, s2, mask15); + + vec_u8_t srv16_16 = vec_perm(s1, s1, mask16_0); + vec_u8_t srv16_17 = vec_perm(s1, s1, mask16_1); + vec_u8_t srv16_18 = vec_perm(s1, s1, mask16_2); + vec_u8_t srv16_19 = vec_perm(s1, s1, mask16_3); + vec_u8_t srv16_20 = vec_perm(s1, s1, mask16_4); + vec_u8_t srv16_21 = vec_perm(s1, s1, mask16_5); + vec_u8_t srv16_22 = vec_perm(s1, s1, mask16_6); + vec_u8_t srv16_23 = vec_perm(s1, s1, mask16_7); + vec_u8_t srv16_24 = vec_perm(s1, s1, mask16_8); + vec_u8_t srv16_25 = vec_perm(s1, s1, mask16_9); + vec_u8_t srv16_26 = vec_perm(s1, s2, mask16_10); + vec_u8_t srv16_27 = vec_perm(s1, s2, mask16_11); + vec_u8_t srv16_28 = vec_perm(s1, s2, mask16_12); + vec_u8_t srv16_29 = vec_perm(s1, s2, mask16_13); + vec_u8_t srv16_30 = vec_perm(s1, s2, mask16_14); + vec_u8_t srv16_31 = vec_perm(s1, s2, mask16_15); + + vec_u8_t srv0add1 = srv1; + vec_u8_t srv1add1 = srv2; + vec_u8_t srv2add1 = srv3; + vec_u8_t srv3add1 = srv4; + vec_u8_t srv4add1 = srv5; + vec_u8_t srv5add1 = srv6; + vec_u8_t srv6add1 = srv7; + vec_u8_t srv7add1 = srv8; + vec_u8_t srv8add1 = srv9; + vec_u8_t srv9add1 = srv10; + vec_u8_t srv10add1 = srv11; + vec_u8_t srv11add1 = srv12; + vec_u8_t srv12add1= srv13; + vec_u8_t srv13add1 = srv14; + vec_u8_t srv14add1 = srv15; + vec_u8_t srv15add1 = srv16; + + vec_u8_t srv16add1_0 = srv16_1; + vec_u8_t srv16add1_1 = srv16_2; + vec_u8_t srv16add1_2 = srv16_3; + vec_u8_t srv16add1_3 = srv16_4; + vec_u8_t srv16add1_4 = srv16_5; + vec_u8_t srv16add1_5 = srv16_6; + vec_u8_t srv16add1_6 = srv16_7; + vec_u8_t srv16add1_7 = srv16_8; + vec_u8_t srv16add1_8 = srv16_9; + vec_u8_t srv16add1_9 = srv16_10; + vec_u8_t srv16add1_10 = srv16_11; + vec_u8_t srv16add1_11 = srv16_12; + vec_u8_t srv16add1_12= srv16_13; + vec_u8_t srv16add1_13 = srv16_14; + vec_u8_t srv16add1_14 = srv16_15; + vec_u8_t srv16add1_15 = srv16_16; + + vec_u8_t srv16add1 = srv17; + vec_u8_t srv17add1 = srv18; + vec_u8_t srv18add1 = srv19; + vec_u8_t srv19add1 = srv20; + vec_u8_t srv20add1 = srv21; + vec_u8_t srv21add1 = srv22; + vec_u8_t srv22add1 = srv23; + vec_u8_t srv23add1 = srv24; + vec_u8_t srv24add1 = srv25; + vec_u8_t srv25add1 = srv26; + vec_u8_t srv26add1 = srv27; + vec_u8_t srv27add1 = srv28; + vec_u8_t srv28add1 = srv29; + vec_u8_t srv29add1 = srv30; + vec_u8_t srv30add1 = srv31; + vec_u8_t srv31add1 = vec_perm(s2, s2, maskadd1_31); + + vec_u8_t srv16add1_16 = srv16_17; + vec_u8_t srv16add1_17 = srv16_18; + vec_u8_t srv16add1_18 = srv16_19; + vec_u8_t srv16add1_19 = srv16_20; + vec_u8_t srv16add1_20 = srv16_21; + vec_u8_t srv16add1_21 = srv16_22; + vec_u8_t srv16add1_22 = srv16_23; + vec_u8_t srv16add1_23 = srv16_24; + vec_u8_t srv16add1_24 = srv16_25; + vec_u8_t srv16add1_25 = srv16_26; + vec_u8_t srv16add1_26 = srv16_27; + vec_u8_t srv16add1_27 = srv16_28; + vec_u8_t srv16add1_28 = srv16_29; + vec_u8_t srv16add1_29 = srv16_30; + vec_u8_t srv16add1_30 = srv16_31; + vec_u8_t srv16add1_31 = vec_perm(s2, s2, maskadd1_16_31); + +vec_u8_t vfrac32_0 = (vec_u8_t){19, 6, 25, 12, 31, 18, 5, 24, 11, 30, 17, 4, 23, 10, 29, 16, }; +vec_u8_t vfrac32_1 = (vec_u8_t){3, 22, 9, 28, 15, 2, 21, 8, 27, 14, 1, 20, 7, 26, 13, 0, }; +vec_u8_t vfrac32_32_0 = (vec_u8_t){13, 26, 7, 20, 1, 14, 27, 8, 21, 2, 15, 28, 9, 22, 3, 16, }; +vec_u8_t vfrac32_32_1 = (vec_u8_t){29, 10, 23, 4, 17, 30, 11, 24, 5, 18, 31, 12, 25, 6, 19, 32, }; + + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv0add1, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_0, srv16add1_0, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv1, srv1add1, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_1, srv16add1_1, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv2, srv2add1, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_2, srv16add1_2, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv3, srv3add1, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_3, srv16add1_3, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv4, srv4add1, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_4, srv16add1_4, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv5, srv5add1, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_5, srv16add1_5, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv6, srv6add1, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_6, srv16add1_6, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv7, srv7add1, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_7, srv16add1_7, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv8, srv8add1, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_8, srv16add1_8, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv9, srv9add1, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_9, srv16add1_9, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv10, srv10add1, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_10, srv16add1_10, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv11, srv11add1, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_11, srv16add1_11, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv12, srv12add1, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_12, srv16add1_12, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv13, srv13add1, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_13, srv16add1_13, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv14, srv14add1, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_14, srv16add1_14, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv15, srv15add1, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_15, srv16add1_15, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv16, srv16add1, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_16, srv16add1_16, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv17, srv17add1, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_17, srv16add1_17, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv18, srv18add1, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_18, srv16add1_18, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv19, srv19add1, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_19, srv16add1_19, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv20, srv20add1, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_20, srv16add1_20, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv21, srv21add1, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_21, srv16add1_21, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv22, srv22add1, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_22, srv16add1_22, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv23, srv23add1, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_23, srv16add1_23, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv24, srv24add1, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_24, srv16add1_24, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv25, srv25add1, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_25, srv16add1_25, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv26, srv26add1, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_26, srv16add1_26, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv27, srv27add1, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_27, srv16add1_27, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv28, srv28add1, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_28, srv16add1_28, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv29, srv29add1, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_29, srv16add1_29, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv30, srv30add1, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_30, srv16add1_30, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv31, srv31add1, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_31, srv16add1_31, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + + + +template<> +void intra_pred<4, 15>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t mask0={0x2, 0x1, 0x1, 0x0, 0x3, 0x2, 0x2, 0x1, 0x4, 0x3, 0x3, 0x2, 0x5, 0x4, 0x4, 0x3, }; + vec_u8_t mask1={0x3, 0x2, 0x2, 0x1, 0x4, 0x3, 0x3, 0x2, 0x5, 0x4, 0x4, 0x3, 0x6, 0x5, 0x5, 0x4, }; + + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(9, srcPix0); + vec_u8_t refmask_4={0x4, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, }; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + vec_u8_t vfrac4 = (vec_u8_t){15, 30, 13, 28, 15, 30, 13, 28, 15, 30, 13, 28, 15, 30, 13, 28, }; + vec_u8_t vfrac4_32 = (vec_u8_t){17, 2, 19, 4, 17, 2, 19, 4, 17, 2, 19, 4, 17, 2, 19, 4, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 15>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x4, 0x3, 0x3, 0x2, 0x2, 0x1, 0x1, 0x0, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x1, }; +vec_u8_t mask1={0x5, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x1, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, }; +vec_u8_t mask2={0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, }; +vec_u8_t mask3={0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, }; +vec_u8_t mask4={0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, }; +vec_u8_t mask5={0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, }; +vec_u8_t mask6={0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, }; +vec_u8_t mask7={0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, }; +//vec_u8_t mask8={0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vout_0, vout_1, vout_2, vout_3; + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(17, srcPix0); + vec_u8_t refmask_8={0x8, 0x6, 0x4, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + vec_u8_t srv3 = vec_perm(srv, srv, mask3); + vec_u8_t srv4 = vec_perm(srv, srv, mask4); + vec_u8_t srv5 = vec_perm(srv, srv, mask5); + vec_u8_t srv6 = vec_perm(srv, srv, mask6); + vec_u8_t srv7 = vec_perm(srv, srv, mask7); + + vec_u8_t vfrac8 = (vec_u8_t){15, 30, 13, 28, 11, 26, 9, 24, 15, 30, 13, 28, 11, 26, 9, 24, }; + vec_u8_t vfrac8_32 = (vec_u8_t){17, 2, 19, 4, 21, 6, 23, 8, 17, 2, 19, 4, 21, 6, 23, 8, }; + + one_line(srv0, srv1, vfrac8_32, vfrac8, vout_0); + one_line(srv2, srv3, vfrac8_32, vfrac8, vout_1); + one_line(srv4, srv5, vfrac8_32, vfrac8, vout_2); + one_line(srv6, srv7, vfrac8_32, vfrac8, vout_3); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 15>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x1, 0x1, 0x0, }; +vec_u8_t mask1={0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x1, }; +vec_u8_t mask2={0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, }; +vec_u8_t mask3={0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, }; +vec_u8_t mask4={0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, }; +vec_u8_t mask5={0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, }; +vec_u8_t mask6={0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, }; +vec_u8_t mask7={0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, }; +vec_u8_t mask8={0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, }; +vec_u8_t mask9={0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, }; +vec_u8_t mask10={0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, }; +vec_u8_t mask11={0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, }; +vec_u8_t mask12={0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, }; +vec_u8_t mask13={0x15, 0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, }; +vec_u8_t mask14={0x16, 0x15, 0x15, 0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, }; +vec_u8_t mask15={0x17, 0x16, 0x16, 0x15, 0x15, 0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, }; +vec_u8_t maskadd1_15={0x18, 0x17, 0x17, 0x16, 0x16, 0x15, 0x15, 0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_16={0xf, 0xd, 0xb, 0x9, 0x8, 0x6, 0x4, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(40, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = vec_perm(s0, s1, mask1); + vec_u8_t srv2 = vec_perm(s0, s1, mask2); + vec_u8_t srv3 = vec_perm(s0, s1, mask3); + vec_u8_t srv4 = vec_perm(s0, s1, mask4); + vec_u8_t srv5 =vec_perm(s0, s1, mask5); + vec_u8_t srv6 = vec_perm(s0, s1, mask6); + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = vec_perm(s0, s1, mask8); + vec_u8_t srv9 = vec_perm(s0, s1, mask9); + vec_u8_t srv10 = vec_perm(s0, s1, mask10); + vec_u8_t srv11 = vec_perm(s0, s1, mask11); + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = vec_perm(s0, s1, mask13); + vec_u8_t srv14 = vec_perm(s0, s1, mask14); + vec_u8_t srv15 = vec_perm(s0, s1, mask15); + + vec_u8_t srv0_add1 = srv1; + vec_u8_t srv1_add1 = srv2; + vec_u8_t srv2_add1 = srv3; + vec_u8_t srv3_add1 = srv4; + vec_u8_t srv4_add1 = srv5; + vec_u8_t srv5_add1 = srv6; + vec_u8_t srv6_add1 = srv7; + vec_u8_t srv7_add1 = srv8; + vec_u8_t srv8_add1 = srv9; + vec_u8_t srv9_add1 = srv10; + vec_u8_t srv10_add1 = srv11; + vec_u8_t srv11_add1 = srv12; + vec_u8_t srv12_add1= srv13; + vec_u8_t srv13_add1 = srv14; + vec_u8_t srv14_add1 = srv15; + vec_u8_t srv15_add1 = vec_perm(s0, s1, maskadd1_15); + +vec_u8_t vfrac16 = (vec_u8_t){15, 30, 13, 28, 11, 26, 9, 24, 7, 22, 5, 20, 3, 18, 1, 16, }; +vec_u8_t vfrac16_32 = (vec_u8_t){17, 2, 19, 4, 21, 6, 23, 8, 25, 10, 27, 12, 29, 14, 31, 16, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv0_add1, vfrac16_32, vfrac16, vout_0); + one_line(srv1, srv1_add1, vfrac16_32, vfrac16, vout_1); + one_line(srv2, srv2_add1, vfrac16_32, vfrac16, vout_2); + one_line(srv3, srv3_add1, vfrac16_32, vfrac16, vout_3); + one_line(srv4, srv4_add1, vfrac16_32, vfrac16, vout_4); + one_line(srv5, srv5_add1, vfrac16_32, vfrac16, vout_5); + one_line(srv6, srv6_add1, vfrac16_32, vfrac16, vout_6); + one_line(srv7, srv7_add1, vfrac16_32, vfrac16, vout_7); + one_line(srv8, srv8_add1, vfrac16_32, vfrac16, vout_8); + one_line(srv9, srv9_add1, vfrac16_32, vfrac16, vout_9); + one_line(srv10, srv10_add1, vfrac16_32, vfrac16, vout_10); + one_line(srv11, srv11_add1, vfrac16_32, vfrac16, vout_11); + one_line(srv12, srv12_add1, vfrac16_32, vfrac16, vout_12); + one_line(srv13, srv13_add1, vfrac16_32, vfrac16, vout_13); + one_line(srv14, srv14_add1, vfrac16_32, vfrac16, vout_14); + one_line(srv15, srv15_add1, vfrac16_32, vfrac16, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 15>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, }; +vec_u8_t mask1={0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, }; +vec_u8_t mask2={0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, }; +vec_u8_t mask3={0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, }; +vec_u8_t mask4={0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, }; +vec_u8_t mask5={0x15, 0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, }; +vec_u8_t mask6={0x16, 0x15, 0x15, 0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, }; +vec_u8_t mask7={0x17, 0x16, 0x16, 0x15, 0x15, 0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, }; +vec_u8_t mask8={0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x1, 0x1, 0x0, }; +vec_u8_t mask9={0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x1, }; +vec_u8_t mask10={0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, }; +vec_u8_t mask11={0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, }; +vec_u8_t mask12={0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, }; +vec_u8_t mask13={0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, }; +vec_u8_t mask14={0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, }; +vec_u8_t mask15={0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, }; + +vec_u8_t mask16_0={0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x1, 0x1, 0x0, 0x0, }; +vec_u8_t mask16_1={0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x1, 0x1, }; +vec_u8_t mask16_2={0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, }; +vec_u8_t mask16_3={0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, }; +vec_u8_t mask16_4={0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, }; +vec_u8_t mask16_5={0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, 0x5, 0x5, }; +vec_u8_t mask16_6={0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, 0x6, 0x6, }; +vec_u8_t mask16_7={0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, 0x7, 0x7, }; +vec_u8_t mask16_8={0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, 0x8, }; +vec_u8_t mask16_9={0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, }; +vec_u8_t mask16_10={0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, }; +vec_u8_t mask16_11={0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, }; +vec_u8_t mask16_12={0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, }; +vec_u8_t mask16_13={0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, }; +vec_u8_t mask16_14={0x15, 0x15, 0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, 0xe, 0xe, }; +vec_u8_t mask16_15={0x16, 0x16, 0x15, 0x15, 0x14, 0x14, 0x13, 0x13, 0x12, 0x12, 0x11, 0x11, 0x10, 0x10, 0xf, 0xf, }; + +vec_u8_t maskadd1_31={0x10, 0xf, 0xf, 0xe, 0xe, 0xd, 0xd, 0xc, 0xc, 0xb, 0xb, 0xa, 0xa, 0x9, 0x9, 0x8, }; +vec_u8_t maskadd1_16_31={0x7, 0x7, 0x6, 0x6, 0x5, 0x5, 0x4, 0x4, 0x3, 0x3, 0x2, 0x2, 0x1, 0x1, 0x0, 0x0, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + + vec_u8_t srv_left0=vec_xl(0, srcPix0); + vec_u8_t srv_left1=vec_xl(16, srcPix0); + vec_u8_t srv_right=vec_xl(65, srcPix0); + vec_u8_t refmask_32_0={0x1e, 0x1c, 0x1a, 0x18, 0x17, 0x15, 0x13, 0x11, 0xf, 0xd, 0xb, 0x9, 0x8, 0x6, 0x4, 0x2}; + vec_u8_t refmask_32_1={0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e}; + vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0); + vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1); + vec_u8_t s2 = vec_xl(80, srcPix0); + vec_u8_t s3 = vec_xl(96, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = vec_perm(s0, s1, mask1); + vec_u8_t srv2 = vec_perm(s0, s1, mask2); + vec_u8_t srv3 = vec_perm(s0, s1, mask3); + vec_u8_t srv4 = vec_perm(s0, s1, mask4); + vec_u8_t srv5 = vec_perm(s0, s1, mask5); + vec_u8_t srv6 = vec_perm(s0, s1, mask6); + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = vec_perm(s1, s1, mask8); + vec_u8_t srv9 = vec_perm(s1, s1, mask9); + vec_u8_t srv10 = vec_perm(s1, s1, mask10); + vec_u8_t srv11 = vec_perm(s1, s1, mask11); + vec_u8_t srv12= vec_perm(s1, s1, mask12); + vec_u8_t srv13 = vec_perm(s1, s1, mask13); + vec_u8_t srv14 = vec_perm(s1, s1, mask14); + vec_u8_t srv15 = vec_perm(s1, s1, mask15); + + vec_u8_t srv16_0 = vec_perm(s0, s0, mask16_0); + vec_u8_t srv16_1 = vec_perm(s0, s0, mask16_1); + vec_u8_t srv16_2 = vec_perm(s0, s0, mask16_2); + vec_u8_t srv16_3 = vec_perm(s0, s0, mask16_3); + vec_u8_t srv16_4 = vec_perm(s0, s0, mask16_4); + vec_u8_t srv16_5 = vec_perm(s0, s0, mask16_5); + vec_u8_t srv16_6 = vec_perm(s0, s0, mask16_6); + vec_u8_t srv16_7 = vec_perm(s0, s0, mask16_7); + vec_u8_t srv16_8 = vec_perm(s0, s0, mask16_8); + vec_u8_t srv16_9 = vec_perm(s0, s1, mask16_9); + vec_u8_t srv16_10 = vec_perm(s0, s1, mask16_10); + vec_u8_t srv16_11 = vec_perm(s0, s1, mask16_11); + vec_u8_t srv16_12= vec_perm(s0, s1, mask16_12); + vec_u8_t srv16_13 = vec_perm(s0, s1, mask16_13); + vec_u8_t srv16_14 = vec_perm(s0, s1, mask16_14); + vec_u8_t srv16_15 = vec_perm(s0, s1, mask16_15); + + vec_u8_t srv16 = vec_perm(s1, s2, mask0); + vec_u8_t srv17 = vec_perm(s1, s2, mask1); + vec_u8_t srv18 = vec_perm(s1, s2, mask2); + vec_u8_t srv19 = vec_perm(s1, s2, mask3); + vec_u8_t srv20 = vec_perm(s1, s2, mask4); + vec_u8_t srv21 = vec_perm(s1, s2, mask5); + vec_u8_t srv22 = vec_perm(s1, s2, mask6); + vec_u8_t srv23 = vec_perm(s1, s2, mask7); + vec_u8_t srv24 = vec_perm(s2, s2, mask8); + vec_u8_t srv25 = vec_perm(s2, s2, mask9); + vec_u8_t srv26 = vec_perm(s2, s2, mask10); + vec_u8_t srv27 = vec_perm(s2, s2, mask11); + vec_u8_t srv28 = vec_perm(s2, s2, mask12); + vec_u8_t srv29 = vec_perm(s2, s2, mask13); + vec_u8_t srv30 = vec_perm(s2, s2, mask14); + vec_u8_t srv31 = vec_perm(s2, s2, mask15); + + vec_u8_t srv16_16 = vec_perm(s1, s1, mask16_0); + vec_u8_t srv16_17 = vec_perm(s1, s1, mask16_1); + vec_u8_t srv16_18 = vec_perm(s1, s1, mask16_2); + vec_u8_t srv16_19 = vec_perm(s1, s1, mask16_3); + vec_u8_t srv16_20 = vec_perm(s1, s1, mask16_4); + vec_u8_t srv16_21 = vec_perm(s1, s1, mask16_5); + vec_u8_t srv16_22 = vec_perm(s1, s1, mask16_6); + vec_u8_t srv16_23 = vec_perm(s1, s1, mask16_7); + vec_u8_t srv16_24 = vec_perm(s1, s1, mask16_8); + vec_u8_t srv16_25 = vec_perm(s1, s2, mask16_9); + vec_u8_t srv16_26 = vec_perm(s1, s2, mask16_10); + vec_u8_t srv16_27 = vec_perm(s1, s2, mask16_11); + vec_u8_t srv16_28 = vec_perm(s1, s2, mask16_12); + vec_u8_t srv16_29 = vec_perm(s1, s2, mask16_13); + vec_u8_t srv16_30 = vec_perm(s1, s2, mask16_14); + vec_u8_t srv16_31 = vec_perm(s1, s2, mask16_15); + + vec_u8_t srv0add1 = srv1; + vec_u8_t srv1add1 = srv2; + vec_u8_t srv2add1 = srv3; + vec_u8_t srv3add1 = srv4; + vec_u8_t srv4add1 = srv5; + vec_u8_t srv5add1 = srv6; + vec_u8_t srv6add1 = srv7; + vec_u8_t srv7add1 = srv8; + vec_u8_t srv8add1 = srv9; + vec_u8_t srv9add1 = srv10; + vec_u8_t srv10add1 = srv11; + vec_u8_t srv11add1 = srv12; + vec_u8_t srv12add1= srv13; + vec_u8_t srv13add1 = srv14; + vec_u8_t srv14add1 = srv15; + vec_u8_t srv15add1 = srv16; + + vec_u8_t srv16add1_0 = srv16_1; + vec_u8_t srv16add1_1 = srv16_2; + vec_u8_t srv16add1_2 = srv16_3; + vec_u8_t srv16add1_3 = srv16_4; + vec_u8_t srv16add1_4 = srv16_5; + vec_u8_t srv16add1_5 = srv16_6; + vec_u8_t srv16add1_6 = srv16_7; + vec_u8_t srv16add1_7 = srv16_8; + vec_u8_t srv16add1_8 = srv16_9; + vec_u8_t srv16add1_9 = srv16_10; + vec_u8_t srv16add1_10 = srv16_11; + vec_u8_t srv16add1_11 = srv16_12; + vec_u8_t srv16add1_12= srv16_13; + vec_u8_t srv16add1_13 = srv16_14; + vec_u8_t srv16add1_14 = srv16_15; + vec_u8_t srv16add1_15 = srv16_16; + + vec_u8_t srv16add1 = srv17; + vec_u8_t srv17add1 = srv18; + vec_u8_t srv18add1 = srv19; + vec_u8_t srv19add1 = srv20; + vec_u8_t srv20add1 = srv21; + vec_u8_t srv21add1 = srv22; + vec_u8_t srv22add1 = srv23; + vec_u8_t srv23add1 = srv24; + vec_u8_t srv24add1 = srv25; + vec_u8_t srv25add1 = srv26; + vec_u8_t srv26add1 = srv27; + vec_u8_t srv27add1 = srv28; + vec_u8_t srv28add1 = srv29; + vec_u8_t srv29add1 = srv30; + vec_u8_t srv30add1 = srv31; + vec_u8_t srv31add1 = vec_perm(s2, s3, maskadd1_31); + + vec_u8_t srv16add1_16 = srv16_17; + vec_u8_t srv16add1_17 = srv16_18; + vec_u8_t srv16add1_18 = srv16_19; + vec_u8_t srv16add1_19 = srv16_20; + vec_u8_t srv16add1_20 = srv16_21; + vec_u8_t srv16add1_21 = srv16_22; + vec_u8_t srv16add1_22 = srv16_23; + vec_u8_t srv16add1_23 = srv16_24; + vec_u8_t srv16add1_24 = srv16_25; + vec_u8_t srv16add1_25 = srv16_26; + vec_u8_t srv16add1_26 = srv16_27; + vec_u8_t srv16add1_27 = srv16_28; + vec_u8_t srv16add1_28 = srv16_29; + vec_u8_t srv16add1_29 = srv16_30; + vec_u8_t srv16add1_30 = srv16_31; + vec_u8_t srv16add1_31 = vec_perm(s2, s2, maskadd1_16_31); + +vec_u8_t vfrac32_0 = (vec_u8_t){15, 30, 13, 28, 11, 26, 9, 24, 7, 22, 5, 20, 3, 18, 1, 16, }; +vec_u8_t vfrac32_1 = (vec_u8_t){31, 14, 29, 12, 27, 10, 25, 8, 23, 6, 21, 4, 19, 2, 17, 0, }; +vec_u8_t vfrac32_32_0 = (vec_u8_t){17, 2, 19, 4, 21, 6, 23, 8, 25, 10, 27, 12, 29, 14, 31, 16, }; +vec_u8_t vfrac32_32_1 = (vec_u8_t){1, 18, 3, 20, 5, 22, 7, 24, 9, 26, 11, 28, 13, 30, 15, 32, }; + + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv0add1, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_0, srv16add1_0, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv1, srv1add1, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_1, srv16add1_1, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv2, srv2add1, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_2, srv16add1_2, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv3, srv3add1, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_3, srv16add1_3, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv4, srv4add1, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_4, srv16add1_4, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv5, srv5add1, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_5, srv16add1_5, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv6, srv6add1, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_6, srv16add1_6, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv7, srv7add1, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_7, srv16add1_7, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv8, srv8add1, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_8, srv16add1_8, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv9, srv9add1, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_9, srv16add1_9, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv10, srv10add1, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_10, srv16add1_10, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv11, srv11add1, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_11, srv16add1_11, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv12, srv12add1, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_12, srv16add1_12, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv13, srv13add1, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_13, srv16add1_13, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv14, srv14add1, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_14, srv16add1_14, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv15, srv15add1, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_15, srv16add1_15, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv16, srv16add1, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_16, srv16add1_16, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv17, srv17add1, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_17, srv16add1_17, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv18, srv18add1, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_18, srv16add1_18, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv19, srv19add1, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_19, srv16add1_19, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv20, srv20add1, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_20, srv16add1_20, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv21, srv21add1, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_21, srv16add1_21, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv22, srv22add1, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_22, srv16add1_22, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv23, srv23add1, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_23, srv16add1_23, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv24, srv24add1, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_24, srv16add1_24, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv25, srv25add1, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_25, srv16add1_25, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv26, srv26add1, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_26, srv16add1_26, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv27, srv27add1, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_27, srv16add1_27, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv28, srv28add1, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_28, srv16add1_28, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv29, srv29add1, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_29, srv16add1_29, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv30, srv30add1, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_30, srv16add1_30, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv31, srv31add1, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_31, srv16add1_31, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + + +template<> +void intra_pred<4, 16>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t mask0={0x2, 0x1, 0x1, 0x0, 0x3, 0x2, 0x2, 0x1, 0x4, 0x3, 0x3, 0x2, 0x5, 0x4, 0x4, 0x3, }; + vec_u8_t mask1={0x3, 0x2, 0x2, 0x1, 0x4, 0x3, 0x3, 0x2, 0x5, 0x4, 0x4, 0x3, 0x6, 0x5, 0x5, 0x4, }; + + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(9, srcPix0); + vec_u8_t refmask_4={0x3, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, }; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + vec_u8_t vfrac4 = (vec_u8_t){11, 22, 1, 12, 11, 22, 1, 12, 11, 22, 1, 12, 11, 22, 1, 12, }; + vec_u8_t vfrac4_32 = (vec_u8_t){21, 10, 31, 20, 21, 10, 31, 20, 21, 10, 31, 20, 21, 10, 31, 20, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 16>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x5, 0x4, 0x4, 0x3, 0x2, 0x2, 0x1, 0x0, 0x6, 0x5, 0x5, 0x4, 0x3, 0x3, 0x2, 0x1, }; +vec_u8_t mask1={0x6, 0x5, 0x5, 0x4, 0x3, 0x3, 0x2, 0x1, 0x7, 0x6, 0x6, 0x5, 0x4, 0x4, 0x3, 0x2, }; +vec_u8_t mask2={0x7, 0x6, 0x6, 0x5, 0x4, 0x4, 0x3, 0x2, 0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, }; +vec_u8_t mask3={0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0x9, 0x8, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, }; +vec_u8_t mask4={0x9, 0x8, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, 0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, }; +vec_u8_t mask5={0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, }; +vec_u8_t mask6={0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, }; +vec_u8_t mask7={0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, }; +//vec_u8_t mask8={0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vout_0, vout_1, vout_2, vout_3; + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + + + vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_8={0x8, 0x6, 0x5, 0x3, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, }; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + vec_u8_t srv3 = vec_perm(srv, srv, mask3); + vec_u8_t srv4 = vec_perm(srv, srv, mask4); + vec_u8_t srv5 = vec_perm(srv, srv, mask5); + vec_u8_t srv6 = vec_perm(srv, srv, mask6); + vec_u8_t srv7 = vec_perm(srv, srv, mask7); + + vec_u8_t vfrac8 = (vec_u8_t){11, 22, 1, 12, 23, 2, 13, 24, 11, 22, 1, 12, 23, 2, 13, 24, }; + vec_u8_t vfrac8_32 = (vec_u8_t){21, 10, 31, 20, 9, 30, 19, 8, 21, 10, 31, 20, 9, 30, 19, 8, }; + +one_line(srv0, srv1, vfrac8_32, vfrac8, vout_0); +one_line(srv2, srv3, vfrac8_32, vfrac8, vout_1); +one_line(srv4, srv5, vfrac8_32, vfrac8, vout_2); +one_line(srv6, srv7, vfrac8_32, vfrac8, vout_3); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 16>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0x3, 0x2, 0x1, 0x1, 0x0, }; +vec_u8_t mask1={0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, 0x4, 0x3, 0x2, 0x2, 0x1, }; +vec_u8_t mask2={0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0x3, 0x2, }; +vec_u8_t mask3={0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, 0x4, 0x3, }; +vec_u8_t mask4={0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, }; +vec_u8_t mask5={0xf, 0xe, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x6, 0x5, }; +vec_u8_t mask6={0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, }; +vec_u8_t mask7={0x11, 0x10, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, }; +vec_u8_t mask8={0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, }; +vec_u8_t mask9={0x13, 0x12, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, }; +vec_u8_t mask10={0x14, 0x13, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, }; +vec_u8_t mask11={0x15, 0x14, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xc, 0xb, }; +vec_u8_t mask12={0x16, 0x15, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, }; +vec_u8_t mask13={0x17, 0x16, 0x16, 0x15, 0x14, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xe, 0xd, }; +vec_u8_t mask14={0x18, 0x17, 0x17, 0x16, 0x15, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, }; +vec_u8_t mask15={0x19, 0x18, 0x18, 0x17, 0x16, 0x16, 0x15, 0x14, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0x10, 0xf, }; +vec_u8_t maskadd1_15={0x1a, 0x19, 0x19, 0x18, 0x17, 0x17, 0x16, 0x15, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x11, 0x10, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_16={0xf, 0xe, 0xc, 0xb, 0x9, 0x8, 0x6, 0x5, 0x3, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(38, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = vec_perm(s0, s1, mask1); + vec_u8_t srv2 = vec_perm(s0, s1, mask2); + vec_u8_t srv3 = vec_perm(s0, s1, mask3); + vec_u8_t srv4 = vec_perm(s0, s1, mask4); + vec_u8_t srv5 =vec_perm(s0, s1, mask5); + vec_u8_t srv6 = vec_perm(s0, s1, mask6); + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = vec_perm(s0, s1, mask8); + vec_u8_t srv9 = vec_perm(s0, s1, mask9); + vec_u8_t srv10 = vec_perm(s0, s1, mask10); + vec_u8_t srv11 = vec_perm(s0, s1, mask11); + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = vec_perm(s0, s1, mask13); + vec_u8_t srv14 = vec_perm(s0, s1, mask14); + vec_u8_t srv15 = vec_perm(s0, s1, mask15); + + vec_u8_t srv0_add1 = srv1; + vec_u8_t srv1_add1 = srv2; + vec_u8_t srv2_add1 = srv3; + vec_u8_t srv3_add1 = srv4; + vec_u8_t srv4_add1 = srv5; + vec_u8_t srv5_add1 = srv6; + vec_u8_t srv6_add1 = srv7; + vec_u8_t srv7_add1 = srv8; + vec_u8_t srv8_add1 = srv9; + vec_u8_t srv9_add1 = srv10; + vec_u8_t srv10_add1 = srv11; + vec_u8_t srv11_add1 = srv12; + vec_u8_t srv12_add1= srv13; + vec_u8_t srv13_add1 = srv14; + vec_u8_t srv14_add1 = srv15; + vec_u8_t srv15_add1 = vec_perm(s0, s1, maskadd1_15); + + vec_u8_t vfrac16 = (vec_u8_t){11, 22, 1, 12, 23, 2, 13, 24, 3, 14, 25, 4, 15, 26, 5, 16, }; + vec_u8_t vfrac16_32 = (vec_u8_t){21, 10, 31, 20, 9, 30, 19, 8, 29, 18, 7, 28, 17, 6, 27, 16, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv0_add1, vfrac16_32, vfrac16, vout_0); + one_line(srv1, srv1_add1, vfrac16_32, vfrac16, vout_1); + one_line(srv2, srv2_add1, vfrac16_32, vfrac16, vout_2); + one_line(srv3, srv3_add1, vfrac16_32, vfrac16, vout_3); + one_line(srv4, srv4_add1, vfrac16_32, vfrac16, vout_4); + one_line(srv5, srv5_add1, vfrac16_32, vfrac16, vout_5); + one_line(srv6, srv6_add1, vfrac16_32, vfrac16, vout_6); + one_line(srv7, srv7_add1, vfrac16_32, vfrac16, vout_7); + one_line(srv8, srv8_add1, vfrac16_32, vfrac16, vout_8); + one_line(srv9, srv9_add1, vfrac16_32, vfrac16, vout_9); + one_line(srv10, srv10_add1, vfrac16_32, vfrac16, vout_10); + one_line(srv11, srv11_add1, vfrac16_32, vfrac16, vout_11); + one_line(srv12, srv12_add1, vfrac16_32, vfrac16, vout_12); + one_line(srv13, srv13_add1, vfrac16_32, vfrac16, vout_13); + one_line(srv14, srv14_add1, vfrac16_32, vfrac16, vout_14); + one_line(srv15, srv15_add1, vfrac16_32, vfrac16, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 16>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x14, 0x13, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, }; +vec_u8_t mask1={0x15, 0x14, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xc, 0xb, }; +vec_u8_t mask2={0x16, 0x15, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, }; +vec_u8_t mask3={0x17, 0x16, 0x16, 0x15, 0x14, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xe, 0xd, }; +vec_u8_t mask4={0x18, 0x17, 0x17, 0x16, 0x15, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, }; +vec_u8_t mask5={0x19, 0x18, 0x18, 0x17, 0x16, 0x16, 0x15, 0x14, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0x10, 0xf, }; +vec_u8_t mask6={0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0x3, 0x2, 0x1, 0x1, 0x0, }; +vec_u8_t mask7={0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, 0x4, 0x3, 0x2, 0x2, 0x1, }; +vec_u8_t mask8={0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0x3, 0x2, }; +vec_u8_t mask9={0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, 0x4, 0x3, }; +vec_u8_t mask10={0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, }; +vec_u8_t mask11={0xf, 0xe, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x6, 0x5, }; +vec_u8_t mask12={0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, }; +vec_u8_t mask13={0x11, 0x10, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, }; +vec_u8_t mask14={0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, }; +vec_u8_t mask15={0x13, 0x12, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, }; + +vec_u8_t mask16_0={0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0x3, 0x2, 0x1, 0x1, 0x0, 0x0, }; +vec_u8_t mask16_1={0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, 0x4, 0x3, 0x2, 0x2, 0x1, 0x1, }; +vec_u8_t mask16_2={0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0x3, 0x2, 0x2, }; +vec_u8_t mask16_3={0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, 0x4, 0x3, 0x3, }; +vec_u8_t mask16_4={0xd, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, 0x4, }; +vec_u8_t mask16_5={0xe, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x6, 0x5, 0x5, }; +vec_u8_t mask16_6={0xf, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x6, }; +vec_u8_t mask16_7={0x10, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x8, 0x7, 0x7, }; +vec_u8_t mask16_8={0x11, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x9, 0x8, 0x8, }; +vec_u8_t mask16_9={0x12, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0xa, 0x9, 0x9, }; +vec_u8_t mask16_10={0x13, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, 0xa, }; +vec_u8_t mask16_11={0x14, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xc, 0xb, 0xb, }; +vec_u8_t mask16_12={0x15, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, 0xc, }; +vec_u8_t mask16_13={0x16, 0x16, 0x15, 0x14, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xe, 0xd, 0xd, }; +vec_u8_t mask16_14={0x17, 0x17, 0x16, 0x15, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, 0xe, }; +vec_u8_t mask16_15={0x18, 0x18, 0x17, 0x16, 0x16, 0x15, 0x14, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0x10, 0xf, 0xf, }; + +vec_u8_t maskadd1_31={0x14, 0x13, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xb, 0xa, }; +vec_u8_t maskadd1_16_31={0x9, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0x3, 0x2, 0x1, 0x1, 0x0, 0x0, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv_left0=vec_xl(0, srcPix0); + vec_u8_t srv_left1=vec_xl(16, srcPix0); + vec_u8_t srv_right=vec_xl(65, srcPix0); + vec_u8_t refmask_32_0={0x1e, 0x1d, 0x1b, 0x1a, 0x18, 0x17, 0x15, 0x14, 0x12, 0x11, 0xf, 0xe, 0xc, 0xb, 0x9, 0x8}; + vec_u8_t refmask_32_1={0x6, 0x5, 0x3, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a}; + vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0); + vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1); + vec_u8_t s2 = vec_xl(76, srcPix0); + vec_u8_t s3 = vec_xl(92, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = vec_perm(s0, s1, mask1); + vec_u8_t srv2 = vec_perm(s0, s1, mask2); + vec_u8_t srv3 = vec_perm(s0, s1, mask3); + vec_u8_t srv4 = vec_perm(s0, s1, mask4); + vec_u8_t srv5 = vec_perm(s0, s1, mask5); + vec_u8_t srv6 = vec_perm(s1, s1, mask6); + vec_u8_t srv7 = vec_perm(s1, s1, mask7); + vec_u8_t srv8 = vec_perm(s1, s1, mask8); + vec_u8_t srv9 = vec_perm(s1, s1, mask9); + vec_u8_t srv10 = vec_perm(s1, s1, mask10); + vec_u8_t srv11 = vec_perm(s1, s1, mask11); + vec_u8_t srv12= vec_perm(s1, s2, mask12); + vec_u8_t srv13 = vec_perm(s1, s2, mask13); + vec_u8_t srv14 = vec_perm(s1, s2, mask14); + vec_u8_t srv15 = vec_perm(s1, s2, mask15); + + vec_u8_t srv16_0 = vec_perm(s0, s1, mask16_0); + vec_u8_t srv16_1 = vec_perm(s0, s1, mask16_1); + vec_u8_t srv16_2 = vec_perm(s0, s1, mask16_2); + vec_u8_t srv16_3 = vec_perm(s0, s1, mask16_3); + vec_u8_t srv16_4 = vec_perm(s0, s1, mask16_4); + vec_u8_t srv16_5 = vec_perm(s0, s1, mask16_5); + vec_u8_t srv16_6 = vec_perm(s0, s1, mask16_6); + vec_u8_t srv16_7 = vec_perm(s0, s1, mask16_7); + vec_u8_t srv16_8 = vec_perm(s0, s1, mask16_8); + vec_u8_t srv16_9 = vec_perm(s0, s1, mask16_9); + vec_u8_t srv16_10 = vec_perm(s0, s1, mask16_10); + vec_u8_t srv16_11 = vec_perm(s0, s1, mask16_11); + vec_u8_t srv16_12= vec_perm(s0, s1, mask16_12); + vec_u8_t srv16_13 = vec_perm(s0, s1, mask16_13); + vec_u8_t srv16_14 = vec_perm(s0, s1, mask16_14); + vec_u8_t srv16_15 = vec_perm(s0, s1, mask16_15); + + vec_u8_t srv16 = vec_perm(s1, s2, mask0); + vec_u8_t srv17 = vec_perm(s1, s2, mask1); + vec_u8_t srv18 = vec_perm(s1, s2, mask2); + vec_u8_t srv19 = vec_perm(s1, s2, mask3); + vec_u8_t srv20 = vec_perm(s1, s2, mask4); + vec_u8_t srv21 = vec_perm(s1, s2, mask5); + vec_u8_t srv22 = vec_perm(s2, s2, mask6); + vec_u8_t srv23 = vec_perm(s2, s2, mask7); + vec_u8_t srv24 = vec_perm(s2, s2, mask8); + vec_u8_t srv25 = vec_perm(s2, s2, mask9); + vec_u8_t srv26 = vec_perm(s2, s2, mask10); + vec_u8_t srv27 = vec_perm(s2, s2, mask11); + vec_u8_t srv28 = vec_perm(s2, s3, mask12); + vec_u8_t srv29 = vec_perm(s2, s3, mask13); + vec_u8_t srv30 = vec_perm(s2, s3, mask14); + vec_u8_t srv31 = vec_perm(s2, s3, mask15); + + vec_u8_t srv16_16 = vec_perm(s1, s2, mask16_0); + vec_u8_t srv16_17 = vec_perm(s1, s2, mask16_1); + vec_u8_t srv16_18 = vec_perm(s1, s2, mask16_2); + vec_u8_t srv16_19 = vec_perm(s1, s2, mask16_3); + vec_u8_t srv16_20 = vec_perm(s1, s2, mask16_4); + vec_u8_t srv16_21 = vec_perm(s1, s2, mask16_5); + vec_u8_t srv16_22 = vec_perm(s1, s2, mask16_6); + vec_u8_t srv16_23 = vec_perm(s1, s2, mask16_7); + vec_u8_t srv16_24 = vec_perm(s1, s2, mask16_8); + vec_u8_t srv16_25 = vec_perm(s1, s2, mask16_9); + vec_u8_t srv16_26 = vec_perm(s1, s2, mask16_10); + vec_u8_t srv16_27 = vec_perm(s1, s2, mask16_11); + vec_u8_t srv16_28 = vec_perm(s1, s2, mask16_12); + vec_u8_t srv16_29 = vec_perm(s1, s2, mask16_13); + vec_u8_t srv16_30 = vec_perm(s1, s2, mask16_14); + vec_u8_t srv16_31 = vec_perm(s1, s2, mask16_15); + + vec_u8_t srv0add1 = srv1; + vec_u8_t srv1add1 = srv2; + vec_u8_t srv2add1 = srv3; + vec_u8_t srv3add1 = srv4; + vec_u8_t srv4add1 = srv5; + vec_u8_t srv5add1 = srv6; + vec_u8_t srv6add1 = srv7; + vec_u8_t srv7add1 = srv8; + vec_u8_t srv8add1 = srv9; + vec_u8_t srv9add1 = srv10; + vec_u8_t srv10add1 = srv11; + vec_u8_t srv11add1 = srv12; + vec_u8_t srv12add1= srv13; + vec_u8_t srv13add1 = srv14; + vec_u8_t srv14add1 = srv15; + vec_u8_t srv15add1 = srv16; + + vec_u8_t srv16add1_0 = srv16_1; + vec_u8_t srv16add1_1 = srv16_2; + vec_u8_t srv16add1_2 = srv16_3; + vec_u8_t srv16add1_3 = srv16_4; + vec_u8_t srv16add1_4 = srv16_5; + vec_u8_t srv16add1_5 = srv16_6; + vec_u8_t srv16add1_6 = srv16_7; + vec_u8_t srv16add1_7 = srv16_8; + vec_u8_t srv16add1_8 = srv16_9; + vec_u8_t srv16add1_9 = srv16_10; + vec_u8_t srv16add1_10 = srv16_11; + vec_u8_t srv16add1_11 = srv16_12; + vec_u8_t srv16add1_12= srv16_13; + vec_u8_t srv16add1_13 = srv16_14; + vec_u8_t srv16add1_14 = srv16_15; + vec_u8_t srv16add1_15 = srv16_16; + + vec_u8_t srv16add1 = srv17; + vec_u8_t srv17add1 = srv18; + vec_u8_t srv18add1 = srv19; + vec_u8_t srv19add1 = srv20; + vec_u8_t srv20add1 = srv21; + vec_u8_t srv21add1 = srv22; + vec_u8_t srv22add1 = srv23; + vec_u8_t srv23add1 = srv24; + vec_u8_t srv24add1 = srv25; + vec_u8_t srv25add1 = srv26; + vec_u8_t srv26add1 = srv27; + vec_u8_t srv27add1 = srv28; + vec_u8_t srv28add1 = srv29; + vec_u8_t srv29add1 = srv30; + vec_u8_t srv30add1 = srv31; + vec_u8_t srv31add1 = vec_perm(s2, s3, maskadd1_31); + + vec_u8_t srv16add1_16 = srv16_17; + vec_u8_t srv16add1_17 = srv16_18; + vec_u8_t srv16add1_18 = srv16_19; + vec_u8_t srv16add1_19 = srv16_20; + vec_u8_t srv16add1_20 = srv16_21; + vec_u8_t srv16add1_21 = srv16_22; + vec_u8_t srv16add1_22 = srv16_23; + vec_u8_t srv16add1_23 = srv16_24; + vec_u8_t srv16add1_24 = srv16_25; + vec_u8_t srv16add1_25 = srv16_26; + vec_u8_t srv16add1_26 = srv16_27; + vec_u8_t srv16add1_27 = srv16_28; + vec_u8_t srv16add1_28 = srv16_29; + vec_u8_t srv16add1_29 = srv16_30; + vec_u8_t srv16add1_30 = srv16_31; + vec_u8_t srv16add1_31 = vec_perm(s2, s2, maskadd1_16_31); + +vec_u8_t vfrac32_0 = (vec_u8_t){11, 22, 1, 12, 23, 2, 13, 24, 3, 14, 25, 4, 15, 26, 5, 16, }; +vec_u8_t vfrac32_1 = (vec_u8_t){27, 6, 17, 28, 7, 18, 29, 8, 19, 30, 9, 20, 31, 10, 21, 0, }; +vec_u8_t vfrac32_32_0 = (vec_u8_t){21, 10, 31, 20, 9, 30, 19, 8, 29, 18, 7, 28, 17, 6, 27, 16, }; +vec_u8_t vfrac32_32_1 = (vec_u8_t){5, 26, 15, 4, 25, 14, 3, 24, 13, 2, 23, 12, 1, 22, 11, 32, }; + + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv0add1, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_0, srv16add1_0, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv1, srv1add1, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_1, srv16add1_1, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv2, srv2add1, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_2, srv16add1_2, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv3, srv3add1, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_3, srv16add1_3, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv4, srv4add1, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_4, srv16add1_4, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv5, srv5add1, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_5, srv16add1_5, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv6, srv6add1, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_6, srv16add1_6, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv7, srv7add1, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_7, srv16add1_7, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv8, srv8add1, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_8, srv16add1_8, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv9, srv9add1, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_9, srv16add1_9, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv10, srv10add1, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_10, srv16add1_10, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv11, srv11add1, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_11, srv16add1_11, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv12, srv12add1, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_12, srv16add1_12, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv13, srv13add1, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_13, srv16add1_13, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv14, srv14add1, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_14, srv16add1_14, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv15, srv15add1, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_15, srv16add1_15, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv16, srv16add1, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_16, srv16add1_16, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv17, srv17add1, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_17, srv16add1_17, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv18, srv18add1, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_18, srv16add1_18, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv19, srv19add1, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_19, srv16add1_19, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv20, srv20add1, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_20, srv16add1_20, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv21, srv21add1, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_21, srv16add1_21, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv22, srv22add1, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_22, srv16add1_22, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv23, srv23add1, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_23, srv16add1_23, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv24, srv24add1, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_24, srv16add1_24, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv25, srv25add1, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_25, srv16add1_25, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv26, srv26add1, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_26, srv16add1_26, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv27, srv27add1, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_27, srv16add1_27, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv28, srv28add1, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_28, srv16add1_28, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv29, srv29add1, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_29, srv16add1_29, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv30, srv30add1, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_30, srv16add1_30, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv31, srv31add1, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_31, srv16add1_31, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<4, 17>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + //vec_u8_t mask0={0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3, }; + //vec_u8_t mask1={0x4, 0x5, 0x6, 0x7, 0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, }; + vec_u8_t mask0={0x3, 0x2, 0x1, 0x0, 0x4, 0x3, 0x2, 0x1, 0x5, 0x4, 0x3, 0x2, 0x6, 0x5, 0x4, 0x3}; + vec_u8_t mask1={0x4, 0x3, 0x2, 0x1, 0x5, 0x4, 0x3, 0x2, 0x6, 0x5, 0x4, 0x3, 0x7, 0x6, 0x5, 0x4}; + + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(9, srcPix0); + vec_u8_t refmask_4={0x4, 0x2, 0x1, 0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + //vec_u8_t vfrac4 = (vec_u8_t){6, 6, 6, 6, 12, 12, 12, 12, 18, 18, 18, 18, 24, 24, 24, 24}; + //vec_u8_t vfrac4_32 = (vec_u8_t){26, 26, 26, 26, 20, 20, 20, 20, 14, 14, 14, 14, 8, 8, 8, 8}; + vec_u8_t vfrac4 = (vec_u8_t){6, 12, 18, 24, 6, 12, 18, 24, 6, 12, 18, 24, 6, 12, 18, 24, }; + vec_u8_t vfrac4_32 = (vec_u8_t){26, 20, 14, 8, 26, 20, 14, 8, 26, 20, 14, 8, 26, 20, 14, 8, }; + + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 17>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u8_t mask0={0x6, 0x5, 0x4, 0x3, 0x2, 0x2, 0x1, 0x0, 0x7, 0x6, 0x5, 0x4, 0x3, 0x3, 0x2, 0x1, }; + vec_u8_t mask1={0x7, 0x6, 0x5, 0x4, 0x3, 0x3, 0x2, 0x1, 0x8, 0x7, 0x6, 0x5, 0x4, 0x4, 0x3, 0x2, }; + vec_u8_t mask2={0x8, 0x7, 0x6, 0x5, 0x4, 0x4, 0x3, 0x2, 0x9, 0x8, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, }; + vec_u8_t mask3={0x9, 0x8, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0xa, 0x9, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, }; + vec_u8_t mask4={0xa, 0x9, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, 0xb, 0xa, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, }; + vec_u8_t mask5={0xb, 0xa, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0xc, 0xb, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, }; + vec_u8_t mask6={0xc, 0xb, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0xd, 0xc, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, }; + vec_u8_t mask7={0xd, 0xc, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0xe, 0xd, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, }; + //vec_u8_t mask8={0xe, 0xd, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0xf, 0xe, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vout_0, vout_1, vout_2, vout_3; + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + + vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_8={0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00}; + + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + vec_u8_t srv3 = vec_perm(srv, srv, mask3); + vec_u8_t srv4 = vec_perm(srv, srv, mask4); + vec_u8_t srv5 = vec_perm(srv, srv, mask5); + vec_u8_t srv6 = vec_perm(srv, srv, mask6); + vec_u8_t srv7 = vec_perm(srv, srv, mask7); + +vec_u8_t vfrac8 = (vec_u8_t){6, 12, 18, 24, 30, 4, 10, 16, 6, 12, 18, 24, 30, 4, 10, 16, }; +vec_u8_t vfrac8_32 = (vec_u8_t){26, 20, 14, 8, 2, 28, 22, 16, 26, 20, 14, 8, 2, 28, 22, 16, }; + +one_line(srv0, srv1, vfrac8_32, vfrac8, vout_0); +one_line(srv2, srv3, vfrac8_32, vfrac8, vout_1); +one_line(srv4, srv5, vfrac8_32, vfrac8, vout_2); +one_line(srv6, srv7, vfrac8_32, vfrac8, vout_3); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 17>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0xc, 0xb, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x5, 0x4, 0x4, 0x3, 0x2, 0x1, 0x0, 0x0, }; +vec_u8_t mask1={0xd, 0xc, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0x2, 0x1, 0x1, }; +vec_u8_t mask2={0xe, 0xd, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, 0x3, 0x2, 0x2, }; +vec_u8_t mask3={0xf, 0xe, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x4, 0x3, 0x3, }; +vec_u8_t mask4={0x10, 0xf, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x5, 0x4, 0x4, }; +vec_u8_t mask5={0x11, 0x10, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x6, 0x5, 0x5, }; +vec_u8_t mask6={0x12, 0x11, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x7, 0x6, 0x6, }; +vec_u8_t mask7={0x13, 0x12, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x8, 0x7, 0x7, }; +vec_u8_t mask8={0x14, 0x13, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0x9, 0x8, 0x8, }; +vec_u8_t mask9={0x15, 0x14, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xa, 0x9, 0x9, }; +vec_u8_t mask10={0x16, 0x15, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xb, 0xa, 0xa, }; +vec_u8_t mask11={0x17, 0x16, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xc, 0xb, 0xb, }; +vec_u8_t mask12={0x18, 0x17, 0x16, 0x15, 0x14, 0x14, 0x13, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xd, 0xc, 0xc, }; +vec_u8_t mask13={0x19, 0x18, 0x17, 0x16, 0x15, 0x15, 0x14, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xe, 0xd, 0xd, }; +vec_u8_t mask14={0x1a, 0x19, 0x18, 0x17, 0x16, 0x16, 0x15, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0xf, 0xe, 0xe, }; +vec_u8_t mask15={0x1b, 0x1a, 0x19, 0x18, 0x17, 0x17, 0x16, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x10, 0xf, 0xf, }; +vec_u8_t maskadd1_15={0x1c, 0x1b, 0x1a, 0x19, 0x18, 0x18, 0x17, 0x16, 0x15, 0x14, 0x14, 0x13, 0x12, 0x11, 0x10, 0x10, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_16={0xf, 0xe, 0xc, 0xb, 0xa, 0x9, 0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x00, 0x10, 0x11, 0x12}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(36, srcPix0); + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = vec_perm(s0, s1, mask1); + vec_u8_t srv2 = vec_perm(s0, s1, mask2); + vec_u8_t srv3 = vec_perm(s0, s1, mask3); + vec_u8_t srv4 = vec_perm(s0, s1, mask4); + vec_u8_t srv5 =vec_perm(s0, s1, mask5); + vec_u8_t srv6 = vec_perm(s0, s1, mask6); + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = vec_perm(s0, s1, mask8); + vec_u8_t srv9 = vec_perm(s0, s1, mask9); + vec_u8_t srv10 = vec_perm(s0, s1, mask10); + vec_u8_t srv11 = vec_perm(s0, s1, mask11); + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = vec_perm(s0, s1, mask13); + vec_u8_t srv14 = vec_perm(s0, s1, mask14); + vec_u8_t srv15 = vec_perm(s0, s1, mask15); + + vec_u8_t srv0_add1 = srv1; + vec_u8_t srv1_add1 = srv2; + vec_u8_t srv2_add1 = srv3; + vec_u8_t srv3_add1 = srv4; + vec_u8_t srv4_add1 = srv5; + vec_u8_t srv5_add1 = srv6; + vec_u8_t srv6_add1 = srv7; + vec_u8_t srv7_add1 = srv8; + vec_u8_t srv8_add1 = srv9; + vec_u8_t srv9_add1 = srv10; + vec_u8_t srv10_add1 = srv11; + vec_u8_t srv11_add1 = srv12; + vec_u8_t srv12_add1= srv13; + vec_u8_t srv13_add1 = srv14; + vec_u8_t srv14_add1 = srv15; + vec_u8_t srv15_add1 = vec_perm(s0, s1, maskadd1_15); + +vec_u8_t vfrac16 = (vec_u8_t){6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0, }; +vec_u8_t vfrac16_32 = (vec_u8_t){26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 32, }; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv0_add1, vfrac16_32, vfrac16, vout_0); + one_line(srv1, srv1_add1, vfrac16_32, vfrac16, vout_1); + one_line(srv2, srv2_add1, vfrac16_32, vfrac16, vout_2); + one_line(srv3, srv3_add1, vfrac16_32, vfrac16, vout_3); + one_line(srv4, srv4_add1, vfrac16_32, vfrac16, vout_4); + one_line(srv5, srv5_add1, vfrac16_32, vfrac16, vout_5); + one_line(srv6, srv6_add1, vfrac16_32, vfrac16, vout_6); + one_line(srv7, srv7_add1, vfrac16_32, vfrac16, vout_7); + one_line(srv8, srv8_add1, vfrac16_32, vfrac16, vout_8); + one_line(srv9, srv9_add1, vfrac16_32, vfrac16, vout_9); + one_line(srv10, srv10_add1, vfrac16_32, vfrac16, vout_10); + one_line(srv11, srv11_add1, vfrac16_32, vfrac16, vout_11); + one_line(srv12, srv12_add1, vfrac16_32, vfrac16, vout_12); + one_line(srv13, srv13_add1, vfrac16_32, vfrac16, vout_13); + one_line(srv14, srv14_add1, vfrac16_32, vfrac16, vout_14); + one_line(srv15, srv15_add1, vfrac16_32, vfrac16, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 17>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x19, 0x18, 0x17, 0x16, 0x15, 0x15, 0x14, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xe, 0xd, 0xd, }; +vec_u8_t mask1={0x1a, 0x19, 0x18, 0x17, 0x16, 0x16, 0x15, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0xf, 0xe, 0xe, }; +vec_u8_t mask2={0x1b, 0x1a, 0x19, 0x18, 0x17, 0x17, 0x16, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x10, 0xf, 0xf, }; +vec_u8_t mask3={0xc, 0xb, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x5, 0x4, 0x4, 0x3, 0x2, 0x1, 0x0, 0x0, }; +vec_u8_t mask4={0xd, 0xc, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0x2, 0x1, 0x1, }; +vec_u8_t mask5={0xe, 0xd, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, 0x3, 0x2, 0x2, }; +vec_u8_t mask6={0xf, 0xe, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x4, 0x3, 0x3, }; +vec_u8_t mask7={0x10, 0xf, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x5, 0x4, 0x4, }; +vec_u8_t mask8={0x11, 0x10, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x6, 0x5, 0x5, }; +vec_u8_t mask9={0x12, 0x11, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x7, 0x6, 0x6, }; +vec_u8_t mask10={0x13, 0x12, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x8, 0x7, 0x7, }; +vec_u8_t mask11={0x14, 0x13, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0x9, 0x8, 0x8, }; +vec_u8_t mask12={0x15, 0x14, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xa, 0x9, 0x9, }; +vec_u8_t mask13={0x16, 0x15, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xb, 0xa, 0xa, }; +vec_u8_t mask14={0x17, 0x16, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xc, 0xb, 0xb, }; +vec_u8_t mask15={0x18, 0x17, 0x16, 0x15, 0x14, 0x14, 0x13, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xd, 0xc, 0xc, }; + +vec_u8_t mask16_0={0xc, 0xb, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x5, 0x4, 0x4, 0x3, 0x2, 0x1, 0x0, 0x0, }; +vec_u8_t mask16_1={0xd, 0xc, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x6, 0x5, 0x5, 0x4, 0x3, 0x2, 0x1, 0x1, }; +vec_u8_t mask16_2={0xe, 0xd, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x7, 0x6, 0x6, 0x5, 0x4, 0x3, 0x2, 0x2, }; +vec_u8_t mask16_3={0xf, 0xe, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x8, 0x7, 0x7, 0x6, 0x5, 0x4, 0x3, 0x3, }; +vec_u8_t mask16_4={0x10, 0xf, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x5, 0x4, 0x4, }; +vec_u8_t mask16_5={0x11, 0x10, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xa, 0x9, 0x9, 0x8, 0x7, 0x6, 0x5, 0x5, }; +vec_u8_t mask16_6={0x12, 0x11, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xb, 0xa, 0xa, 0x9, 0x8, 0x7, 0x6, 0x6, }; +vec_u8_t mask16_7={0x13, 0x12, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xc, 0xb, 0xb, 0xa, 0x9, 0x8, 0x7, 0x7, }; +vec_u8_t mask16_8={0x14, 0x13, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xd, 0xc, 0xc, 0xb, 0xa, 0x9, 0x8, 0x8, }; +vec_u8_t mask16_9={0x15, 0x14, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xe, 0xd, 0xd, 0xc, 0xb, 0xa, 0x9, 0x9, }; +vec_u8_t mask16_10={0x16, 0x15, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0xf, 0xe, 0xe, 0xd, 0xc, 0xb, 0xa, 0xa, }; +vec_u8_t mask16_11={0x17, 0x16, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x10, 0xf, 0xf, 0xe, 0xd, 0xc, 0xb, 0xb, }; +vec_u8_t mask16_12={0x18, 0x17, 0x16, 0x15, 0x14, 0x14, 0x13, 0x12, 0x11, 0x10, 0x10, 0xf, 0xe, 0xd, 0xc, 0xc, }; +vec_u8_t mask16_13={0x19, 0x18, 0x17, 0x16, 0x15, 0x15, 0x14, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xe, 0xd, 0xd, }; +vec_u8_t mask16_14={0x1a, 0x19, 0x18, 0x17, 0x16, 0x16, 0x15, 0x14, 0x13, 0x12, 0x12, 0x11, 0x10, 0xf, 0xe, 0xe, }; +vec_u8_t mask16_15={0x1b, 0x1a, 0x19, 0x18, 0x17, 0x17, 0x16, 0x15, 0x14, 0x13, 0x13, 0x12, 0x11, 0x10, 0xf, 0xf, }; + +vec_u8_t maskadd1_31={0x19, 0x18, 0x17, 0x16, 0x15, 0x15, 0x14, 0x13, 0x12, 0x11, 0x11, 0x10, 0xf, 0xe, 0xd, 0xd, }; +vec_u8_t maskadd1_16_31={0xc, 0xb, 0xa, 0x9, 0x8, 0x8, 0x7, 0x6, 0x5, 0x4, 0x4, 0x3, 0x2, 0x1, 0x0, 0x0, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + + vec_u8_t srv_left0=vec_xl(0, srcPix0); + vec_u8_t srv_left1=vec_xl(16, srcPix0); + vec_u8_t srv_right=vec_xl(65, srcPix0); + vec_u8_t refmask_32_0={0x1f, 0x1e, 0x1c, 0x1b, 0x1a, 0x19, 0x17, 0x16, 0x15, 0x14, 0x12, 0x11, 0x10, 0xf, 0xe, 0xc }; + vec_u8_t refmask_32_1={0xb, 0xa, 0x9, 0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + + vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0); + vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1); + vec_u8_t s2 = vec_xl(71, srcPix0); + vec_u8_t s3 = vec_xl(87, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = vec_perm(s0, s1, mask1); + vec_u8_t srv2 = vec_perm(s0, s1, mask2); + vec_u8_t srv3 = vec_perm(s1, s1, mask3); + vec_u8_t srv4 = vec_perm(s1, s1, mask4); + vec_u8_t srv5 = vec_perm(s1, s1, mask5); + vec_u8_t srv6 = vec_perm(s1, s1, mask6); + vec_u8_t srv7 = vec_perm(s1, s2, mask7); + vec_u8_t srv8 = vec_perm(s1, s2, mask8); + vec_u8_t srv9 = vec_perm(s1, s2, mask9); + vec_u8_t srv10 = vec_perm(s1, s2, mask10); + vec_u8_t srv11 = vec_perm(s1, s2, mask11); + vec_u8_t srv12= vec_perm(s1, s2, mask12); + vec_u8_t srv13 = vec_perm(s1, s2, mask13); + vec_u8_t srv14 = vec_perm(s1, s2, mask14); + vec_u8_t srv15 = vec_perm(s1, s2, mask15); + + vec_u8_t srv16_0 = vec_perm(s0, s1, mask16_0); + vec_u8_t srv16_1 = vec_perm(s0, s1, mask16_1); + vec_u8_t srv16_2 = vec_perm(s0, s1, mask16_2); + vec_u8_t srv16_3 = vec_perm(s0, s1, mask16_3); + vec_u8_t srv16_4 = vec_perm(s0, s1, mask16_4); + vec_u8_t srv16_5 = vec_perm(s0, s1, mask16_5); + vec_u8_t srv16_6 = vec_perm(s0, s1, mask16_6); + vec_u8_t srv16_7 = vec_perm(s0, s1, mask16_7); + vec_u8_t srv16_8 = vec_perm(s0, s1, mask16_8); + vec_u8_t srv16_9 = vec_perm(s0, s1, mask16_9); + vec_u8_t srv16_10 = vec_perm(s0, s1, mask16_10); + vec_u8_t srv16_11 = vec_perm(s0, s1, mask16_11); + vec_u8_t srv16_12= vec_perm(s0, s1, mask16_12); + vec_u8_t srv16_13 = vec_perm(s0, s1, mask16_13); + vec_u8_t srv16_14 = vec_perm(s0, s1, mask16_14); + vec_u8_t srv16_15 = vec_perm(s0, s1, mask16_15); + + vec_u8_t srv16 = vec_perm(s1, s2, mask0); + vec_u8_t srv17 = vec_perm(s1, s2, mask1); + vec_u8_t srv18 = vec_perm(s1, s2, mask2); + vec_u8_t srv19 = vec_perm(s2, s2, mask3); + vec_u8_t srv20 = vec_perm(s2, s2, mask4); + vec_u8_t srv21 = vec_perm(s2, s2, mask5); + vec_u8_t srv22 = vec_perm(s2, s2, mask6); + vec_u8_t srv23 = vec_perm(s2, s3, mask7); + vec_u8_t srv24 = vec_perm(s2, s3, mask8); + vec_u8_t srv25 = vec_perm(s2, s3, mask9); + vec_u8_t srv26 = vec_perm(s2, s3, mask10); + vec_u8_t srv27 = vec_perm(s2, s3, mask11); + vec_u8_t srv28 = vec_perm(s2, s3, mask12); + vec_u8_t srv29 = vec_perm(s2, s3, mask13); + vec_u8_t srv30 = vec_perm(s2, s3, mask14); + vec_u8_t srv31 = vec_perm(s2, s3, mask15); + + vec_u8_t srv16_16 = vec_perm(s1, s2, mask16_0); + vec_u8_t srv16_17 = vec_perm(s1, s2, mask16_1); + vec_u8_t srv16_18 = vec_perm(s1, s2, mask16_2); + vec_u8_t srv16_19 = vec_perm(s1, s2, mask16_3); + vec_u8_t srv16_20 = vec_perm(s1, s2, mask16_4); + vec_u8_t srv16_21 = vec_perm(s1, s2, mask16_5); + vec_u8_t srv16_22 = vec_perm(s1, s2, mask16_6); + vec_u8_t srv16_23 = vec_perm(s1, s2, mask16_7); + vec_u8_t srv16_24 = vec_perm(s1, s2, mask16_8); + vec_u8_t srv16_25 = vec_perm(s1, s2, mask16_9); + vec_u8_t srv16_26 = vec_perm(s1, s2, mask16_10); + vec_u8_t srv16_27 = vec_perm(s1, s2, mask16_11); + vec_u8_t srv16_28 = vec_perm(s1, s2, mask16_12); + vec_u8_t srv16_29 = vec_perm(s1, s2, mask16_13); + vec_u8_t srv16_30 = vec_perm(s1, s2, mask16_14); + vec_u8_t srv16_31 = vec_perm(s1, s2, mask16_15); + + vec_u8_t srv0add1 = srv1; + vec_u8_t srv1add1 = srv2; + vec_u8_t srv2add1 = srv3; + vec_u8_t srv3add1 = srv4; + vec_u8_t srv4add1 = srv5; + vec_u8_t srv5add1 = srv6; + vec_u8_t srv6add1 = srv7; + vec_u8_t srv7add1 = srv8; + vec_u8_t srv8add1 = srv9; + vec_u8_t srv9add1 = srv10; + vec_u8_t srv10add1 = srv11; + vec_u8_t srv11add1 = srv12; + vec_u8_t srv12add1= srv13; + vec_u8_t srv13add1 = srv14; + vec_u8_t srv14add1 = srv15; + vec_u8_t srv15add1 = srv16; + + vec_u8_t srv16add1_0 = srv16_1; + vec_u8_t srv16add1_1 = srv16_2; + vec_u8_t srv16add1_2 = srv16_3; + vec_u8_t srv16add1_3 = srv16_4; + vec_u8_t srv16add1_4 = srv16_5; + vec_u8_t srv16add1_5 = srv16_6; + vec_u8_t srv16add1_6 = srv16_7; + vec_u8_t srv16add1_7 = srv16_8; + vec_u8_t srv16add1_8 = srv16_9; + vec_u8_t srv16add1_9 = srv16_10; + vec_u8_t srv16add1_10 = srv16_11; + vec_u8_t srv16add1_11 = srv16_12; + vec_u8_t srv16add1_12= srv16_13; + vec_u8_t srv16add1_13 = srv16_14; + vec_u8_t srv16add1_14 = srv16_15; + vec_u8_t srv16add1_15 = srv16_16; + + vec_u8_t srv16add1 = srv17; + vec_u8_t srv17add1 = srv18; + vec_u8_t srv18add1 = srv19; + vec_u8_t srv19add1 = srv20; + vec_u8_t srv20add1 = srv21; + vec_u8_t srv21add1 = srv22; + vec_u8_t srv22add1 = srv23; + vec_u8_t srv23add1 = srv24; + vec_u8_t srv24add1 = srv25; + vec_u8_t srv25add1 = srv26; + vec_u8_t srv26add1 = srv27; + vec_u8_t srv27add1 = srv28; + vec_u8_t srv28add1 = srv29; + vec_u8_t srv29add1 = srv30; + vec_u8_t srv30add1 = srv31; + vec_u8_t srv31add1 = vec_perm(s2, s3, maskadd1_31); + + vec_u8_t srv16add1_16 = srv16_17; + vec_u8_t srv16add1_17 = srv16_18; + vec_u8_t srv16add1_18 = srv16_19; + vec_u8_t srv16add1_19 = srv16_20; + vec_u8_t srv16add1_20 = srv16_21; + vec_u8_t srv16add1_21 = srv16_22; + vec_u8_t srv16add1_22 = srv16_23; + vec_u8_t srv16add1_23 = srv16_24; + vec_u8_t srv16add1_24 = srv16_25; + vec_u8_t srv16add1_25 = srv16_26; + vec_u8_t srv16add1_26 = srv16_27; + vec_u8_t srv16add1_27 = srv16_28; + vec_u8_t srv16add1_28 = srv16_29; + vec_u8_t srv16add1_29 = srv16_30; + vec_u8_t srv16add1_30 = srv16_31; + vec_u8_t srv16add1_31 = vec_perm(s2, s2, maskadd1_16_31); + +vec_u8_t vfrac32_0 = (vec_u8_t){6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0, }; +vec_u8_t vfrac32_1 = (vec_u8_t){6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0, }; +vec_u8_t vfrac32_32_0 = (vec_u8_t){26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 32, }; +vec_u8_t vfrac32_32_1 = (vec_u8_t){26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 32, }; + + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv0add1, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_0, srv16add1_0, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv1, srv1add1, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_1, srv16add1_1, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv2, srv2add1, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_2, srv16add1_2, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv3, srv3add1, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_3, srv16add1_3, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv4, srv4add1, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_4, srv16add1_4, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv5, srv5add1, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_5, srv16add1_5, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv6, srv6add1, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_6, srv16add1_6, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv7, srv7add1, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_7, srv16add1_7, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv8, srv8add1, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_8, srv16add1_8, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv9, srv9add1, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_9, srv16add1_9, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv10, srv10add1, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_10, srv16add1_10, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv11, srv11add1, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_11, srv16add1_11, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv12, srv12add1, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_12, srv16add1_12, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv13, srv13add1, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_13, srv16add1_13, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv14, srv14add1, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_14, srv16add1_14, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv15, srv15add1, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_15, srv16add1_15, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv16, srv16add1, vfrac32_32_0, vfrac32_0, vout_0); + one_line(srv16_16, srv16add1_16, vfrac32_32_1, vfrac32_1, vout_1); + + one_line(srv17, srv17add1, vfrac32_32_0, vfrac32_0, vout_2); + one_line(srv16_17, srv16add1_17, vfrac32_32_1, vfrac32_1, vout_3); + + one_line(srv18, srv18add1, vfrac32_32_0, vfrac32_0, vout_4); + one_line(srv16_18, srv16add1_18, vfrac32_32_1, vfrac32_1, vout_5); + + one_line(srv19, srv19add1, vfrac32_32_0, vfrac32_0, vout_6); + one_line(srv16_19, srv16add1_19, vfrac32_32_1, vfrac32_1, vout_7); + + one_line(srv20, srv20add1, vfrac32_32_0, vfrac32_0, vout_8); + one_line(srv16_20, srv16add1_20, vfrac32_32_1, vfrac32_1, vout_9); + + one_line(srv21, srv21add1, vfrac32_32_0, vfrac32_0, vout_10); + one_line(srv16_21, srv16add1_21, vfrac32_32_1, vfrac32_1, vout_11); + + one_line(srv22, srv22add1, vfrac32_32_0, vfrac32_0, vout_12); + one_line(srv16_22, srv16add1_22, vfrac32_32_1, vfrac32_1, vout_13); + + one_line(srv23, srv23add1, vfrac32_32_0, vfrac32_0, vout_14); + one_line(srv16_23, srv16add1_23, vfrac32_32_1, vfrac32_1, vout_15); + + one_line(srv24, srv24add1, vfrac32_32_0, vfrac32_0, vout_16); + one_line(srv16_24, srv16add1_24, vfrac32_32_1, vfrac32_1, vout_17); + + one_line(srv25, srv25add1, vfrac32_32_0, vfrac32_0, vout_18); + one_line(srv16_25, srv16add1_25, vfrac32_32_1, vfrac32_1, vout_19); + + one_line(srv26, srv26add1, vfrac32_32_0, vfrac32_0, vout_20); + one_line(srv16_26, srv16add1_26, vfrac32_32_1, vfrac32_1, vout_21); + + one_line(srv27, srv27add1, vfrac32_32_0, vfrac32_0, vout_22); + one_line(srv16_27, srv16add1_27, vfrac32_32_1, vfrac32_1, vout_23); + + one_line(srv28, srv28add1, vfrac32_32_0, vfrac32_0, vout_24); + one_line(srv16_28, srv16add1_28, vfrac32_32_1, vfrac32_1, vout_25); + + one_line(srv29, srv29add1, vfrac32_32_0, vfrac32_0, vout_26); + one_line(srv16_29, srv16add1_29, vfrac32_32_1, vfrac32_1, vout_27); + + one_line(srv30, srv30add1, vfrac32_32_0, vfrac32_0, vout_28); + one_line(srv16_30, srv16add1_30, vfrac32_32_1, vfrac32_1, vout_29); + + one_line(srv31, srv31add1, vfrac32_32_0, vfrac32_0, vout_30); + one_line(srv16_31, srv16add1_31, vfrac32_32_1, vfrac32_1, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + + +template<> +void intra_pred<4, 18>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + //vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + //vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t mask0={0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3, }; + //vec_u8_t mask1={0x4, 0x5, 0x6, 0x7, 0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, }; + + + vec_u8_t srv_left=vec_xl(8, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_4={0x3, 0x2, 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + //vec_u8_t srv1 = vec_perm(srv, srv, mask1); + //vec_u8_t vfrac4 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + //vec_u8_t vfrac4_32 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + if(dstStride==4){ + vec_xst(srv0, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)srv0, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(srv0, srv0, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(srv0, srv0, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(srv0, srv0, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(srv0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(srv0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(srv0, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(srv0, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 18>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, }; +//vec_u8_t mask1={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, }; +vec_u8_t mask2={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, }; +//vec_u8_t mask3={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, }; +vec_u8_t mask4={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +//vec_u8_t mask5={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, }; +vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, }; +//vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; + + + //vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + //vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + //vec_u8_t vout_0, vout_1, vout_2, vout_3; + //vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + + vec_u8_t srv_left=vec_xl(16, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_8={0x7, 0x6, 0x5, 0x4, 0x3, 0x2, 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + //vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + //vec_u8_t srv3 = vec_perm(srv, srv, mask3); + vec_u8_t srv4 = vec_perm(srv, srv, mask4); + //vec_u8_t srv5 = vec_perm(srv, srv, mask5); + vec_u8_t srv6 = vec_perm(srv, srv, mask6); + //vec_u8_t srv7 = vec_perm(srv, srv, mask7); + + if(dstStride==8){ + vec_xst(srv0, 0, dst); + vec_xst(srv2, 16, dst); + vec_xst(srv4, 32, dst); + vec_xst(srv6, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(srv0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(srv0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(srv2, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(srv2, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(srv4, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(srv4, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(srv6, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(srv6, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 18>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, }; +vec_u8_t mask1={0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, }; +vec_u8_t mask2={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, }; +vec_u8_t mask3={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, }; +vec_u8_t mask4={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; +vec_u8_t mask5={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; +vec_u8_t mask6={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +vec_u8_t mask7={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t mask8={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask9={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask10={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask11={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask12={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; + + vec_u8_t srv_left=vec_xl(32, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_16={0xf, 0xe, 0xd, 0xc, 0xb, 0xa, 0x9, 0x8, 0x7, 0x6, 0x5, 0x4, 0x3, 0x2, 0x1, 0x10}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(1, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = vec_perm(s0, s1, mask1); + vec_u8_t srv2 = vec_perm(s0, s1, mask2); + vec_u8_t srv3 = vec_perm(s0, s1, mask3); + vec_u8_t srv4 = vec_perm(s0, s1, mask4); + vec_u8_t srv5 = vec_perm(s0, s1, mask5); + vec_u8_t srv6 = vec_perm(s0, s1, mask6); + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = vec_perm(s0, s1, mask8); + vec_u8_t srv9 = vec_perm(s0, s1, mask9); + vec_u8_t srv10 = vec_perm(s0, s1, mask10); + vec_u8_t srv11 = vec_perm(s0, s1, mask11); + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = vec_perm(s0, s1, mask13); + vec_u8_t srv14 = vec_perm(s0, s1, mask14); + vec_u8_t srv15 = s0; + + + vec_xst(srv0, 0, dst); + vec_xst(srv1, dstStride, dst); + vec_xst(srv2, dstStride*2, dst); + vec_xst(srv3, dstStride*3, dst); + vec_xst(srv4, dstStride*4, dst); + vec_xst(srv5, dstStride*5, dst); + vec_xst(srv6, dstStride*6, dst); + vec_xst(srv7, dstStride*7, dst); + vec_xst(srv8, dstStride*8, dst); + vec_xst(srv9, dstStride*9, dst); + vec_xst(srv10, dstStride*10, dst); + vec_xst(srv11, dstStride*11, dst); + vec_xst(srv12, dstStride*12, dst); + vec_xst(srv13, dstStride*13, dst); + vec_xst(srv14, dstStride*14, dst); + vec_xst(srv15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 18>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, }; +vec_u8_t mask1={0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, }; +vec_u8_t mask2={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, }; +vec_u8_t mask3={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, }; +vec_u8_t mask4={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; +vec_u8_t mask5={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; +vec_u8_t mask6={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +vec_u8_t mask7={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t mask8={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask9={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask10={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask11={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask12={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; + + //vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + //vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t refmask_32_0 = {0x1f, 0x1e, 0x1d, 0x1c, 0x1b, 0x1a, 0x19, 0x18, 0x17, 0x16, 0x15, 0x14, 0x13, 0x12, 0x11, 0x10}; + vec_u8_t refmask_32_1 = {0xf, 0xe, 0xd, 0xc, 0xb, 0xa, 0x9, 0x8, 0x7, 0x6, 0x5, 0x4, 0x3, 0x2, 0x1, 0x10}; + + vec_u8_t srv_left0=vec_xl(64, srcPix0); + vec_u8_t srv_left1=vec_xl(80, srcPix0); + vec_u8_t srv_right=vec_xl(0, srcPix0); + vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0); + vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1); + vec_u8_t s2 = vec_xl(1, srcPix0); + vec_u8_t s3 = vec_xl(17, srcPix0); + + vec_u8_t srv0 = vec_perm(s1, s2, mask0); + vec_u8_t srv1 = vec_perm(s1, s2, mask1); + vec_u8_t srv2 = vec_perm(s1, s2, mask2); + vec_u8_t srv3 = vec_perm(s1, s2, mask3); + vec_u8_t srv4 = vec_perm(s1, s2, mask4); + vec_u8_t srv5 = vec_perm(s1, s2, mask5); + vec_u8_t srv6 = vec_perm(s1, s2, mask6); + vec_u8_t srv7 = vec_perm(s1, s2, mask7); + vec_u8_t srv8 = vec_perm(s1, s2, mask8); + vec_u8_t srv9 = vec_perm(s1, s2, mask9); + vec_u8_t srv10 = vec_perm(s1, s2, mask10); + vec_u8_t srv11 = vec_perm(s1, s2, mask11); + vec_u8_t srv12= vec_perm(s1, s2, mask12); + vec_u8_t srv13 = vec_perm(s1, s2, mask13); + vec_u8_t srv14 = vec_perm(s1, s2, mask14); + vec_u8_t srv15 = s1; + + vec_u8_t srv16_0 = vec_perm(s2, s3, mask0); + vec_u8_t srv16_1 = vec_perm(s2, s3, mask1); + vec_u8_t srv16_2 = vec_perm(s2, s3, mask2); + vec_u8_t srv16_3 = vec_perm(s2, s3, mask3); + vec_u8_t srv16_4 = vec_perm(s2, s3, mask4); + vec_u8_t srv16_5 = vec_perm(s2, s3, mask5); + vec_u8_t srv16_6 = vec_perm(s2, s3, mask6); + vec_u8_t srv16_7 = vec_perm(s2, s3, mask7); + vec_u8_t srv16_8 = vec_perm(s2, s3, mask8); + vec_u8_t srv16_9 = vec_perm(s2, s3, mask9); + vec_u8_t srv16_10 = vec_perm(s2, s3, mask10); + vec_u8_t srv16_11 = vec_perm(s2, s3, mask11); + vec_u8_t srv16_12= vec_perm(s2, s3, mask12); + vec_u8_t srv16_13 = vec_perm(s2, s3, mask13); + vec_u8_t srv16_14 = vec_perm(s2, s3, mask14); + vec_u8_t srv16_15 = s2; + + //0(1,2),1,1,3,4,4,6(1),7(0,1),7,9,10,10,12,13,13,15,16,16,18,19,19,21,22,22,24,25,25,27,28,28,30,30 + + vec_u8_t srv16 = vec_perm(s0, s1, mask0); + vec_u8_t srv17 = vec_perm(s0, s1, mask1); + vec_u8_t srv18 = vec_perm(s0, s1, mask2); + vec_u8_t srv19 = vec_perm(s0, s1, mask3); + vec_u8_t srv20 = vec_perm(s0, s1, mask4); + vec_u8_t srv21 = vec_perm(s0, s1, mask5); + vec_u8_t srv22 = vec_perm(s0, s1, mask6); + vec_u8_t srv23 = vec_perm(s0, s1, mask7); + vec_u8_t srv24 = vec_perm(s0, s1, mask8); + vec_u8_t srv25 = vec_perm(s0, s1, mask9); + vec_u8_t srv26 = vec_perm(s0, s1, mask10); + vec_u8_t srv27 = vec_perm(s0, s1, mask11); + vec_u8_t srv28 = vec_perm(s0, s1, mask12); + vec_u8_t srv29 = vec_perm(s0, s1, mask13); + vec_u8_t srv30 = vec_perm(s0, s1, mask14); + vec_u8_t srv31 = s0; + + vec_xst(srv0, 0, dst); + vec_xst(srv16_0, 16, dst); + vec_xst(srv1, dstStride, dst); + vec_xst(srv16_1, dstStride+16, dst); + vec_xst(srv2, dstStride*2, dst); + vec_xst(srv16_2, dstStride*2+16, dst); + vec_xst(srv3, dstStride*3, dst); + vec_xst(srv16_3, dstStride*3+16, dst); + vec_xst(srv4, dstStride*4, dst); + vec_xst(srv16_4, dstStride*4+16, dst); + vec_xst(srv5, dstStride*5, dst); + vec_xst(srv16_5, dstStride*5+16, dst); + vec_xst(srv6, dstStride*6, dst); + vec_xst(srv16_6, dstStride*6+16, dst); + vec_xst(srv7, dstStride*7, dst); + vec_xst(srv16_7, dstStride*7+16, dst); + vec_xst(srv8, dstStride*8, dst); + vec_xst(srv16_8, dstStride*8+16, dst); + vec_xst(srv9, dstStride*9, dst); + vec_xst(srv16_9, dstStride*9+16, dst); + vec_xst(srv10, dstStride*10, dst); + vec_xst(srv16_10, dstStride*10+16, dst); + vec_xst(srv11, dstStride*11, dst); + vec_xst(srv16_11, dstStride*11+16, dst); + vec_xst(srv12, dstStride*12, dst); + vec_xst(srv16_12, dstStride*12+16, dst); + vec_xst(srv13, dstStride*13, dst); + vec_xst(srv16_13, dstStride*13+16, dst); + vec_xst(srv14, dstStride*14, dst); + vec_xst(srv16_14, dstStride*14+16, dst); + vec_xst(srv15, dstStride*15, dst); + vec_xst(srv16_15, dstStride*15+16, dst); + + vec_xst(srv16, dstStride*16, dst); + vec_xst(srv0, dstStride*16+16, dst); + vec_xst(srv17, dstStride*17, dst); + vec_xst(srv1, dstStride*17+16, dst); + vec_xst(srv18, dstStride*18, dst); + vec_xst(srv2, dstStride*18+16, dst); + vec_xst(srv19, dstStride*19, dst); + vec_xst(srv3, dstStride*19+16, dst); + vec_xst(srv20, dstStride*20, dst); + vec_xst(srv4, dstStride*20+16, dst); + vec_xst(srv21, dstStride*21, dst); + vec_xst(srv5, dstStride*21+16, dst); + vec_xst(srv22, dstStride*22, dst); + vec_xst(srv6, dstStride*22+16, dst); + vec_xst(srv23, dstStride*23, dst); + vec_xst(srv7, dstStride*23+16, dst); + vec_xst(srv24, dstStride*24, dst); + vec_xst(srv8, dstStride*24+16, dst); + vec_xst(srv25, dstStride*25, dst); + vec_xst(srv9, dstStride*25+16, dst); + vec_xst(srv26, dstStride*26, dst); + vec_xst(srv10, dstStride*26+16, dst); + vec_xst(srv27, dstStride*27, dst); + vec_xst(srv11, dstStride*27+16, dst); + vec_xst(srv28, dstStride*28, dst); + vec_xst(srv12, dstStride*28+16, dst); + vec_xst(srv29, dstStride*29, dst); + vec_xst(srv13, dstStride*29+16, dst); + vec_xst(srv30, dstStride*30, dst); + vec_xst(srv14, dstStride*30+16, dst); + vec_xst(srv31, dstStride*31, dst); + vec_xst(srv15, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<4, 19>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t mask0={0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3, }; +vec_u8_t mask1={0x4, 0x5, 0x6, 0x7, 0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, }; + + + //mode 19: + //int offset[32] = {-1, -2, -3, -4, -5, -5, -6, -7, -8, -9, -9, -10, -11, -12, -13, -13, -14, -15, -16, -17, -18, -18, -19, -20, -21, -22, -22, -23, -24, -25, -26, -26}; + //int fraction[32] = {6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0, 6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0}; + //mode=19 width=32 nbProjected=25(invAngleSum >> 8)=1 ,(invAngleSum >> 8)=2 ,(invAngleSum >> 8)=4 ,(invAngleSum >> 8)=5 ,(invAngleSum >> 8)=6 ,(invAngleSum >> 8)=7 ,(invAngleSum >> 8)=9 ,(invAngleSum >> 8)=10 ,(invAngleSum >> 8)=11 ,(invAngleSum >> 8)=12 ,(invAngleSum >> 8)=14 ,(invAngleSum >> 8)=15 ,(invAngleSum >> 8)=16 ,(invAngleSum >> 8)=17 ,(invAngleSum >> 8)=18 ,(invAngleSum >> 8)=20 ,(invAngleSum >> 8)=21 ,(invAngleSum >> 8)=22 ,(invAngleSum >> 8)=23 ,(invAngleSum >> 8)=25 ,(invAngleSum >> 8)=26 ,(invAngleSum >> 8)=27 ,(invAngleSum >> 8)=28 ,(invAngleSum >> 8)=30 ,(invAngleSum >> 8)=31 + + //mode19 invAS[32]= {1, 2, 4, }; + //vec_u8_t mask_left={0x1, 0x02, 0x04, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,0x0, 0x0}; + vec_u8_t srv_left=vec_xl(8, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + //vec_u8_t srv_left=vec_perm(srv_left, srv_left, mask_left); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_4={0x4, 0x2, 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; + + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + +vec_u8_t vfrac4 = (vec_u8_t){6, 6, 6, 6, 12, 12, 12, 12, 18, 18, 18, 18, 24, 24, 24, 24}; +vec_u8_t vfrac4_32 = (vec_u8_t){26, 26, 26, 26, 20, 20, 20, 20, 14, 14, 14, 14, 8, 8, 8, 8}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 19>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, }; +vec_u8_t mask1={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, }; +vec_u8_t mask2={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, }; +vec_u8_t mask3={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, }; +vec_u8_t mask4={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +vec_u8_t mask5={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, }; +vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, }; +vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vout_0, vout_1, vout_2, vout_3; + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + + vec_u8_t srv_left=vec_xl(16, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_8={0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + vec_u8_t srv3 = vec_perm(srv, srv, mask3); + vec_u8_t srv4 = vec_perm(srv, srv, mask4); + vec_u8_t srv5 = vec_perm(srv, srv, mask5); + vec_u8_t srv6 = vec_perm(srv, srv, mask6); + vec_u8_t srv7 = vec_perm(srv, srv, mask7); + + + /* fraction[0-7] */ +vec_u8_t vfrac8_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac8_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac8_2 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac8_3 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* 32 - fraction[0-7] */ +vec_u8_t vfrac8_32_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac8_32_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac8_32_2 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac8_32_3 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 16, 16, 16, 16, 16, 16, 16, 16}; + +one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0); +one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1); +one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2); +one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 19>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, }; +vec_u8_t mask1={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; +vec_u8_t mask2={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; +vec_u8_t mask3={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +vec_u8_t mask4={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +//vec_u8_t mask5={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t mask6={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask7={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask8={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask9={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +//vec_u8_t mask10={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask11={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask14={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +//vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; + +vec_u8_t maskadd1_0={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv_left=vec_xl(32, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_16 ={0xf, 0xe, 0xc, 0xb, 0xa, 0x9, 0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(4, srcPix0); + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = vec_perm(s0, s1, mask1); + vec_u8_t srv2 = vec_perm(s0, s1, mask2); + vec_u8_t srv3 = vec_perm(s0, s1, mask3); + vec_u8_t srv4 = vec_perm(s0, s1, mask4); + vec_u8_t srv5 =srv4; + vec_u8_t srv6 = vec_perm(s0, s1, mask6); + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = vec_perm(s0, s1, mask8); + vec_u8_t srv9 = vec_perm(s0, s1, mask9); + vec_u8_t srv10 = srv9; + vec_u8_t srv11 = vec_perm(s0, s1, mask11); + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = vec_perm(s0, s1, mask13); + vec_u8_t srv14 = vec_perm(s0, s1, mask14); + vec_u8_t srv15 = srv14; + + vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0); + vec_u8_t srv1_add1 = srv0; + vec_u8_t srv2_add1 = srv1; + vec_u8_t srv3_add1 = srv2; + vec_u8_t srv4_add1 = srv3; + vec_u8_t srv5_add1 = srv3; + vec_u8_t srv6_add1 = srv4; + vec_u8_t srv7_add1 = srv6; + vec_u8_t srv8_add1 = srv7; + vec_u8_t srv9_add1 = srv8; + vec_u8_t srv10_add1 = srv8; + vec_u8_t srv11_add1 = srv9; + vec_u8_t srv12_add1= srv11; + vec_u8_t srv13_add1 = srv12; + vec_u8_t srv14_add1 = srv13; + vec_u8_t srv15_add1 = srv13; + + + /* fraction[0-15] */ +vec_u8_t vfrac16_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_1 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_2 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_4 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_5 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_6 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_8 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_9 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_10 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_12 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_13 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_14 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + + /* 32- fraction[0-15] */ +vec_u8_t vfrac16_32_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 19>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +vec_u8_t mask1={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t mask2={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask3={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask4={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +//vec_u8_t mask5={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask6={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask7={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask8={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask9={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask10={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask11={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask12={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, }; +vec_u8_t mask13={0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, }; +vec_u8_t mask14={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, }; +//vec_u8_t mask15={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, }; +vec_u8_t mask16={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, }; + +vec_u8_t mask17={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; +vec_u8_t mask18={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; +vec_u8_t mask19={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +vec_u8_t mask20={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +//vec_u8_t mask21={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t mask22={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask23={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask24={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask25={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +//vec_u8_t mask26={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask27={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask28={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask29={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +//vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; + +vec_u8_t maskadd1_0={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t refmask_32_0 ={0x1f, 0x1e, 0x1c, 0x1b, 0x1a, 0x19, 0x17, 0x16, 0x15, 0x14, 0x12, 0x11, 0x10, 0xf, 0xe, 0xc}; + vec_u8_t refmask_32_1 = {0xb, 0xa, 0x9, 0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + + vec_u8_t srv_left0=vec_xl(64, srcPix0); + vec_u8_t srv_left1=vec_xl(80, srcPix0); + vec_u8_t srv_right=vec_xl(0, srcPix0); + vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0); + vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1); + vec_u8_t s2 = vec_xl(7, srcPix0); + vec_u8_t s3 = vec_xl(16+7, srcPix0); + + vec_u8_t srv0 = vec_perm(s1, s2, mask0); + vec_u8_t srv1 = vec_perm(s1, s2, mask1); + vec_u8_t srv2 = vec_perm(s1, s2, mask2); + vec_u8_t srv3 = vec_perm(s1, s2, mask3); + vec_u8_t srv4 = vec_perm(s1, s2, mask4); + vec_u8_t srv5 =srv4; + vec_u8_t srv6 = vec_perm(s1, s2, mask6); + vec_u8_t srv7 = vec_perm(s1, s2, mask7); + vec_u8_t srv8 = vec_perm(s1, s2, mask8); + vec_u8_t srv9 = vec_perm(s1, s2, mask9); + vec_u8_t srv10 = srv9; + vec_u8_t srv11 = s1; + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = vec_perm(s0, s1, mask13); + vec_u8_t srv14 = vec_perm(s0, s1, mask14); + vec_u8_t srv15 = srv14; + + vec_u8_t srv16_0 = vec_perm(s2, s3, mask0); + vec_u8_t srv16_1 = vec_perm(s2, s3, mask1); + vec_u8_t srv16_2 = vec_perm(s2, s3, mask2); + vec_u8_t srv16_3 = vec_perm(s2, s3, mask3); + vec_u8_t srv16_4 = vec_perm(s2, s3, mask4); + vec_u8_t srv16_5 =srv16_4; + vec_u8_t srv16_6 = vec_perm(s2, s3, mask6); + vec_u8_t srv16_7 = vec_perm(s2, s3, mask7); + vec_u8_t srv16_8 = vec_perm(s2, s3, mask8); + vec_u8_t srv16_9 = vec_perm(s2, s3, mask9); + vec_u8_t srv16_10 = srv16_9; + vec_u8_t srv16_11 = s2; + vec_u8_t srv16_12= vec_perm(s1, s2, mask12); + vec_u8_t srv16_13 = vec_perm(s1, s2, mask13); + vec_u8_t srv16_14 = vec_perm(s1, s2, mask14); + vec_u8_t srv16_15 = srv16_14; + //0,1,2,3,4,4,6,7,8,9,9(1,2),11(1),12(0,1),13,14,14,15,16,17,18,19,20,20,22,23,24,25,25,27,28,29,30(0),30, + + vec_u8_t srv16 = vec_perm(s0, s1, mask16); + vec_u8_t srv17 = vec_perm(s0, s1, mask17); + vec_u8_t srv18 = vec_perm(s0, s1, mask18); + vec_u8_t srv19 = vec_perm(s0, s1, mask19); + vec_u8_t srv20 = vec_perm(s0, s1, mask20); + vec_u8_t srv21 = srv20; + vec_u8_t srv22 = vec_perm(s0, s1, mask22); + vec_u8_t srv23 = vec_perm(s0, s1, mask23); + vec_u8_t srv24 = vec_perm(s0, s1, mask24); + vec_u8_t srv25 = vec_perm(s0, s1, mask25); + vec_u8_t srv26 = srv25; + vec_u8_t srv27 = vec_perm(s0, s1, mask27); + vec_u8_t srv28 = vec_perm(s0, s1, mask28); + vec_u8_t srv29 = vec_perm(s0, s1, mask29); + vec_u8_t srv30 = s0; + vec_u8_t srv31 = s0; + + vec_u8_t srv16_16 = vec_perm(s1, s2, mask16); + vec_u8_t srv16_17 = vec_perm(s1, s2, mask17); + vec_u8_t srv16_18 = vec_perm(s1, s2, mask18); + vec_u8_t srv16_19 = vec_perm(s1, s2, mask19); + vec_u8_t srv16_20 = vec_perm(s1, s2, mask20); + vec_u8_t srv16_21 = srv16_20; + vec_u8_t srv16_22 = vec_perm(s1, s2, mask22); + vec_u8_t srv16_23 = vec_perm(s1, s2, mask23); + vec_u8_t srv16_24 = vec_perm(s1, s2, mask24); + vec_u8_t srv16_25 = vec_perm(s1, s2, mask25); + vec_u8_t srv16_26 = srv16_25; + vec_u8_t srv16_27 = vec_perm(s1, s2, mask27); + vec_u8_t srv16_28 = vec_perm(s1, s2, mask28); + vec_u8_t srv16_29 = vec_perm(s1, s2, mask29); + vec_u8_t srv16_30 = s1; + vec_u8_t srv16_31 = s1; + + vec_u8_t srv0add1 = vec_perm(s1, s2, maskadd1_0); + vec_u8_t srv1add1 = srv0; + vec_u8_t srv2add1 = srv1; + vec_u8_t srv3add1 = srv2; + vec_u8_t srv4add1 = srv3; + vec_u8_t srv5add1 = srv3; + vec_u8_t srv6add1 = srv4; + vec_u8_t srv7add1 = srv6; + vec_u8_t srv8add1 = srv7; + vec_u8_t srv9add1 = srv8; + vec_u8_t srv10add1 = srv8; + vec_u8_t srv11add1 = srv9; + vec_u8_t srv12add1= srv11; + vec_u8_t srv13add1 = srv12; + vec_u8_t srv14add1 = srv13; + vec_u8_t srv15add1 = srv13; + + //0(1,2),1,2,3,3.4,6,7,8,8,9,11(1),12(0,1),13,13,14,16, 17, 18,19,19,20,22,26,24,24,25,27,28,29,29, + + vec_u8_t srv16add1_0 = vec_perm(s2, s3, maskadd1_0); + vec_u8_t srv16add1_1 = srv16_0; + vec_u8_t srv16add1_2 = srv16_1; + vec_u8_t srv16add1_3 = srv16_2; + vec_u8_t srv16add1_4 = srv16_3; + vec_u8_t srv16add1_5 = srv16_3; + vec_u8_t srv16add1_6 = srv16_4; + vec_u8_t srv16add1_7 = srv16_6; + vec_u8_t srv16add1_8 = srv16_7; + vec_u8_t srv16add1_9 = srv16_8; + vec_u8_t srv16add1_10 = srv16_8; + vec_u8_t srv16add1_11 = srv16_9; + vec_u8_t srv16add1_12= srv16_11; + vec_u8_t srv16add1_13 = srv16_12; + vec_u8_t srv16add1_14 = srv16_13; + vec_u8_t srv16add1_15 = srv16_13; + + vec_u8_t srv16add1 = srv14; + vec_u8_t srv17add1 = srv16; + vec_u8_t srv18add1 = srv17; + vec_u8_t srv19add1 = srv18; + vec_u8_t srv20add1 = srv19; + vec_u8_t srv21add1 = srv19; + vec_u8_t srv22add1 = srv20; + vec_u8_t srv23add1 = srv22; + vec_u8_t srv24add1 = srv23; + vec_u8_t srv25add1 = srv24; + vec_u8_t srv26add1 = srv24; + vec_u8_t srv27add1 = srv25; + vec_u8_t srv28add1 = srv27; + vec_u8_t srv29add1 = srv28; + vec_u8_t srv30add1 = srv29; + vec_u8_t srv31add1 = srv29; + + vec_u8_t srv16add1_16 = srv16_14; + vec_u8_t srv16add1_17 = srv16_16; + vec_u8_t srv16add1_18 = srv16_17; + vec_u8_t srv16add1_19 = srv16_18; + vec_u8_t srv16add1_20 = srv16_19; + vec_u8_t srv16add1_21 = srv16_19; + vec_u8_t srv16add1_22 = srv16_20; + vec_u8_t srv16add1_23 = srv16_22; + vec_u8_t srv16add1_24 = srv16_23; + vec_u8_t srv16add1_25 = srv16_24; + vec_u8_t srv16add1_26 = srv16_24; + vec_u8_t srv16add1_27 = srv16_25; + vec_u8_t srv16add1_28 = srv16_27; + vec_u8_t srv16add1_29 = srv16_28; + vec_u8_t srv16add1_30 = srv16_29; + vec_u8_t srv16add1_31 = srv16_29; + +vec_u8_t vfrac16_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_1 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_2 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_4 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_5 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_6 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_8 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_9 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_10 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_12 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_13 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_14 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; +vec_u8_t vfrac16_32_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv16, srv16add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv16_16, srv16add1_16, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv17, srv17add1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv16_17, srv16add1_17, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv18, srv18add1, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv16_18, srv16add1_18, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv19, srv19add1, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv16_19, srv16add1_19, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv20, srv20add1, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv16_20, srv16add1_20, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv21, srv21add1, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv16_21, srv16add1_21, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv22, srv22add1, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv16_22, srv16add1_22, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv23, srv23add1, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv16_23, srv16add1_23, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv24, srv24add1, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv16_24, srv16add1_24, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv25, srv25add1, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv16_25, srv16add1_25, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv26, srv26add1, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv16_26, srv16add1_26, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv27, srv27add1, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv16_27, srv16add1_27, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv28, srv28add1, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv16_28, srv16add1_28, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv29, srv29add1, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv16_29, srv16add1_29, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv30, srv30add1, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv16_30, srv16add1_30, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv31, srv31add1, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv16_31, srv16add1_31, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<4, 20>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t mask0={0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3, }; +vec_u8_t mask1={0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, }; + + + //mode 19: + //int offset[32] = {-1, -2, -3, -4, -5, -5, -6, -7, -8, -9, -9, -10, -11, -12, -13, -13, -14, -15, -16, -17, -18, -18, -19, -20, -21, -22, -22, -23, -24, -25, -26, -26}; + //int fraction[32] = {6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0, 6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0}; + //mode=19 width=32 nbProjected=25(invAngleSum >> 8)=1 ,(invAngleSum >> 8)=2 ,(invAngleSum >> 8)=4 ,(invAngleSum >> 8)=5 ,(invAngleSum >> 8)=6 ,(invAngleSum >> 8)=7 ,(invAngleSum >> 8)=9 ,(invAngleSum >> 8)=10 ,(invAngleSum >> 8)=11 ,(invAngleSum >> 8)=12 ,(invAngleSum >> 8)=14 ,(invAngleSum >> 8)=15 ,(invAngleSum >> 8)=16 ,(invAngleSum >> 8)=17 ,(invAngleSum >> 8)=18 ,(invAngleSum >> 8)=20 ,(invAngleSum >> 8)=21 ,(invAngleSum >> 8)=22 ,(invAngleSum >> 8)=23 ,(invAngleSum >> 8)=25 ,(invAngleSum >> 8)=26 ,(invAngleSum >> 8)=27 ,(invAngleSum >> 8)=28 ,(invAngleSum >> 8)=30 ,(invAngleSum >> 8)=31 + + //mode19 invAS[32]= {1, 2, 4, }; + //vec_u8_t mask_left={0x1, 0x02, 0x04, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,0x0, 0x0}; + vec_u8_t srv_left=vec_xl(8, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + //vec_u8_t srv_left=vec_perm(srv_left, srv_left, mask_left); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_4={0x3, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + +vec_u8_t vfrac4 = (vec_u8_t){11, 11, 11, 11, 22, 22, 22, 22, 1, 1, 1, 1, 12, 12, 12, 12}; +vec_u8_t vfrac4_32 = (vec_u8_t){21, 21, 21, 21, 10, 10, 10, 10, 31, 31, 31, 31, 20, 20, 20, 20}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 20>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, }; +vec_u8_t mask1={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, }; +vec_u8_t mask2={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, }; +vec_u8_t mask3={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, }; +vec_u8_t mask4={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +vec_u8_t mask5={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, }; +vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, }; +vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; + + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vout_0, vout_1, vout_2, vout_3; + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + + vec_u8_t srv_left=vec_xl(16, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_8={0x8, 0x6, 0x5, 0x3, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + vec_u8_t srv3 = vec_perm(srv, srv, mask3); + vec_u8_t srv4 = vec_perm(srv, srv, mask4); + vec_u8_t srv5 = vec_perm(srv, srv, mask5); + vec_u8_t srv6 = vec_perm(srv, srv, mask6); + vec_u8_t srv7 = vec_perm(srv, srv, mask7); + + +vec_u8_t vfrac8_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac8_1 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac8_2 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac8_3 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 24, 24, 24, 24, 24, 24, 24, 24}; + +vec_u8_t vfrac8_32_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac8_32_1 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac8_32_2 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac8_32_3 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 8, 8, 8, 8, 8, 8, 8, 8}; + +one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0); +one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1); +one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2); +one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 20>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; +vec_u8_t mask1={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +//vec_u8_t mask2={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +vec_u8_t mask3={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t mask4={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +//vec_u8_t mask5={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask6={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask7={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +//vec_u8_t mask8={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask9={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask10={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask11={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t maskadd1_0={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; +/*vec_u8_t maskadd1_1={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; +vec_u8_t maskadd1_2={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; +vec_u8_t maskadd1_3={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +vec_u8_t maskadd1_4={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t maskadd1_5={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t maskadd1_6={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t maskadd1_7={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t maskadd1_8={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t maskadd1_9={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t maskadd1_10={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t maskadd1_11={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t maskadd1_12={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_14={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/ + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv_left=vec_xl(32, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_16={0xf, 0xe, 0xc, 0xb, 0x9, 0x8, 0x6, 0x5, 0x3, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(6, srcPix0); + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = vec_perm(s0, s1, mask1); + vec_u8_t srv2 = srv1; + vec_u8_t srv3 = vec_perm(s0, s1, mask3); + vec_u8_t srv4 = vec_perm(s0, s1, mask4); + vec_u8_t srv5 =srv4; + vec_u8_t srv6 = vec_perm(s0, s1, mask6); + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = srv7; + vec_u8_t srv9 = vec_perm(s0, s1, mask9); + vec_u8_t srv10 = vec_perm(s0, s1, mask10); + vec_u8_t srv11 = srv10; + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = vec_perm(s0, s1, mask13); + vec_u8_t srv14 = srv13; + vec_u8_t srv15 = vec_perm(s0, s1, mask15); + + vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0); + vec_u8_t srv1_add1 = srv0; + vec_u8_t srv2_add1 = srv0; + vec_u8_t srv3_add1 = srv1; + vec_u8_t srv4_add1 = srv3; + vec_u8_t srv5_add1 = srv3; + vec_u8_t srv6_add1 = srv4; + vec_u8_t srv7_add1 = srv6; + vec_u8_t srv8_add1 = srv6; + vec_u8_t srv9_add1 = srv7; + vec_u8_t srv10_add1 = srv9; + vec_u8_t srv11_add1 = srv9; + vec_u8_t srv12_add1= srv10; + vec_u8_t srv13_add1 = srv12; + vec_u8_t srv14_add1 = srv12; + vec_u8_t srv15_add1 = srv13; +vec_u8_t vfrac16_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_4 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_6 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_8 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_10 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_12 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_14 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + +vec_u8_t vfrac16_32_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 20>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask1={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask2={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask3={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask4={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask5={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask6={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask7={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, }; +//vec_u8_t mask8={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, }; +vec_u8_t mask9={0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, }; +vec_u8_t mask10={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, }; +//vec_u8_t mask11={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, }; +vec_u8_t mask12={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, }; +vec_u8_t mask13={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; +//vec_u8_t mask14={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; +vec_u8_t mask15={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; + +vec_u8_t mask16={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +//vec_u8_t mask17={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +vec_u8_t mask18={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t mask19={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +//vec_u8_t mask20={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask21={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask22={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +//vec_u8_t mask23={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask24={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask25={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask26={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask27={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask28={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask29={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +//vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; + +vec_u8_t maskadd1_0={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t refmask_32_0 = {0x1e, 0x1d, 0x1b, 0x1a, 0x18, 0x17, 0x15, 0x14, 0x12, 0x11, 0xf, 0xe, 0xc, 0xb, 0x9, 0x8, }; + vec_u8_t refmask_32_1 = {0x6, 0x5, 0x3, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b}; + + vec_u8_t srv_left0=vec_xl(64, srcPix0); + vec_u8_t srv_left1=vec_xl(80, srcPix0); + vec_u8_t srv_right=vec_xl(0, srcPix0); + vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0); + vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1); + vec_u8_t s2 = vec_xl(12, srcPix0); + vec_u8_t s3 = vec_xl(16+12, srcPix0); + + vec_u8_t srv0 = vec_perm(s1, s2, mask0); + vec_u8_t srv1 = vec_perm(s1, s2, mask1); + vec_u8_t srv2 = srv1; + vec_u8_t srv3 = vec_perm(s1, s2, mask3); + vec_u8_t srv4 = vec_perm(s1, s2, mask4); + vec_u8_t srv5 = srv4; + vec_u8_t srv6 = s1; + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = srv7; + vec_u8_t srv9 = vec_perm(s0, s1, mask9); + vec_u8_t srv10 = vec_perm(s0, s1, mask10); + vec_u8_t srv11 = srv10; + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = vec_perm(s0, s1, mask13); + vec_u8_t srv14 = srv13; + vec_u8_t srv15 = vec_perm(s0, s1, mask15); + + vec_u8_t srv16_0 = vec_perm(s2, s3, mask0); + vec_u8_t srv16_1 = vec_perm(s2, s3, mask1); + vec_u8_t srv16_2 = srv16_1; + vec_u8_t srv16_3 = vec_perm(s2, s3, mask3); + vec_u8_t srv16_4 = vec_perm(s2, s3, mask4); + vec_u8_t srv16_5 = srv16_4; + vec_u8_t srv16_6 = s2; + vec_u8_t srv16_7 = vec_perm(s1, s2, mask7); + vec_u8_t srv16_8 = srv16_7; + vec_u8_t srv16_9 = vec_perm(s1, s2, mask9); + vec_u8_t srv16_10 = vec_perm(s1, s2, mask10); + vec_u8_t srv16_11 = srv16_10; + vec_u8_t srv16_12= vec_perm(s1, s2, mask12); + vec_u8_t srv16_13 = vec_perm(s1, s2, mask13); + vec_u8_t srv16_14 = srv16_13; + vec_u8_t srv16_15 = vec_perm(s1, s2, mask15); + + //0(1,2),1,1,3,4,4,6(1),7(0,1),7,9,10,10,12,13,13,15,16,16,18,19,19,21,22,22,24,25,25,27,28,28,30,30 + + vec_u8_t srv16 = vec_perm(s0, s1, mask16); + vec_u8_t srv17 = srv16; + vec_u8_t srv18 = vec_perm(s0, s1, mask18); + vec_u8_t srv19 = vec_perm(s0, s1, mask19); + vec_u8_t srv20 = srv19; + vec_u8_t srv21 = vec_perm(s0, s1, mask21); + vec_u8_t srv22 = vec_perm(s0, s1, mask22); + vec_u8_t srv23 = srv22; + vec_u8_t srv24 = vec_perm(s0, s1, mask24); + vec_u8_t srv25 = vec_perm(s0, s1, mask25); + vec_u8_t srv26 = srv25; + vec_u8_t srv27 = vec_perm(s0, s1, mask27); + vec_u8_t srv28 = vec_perm(s0, s1, mask28); + vec_u8_t srv29 = srv28; + vec_u8_t srv30 = s0; + vec_u8_t srv31 = s0; + + vec_u8_t srv16_16 = vec_perm(s1, s2, mask16); + vec_u8_t srv16_17 = srv16_16; + vec_u8_t srv16_18 = vec_perm(s1, s2, mask18); + vec_u8_t srv16_19 = vec_perm(s1, s2, mask19); + vec_u8_t srv16_20 = srv16_19; + vec_u8_t srv16_21 = vec_perm(s1, s2, mask21); + vec_u8_t srv16_22 = vec_perm(s1, s2, mask22); + vec_u8_t srv16_23 = srv16_22; + vec_u8_t srv16_24 = vec_perm(s1, s2, mask24); + vec_u8_t srv16_25 = vec_perm(s1, s2, mask25); + vec_u8_t srv16_26 = srv16_25; + vec_u8_t srv16_27 = vec_perm(s1, s2, mask27); + vec_u8_t srv16_28 = vec_perm(s1, s2, mask28); + vec_u8_t srv16_29 = srv16_28; + vec_u8_t srv16_30 = s1; + vec_u8_t srv16_31 = s1; + + vec_u8_t srv0add1 = vec_perm(s1, s2, maskadd1_0); + vec_u8_t srv1add1 = srv0; + vec_u8_t srv2add1 = srv0; + vec_u8_t srv3add1 = srv1; + vec_u8_t srv4add1 = srv3; + vec_u8_t srv5add1 = srv3; + vec_u8_t srv6add1 = srv4; + vec_u8_t srv7add1 = s1; + vec_u8_t srv8add1 = s1; + vec_u8_t srv9add1 = srv7; + vec_u8_t srv10add1 = srv9; + vec_u8_t srv11add1 = srv9; + vec_u8_t srv12add1= srv10; + vec_u8_t srv13add1 = srv12; + vec_u8_t srv14add1 = srv12; + vec_u8_t srv15add1 = srv13; + + vec_u8_t srv16add1_0 = vec_perm(s2, s3, maskadd1_0); + vec_u8_t srv16add1_1 = srv16_0; + vec_u8_t srv16add1_2 = srv16_0; + vec_u8_t srv16add1_3 = srv16_1; + vec_u8_t srv16add1_4 = srv16_3; + vec_u8_t srv16add1_5 = srv16_3; + vec_u8_t srv16add1_6 = srv16_4; + vec_u8_t srv16add1_7 = s2; + vec_u8_t srv16add1_8 = s2; + vec_u8_t srv16add1_9 = srv16_7; + vec_u8_t srv16add1_10 = srv16_9; + vec_u8_t srv16add1_11 = srv16_9; + vec_u8_t srv16add1_12= srv16_10; + vec_u8_t srv16add1_13 = srv16_12; + vec_u8_t srv16add1_14 = srv16_12; + vec_u8_t srv16add1_15 = srv16_13; + + //0,0,1,3,3,4,6(0),6,7,9,9,10,12,12,13,15,15,16,18,18,19,21,21,22,24,24,25,27,27,28,28 + + vec_u8_t srv16add1 = srv15; + vec_u8_t srv17add1 = srv15; + vec_u8_t srv18add1 = srv16; + vec_u8_t srv19add1 = srv18; + vec_u8_t srv20add1 = srv18; + vec_u8_t srv21add1 = srv19; + vec_u8_t srv22add1 = srv21; + vec_u8_t srv23add1 = srv21; + vec_u8_t srv24add1 = srv22; + vec_u8_t srv25add1 = srv24; + vec_u8_t srv26add1 = srv24; + vec_u8_t srv27add1 = srv25; + vec_u8_t srv28add1 = srv27; + vec_u8_t srv29add1 = srv27; + vec_u8_t srv30add1 = srv28; + vec_u8_t srv31add1 = srv28; + + vec_u8_t srv16add1_16 = srv16_15; + vec_u8_t srv16add1_17 = srv16_15; + vec_u8_t srv16add1_18 = srv16_16; + vec_u8_t srv16add1_19 = srv16_18; + vec_u8_t srv16add1_20 = srv16_18; + vec_u8_t srv16add1_21 = srv16_19; + vec_u8_t srv16add1_22 = srv16_21; + vec_u8_t srv16add1_23 = srv16_21; + vec_u8_t srv16add1_24 = srv16_22; + vec_u8_t srv16add1_25 = srv16_24; + vec_u8_t srv16add1_26 = srv16_24; + vec_u8_t srv16add1_27 = srv16_25; + vec_u8_t srv16add1_28 = srv16_27; + vec_u8_t srv16add1_29 = srv16_27; + vec_u8_t srv16add1_30 = srv16_28; + vec_u8_t srv16add1_31 = srv16_28; + +vec_u8_t vfrac16_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_4 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_6 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_8 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_10 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_12 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_14 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_16 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_17 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_18 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_20 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_21 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_22 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_24 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_25 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_26 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_28 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_29 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_30 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; +vec_u8_t vfrac16_32_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_16 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_32_17 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_18 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_32_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_20 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_32_21 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_22 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_32_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_24 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_32_25 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_26 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_32_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_28 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_32_29 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_30 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv16, srv16add1, vfrac16_32_16, vfrac16_16, vout_0); + one_line(srv16_16, srv16add1_16, vfrac16_32_16, vfrac16_16, vout_1); + + one_line(srv17, srv17add1, vfrac16_32_17, vfrac16_17, vout_2); + one_line(srv16_17, srv16add1_17, vfrac16_32_17, vfrac16_17, vout_3); + + one_line(srv18, srv18add1, vfrac16_32_18, vfrac16_18, vout_4); + one_line(srv16_18, srv16add1_18, vfrac16_32_18, vfrac16_18, vout_5); + + one_line(srv19, srv19add1, vfrac16_32_19, vfrac16_19, vout_6); + one_line(srv16_19, srv16add1_19, vfrac16_32_19, vfrac16_19, vout_7); + + one_line(srv20, srv20add1, vfrac16_32_20, vfrac16_20, vout_8); + one_line(srv16_20, srv16add1_20, vfrac16_32_20, vfrac16_20, vout_9); + + one_line(srv21, srv21add1, vfrac16_32_21, vfrac16_21, vout_10); + one_line(srv16_21, srv16add1_21, vfrac16_32_21, vfrac16_21, vout_11); + + one_line(srv22, srv22add1, vfrac16_32_22, vfrac16_22, vout_12); + one_line(srv16_22, srv16add1_22, vfrac16_32_22, vfrac16_22, vout_13); + + one_line(srv23, srv23add1, vfrac16_32_23, vfrac16_23, vout_14); + one_line(srv16_23, srv16add1_23, vfrac16_32_23, vfrac16_23, vout_15); + + one_line(srv24, srv24add1, vfrac16_32_24, vfrac16_24, vout_16); + one_line(srv16_24, srv16add1_24, vfrac16_32_24, vfrac16_24, vout_17); + + one_line(srv25, srv25add1, vfrac16_32_25, vfrac16_25, vout_18); + one_line(srv16_25, srv16add1_25, vfrac16_32_25, vfrac16_25, vout_19); + + one_line(srv26, srv26add1, vfrac16_32_26, vfrac16_26, vout_20); + one_line(srv16_26, srv16add1_26, vfrac16_32_26, vfrac16_26, vout_21); + + one_line(srv27, srv27add1, vfrac16_32_27, vfrac16_27, vout_22); + one_line(srv16_27, srv16add1_27, vfrac16_32_27, vfrac16_27, vout_23); + + one_line(srv28, srv28add1, vfrac16_32_28, vfrac16_28, vout_24); + one_line(srv16_28, srv16add1_28, vfrac16_32_28, vfrac16_28, vout_25); + + one_line(srv29, srv29add1, vfrac16_32_29, vfrac16_29, vout_26); + one_line(srv16_29, srv16add1_29, vfrac16_32_29, vfrac16_29, vout_27); + + one_line(srv30, srv30add1, vfrac16_32_30, vfrac16_30, vout_28); + one_line(srv16_30, srv16add1_30, vfrac16_32_30, vfrac16_30, vout_29); + + one_line(srv31, srv31add1, vfrac16_32_31, vfrac16_31, vout_30); + one_line(srv16_31, srv16add1_31, vfrac16_32_31, vfrac16_31, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<4, 21>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t mask0={0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3, }; +vec_u8_t mask1={0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, }; + + + + //mode 19: + //int offset[32] = {-1, -2, -3, -4, -5, -5, -6, -7, -8, -9, -9, -10, -11, -12, -13, -13, -14, -15, -16, -17, -18, -18, -19, -20, -21, -22, -22, -23, -24, -25, -26, -26}; + //int fraction[32] = {6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0, 6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0}; + //mode=19 width=32 nbProjected=25(invAngleSum >> 8)=1 ,(invAngleSum >> 8)=2 ,(invAngleSum >> 8)=4 ,(invAngleSum >> 8)=5 ,(invAngleSum >> 8)=6 ,(invAngleSum >> 8)=7 ,(invAngleSum >> 8)=9 ,(invAngleSum >> 8)=10 ,(invAngleSum >> 8)=11 ,(invAngleSum >> 8)=12 ,(invAngleSum >> 8)=14 ,(invAngleSum >> 8)=15 ,(invAngleSum >> 8)=16 ,(invAngleSum >> 8)=17 ,(invAngleSum >> 8)=18 ,(invAngleSum >> 8)=20 ,(invAngleSum >> 8)=21 ,(invAngleSum >> 8)=22 ,(invAngleSum >> 8)=23 ,(invAngleSum >> 8)=25 ,(invAngleSum >> 8)=26 ,(invAngleSum >> 8)=27 ,(invAngleSum >> 8)=28 ,(invAngleSum >> 8)=30 ,(invAngleSum >> 8)=31 + + //mode19 invAS[32]= {1, 2, 4, }; + //vec_u8_t mask_left={0x1, 0x02, 0x04, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,0x0, 0x0}; + vec_u8_t srv_left=vec_xl(8, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + //vec_u8_t srv_left=vec_perm(srv_left, srv_left, mask_left); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_4={0x4, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); +vec_u8_t vfrac4 = (vec_u8_t){15, 15, 15, 15, 30, 30, 30, 30, 13, 13, 13, 13, 28, 28, 28, 28}; +vec_u8_t vfrac4_32 = (vec_u8_t){17, 17, 17, 17, 2, 2, 2, 2, 19, 19, 19, 19, 4, 4, 4, 4}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 21>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, }; +vec_u8_t mask1={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, }; +vec_u8_t mask2={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +vec_u8_t mask3={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, }; +vec_u8_t mask4={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; +vec_u8_t mask5={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, }; +vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vout_0, vout_1, vout_2, vout_3; + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + + vec_u8_t srv_left=vec_xl(16, srcPix0); + vec_u8_t srv_right=vec_xl(0, srcPix0); + vec_u8_t refmask_8={0x8, 0x6, 0x4, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00, 0x00, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + vec_u8_t srv3 = vec_perm(srv, srv, mask3); + vec_u8_t srv4 = vec_perm(srv, srv, mask4); + vec_u8_t srv5 = vec_perm(srv, srv, mask5); + vec_u8_t srv6 = vec_perm(srv, srv, mask6); + vec_u8_t srv7 = vec_perm(srv, srv, mask7); + + +vec_u8_t vfrac8_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac8_1 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac8_2 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac8_3 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac8_32_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac8_32_1 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac8_32_2 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac8_32_3 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 8, 8, 8, 8, 8, 8, 8, 8}; + +one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0); +one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1); +one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2); +one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 21>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t mask0={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t mask1={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +//vec_u8_t mask2={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask3={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +//vec_u8_t mask4={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask5={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +//vec_u8_t mask6={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask7={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +//vec_u8_t mask8={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask9={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask10={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask11={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +//vec_u8_t mask12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; + +vec_u8_t maskadd1_0={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +/*vec_u8_t maskadd1_1={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t maskadd1_2={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t maskadd1_3={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t maskadd1_4={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t maskadd1_5={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t maskadd1_6={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t maskadd1_7={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t maskadd1_8={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t maskadd1_9={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t maskadd1_10={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t maskadd1_11={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_12={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_14={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/ + + vec_u8_t srv_left=vec_xl(32, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_16={0xf, 0xd, 0xb, 0x9, 0x8, 0x6, 0x4, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(8, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = vec_perm(s0, s1, mask1); + vec_u8_t srv2 = srv1; + vec_u8_t srv3 = vec_perm(s0, s1, mask3); + vec_u8_t srv4 = srv3; + vec_u8_t srv5 = vec_perm(s0, s1, mask5); + vec_u8_t srv6 = srv5; + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = srv7; + vec_u8_t srv9 = vec_perm(s0, s1, mask9); + vec_u8_t srv10 = srv9; + vec_u8_t srv11 = vec_perm(s0, s1, mask11); + vec_u8_t srv12= srv11; + vec_u8_t srv13 = vec_perm(s0, s1, mask13); + vec_u8_t srv14 = srv13; + vec_u8_t srv15 = vec_perm(s0, s1, mask15); + + vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0); + vec_u8_t srv1_add1 = srv0; + vec_u8_t srv2_add1 = srv0; + vec_u8_t srv3_add1 = srv1; + vec_u8_t srv4_add1 = srv1; + vec_u8_t srv5_add1 = srv3; + vec_u8_t srv6_add1 = srv3; + vec_u8_t srv7_add1 = srv5; + vec_u8_t srv8_add1 = srv5; + vec_u8_t srv9_add1 = srv7; + vec_u8_t srv10_add1 = srv7; + vec_u8_t srv11_add1 = srv9; + vec_u8_t srv12_add1= srv9; + vec_u8_t srv13_add1 = srv11; + vec_u8_t srv14_add1 = srv11; + vec_u8_t srv15_add1 = srv13; + +vec_u8_t vfrac16_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_1 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_4 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_5 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_6 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_8 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_9 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_10 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_12 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_13 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_14 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + +vec_u8_t vfrac16_32_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 21>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +//vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask1={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, }; +//vec_u8_t mask2={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, }; +vec_u8_t mask3={0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, }; +//vec_u8_t mask4={0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, }; +vec_u8_t mask5={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, }; +//vec_u8_t mask6={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, }; +vec_u8_t mask7={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, }; +//vec_u8_t mask8={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, }; +vec_u8_t mask9={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; +//vec_u8_t mask10={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; +vec_u8_t mask11={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; +//vec_u8_t mask12={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; +vec_u8_t mask13={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +//vec_u8_t mask14={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +vec_u8_t mask15={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; + +vec_u8_t mask16={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +//vec_u8_t mask17={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask18={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +//vec_u8_t mask19={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask20={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +//vec_u8_t mask21={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask22={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +//vec_u8_t mask23={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask24={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask25={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask26={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +//vec_u8_t mask27={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask28={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask29={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +//vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; + +vec_u8_t maskadd1_0={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + //vec_u8_t srv_left0=vec_xl(64, srcPix0); + //vec_u8_t srv_left1=vec_xl(80, srcPix0); + //vec_u8_t srv_right=vec_xl(0, srcPix0); + //vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0); + //vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1); + //vec_u8_t s2 = vec_xl(12, srcPix0); + //vec_u8_t s3 = vec_xl(16+12, srcPix0); + + vec_u8_t srv_left0=vec_xl(64, srcPix0); + vec_u8_t srv_left1=vec_xl(80, srcPix0); + vec_u8_t refmask_32 = {0x1e, 0x1c, 0x1a, 0x18, 0x17, 0x15, 0x13, 0x11, 0xf, 0xd, 0xb, 0x9, 0x8, 0x6, 0x4, 0x2}; + vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32); + vec_u8_t s1 = vec_xl(0, srcPix0);; + vec_u8_t s2 = vec_xl(16, srcPix0); + vec_u8_t s3 = vec_xl(32, srcPix0); + + + vec_u8_t srv0 = s1; + vec_u8_t srv1 = vec_perm(s0, s1, mask1); + vec_u8_t srv2 = srv1; + vec_u8_t srv3 = vec_perm(s0, s1, mask3); + vec_u8_t srv4 = srv3; + vec_u8_t srv5 = vec_perm(s0, s1, mask5); + vec_u8_t srv6 = srv5; + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = srv7; + vec_u8_t srv9 = vec_perm(s0, s1, mask9); + vec_u8_t srv10 = srv9; + vec_u8_t srv11 = vec_perm(s0, s1, mask11); + vec_u8_t srv12= srv11; + vec_u8_t srv13 = vec_perm(s0, s1, mask13); + vec_u8_t srv14 = srv13; + vec_u8_t srv15 = vec_perm(s0, s1, mask15); + + vec_u8_t srv16_0 = s2; + vec_u8_t srv16_1 = vec_perm(s1, s2, mask1); + vec_u8_t srv16_2 = srv16_1; + vec_u8_t srv16_3 = vec_perm(s1, s2, mask3); + vec_u8_t srv16_4 = srv16_3; + vec_u8_t srv16_5 = vec_perm(s1, s2, mask5); + vec_u8_t srv16_6 = srv16_5; + vec_u8_t srv16_7 = vec_perm(s1, s2, mask7); + vec_u8_t srv16_8 = srv16_7; + vec_u8_t srv16_9 = vec_perm(s1, s2, mask9); + vec_u8_t srv16_10 = srv16_9; + vec_u8_t srv16_11 = vec_perm(s1, s2, mask11); + vec_u8_t srv16_12= srv16_11; + vec_u8_t srv16_13 = vec_perm(s1, s2, mask13); + vec_u8_t srv16_14 = srv16_13; + vec_u8_t srv16_15 = vec_perm(s1, s2, mask15); + + //s1, 1,1,3,3,5,5,7,7,9,9,11,11,13,13,15,16,16,18,18,20,20,22,22,24,24,26,26,28,28,s0,s0 + + vec_u8_t srv16 = vec_perm(s0, s1, mask16); + vec_u8_t srv17 = srv16; + vec_u8_t srv18 = vec_perm(s0, s1, mask18); + vec_u8_t srv19 = srv18; + vec_u8_t srv20 = vec_perm(s0, s1, mask20); + vec_u8_t srv21 = srv20; + vec_u8_t srv22 = vec_perm(s0, s1, mask22); + vec_u8_t srv23 = srv22; + vec_u8_t srv24 = vec_perm(s0, s1, mask24); + vec_u8_t srv25 = srv24; + vec_u8_t srv26 = vec_perm(s0, s1, mask26); + vec_u8_t srv27 = srv26; + vec_u8_t srv28 = vec_perm(s0, s1, mask28); + vec_u8_t srv29 = srv28; + vec_u8_t srv30 = s0; + vec_u8_t srv31 = s0; + + vec_u8_t srv16_16 = vec_perm(s1, s2, mask16); + vec_u8_t srv16_17 = srv16_16; + vec_u8_t srv16_18 = vec_perm(s1, s2, mask18); + vec_u8_t srv16_19 = srv16_18; + vec_u8_t srv16_20 = vec_perm(s1, s2, mask20); + vec_u8_t srv16_21 = srv16_20; + vec_u8_t srv16_22 = vec_perm(s1, s2, mask22); + vec_u8_t srv16_23 = srv16_22; + vec_u8_t srv16_24 = vec_perm(s1, s2, mask24); + vec_u8_t srv16_25 = srv16_24; + vec_u8_t srv16_26 = vec_perm(s1, s2, mask26); + vec_u8_t srv16_27 = srv16_26; + vec_u8_t srv16_28 = vec_perm(s1, s2, mask28); + vec_u8_t srv16_29 = srv16_28; + vec_u8_t srv16_30 = s1; + vec_u8_t srv16_31 = s1; + + vec_u8_t srv0add1 = vec_perm(s1, s2, maskadd1_0); + vec_u8_t srv1add1 = s1; + vec_u8_t srv2add1 = s1; + vec_u8_t srv3add1 = srv1; + vec_u8_t srv4add1 = srv1; + vec_u8_t srv5add1 = srv3; + vec_u8_t srv6add1 = srv3; + vec_u8_t srv7add1 = srv6; + vec_u8_t srv8add1 = srv6; + vec_u8_t srv9add1 = srv7; + vec_u8_t srv10add1 = srv7; + vec_u8_t srv11add1 = srv9; + vec_u8_t srv12add1= srv9; + vec_u8_t srv13add1 = srv11; + vec_u8_t srv14add1 = srv11; + vec_u8_t srv15add1 = srv14; + + vec_u8_t srv16add1_0 = vec_perm(s2, s3, maskadd1_0); + vec_u8_t srv16add1_1 = s2; + vec_u8_t srv16add1_2 = s2; + vec_u8_t srv16add1_3 = srv16_1; + vec_u8_t srv16add1_4 = srv16_1; + vec_u8_t srv16add1_5 = srv16_3; + vec_u8_t srv16add1_6 = srv16_3; + vec_u8_t srv16add1_7 = srv16_6; + vec_u8_t srv16add1_8 = srv16_6; + vec_u8_t srv16add1_9 = srv16_7; + vec_u8_t srv16add1_10 = srv16_7; + vec_u8_t srv16add1_11 = srv16_9; + vec_u8_t srv16add1_12= srv16_9; + vec_u8_t srv16add1_13 = srv16_11; + vec_u8_t srv16add1_14 = srv16_11; + vec_u8_t srv16add1_15 = srv16_14; + + //srv28, s1,s1, 1,1,3,3,6,6,7,7,9,9,11,11,14,15,15,16,16,18,18,20,20,22,22,24,24,26,26,28,28, + + vec_u8_t srv16add1 = srv15; + vec_u8_t srv17add1 = srv15; + vec_u8_t srv18add1 = srv16; + vec_u8_t srv19add1 = srv16; + vec_u8_t srv20add1 = srv18; + vec_u8_t srv21add1 = srv18; + vec_u8_t srv22add1 = srv20; + vec_u8_t srv23add1 = srv20; + vec_u8_t srv24add1 = srv22; + vec_u8_t srv25add1 = srv22; + vec_u8_t srv26add1 = srv24; + vec_u8_t srv27add1 = srv24; + vec_u8_t srv28add1 = srv26; + vec_u8_t srv29add1 = srv26; + vec_u8_t srv30add1 = srv28; + vec_u8_t srv31add1 = srv28; + + vec_u8_t srv16add1_16 = srv16_15; + vec_u8_t srv16add1_17 = srv16_15; + vec_u8_t srv16add1_18 = srv16_16; + vec_u8_t srv16add1_19 = srv16_16; + vec_u8_t srv16add1_20 = srv16_18; + vec_u8_t srv16add1_21 = srv16_18; + vec_u8_t srv16add1_22 = srv16_20; + vec_u8_t srv16add1_23 = srv16_20; + vec_u8_t srv16add1_24 = srv16_22; + vec_u8_t srv16add1_25 = srv16_22; + vec_u8_t srv16add1_26 = srv16_24; + vec_u8_t srv16add1_27 = srv16_24; + vec_u8_t srv16add1_28 = srv16_26; + vec_u8_t srv16add1_29 = srv16_26; + vec_u8_t srv16add1_30 = srv16_28; + vec_u8_t srv16add1_31 = srv16_28; + +vec_u8_t vfrac16_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_1 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_4 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_5 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_6 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_8 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_9 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_10 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_12 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_13 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_14 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_16 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_17 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_18 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_19 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_20 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_21 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_22 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_24 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_25 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_26 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_27 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_28 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_29 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_30 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; +vec_u8_t vfrac16_32_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_16 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_32_17 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_18 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_32_19 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_20 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_32_21 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_22 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_32_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_24 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_32_25 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_26 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_32_27 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_28 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_32_29 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_30 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv16, srv16add1, vfrac16_32_16, vfrac16_16, vout_0); + one_line(srv16_16, srv16add1_16, vfrac16_32_16, vfrac16_16, vout_1); + + one_line(srv17, srv17add1, vfrac16_32_17, vfrac16_17, vout_2); + one_line(srv16_17, srv16add1_17, vfrac16_32_17, vfrac16_17, vout_3); + + one_line(srv18, srv18add1, vfrac16_32_18, vfrac16_18, vout_4); + one_line(srv16_18, srv16add1_18, vfrac16_32_18, vfrac16_18, vout_5); + + one_line(srv19, srv19add1, vfrac16_32_19, vfrac16_19, vout_6); + one_line(srv16_19, srv16add1_19, vfrac16_32_19, vfrac16_19, vout_7); + + one_line(srv20, srv20add1, vfrac16_32_20, vfrac16_20, vout_8); + one_line(srv16_20, srv16add1_20, vfrac16_32_20, vfrac16_20, vout_9); + + one_line(srv21, srv21add1, vfrac16_32_21, vfrac16_21, vout_10); + one_line(srv16_21, srv16add1_21, vfrac16_32_21, vfrac16_21, vout_11); + + one_line(srv22, srv22add1, vfrac16_32_22, vfrac16_22, vout_12); + one_line(srv16_22, srv16add1_22, vfrac16_32_22, vfrac16_22, vout_13); + + one_line(srv23, srv23add1, vfrac16_32_23, vfrac16_23, vout_14); + one_line(srv16_23, srv16add1_23, vfrac16_32_23, vfrac16_23, vout_15); + + one_line(srv24, srv24add1, vfrac16_32_24, vfrac16_24, vout_16); + one_line(srv16_24, srv16add1_24, vfrac16_32_24, vfrac16_24, vout_17); + + one_line(srv25, srv25add1, vfrac16_32_25, vfrac16_25, vout_18); + one_line(srv16_25, srv16add1_25, vfrac16_32_25, vfrac16_25, vout_19); + + one_line(srv26, srv26add1, vfrac16_32_26, vfrac16_26, vout_20); + one_line(srv16_26, srv16add1_26, vfrac16_32_26, vfrac16_26, vout_21); + + one_line(srv27, srv27add1, vfrac16_32_27, vfrac16_27, vout_22); + one_line(srv16_27, srv16add1_27, vfrac16_32_27, vfrac16_27, vout_23); + + one_line(srv28, srv28add1, vfrac16_32_28, vfrac16_28, vout_24); + one_line(srv16_28, srv16add1_28, vfrac16_32_28, vfrac16_28, vout_25); + + one_line(srv29, srv29add1, vfrac16_32_29, vfrac16_29, vout_26); + one_line(srv16_29, srv16add1_29, vfrac16_32_29, vfrac16_29, vout_27); + + one_line(srv30, srv30add1, vfrac16_32_30, vfrac16_30, vout_28); + one_line(srv16_30, srv16add1_30, vfrac16_32_30, vfrac16_30, vout_29); + + one_line(srv31, srv31add1, vfrac16_32_31, vfrac16_31, vout_30); + one_line(srv16_31, srv16add1_31, vfrac16_32_31, vfrac16_31, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + + +template<> +void intra_pred<4, 22>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t mask0={0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, }; +vec_u8_t mask1={0x2, 0x3, 0x4, 0x5, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, }; + + + + //mode 19: + //int offset[32] = {-1, -2, -3, -4, -5, -5, -6, -7, -8, -9, -9, -10, -11, -12, -13, -13, -14, -15, -16, -17, -18, -18, -19, -20, -21, -22, -22, -23, -24, -25, -26, -26}; + //int fraction[32] = {6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0, 6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0}; + //mode=19 width=32 nbProjected=25(invAngleSum >> 8)=1 ,(invAngleSum >> 8)=2 ,(invAngleSum >> 8)=4 ,(invAngleSum >> 8)=5 ,(invAngleSum >> 8)=6 ,(invAngleSum >> 8)=7 ,(invAngleSum >> 8)=9 ,(invAngleSum >> 8)=10 ,(invAngleSum >> 8)=11 ,(invAngleSum >> 8)=12 ,(invAngleSum >> 8)=14 ,(invAngleSum >> 8)=15 ,(invAngleSum >> 8)=16 ,(invAngleSum >> 8)=17 ,(invAngleSum >> 8)=18 ,(invAngleSum >> 8)=20 ,(invAngleSum >> 8)=21 ,(invAngleSum >> 8)=22 ,(invAngleSum >> 8)=23 ,(invAngleSum >> 8)=25 ,(invAngleSum >> 8)=26 ,(invAngleSum >> 8)=27 ,(invAngleSum >> 8)=28 ,(invAngleSum >> 8)=30 ,(invAngleSum >> 8)=31 + + //mode19 invAS[32]= {1, 2, 4, }; + //vec_u8_t mask_left={0x1, 0x02, 0x04, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,0x0, 0x0}; + vec_u8_t srv_left=vec_xl(8, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + //vec_u8_t srv_left=vec_perm(srv_left, srv_left, mask_left); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_4={0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); +vec_u8_t vfrac4 = (vec_u8_t){19, 19, 19, 19, 6, 6, 6, 6, 25, 25, 25, 25, 12, 12, 12, 12}; +vec_u8_t vfrac4_32 = (vec_u8_t){13, 13, 13, 13, 26, 26, 26, 26, 7, 7, 7, 7, 20, 20, 20, 20}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 22>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, }; +vec_u8_t mask1={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, }; +vec_u8_t mask2={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +vec_u8_t mask3={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, }; +vec_u8_t mask4={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; +vec_u8_t mask5={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, }; +vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; + + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vout_0, vout_1, vout_2, vout_3; + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + + vec_u8_t srv_left=vec_xl(16, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_8={0x7, 0x5, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + vec_u8_t srv3 = vec_perm(srv, srv, mask3); + vec_u8_t srv4 = vec_perm(srv, srv, mask4); + vec_u8_t srv5 = vec_perm(srv, srv, mask5); + vec_u8_t srv6 = vec_perm(srv, srv, mask6); + vec_u8_t srv7 = vec_perm(srv, srv, mask7); + + +vec_u8_t vfrac8_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac8_1 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac8_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac8_3 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac8_32_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac8_32_1 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac8_32_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac8_32_3 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 8, 8, 8, 8, 8, 8, 8, 8}; + +one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0); +one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1); +one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2); +one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 22>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +//vec_u8_t mask1={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask2={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +//vec_u8_t mask3={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask4={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +//vec_u8_t mask5={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +//vec_u8_t mask6={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask7={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask8={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask9={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +//vec_u8_t mask10={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +//vec_u8_t mask11={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask12={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask14={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +//vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; + +vec_u8_t maskadd1_0={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +/*vec_u8_t maskadd1_1={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t maskadd1_2={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t maskadd1_3={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t maskadd1_4={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t maskadd1_5={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t maskadd1_6={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t maskadd1_7={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t maskadd1_8={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t maskadd1_9={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_10={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_11={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/ + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv_left=vec_xl(32, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_16={0xf, 0xc, 0xa, 0x7, 0x5, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19}; + + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(10, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = srv0; + vec_u8_t srv2 = vec_perm(s0, s1, mask2); + vec_u8_t srv3 = srv2; + vec_u8_t srv4 = vec_perm(s0, s1, mask4); + vec_u8_t srv5 = srv4; + vec_u8_t srv6 = srv4; + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = srv7; + vec_u8_t srv9 = vec_perm(s0, s1, mask9); + vec_u8_t srv10 = srv9; + vec_u8_t srv11 = srv9; + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = srv12; + vec_u8_t srv14 = vec_perm(s0, s1, mask14); + vec_u8_t srv15 = srv14; + + vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0); + vec_u8_t srv1_add1 = srv0_add1; + vec_u8_t srv2_add1 = srv0; + vec_u8_t srv3_add1 = srv0; + vec_u8_t srv4_add1 = srv2; + vec_u8_t srv5_add1 = srv2; + vec_u8_t srv6_add1 = srv2; + vec_u8_t srv7_add1 = srv4; + vec_u8_t srv8_add1 = srv4; + vec_u8_t srv9_add1 = srv7; + vec_u8_t srv10_add1 = srv7; + vec_u8_t srv11_add1 = srv7; + vec_u8_t srv12_add1= srv9; + vec_u8_t srv13_add1 = srv9; + vec_u8_t srv14_add1 = srv12; + vec_u8_t srv15_add1 = srv12; +vec_u8_t vfrac16_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_4 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_5 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_6 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_8 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_9 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_10 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_12 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_13 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_14 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + +vec_u8_t vfrac16_32_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 22>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, }; +//vec_u8_t mask1={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, }; +vec_u8_t mask2={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; +//vec_u8_t mask3={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; +vec_u8_t mask4={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; +//vec_u8_t mask5={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; +//vec_u8_t mask6={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; +vec_u8_t mask7={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +//vec_u8_t mask8={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +vec_u8_t mask9={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +//vec_u8_t mask10={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +//vec_u8_t mask11={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t mask12={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +//vec_u8_t mask13={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask14={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +//vec_u8_t mask15={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; + +//vec_u8_t mask16={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask17={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +//vec_u8_t mask18={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask19={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +//vec_u8_t mask20={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +//vec_u8_t mask21={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask22={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask23={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask24={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +//vec_u8_t mask25={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +//vec_u8_t mask26={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask27={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask28={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask29={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +//vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +//vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; + +vec_u8_t maskadd1_0={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + //vec_u8_t srv_left0=vec_xl(64, srcPix0); + //vec_u8_t srv_left1=vec_xl(80, srcPix0); + //vec_u8_t srv_right=vec_xl(0, srcPix0); + //vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0); + //vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1); + //vec_u8_t s2 = vec_xl(12, srcPix0); + //vec_u8_t s3 = vec_xl(16+12, srcPix0); + + vec_u8_t srv_left0 = vec_xl(64, srcPix0); + vec_u8_t srv_left1 = vec_xl(80, srcPix0); + vec_u8_t srv_right = vec_xl(0, srcPix0);; + vec_u8_t refmask_32_0 ={0x1e, 0x1b, 0x19, 0x16, 0x14, 0x11, 0xf, 0xc, 0xa, 0x7, 0x5, 0x2, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t refmask_32_1 ={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 ); + vec_u8_t s1 = vec_xl(4, srcPix0);; + vec_u8_t s2 = vec_xl(20, srcPix0); + //vec_u8_t s3 = vec_xl(36, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = srv0; + vec_u8_t srv2 = vec_perm(s0, s1, mask2); + vec_u8_t srv3 = srv2; + vec_u8_t srv4 = vec_perm(s0, s1, mask4); + vec_u8_t srv5 = srv4; + vec_u8_t srv6 = srv4; + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = srv7; + vec_u8_t srv9 = vec_perm(s0, s1, mask9); + vec_u8_t srv10 = srv9; + vec_u8_t srv11 = srv9; + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = srv12; + vec_u8_t srv14 = vec_perm(s0, s1, mask14); + vec_u8_t srv15 = srv14; + + vec_u8_t srv16_0 = vec_perm(s1, s2, mask0); + vec_u8_t srv16_1 = srv16_0; + vec_u8_t srv16_2 = vec_perm(s1, s2, mask2); + vec_u8_t srv16_3 = srv16_2; + vec_u8_t srv16_4 = vec_perm(s1, s2, mask4); + vec_u8_t srv16_5 = srv16_4; + vec_u8_t srv16_6 = srv16_4; + vec_u8_t srv16_7 = vec_perm(s1, s2, mask7); + vec_u8_t srv16_8 = srv16_7; + vec_u8_t srv16_9 = vec_perm(s1, s2, mask9); + vec_u8_t srv16_10 = srv16_9; + vec_u8_t srv16_11 = srv16_9; + vec_u8_t srv16_12= vec_perm(s1, s2, mask12); + vec_u8_t srv16_13 = srv16_12; + vec_u8_t srv16_14 = vec_perm(s1, s2, mask14); + vec_u8_t srv16_15 = srv16_14; + + //0(0,1),0,2,2,4,4,4,7,7,9,9,9,12,12,14,14,14,17,17,19,19,19,22,22,24,24,24,27,27,s0,s0,s0 + + vec_u8_t srv16 = srv14; + vec_u8_t srv17 = vec_perm(s0, s1, mask17); + vec_u8_t srv18 = srv17; + vec_u8_t srv19 = vec_perm(s0, s1, mask19); + vec_u8_t srv20 = srv19; + vec_u8_t srv21 = srv19; + vec_u8_t srv22 = vec_perm(s0, s1, mask22); + vec_u8_t srv23 = srv22; + vec_u8_t srv24 = vec_perm(s0, s1, mask24); + vec_u8_t srv25 = srv24; + vec_u8_t srv26 = srv24; + vec_u8_t srv27 = vec_perm(s0, s1, mask27); + vec_u8_t srv28 = srv27; + vec_u8_t srv29 = s0; + vec_u8_t srv30 = s0; + vec_u8_t srv31 = s0; + + vec_u8_t srv16_16 = srv16_14; + vec_u8_t srv16_17 = vec_perm(s1, s2, mask17); + vec_u8_t srv16_18 = srv16_17; + vec_u8_t srv16_19 = vec_perm(s1, s2, mask19); + vec_u8_t srv16_20 = srv16_19; + vec_u8_t srv16_21 = srv16_19; + vec_u8_t srv16_22 = vec_perm(s1, s2, mask22); + vec_u8_t srv16_23 = srv16_22; + vec_u8_t srv16_24 = vec_perm(s1, s2, mask24); + vec_u8_t srv16_25 = srv16_24; + vec_u8_t srv16_26 = srv16_24; + vec_u8_t srv16_27 = vec_perm(s1, s2, mask27); + vec_u8_t srv16_28 = srv16_27; + vec_u8_t srv16_29 = s1; + vec_u8_t srv16_30 = s1; + vec_u8_t srv16_31 = s1; + + vec_u8_t srv0add1 = vec_perm(s0, s1, maskadd1_0); + vec_u8_t srv1add1 = srv0add1; + vec_u8_t srv2add1 = srv0; + vec_u8_t srv3add1 = srv0; + vec_u8_t srv4add1 = srv2; + vec_u8_t srv5add1 = srv2; + vec_u8_t srv6add1 = srv2; + vec_u8_t srv7add1 = srv4; + vec_u8_t srv8add1 = srv4; + vec_u8_t srv9add1 = srv7; + vec_u8_t srv10add1 = srv7; + vec_u8_t srv11add1 = srv7; + vec_u8_t srv12add1= srv9; + vec_u8_t srv13add1 = srv9; + vec_u8_t srv14add1 = srv12; + vec_u8_t srv15add1 = srv12; + + vec_u8_t srv16add1_0 = vec_perm(s1, s2, maskadd1_0); + vec_u8_t srv16add1_1 = srv16add1_0; + vec_u8_t srv16add1_2 = srv16_0; + vec_u8_t srv16add1_3 = srv16_0; + vec_u8_t srv16add1_4 = srv16_2; + vec_u8_t srv16add1_5 = srv16_2; + vec_u8_t srv16add1_6 = srv16_2; + vec_u8_t srv16add1_7 = srv16_4; + vec_u8_t srv16add1_8 = srv16_4; + vec_u8_t srv16add1_9 = srv16_7; + vec_u8_t srv16add1_10 = srv16_7; + vec_u8_t srv16add1_11 = srv16_7; + vec_u8_t srv16add1_12= srv16_9; + vec_u8_t srv16add1_13 = srv16_9; + vec_u8_t srv16add1_14 = srv16_12; + vec_u8_t srv16add1_15 = srv16_12; + + //srv28, s1,s1, 1,1,3,3,6,6,7,7,9,9,11,11,14,15,15,16,16,18,18,20,20,22,22,24,24,26,26,28,28, + //0,0,2,2,2,4,4,7,7,7,9,9,12,12,12,14,14,17,17,17,19,19,22,22,22,24,24,27,27,27, + + vec_u8_t srv16add1 = srv12; + vec_u8_t srv17add1 = srv14; + vec_u8_t srv18add1 = srv14; + vec_u8_t srv19add1 = srv17; + vec_u8_t srv20add1 = srv17; + vec_u8_t srv21add1 = srv17; + vec_u8_t srv22add1 = srv19; + vec_u8_t srv23add1 = srv19; + vec_u8_t srv24add1 = srv22; + vec_u8_t srv25add1 = srv22; + vec_u8_t srv26add1 = srv22; + vec_u8_t srv27add1 = srv24; + vec_u8_t srv28add1 = srv24; + vec_u8_t srv29add1 = srv27; + vec_u8_t srv30add1 = srv27; + vec_u8_t srv31add1 = srv27; + + vec_u8_t srv16add1_16 = srv16_12; + vec_u8_t srv16add1_17 = srv16_14; + vec_u8_t srv16add1_18 = srv16_14; + vec_u8_t srv16add1_19 = srv16_17; + vec_u8_t srv16add1_20 = srv16_17; + vec_u8_t srv16add1_21 = srv16_17; + vec_u8_t srv16add1_22 = srv16_19; + vec_u8_t srv16add1_23 = srv16_19; + vec_u8_t srv16add1_24 = srv16_22; + vec_u8_t srv16add1_25 = srv16_22; + vec_u8_t srv16add1_26 = srv16_22; + vec_u8_t srv16add1_27 = srv16_24; + vec_u8_t srv16add1_28 = srv16_24; + vec_u8_t srv16add1_29 = srv16_27; + vec_u8_t srv16add1_30 = srv16_27; + vec_u8_t srv16add1_31 = srv16_27; + +vec_u8_t vfrac16_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_4 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_5 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_6 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_8 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_9 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_10 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_12 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_13 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_14 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_16 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_17 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_18 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_20 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_21 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_22 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_24 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_25 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_26 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_28 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_29 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_30 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; +vec_u8_t vfrac16_32_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_16 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_32_17 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_18 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_32_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_20 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_32_21 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_22 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_32_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_24 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_32_25 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_26 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_32_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_28 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_32_29 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_30 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv16, srv16add1, vfrac16_32_16, vfrac16_16, vout_0); + one_line(srv16_16, srv16add1_16, vfrac16_32_16, vfrac16_16, vout_1); + + one_line(srv17, srv17add1, vfrac16_32_17, vfrac16_17, vout_2); + one_line(srv16_17, srv16add1_17, vfrac16_32_17, vfrac16_17, vout_3); + + one_line(srv18, srv18add1, vfrac16_32_18, vfrac16_18, vout_4); + one_line(srv16_18, srv16add1_18, vfrac16_32_18, vfrac16_18, vout_5); + + one_line(srv19, srv19add1, vfrac16_32_19, vfrac16_19, vout_6); + one_line(srv16_19, srv16add1_19, vfrac16_32_19, vfrac16_19, vout_7); + + one_line(srv20, srv20add1, vfrac16_32_20, vfrac16_20, vout_8); + one_line(srv16_20, srv16add1_20, vfrac16_32_20, vfrac16_20, vout_9); + + one_line(srv21, srv21add1, vfrac16_32_21, vfrac16_21, vout_10); + one_line(srv16_21, srv16add1_21, vfrac16_32_21, vfrac16_21, vout_11); + + one_line(srv22, srv22add1, vfrac16_32_22, vfrac16_22, vout_12); + one_line(srv16_22, srv16add1_22, vfrac16_32_22, vfrac16_22, vout_13); + + one_line(srv23, srv23add1, vfrac16_32_23, vfrac16_23, vout_14); + one_line(srv16_23, srv16add1_23, vfrac16_32_23, vfrac16_23, vout_15); + + one_line(srv24, srv24add1, vfrac16_32_24, vfrac16_24, vout_16); + one_line(srv16_24, srv16add1_24, vfrac16_32_24, vfrac16_24, vout_17); + + one_line(srv25, srv25add1, vfrac16_32_25, vfrac16_25, vout_18); + one_line(srv16_25, srv16add1_25, vfrac16_32_25, vfrac16_25, vout_19); + + one_line(srv26, srv26add1, vfrac16_32_26, vfrac16_26, vout_20); + one_line(srv16_26, srv16add1_26, vfrac16_32_26, vfrac16_26, vout_21); + + one_line(srv27, srv27add1, vfrac16_32_27, vfrac16_27, vout_22); + one_line(srv16_27, srv16add1_27, vfrac16_32_27, vfrac16_27, vout_23); + + one_line(srv28, srv28add1, vfrac16_32_28, vfrac16_28, vout_24); + one_line(srv16_28, srv16add1_28, vfrac16_32_28, vfrac16_28, vout_25); + + one_line(srv29, srv29add1, vfrac16_32_29, vfrac16_29, vout_26); + one_line(srv16_29, srv16add1_29, vfrac16_32_29, vfrac16_29, vout_27); + + one_line(srv30, srv30add1, vfrac16_32_30, vfrac16_30, vout_28); + one_line(srv16_30, srv16add1_30, vfrac16_32_30, vfrac16_30, vout_29); + + one_line(srv31, srv31add1, vfrac16_32_31, vfrac16_31, vout_30); + one_line(srv16_31, srv16add1_31, vfrac16_32_31, vfrac16_31, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<4, 23>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t mask0={0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3, }; +vec_u8_t mask1={0x2, 0x3, 0x4, 0x5, 0x2, 0x3, 0x4, 0x5, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, }; + + //mode 19: + //int offset[32] = {-1, -2, -3, -4, -5, -5, -6, -7, -8, -9, -9, -10, -11, -12, -13, -13, -14, -15, -16, -17, -18, -18, -19, -20, -21, -22, -22, -23, -24, -25, -26, -26}; + //int fraction[32] = {6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0, 6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0}; + //mode=19 width=32 nbProjected=25(invAngleSum >> 8)=1 ,(invAngleSum >> 8)=2 ,(invAngleSum >> 8)=4 ,(invAngleSum >> 8)=5 ,(invAngleSum >> 8)=6 ,(invAngleSum >> 8)=7 ,(invAngleSum >> 8)=9 ,(invAngleSum >> 8)=10 ,(invAngleSum >> 8)=11 ,(invAngleSum >> 8)=12 ,(invAngleSum >> 8)=14 ,(invAngleSum >> 8)=15 ,(invAngleSum >> 8)=16 ,(invAngleSum >> 8)=17 ,(invAngleSum >> 8)=18 ,(invAngleSum >> 8)=20 ,(invAngleSum >> 8)=21 ,(invAngleSum >> 8)=22 ,(invAngleSum >> 8)=23 ,(invAngleSum >> 8)=25 ,(invAngleSum >> 8)=26 ,(invAngleSum >> 8)=27 ,(invAngleSum >> 8)=28 ,(invAngleSum >> 8)=30 ,(invAngleSum >> 8)=31 + + //mode19 invAS[32]= {1, 2, 4, }; + //vec_u8_t mask_left={0x1, 0x02, 0x04, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,0x0, 0x0}; + vec_u8_t srv_left=vec_xl(8, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + //vec_u8_t srv_left=vec_perm(srv_left, srv_left, mask_left); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ +vec_u8_t refmask_4={0x4, 0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; + + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); +vec_u8_t vfrac4 = (vec_u8_t){23, 23, 23, 23, 14, 14, 14, 14, 5, 5, 5, 5, 28, 28, 28, 28}; +vec_u8_t vfrac4_32 = (vec_u8_t){9, 9, 9, 9, 18, 18, 18, 18, 27, 27, 27, 27, 4, 4, 4, 4}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 23>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +vec_u8_t mask1={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, }; +vec_u8_t mask2={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; +vec_u8_t mask3={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +vec_u8_t mask4={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; +vec_u8_t mask5={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, }; +vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vout_0, vout_1, vout_2, vout_3; + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + + vec_u8_t srv_left=vec_xl(16, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ +vec_u8_t refmask_8={0x7, 0x4, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00, 0x00, 0x00, 0x00, 0x00, }; + + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + vec_u8_t srv3 = vec_perm(srv, srv, mask3); + vec_u8_t srv4 = vec_perm(srv, srv, mask4); + vec_u8_t srv5 = vec_perm(srv, srv, mask5); + vec_u8_t srv6 = vec_perm(srv, srv, mask6); + vec_u8_t srv7 = vec_perm(srv, srv, mask7); + + +vec_u8_t vfrac8_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac8_1 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac8_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac8_3 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac8_32_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac8_32_1 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac8_32_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac8_32_3 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 8, 8, 8, 8, 8, 8, 8, 8}; + +one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0); +one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1); +one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2); +one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 23>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +//vec_u8_t mask1={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +//vec_u8_t mask2={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask3={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask4={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask5={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask6={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +//vec_u8_t mask8={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +//vec_u8_t mask9={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask10={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask11={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask12={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask14={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +//vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; + +vec_u8_t maskadd1_0={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +/*vec_u8_t maskadd1_1={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t maskadd1_2={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t maskadd1_3={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t maskadd1_4={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t maskadd1_5={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t maskadd1_6={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t maskadd1_7={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_8={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_9={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_10={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_11={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/ + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv_left=vec_xl(32, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ +vec_u8_t refmask_16={0xe, 0xb, 0x7, 0x4, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(12, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = srv0; + vec_u8_t srv2 = srv0; + vec_u8_t srv3 = vec_perm(s0, s1, mask3); + vec_u8_t srv4 = srv3; + vec_u8_t srv5 = srv3; + vec_u8_t srv6 = srv3; + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = srv7; + vec_u8_t srv9 = srv7; + vec_u8_t srv10 = vec_perm(s0, s1, mask10); + vec_u8_t srv11 = srv10; + vec_u8_t srv12= srv10; + vec_u8_t srv13 = srv10; + vec_u8_t srv14 = vec_perm(s0, s1, mask14); + vec_u8_t srv15 = srv14; + + vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0); + vec_u8_t srv1_add1 = srv0_add1; + vec_u8_t srv2_add1 = srv0_add1; + vec_u8_t srv3_add1 = srv0; + vec_u8_t srv4_add1 = srv0; + vec_u8_t srv5_add1 = srv0; + vec_u8_t srv6_add1 = srv0; + vec_u8_t srv7_add1 = srv3; + vec_u8_t srv8_add1 = srv3; + vec_u8_t srv9_add1 = srv3; + vec_u8_t srv10_add1 = srv7; + vec_u8_t srv11_add1 = srv7; + vec_u8_t srv12_add1= srv7; + vec_u8_t srv13_add1 = srv7; + vec_u8_t srv14_add1 = srv10; + vec_u8_t srv15_add1 = srv10; +vec_u8_t vfrac16_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_2 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_4 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_5 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_6 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_8 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_9 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_10 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_12 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_13 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_14 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 23>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t mask3={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask7={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask10={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask14={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask17={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask21={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask24={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +/*vec_u8_t mask1={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t mask2={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t mask4={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask5={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask6={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask8={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask9={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask11={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask12={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask13={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask15={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; + +vec_u8_t mask16={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask18={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask19={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask20={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask22={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask23={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask25={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask26={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask27={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask28={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask29={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };*/ + +vec_u8_t maskadd1_0={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv_left0 = vec_xl(64, srcPix0); + vec_u8_t srv_left1 = vec_xl(80, srcPix0); + vec_u8_t srv_right = vec_xl(0, srcPix0);; + vec_u8_t refmask_32_0 ={0x1c, 0x19, 0x15, 0x12, 0xe, 0xb, 0x7, 0x4, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t refmask_32_1 ={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 ); + vec_u8_t s1 = vec_xl(8, srcPix0);; + vec_u8_t s2 = vec_xl(24, srcPix0); + //vec_u8_t s3 = vec_xl(40, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = srv0; + vec_u8_t srv2 = srv0; + vec_u8_t srv3 = vec_perm(s0, s1, mask3); + vec_u8_t srv4 = srv3; + vec_u8_t srv5 = srv3; + vec_u8_t srv6 = srv3; + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = srv7; + vec_u8_t srv9 = srv7; + vec_u8_t srv10 = vec_perm(s0, s1, mask10); + vec_u8_t srv11 = srv10; + vec_u8_t srv12= srv10; + vec_u8_t srv13 = srv10; + vec_u8_t srv14 = vec_perm(s0, s1, mask14); + vec_u8_t srv15 = srv14; + + //0,0,0,3,3,3,3,7,7,7,10,10,10,10,14,14,14,17,17,17,17,21,21,21,24,24,24,24,s0,s0,s0,s0 + + vec_u8_t srv16_0 = vec_perm(s1, s2, mask0); + vec_u8_t srv16_1 = srv16_0; + vec_u8_t srv16_2 = srv16_0; + vec_u8_t srv16_3 = vec_perm(s1, s2, mask3); + vec_u8_t srv16_4 = srv16_3; + vec_u8_t srv16_5 = srv16_3; + vec_u8_t srv16_6 = srv16_3; + vec_u8_t srv16_7 = vec_perm(s1, s2, mask7); + vec_u8_t srv16_8 = srv16_7; + vec_u8_t srv16_9 = srv16_7; + vec_u8_t srv16_10 = vec_perm(s1, s2, mask10); + vec_u8_t srv16_11 = srv16_10; + vec_u8_t srv16_12= srv16_10; + vec_u8_t srv16_13 = srv16_10; + vec_u8_t srv16_14 = vec_perm(s1, s2, mask14); + vec_u8_t srv16_15 = srv16_14; + + vec_u8_t srv16 = srv14; + vec_u8_t srv17 = vec_perm(s0, s1, mask17); + vec_u8_t srv18 = srv17; + vec_u8_t srv19 = srv17; + vec_u8_t srv20 = srv17; + vec_u8_t srv21 = vec_perm(s0, s1, mask21); + vec_u8_t srv22 = srv21; + vec_u8_t srv23 = srv21; + vec_u8_t srv24 = vec_perm(s0, s1, mask24); + vec_u8_t srv25 = srv24; + vec_u8_t srv26 = srv24; + vec_u8_t srv27 = srv24; + vec_u8_t srv28 = s0; + vec_u8_t srv29 = s0; + vec_u8_t srv30 = s0; + vec_u8_t srv31 = s0; + + vec_u8_t srv16_16 = srv16_14; + vec_u8_t srv16_17 = vec_perm(s1, s2, mask17); + vec_u8_t srv16_18 = srv16_17; + vec_u8_t srv16_19 = srv16_17; + vec_u8_t srv16_20 = srv16_17; + vec_u8_t srv16_21 = vec_perm(s1, s2, mask21); + vec_u8_t srv16_22 = srv16_21; + vec_u8_t srv16_23 = srv16_21; + vec_u8_t srv16_24 = vec_perm(s1, s2, mask24); + vec_u8_t srv16_25 = srv16_24; + vec_u8_t srv16_26 = srv16_24; + vec_u8_t srv16_27 = srv16_24; + vec_u8_t srv16_28 = s1; + vec_u8_t srv16_29 = s1; + vec_u8_t srv16_30 = s1; + vec_u8_t srv16_31 = s1; + + vec_u8_t srv0add1 = vec_perm(s0, s1, maskadd1_0); + vec_u8_t srv1add1 = srv0add1; + vec_u8_t srv2add1 = srv0add1; + vec_u8_t srv3add1 = srv0; + vec_u8_t srv4add1 = srv0; + vec_u8_t srv5add1 = srv0; + vec_u8_t srv6add1 = srv0; + vec_u8_t srv7add1 = srv3; + vec_u8_t srv8add1 = srv3; + vec_u8_t srv9add1 = srv3; + vec_u8_t srv10add1 = srv7; + vec_u8_t srv11add1 = srv7; + vec_u8_t srv12add1= srv7; + vec_u8_t srv13add1 = srv7; + vec_u8_t srv14add1 = srv10; + vec_u8_t srv15add1 = srv10; + //0,0,0,0,3,3,3,7,7,7,7,10,10,10,14,14,14,14,17,17,17,21,21,21,21,24,24,24,24, + vec_u8_t srv16add1_0 = vec_perm(s1, s2, maskadd1_0); + vec_u8_t srv16add1_1 = srv16add1_0; + vec_u8_t srv16add1_2 = srv16add1_0; + vec_u8_t srv16add1_3 = srv16_0; + vec_u8_t srv16add1_4 = srv16_0; + vec_u8_t srv16add1_5 = srv16_0; + vec_u8_t srv16add1_6 = srv16_0; + vec_u8_t srv16add1_7 = srv16_3; + vec_u8_t srv16add1_8 = srv16_3; + vec_u8_t srv16add1_9 = srv16_3; + vec_u8_t srv16add1_10 = srv16_7; + vec_u8_t srv16add1_11 = srv16_7; + vec_u8_t srv16add1_12= srv16_7; + vec_u8_t srv16add1_13 = srv16_7; + vec_u8_t srv16add1_14 = srv16_10; + vec_u8_t srv16add1_15 = srv16_10; + + vec_u8_t srv16add1 = srv10; + vec_u8_t srv17add1 = srv14; + vec_u8_t srv18add1 = srv14; + vec_u8_t srv19add1 = srv14; + vec_u8_t srv20add1 = srv14; + vec_u8_t srv21add1 = srv17; + vec_u8_t srv22add1 = srv17; + vec_u8_t srv23add1 = srv17; + vec_u8_t srv24add1 = srv21; + vec_u8_t srv25add1 = srv21; + vec_u8_t srv26add1 = srv21; + vec_u8_t srv27add1 = srv21; + vec_u8_t srv28add1 = srv24; + vec_u8_t srv29add1 = srv24; + vec_u8_t srv30add1 = srv24; + vec_u8_t srv31add1 = srv24; + + vec_u8_t srv16add1_16 = srv16_10; + vec_u8_t srv16add1_17 = srv16_14; + vec_u8_t srv16add1_18 = srv16_14; + vec_u8_t srv16add1_19 = srv16_14; + vec_u8_t srv16add1_20 = srv16_14; + vec_u8_t srv16add1_21 = srv16_17; + vec_u8_t srv16add1_22 = srv16_17; + vec_u8_t srv16add1_23 = srv16_17; + vec_u8_t srv16add1_24 = srv16_21; + vec_u8_t srv16add1_25 = srv16_21; + vec_u8_t srv16add1_26 = srv16_21; + vec_u8_t srv16add1_27 = srv16_21; + vec_u8_t srv16add1_28 = srv16_24; + vec_u8_t srv16add1_29 = srv16_24; + vec_u8_t srv16add1_30 = srv16_24; + vec_u8_t srv16add1_31 = srv16_24; + +vec_u8_t vfrac16_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_2 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_4 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_5 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_6 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_8 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_9 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_10 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_12 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_13 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_14 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_16 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_17 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_18 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_19 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_20 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_21 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_22 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_24 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_25 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_26 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_27 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_28 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_29 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_30 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; +vec_u8_t vfrac16_32_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_16 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_32_17 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_18 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_32_19 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_20 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_32_21 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_22 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_32_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_24 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_32_25 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_26 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_32_27 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_28 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_32_29 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_30 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv16, srv16add1, vfrac16_32_16, vfrac16_16, vout_0); + one_line(srv16_16, srv16add1_16, vfrac16_32_16, vfrac16_16, vout_1); + + one_line(srv17, srv17add1, vfrac16_32_17, vfrac16_17, vout_2); + one_line(srv16_17, srv16add1_17, vfrac16_32_17, vfrac16_17, vout_3); + + one_line(srv18, srv18add1, vfrac16_32_18, vfrac16_18, vout_4); + one_line(srv16_18, srv16add1_18, vfrac16_32_18, vfrac16_18, vout_5); + + one_line(srv19, srv19add1, vfrac16_32_19, vfrac16_19, vout_6); + one_line(srv16_19, srv16add1_19, vfrac16_32_19, vfrac16_19, vout_7); + + one_line(srv20, srv20add1, vfrac16_32_20, vfrac16_20, vout_8); + one_line(srv16_20, srv16add1_20, vfrac16_32_20, vfrac16_20, vout_9); + + one_line(srv21, srv21add1, vfrac16_32_21, vfrac16_21, vout_10); + one_line(srv16_21, srv16add1_21, vfrac16_32_21, vfrac16_21, vout_11); + + one_line(srv22, srv22add1, vfrac16_32_22, vfrac16_22, vout_12); + one_line(srv16_22, srv16add1_22, vfrac16_32_22, vfrac16_22, vout_13); + + one_line(srv23, srv23add1, vfrac16_32_23, vfrac16_23, vout_14); + one_line(srv16_23, srv16add1_23, vfrac16_32_23, vfrac16_23, vout_15); + + one_line(srv24, srv24add1, vfrac16_32_24, vfrac16_24, vout_16); + one_line(srv16_24, srv16add1_24, vfrac16_32_24, vfrac16_24, vout_17); + + one_line(srv25, srv25add1, vfrac16_32_25, vfrac16_25, vout_18); + one_line(srv16_25, srv16add1_25, vfrac16_32_25, vfrac16_25, vout_19); + + one_line(srv26, srv26add1, vfrac16_32_26, vfrac16_26, vout_20); + one_line(srv16_26, srv16add1_26, vfrac16_32_26, vfrac16_26, vout_21); + + one_line(srv27, srv27add1, vfrac16_32_27, vfrac16_27, vout_22); + one_line(srv16_27, srv16add1_27, vfrac16_32_27, vfrac16_27, vout_23); + + one_line(srv28, srv28add1, vfrac16_32_28, vfrac16_28, vout_24); + one_line(srv16_28, srv16add1_28, vfrac16_32_28, vfrac16_28, vout_25); + + one_line(srv29, srv29add1, vfrac16_32_29, vfrac16_29, vout_26); + one_line(srv16_29, srv16add1_29, vfrac16_32_29, vfrac16_29, vout_27); + + one_line(srv30, srv30add1, vfrac16_32_30, vfrac16_30, vout_28); + one_line(srv16_30, srv16add1_30, vfrac16_32_30, vfrac16_30, vout_29); + + one_line(srv31, srv31add1, vfrac16_32_31, vfrac16_31, vout_30); + one_line(srv16_31, srv16add1_31, vfrac16_32_31, vfrac16_31, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + + +template<> +void intra_pred<4, 24>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, }; +vec_u8_t mask1={0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, }; + + + //mode 19: + //int offset[32] = {-1, -2, -3, -4, -5, -5, -6, -7, -8, -9, -9, -10, -11, -12, -13, -13, -14, -15, -16, -17, -18, -18, -19, -20, -21, -22, -22, -23, -24, -25, -26, -26}; + //int fraction[32] = {6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0, 6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0}; + //mode=19 width=32 nbProjected=25(invAngleSum >> 8)=1 ,(invAngleSum >> 8)=2 ,(invAngleSum >> 8)=4 ,(invAngleSum >> 8)=5 ,(invAngleSum >> 8)=6 ,(invAngleSum >> 8)=7 ,(invAngleSum >> 8)=9 ,(invAngleSum >> 8)=10 ,(invAngleSum >> 8)=11 ,(invAngleSum >> 8)=12 ,(invAngleSum >> 8)=14 ,(invAngleSum >> 8)=15 ,(invAngleSum >> 8)=16 ,(invAngleSum >> 8)=17 ,(invAngleSum >> 8)=18 ,(invAngleSum >> 8)=20 ,(invAngleSum >> 8)=21 ,(invAngleSum >> 8)=22 ,(invAngleSum >> 8)=23 ,(invAngleSum >> 8)=25 ,(invAngleSum >> 8)=26 ,(invAngleSum >> 8)=27 ,(invAngleSum >> 8)=28 ,(invAngleSum >> 8)=30 ,(invAngleSum >> 8)=31 + + //mode19 invAS[32]= {1, 2, 4, }; + //vec_u8_t mask_left={0x1, 0x02, 0x04, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,0x0, 0x0}; + //vec_u8_t srv_left=vec_xl(8, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + //vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + // vec_u8_t refmask_4={0x10, 0x11, 0x12, 0x13, 0x14, 0x00, }; + //vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); + vec_u8_t srv = vec_xl(0, srcPix0); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); +vec_u8_t vfrac4 = (vec_u8_t){27, 27, 27, 27, 22, 22, 22, 22, 17, 17, 17, 17, 12, 12, 12, 12}; +vec_u8_t vfrac4_32 = (vec_u8_t){5, 5, 5, 5, 10, 10, 10, 10, 15, 15, 15, 15, 20, 20, 20, 20}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 24>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; +vec_u8_t mask1={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +vec_u8_t mask2={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; +vec_u8_t mask3={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +vec_u8_t mask4={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; +vec_u8_t mask5={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +vec_u8_t mask6={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, }; +vec_u8_t mask7={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; + + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vout_0, vout_1, vout_2, vout_3; + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + + vec_u8_t srv_left=vec_xl(16, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_8={0x6, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + vec_u8_t srv3 = vec_perm(srv, srv, mask3); + vec_u8_t srv4 = vec_perm(srv, srv, mask4); + vec_u8_t srv5 = vec_perm(srv, srv, mask5); + vec_u8_t srv6 = vec_perm(srv, srv, mask6); + vec_u8_t srv7 = vec_perm(srv, srv, mask7); + +vec_u8_t vfrac8_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac8_1 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac8_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac8_3 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac8_32_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac8_32_1 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac8_32_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac8_32_3 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 8, 8, 8, 8, 8, 8, 8, 8}; + +one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0); +one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1); +one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2); +one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 24>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask12={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t maskadd1_0={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +/*vec_u8_t mask1={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask2={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask3={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask4={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask5={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask7={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask8={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask9={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask10={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask11={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask13={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask14={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; + +vec_u8_t maskadd1_1={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_2={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_3={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_4={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_5={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_6={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_8={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_9={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_10={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_11={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_12={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/ + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv_left=vec_xl(32, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_16={0xd, 0x6, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(14, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = srv0; + vec_u8_t srv2 = srv0; + vec_u8_t srv3 = srv0; + vec_u8_t srv4 = srv0; + vec_u8_t srv5 = srv0; + vec_u8_t srv6 = vec_perm(s0, s1, mask6); + vec_u8_t srv7 = srv6; + vec_u8_t srv8 = srv6; + vec_u8_t srv9 = srv6; + vec_u8_t srv10 = srv6; + vec_u8_t srv11 = srv6; + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = srv12; + vec_u8_t srv14 = srv12; + vec_u8_t srv15 = srv12; + + vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0); + vec_u8_t srv1_add1 = srv0_add1; + vec_u8_t srv2_add1 = srv0_add1; + vec_u8_t srv3_add1 = srv0_add1; + vec_u8_t srv4_add1 = srv0_add1; + vec_u8_t srv5_add1 = srv0_add1; + vec_u8_t srv6_add1 = srv0; + vec_u8_t srv7_add1 = srv0; + vec_u8_t srv8_add1 = srv0; + vec_u8_t srv9_add1 = srv0; + vec_u8_t srv10_add1 = srv0; + vec_u8_t srv11_add1 = srv0; + vec_u8_t srv12_add1= srv6; + vec_u8_t srv13_add1 = srv6; + vec_u8_t srv14_add1 = srv6; + vec_u8_t srv15_add1 = srv6; +vec_u8_t vfrac16_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_2 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_4 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_6 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_8 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_10 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_12 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_14 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 24>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +/*vec_u8_t mask1={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask2={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask3={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask4={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask5={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };*/ +vec_u8_t mask6={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask7={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask8={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask9={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask10={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask11={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +/*vec_u8_t mask13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask14={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask15={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; + +vec_u8_t mask16={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask17={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask18={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };*/ +vec_u8_t mask19={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +/*vec_u8_t mask20={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask21={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask22={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask23={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask24={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask25={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask26={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask27={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask28={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask29={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };*/ + +vec_u8_t maskadd1_0={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv_left0 = vec_xl(64, srcPix0); + vec_u8_t srv_left1 = vec_xl(80, srcPix0); + vec_u8_t srv_right = vec_xl(0, srcPix0);; + vec_u8_t refmask_32_0 ={0x1a, 0x13, 0xd, 0x6, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t refmask_32_1 ={0x00, 0x01, 0x02, 0x03, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b}; + vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 ); + vec_u8_t s1 = vec_xl(12, srcPix0);; + vec_u8_t s2 = vec_xl(28, srcPix0); + //vec_u8_t s3 = vec_xl(44, srcPix0); + + //(0,6)(6,6)(12,7)(19,6)(25, s0) + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = srv0; + vec_u8_t srv2 = srv0; + vec_u8_t srv3 = srv0; + vec_u8_t srv4 = srv0; + vec_u8_t srv5 = srv0; + vec_u8_t srv6 = vec_perm(s0, s1, mask6); + vec_u8_t srv7 = srv6; + vec_u8_t srv8 = srv6; + vec_u8_t srv9 = srv6; + vec_u8_t srv10 = srv6; + vec_u8_t srv11 = srv6; + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = srv12; + vec_u8_t srv14 = srv12; + vec_u8_t srv15 = srv12; + + //0,0,0,3,3,3,3,7,7,7,10,10,10,10,14,14,14,17,17,17,17,21,21,21,24,24,24,24,s0,s0,s0,s0 + + vec_u8_t srv16_0 = vec_perm(s1, s2, mask0); + vec_u8_t srv16_1 = srv16_0; + vec_u8_t srv16_2 = srv16_0; + vec_u8_t srv16_3 = srv16_0; + vec_u8_t srv16_4 = srv16_0; + vec_u8_t srv16_5 = srv16_0; + vec_u8_t srv16_6 = vec_perm(s1, s2, mask6); + vec_u8_t srv16_7 = srv16_6; + vec_u8_t srv16_8 = srv16_6; + vec_u8_t srv16_9 = srv16_6; + vec_u8_t srv16_10 = srv16_6; + vec_u8_t srv16_11 = srv16_6; + vec_u8_t srv16_12= vec_perm(s1, s2, mask12); + vec_u8_t srv16_13 = srv16_12; + vec_u8_t srv16_14 = srv16_12; + vec_u8_t srv16_15 = srv16_12; + + vec_u8_t srv16 = srv12; + vec_u8_t srv17 = srv12; + vec_u8_t srv18 = srv12; + vec_u8_t srv19 = vec_perm(s0, s1, mask19); + vec_u8_t srv20 = srv19; + vec_u8_t srv21 = srv19; + vec_u8_t srv22 = srv19; + vec_u8_t srv23 = srv19; + vec_u8_t srv24 = srv19; + vec_u8_t srv25 = s0; + vec_u8_t srv26 = s0; + vec_u8_t srv27 = s0; + vec_u8_t srv28 = s0; + vec_u8_t srv29 = s0; + vec_u8_t srv30 = s0; + vec_u8_t srv31 = s0; + + vec_u8_t srv16_16 = srv16_12; + vec_u8_t srv16_17 = srv16_12; + vec_u8_t srv16_18 = srv16_12; + vec_u8_t srv16_19 = vec_perm(s1, s2, mask19); + vec_u8_t srv16_20 = srv16_19; + vec_u8_t srv16_21 = srv16_19; + vec_u8_t srv16_22 = srv16_19; + vec_u8_t srv16_23 = srv16_19; + vec_u8_t srv16_24 = srv16_19; + vec_u8_t srv16_25 = s1; + vec_u8_t srv16_26 = s1; + vec_u8_t srv16_27 = s1; + vec_u8_t srv16_28 = s1; + vec_u8_t srv16_29 = s1; + vec_u8_t srv16_30 = s1; + vec_u8_t srv16_31 = s1; + + vec_u8_t srv0add1 = vec_perm(s0, s1, maskadd1_0); + vec_u8_t srv1add1 = srv0add1; + vec_u8_t srv2add1 = srv0add1; + vec_u8_t srv3add1 = srv0add1; + vec_u8_t srv4add1 = srv0add1; + vec_u8_t srv5add1 = srv0add1; + vec_u8_t srv6add1 = srv0; + vec_u8_t srv7add1 = srv0; + vec_u8_t srv8add1 = srv0; + vec_u8_t srv9add1 = srv0; + vec_u8_t srv10add1 = srv0; + vec_u8_t srv11add1 = srv0; + vec_u8_t srv12add1= srv6; + vec_u8_t srv13add1 = srv6; + vec_u8_t srv14add1 = srv6; + vec_u8_t srv15add1 = srv6; + + vec_u8_t srv16add1_0 = vec_perm(s1, s2, maskadd1_0); + vec_u8_t srv16add1_1 = srv16add1_0; + vec_u8_t srv16add1_2 = srv16add1_0; + vec_u8_t srv16add1_3 = srv16add1_0; + vec_u8_t srv16add1_4 = srv16add1_0; + vec_u8_t srv16add1_5 = srv16add1_0; + vec_u8_t srv16add1_6 = srv16_0; + vec_u8_t srv16add1_7 = srv16_0; + vec_u8_t srv16add1_8 = srv16_0; + vec_u8_t srv16add1_9 = srv16_0; + vec_u8_t srv16add1_10 = srv16_0; + vec_u8_t srv16add1_11 = srv16_0; + vec_u8_t srv16add1_12= srv16_6; + vec_u8_t srv16add1_13 = srv16_6; + vec_u8_t srv16add1_14 = srv16_6; + vec_u8_t srv16add1_15 = srv16_6; + + vec_u8_t srv16add1 = srv6; + vec_u8_t srv17add1 = srv6; + vec_u8_t srv18add1 = srv6; + vec_u8_t srv19add1 = srv12; + vec_u8_t srv20add1 = srv12; + vec_u8_t srv21add1 = srv12; + vec_u8_t srv22add1 = srv12; + vec_u8_t srv23add1 = srv12; + vec_u8_t srv24add1 = srv12; + vec_u8_t srv25add1 = srv19; + vec_u8_t srv26add1 = srv19; + vec_u8_t srv27add1 = srv19; + vec_u8_t srv28add1 = srv19; + vec_u8_t srv29add1 = srv19; + vec_u8_t srv30add1 = srv19; + vec_u8_t srv31add1 = srv19; + + vec_u8_t srv16add1_16 = srv16_6; + vec_u8_t srv16add1_17 = srv16_6; + vec_u8_t srv16add1_18 = srv16_6; + vec_u8_t srv16add1_19 = srv16_12; + vec_u8_t srv16add1_20 = srv16_12; + vec_u8_t srv16add1_21 = srv16_12; + vec_u8_t srv16add1_22 = srv16_12; + vec_u8_t srv16add1_23 = srv16_12; + vec_u8_t srv16add1_24 = srv16_12; + vec_u8_t srv16add1_25 = srv16_19; + vec_u8_t srv16add1_26 = srv16_19; + vec_u8_t srv16add1_27 = srv16_19; + vec_u8_t srv16add1_28 = srv16_19; + vec_u8_t srv16add1_29 = srv16_19; + vec_u8_t srv16add1_30 = srv16_19; + vec_u8_t srv16add1_31 = srv16_19; + +vec_u8_t vfrac16_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_2 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_4 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_6 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_8 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_10 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_12 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_14 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_16 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_17 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_18 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_20 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_21 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_22 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_24 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_25 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_26 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_28 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_29 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_30 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; +vec_u8_t vfrac16_32_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_16 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_32_17 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_18 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_32_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_20 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_32_21 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_22 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_32_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_24 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_32_25 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_26 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_32_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_28 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_32_29 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_30 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv16, srv16add1, vfrac16_32_16, vfrac16_16, vout_0); + one_line(srv16_16, srv16add1_16, vfrac16_32_16, vfrac16_16, vout_1); + + one_line(srv17, srv17add1, vfrac16_32_17, vfrac16_17, vout_2); + one_line(srv16_17, srv16add1_17, vfrac16_32_17, vfrac16_17, vout_3); + + one_line(srv18, srv18add1, vfrac16_32_18, vfrac16_18, vout_4); + one_line(srv16_18, srv16add1_18, vfrac16_32_18, vfrac16_18, vout_5); + + one_line(srv19, srv19add1, vfrac16_32_19, vfrac16_19, vout_6); + one_line(srv16_19, srv16add1_19, vfrac16_32_19, vfrac16_19, vout_7); + + one_line(srv20, srv20add1, vfrac16_32_20, vfrac16_20, vout_8); + one_line(srv16_20, srv16add1_20, vfrac16_32_20, vfrac16_20, vout_9); + + one_line(srv21, srv21add1, vfrac16_32_21, vfrac16_21, vout_10); + one_line(srv16_21, srv16add1_21, vfrac16_32_21, vfrac16_21, vout_11); + + one_line(srv22, srv22add1, vfrac16_32_22, vfrac16_22, vout_12); + one_line(srv16_22, srv16add1_22, vfrac16_32_22, vfrac16_22, vout_13); + + one_line(srv23, srv23add1, vfrac16_32_23, vfrac16_23, vout_14); + one_line(srv16_23, srv16add1_23, vfrac16_32_23, vfrac16_23, vout_15); + + one_line(srv24, srv24add1, vfrac16_32_24, vfrac16_24, vout_16); + one_line(srv16_24, srv16add1_24, vfrac16_32_24, vfrac16_24, vout_17); + + one_line(srv25, srv25add1, vfrac16_32_25, vfrac16_25, vout_18); + one_line(srv16_25, srv16add1_25, vfrac16_32_25, vfrac16_25, vout_19); + + one_line(srv26, srv26add1, vfrac16_32_26, vfrac16_26, vout_20); + one_line(srv16_26, srv16add1_26, vfrac16_32_26, vfrac16_26, vout_21); + + one_line(srv27, srv27add1, vfrac16_32_27, vfrac16_27, vout_22); + one_line(srv16_27, srv16add1_27, vfrac16_32_27, vfrac16_27, vout_23); + + one_line(srv28, srv28add1, vfrac16_32_28, vfrac16_28, vout_24); + one_line(srv16_28, srv16add1_28, vfrac16_32_28, vfrac16_28, vout_25); + + one_line(srv29, srv29add1, vfrac16_32_29, vfrac16_29, vout_26); + one_line(srv16_29, srv16add1_29, vfrac16_32_29, vfrac16_29, vout_27); + + one_line(srv30, srv30add1, vfrac16_32_30, vfrac16_30, vout_28); + one_line(srv16_30, srv16add1_30, vfrac16_32_30, vfrac16_30, vout_29); + + one_line(srv31, srv31add1, vfrac16_32_31, vfrac16_31, vout_30); + one_line(srv16_31, srv16add1_31, vfrac16_32_31, vfrac16_31, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + + +template<> +void intra_pred<4, 25>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, }; +vec_u8_t mask1={0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, }; + + + //mode 19: + //int offset[32] = {-1, -2, -3, -4, -5, -5, -6, -7, -8, -9, -9, -10, -11, -12, -13, -13, -14, -15, -16, -17, -18, -18, -19, -20, -21, -22, -22, -23, -24, -25, -26, -26}; + //int fraction[32] = {6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0, 6, 12, 18, 24, 30, 4, 10, 16, 22, 28, 2, 8, 14, 20, 26, 0}; + //mode=19 width=32 nbProjected=25(invAngleSum >> 8)=1 ,(invAngleSum >> 8)=2 ,(invAngleSum >> 8)=4 ,(invAngleSum >> 8)=5 ,(invAngleSum >> 8)=6 ,(invAngleSum >> 8)=7 ,(invAngleSum >> 8)=9 ,(invAngleSum >> 8)=10 ,(invAngleSum >> 8)=11 ,(invAngleSum >> 8)=12 ,(invAngleSum >> 8)=14 ,(invAngleSum >> 8)=15 ,(invAngleSum >> 8)=16 ,(invAngleSum >> 8)=17 ,(invAngleSum >> 8)=18 ,(invAngleSum >> 8)=20 ,(invAngleSum >> 8)=21 ,(invAngleSum >> 8)=22 ,(invAngleSum >> 8)=23 ,(invAngleSum >> 8)=25 ,(invAngleSum >> 8)=26 ,(invAngleSum >> 8)=27 ,(invAngleSum >> 8)=28 ,(invAngleSum >> 8)=30 ,(invAngleSum >> 8)=31 + + //mode19 invAS[32]= {1, 2, 4, }; + //vec_u8_t mask_left={0x1, 0x02, 0x04, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,0x0, 0x0}; + //vec_u8_t srv_left=vec_xl(8, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + //vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + //vec_u8_t refmask_4={0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, }; + //vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); + + vec_u8_t srv=vec_xl(0, srcPix0); + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); +vec_u8_t vfrac4 = (vec_u8_t){30, 30, 30, 30, 28, 28, 28, 28, 26, 26, 26, 26, 24, 24, 24, 24}; +vec_u8_t vfrac4_32 = (vec_u8_t){2, 2, 2, 2, 4, 4, 4, 4, 6, 6, 6, 6, 8, 8, 8, 8}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 25>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, }; +vec_u8_t mask1={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; +vec_u8_t mask2={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, }; +vec_u8_t mask3={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; +vec_u8_t mask4={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, }; +vec_u8_t mask5={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; +vec_u8_t mask6={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, }; +vec_u8_t mask7={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; + + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vout_0, vout_1, vout_2, vout_3; + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + + //vec_u8_t srv_left=vec_xl(16, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + //vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + //vec_u8_t refmask_8={0x7, 0x4, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00, 0x00, 0x00, 0x00, 0x00, }; + //vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); + vec_u8_t srv = vec_xl(0, srcPix0); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + vec_u8_t srv3 = vec_perm(srv, srv, mask3); + vec_u8_t srv4 = vec_perm(srv, srv, mask4); + vec_u8_t srv5 = vec_perm(srv, srv, mask5); + vec_u8_t srv6 = vec_perm(srv, srv, mask6); + vec_u8_t srv7 = vec_perm(srv, srv, mask7); + +vec_u8_t vfrac8_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac8_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac8_2 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac8_3 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac8_32_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac8_32_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac8_32_2 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac8_32_3 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 16, 16, 16, 16, 16, 16, 16, 16}; + +one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0); +one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1); +one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2); +one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 25>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +/*vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask1={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask2={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask3={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask4={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask5={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask6={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask7={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask8={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask9={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask10={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask11={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask12={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask13={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask14={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t maskadd1_0={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_1={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_2={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_3={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_4={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_5={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_7={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_8={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_9={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_10={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_11={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_12={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + //vec_u8_t srv_left=vec_xl(32, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + //vec_u8_t srv_right=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + //vec_u8_t refmask_16={0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f, }; + //vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + //vec_u8_t s1 = vec_xl(12, srcPix0); + + vec_u8_t srv0 = vec_xl(0, srcPix0); + vec_u8_t srv1 = vec_xl(1, srcPix0); + +vec_u8_t vfrac16_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_1 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_2 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_4 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_5 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_6 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_8 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_9 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_10 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_12 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_13 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_14 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; +vec_u8_t vfrac16_32_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv0, srv1, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv0, srv1, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv0, srv1, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv0, srv1, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv0, srv1, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv0, srv1, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv0, srv1, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv0, srv1, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv0, srv1, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv0, srv1, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv0, srv1, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv0, srv1, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv0, srv1, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 25>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_0={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv_left = vec_xl(80, srcPix0); + vec_u8_t srv_right = vec_xl(0, srcPix0);; + vec_u8_t refmask_32 ={0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_32); + vec_u8_t s1 = vec_xl(15, srcPix0);; + vec_u8_t s2 = vec_xl(31, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv16_0 = vec_perm(s1, s2, mask0); + vec_u8_t srv0add1 = vec_perm(s0, s1, maskadd1_0); + vec_u8_t srv16add1_0 = vec_perm(s1, s2, maskadd1_0); + +vec_u8_t vfrac16_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_1 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_2 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_4 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_5 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_6 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_8 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_9 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_10 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_12 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_13 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_14 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + +vec_u8_t vfrac16_32_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv0, srv0add1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv16_0, srv16add1_0, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv0, srv0add1, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv16_0, srv16add1_0, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv0, srv0add1, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv16_0, srv16add1_0, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv0, srv0add1, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv16_0, srv16add1_0, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv0, srv0add1, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv16_0, srv16add1_0, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv0, srv0add1, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv16_0, srv16add1_0, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv0, srv0add1, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv16_0, srv16add1_0, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv0, srv0add1, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv16_0, srv16add1_0, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv0, srv0add1, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv16_0, srv16add1_0, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv0, srv0add1, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv16_0, srv16add1_0, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv0, srv0add1, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv16_0, srv16add1_0, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv0, srv0add1, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv16_0, srv16add1_0, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv0, srv0add1, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv16_0, srv16add1_0, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv0, srv0add1, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv16_0, srv16add1_0, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv0, srv0add1, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv16_0, srv16add1_0, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(s0, srv0, vfrac16_32_0, vfrac16_0, vout_0); + one_line(s1, srv16_0, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(s0, srv0, vfrac16_32_1, vfrac16_1, vout_2); + one_line(s1, srv16_0, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(s0, srv0, vfrac16_32_2, vfrac16_2, vout_4); + one_line(s1, srv16_0, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(s0, srv0, vfrac16_32_3, vfrac16_3, vout_6); + one_line(s1, srv16_0, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(s0, srv0, vfrac16_32_4, vfrac16_4, vout_8); + one_line(s1, srv16_0, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(s0, srv0, vfrac16_32_5, vfrac16_5, vout_10); + one_line(s1, srv16_0, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(s0, srv0, vfrac16_32_6, vfrac16_6, vout_12); + one_line(s1, srv16_0, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(s0, srv0, vfrac16_32_7, vfrac16_7, vout_14); + one_line(s1, srv16_0, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(s0, srv0, vfrac16_32_8, vfrac16_8, vout_16); + one_line(s1, srv16_0, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(s0, srv0, vfrac16_32_9, vfrac16_9, vout_18); + one_line(s1, srv16_0, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(s0, srv0, vfrac16_32_10, vfrac16_10, vout_20); + one_line(s1, srv16_0, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(s0, srv0, vfrac16_32_11, vfrac16_11, vout_22); + one_line(s1, srv16_0, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(s0, srv0, vfrac16_32_12, vfrac16_12, vout_24); + one_line(s1, srv16_0, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(s0, srv0, vfrac16_32_13, vfrac16_13, vout_26); + one_line(s1, srv16_0, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(s0, srv0, vfrac16_32_14, vfrac16_14, vout_28); + one_line(s1, srv16_0, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(s0, srv0, vfrac16_32_15, vfrac16_15, vout_30); + one_line(s1, srv16_0, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<4, 26>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u8_t srv =vec_xl(0, srcPix0); /* offset = width2+1 = width<<1 + 1 */ + + if (bFilter){ + LOAD_ZERO; + vec_u8_t tmp_v = vec_sld(srv, srv, 15); + vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask)); + vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask)); + vec_s16_t v0_s16 = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_w4x4_mask9)); + vec_s16_t v1_s16 = (vec_s16_t)vec_sra( vec_sub(v0_s16, c0_s16v), one_u16v ); + vec_s16_t v_sum = vec_add(c1_s16v, v1_s16); + vec_u16_t v_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v_sum)); + vec_u8_t v_filter_u8 = vec_pack(v_filter_u16, zero_u16v); + vec_u8_t v_mask = {0x10, 0x02, 0x03, 0x04, 0x11, 0x02, 0x03, 0x04, 0x12, 0x02, 0x03, 0x04, 0x13, 0x02, 0x03, 0x04}; + vec_u8_t vout = vec_perm(srv, v_filter_u8, v_mask); + if(dstStride == 4) { + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_u8_t v1 = vec_sld(vout, vout, 12); + vec_ste((vec_u32_t)v1, 0, (unsigned int*)(dst+dstStride)); + vec_u8_t v2 = vec_sld(vout, vout, 8); + vec_ste((vec_u32_t)v2, 0, (unsigned int*)(dst+dstStride*2)); + vec_u8_t v3 = vec_sld(vout, vout, 4); + vec_ste((vec_u32_t)v3, 0, (unsigned int*)(dst+dstStride*3)); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + } + else{ + + if(dstStride == 4) { + vec_u8_t v_mask0 = {0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04}; + vec_u8_t v0 = vec_perm(srv, srv, v_mask0); + vec_xst(v0, 0, dst); + } + else if(dstStride%16 == 0){ + vec_u8_t v_mask0 = {0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04}; + vec_u8_t v0 = vec_perm(srv, srv, v_mask0); + vec_ste((vec_u32_t)v0, 0, (unsigned int*)dst); + vec_u8_t v1 = vec_sld(v0, v0, 12); + vec_ste((vec_u32_t)v1, 0, (unsigned int*)(dst+dstStride)); + vec_u8_t v2 = vec_sld(v0, v0, 8); + vec_ste((vec_u32_t)v2, 0, (unsigned int*)(dst+dstStride*2)); + vec_u8_t v3 = vec_sld(v0, v0, 4); + vec_ste((vec_u32_t)v3, 0, (unsigned int*)(dst+dstStride*3)); + } + else{ + vec_u8_t v_mask0 = {0x01, 0x02, 0x03, 0x04, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x01, 0x02, 0x03, 0x04, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x01, 0x02, 0x03, 0x04, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x01, 0x02, 0x03, 0x04, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(srv, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(srv, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(srv, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(srv, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + + +template<> +void intra_pred<8, 26>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u8_t srv =vec_xl(0, srcPix0); /* offset = width2+1 = width<<1 + 1 */ + + if (bFilter){ + LOAD_ZERO; + vec_u8_t tmp_v = vec_xl(17, srcPix0); + vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask)); + vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b1_mask)); + vec_s16_t v0_s16 = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_w8x8_maskh)); + vec_s16_t v1_s16 = (vec_s16_t)vec_sra( vec_sub(v0_s16, c0_s16v), one_u16v ); + vec_s16_t v_sum = vec_add(c1_s16v, v1_s16); + vec_u16_t v_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v_sum)); + vec_u8_t v_filter_u8 = vec_pack(v_filter_u16, zero_u16v); + vec_u8_t v_mask0 = {0x00, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x01, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t v_mask1 = {0x02, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x03, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t v_mask2 = {0x04, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x05, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t v_mask3 = {0x06, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x07, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t v0 = vec_perm(v_filter_u8, srv, v_mask0); + vec_u8_t v1 = vec_perm(v_filter_u8, srv, v_mask1); + vec_u8_t v2 = vec_perm(v_filter_u8, srv, v_mask2); + vec_u8_t v3 = vec_perm(v_filter_u8, srv, v_mask3); + if(dstStride == 8) { + vec_xst(v0, 0, dst); + vec_xst(v1, 16, dst); + vec_xst(v2, 32, dst); + vec_xst(v3, 48, dst); + } + else{ + vec_u8_t v_maskh = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_maskl = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_xst(vec_perm(v0, vec_xl(0, dst), v_maskh), 0, dst); + vec_xst(vec_perm(v0, vec_xl(dstStride, dst), v_maskl), dstStride, dst); + vec_xst(vec_perm(v1, vec_xl(dstStride*2, dst), v_maskh), dstStride*2, dst); + vec_xst(vec_perm(v1, vec_xl(dstStride*3, dst), v_maskl), dstStride*3, dst); + vec_xst(vec_perm(v2, vec_xl(dstStride*4, dst), v_maskh), dstStride*4, dst); + vec_xst(vec_perm(v2, vec_xl(dstStride*5, dst), v_maskl), dstStride*5, dst); + vec_xst(vec_perm(v3, vec_xl(dstStride*6, dst), v_maskh), dstStride*6, dst); + vec_xst(vec_perm(v3, vec_xl(dstStride*7, dst), v_maskl), dstStride*7, dst); + } + } + else{ + if(dstStride == 8) { + vec_u8_t v_mask = {0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08}; + vec_u8_t v0 = vec_perm(srv, srv, v_mask); + vec_xst(v0, 0, dst); + vec_xst(v0, 16, dst); + vec_xst(v0, 32, dst); + vec_xst(v0, 48, dst); + } + else{ + vec_u8_t v_mask = {0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_xst(vec_perm(srv, vec_xl(0, dst), v_mask), 0, dst); + vec_xst(vec_perm(srv, vec_xl(dstStride, dst), v_mask), dstStride, dst); + vec_xst(vec_perm(srv, vec_xl(dstStride*2, dst), v_mask), dstStride*2, dst); + vec_xst(vec_perm(srv, vec_xl(dstStride*3, dst), v_mask), dstStride*3, dst); + vec_xst(vec_perm(srv, vec_xl(dstStride*4, dst), v_mask), dstStride*4, dst); + vec_xst(vec_perm(srv, vec_xl(dstStride*5, dst), v_mask), dstStride*5, dst); + vec_xst(vec_perm(srv, vec_xl(dstStride*6, dst), v_mask), dstStride*6, dst); + vec_xst(vec_perm(srv, vec_xl(dstStride*7, dst), v_mask), dstStride*7, dst); + } + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 26>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u8_t srv =vec_xl(0, srcPix0); + vec_u8_t srv1 =vec_xl(1, srcPix0); + + if (bFilter){ + LOAD_ZERO; + vec_u8_t tmp_v = vec_xl(33, srcPix0); /* offset = width2+1 = width<<1 + 1 */ + vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask)); + vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b1_mask)); + vec_s16_t v0h_s16 = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_w8x8_maskh)); + vec_s16_t v0l_s16 = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_w8x8_maskl)); + vec_s16_t v1h_s16 = (vec_s16_t)vec_sra( vec_sub(v0h_s16, c0_s16v), one_u16v ); + vec_s16_t v1l_s16 = (vec_s16_t)vec_sra( vec_sub(v0l_s16, c0_s16v), one_u16v ); + vec_s16_t vh_sum = vec_add(c1_s16v, v1h_s16); + vec_s16_t vl_sum = vec_add(c1_s16v, v1l_s16); + vec_u16_t vh_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vh_sum)); + vec_u16_t vl_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vl_sum)); + vec_u8_t v_filter_u8 = vec_pack(vh_filter_u16, vl_filter_u16); + vec_u8_t mask0 = {0x00, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask1 = {0x01, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask2 = {0x02, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask3 = {0x03, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask4 = {0x04, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask5 = {0x05, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask6 = {0x06, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask7 = {0x07, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask8 = {0x08, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask9 = {0x09, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask10 = {0xa, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask11 = {0xb, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask12 = {0xc, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask13 = {0xd, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask14 = {0xe, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask15 = {0xf, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + + + if(dstStride == 16) { + vec_xst(vec_perm(v_filter_u8, srv1, mask0), 0, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask1), 16, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask2), 32, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask3), 48, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask4), 64, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask5), 80, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask6), 96, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask7), 112, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask8), 128, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask9), 144, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask10), 160, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask11), 176, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask12), 192, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask13), 208, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask14), 224, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask15), 240, dst); + } + else{ + vec_xst(vec_perm(v_filter_u8, srv1, mask0), 0, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask1), dstStride, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask2), dstStride*2, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask3), dstStride*3, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask4), dstStride*4, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask5), dstStride*5, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask6), dstStride*6, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask7), dstStride*7, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask8), dstStride*8, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask9), dstStride*9, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask10), dstStride*10, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask11), dstStride*11, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask12), dstStride*12, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask13), dstStride*13, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask14), dstStride*14, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask15), dstStride*15, dst); + } + } + else{ + if(dstStride == 16) { + vec_xst(srv1, 0, dst); + vec_xst(srv1, 16, dst); + vec_xst(srv1, 32, dst); + vec_xst(srv1, 48, dst); + vec_xst(srv1, 64, dst); + vec_xst(srv1, 80, dst); + vec_xst(srv1, 96, dst); + vec_xst(srv1, 112, dst); + vec_xst(srv1, 128, dst); + vec_xst(srv1, 144, dst); + vec_xst(srv1, 160, dst); + vec_xst(srv1, 176, dst); + vec_xst(srv1, 192, dst); + vec_xst(srv1, 208, dst); + vec_xst(srv1, 224, dst); + vec_xst(srv1, 240, dst); + } + else{ + vec_xst(srv1, 0, dst); + vec_xst(srv1, dstStride, dst); + vec_xst(srv1, dstStride*2, dst); + vec_xst(srv1, dstStride*3, dst); + vec_xst(srv1, dstStride*4, dst); + vec_xst(srv1, dstStride*5, dst); + vec_xst(srv1, dstStride*6, dst); + vec_xst(srv1, dstStride*7, dst); + vec_xst(srv1, dstStride*8, dst); + vec_xst(srv1, dstStride*9, dst); + vec_xst(srv1, dstStride*10, dst); + vec_xst(srv1, dstStride*11, dst); + vec_xst(srv1, dstStride*12, dst); + vec_xst(srv1, dstStride*13, dst); + vec_xst(srv1, dstStride*14, dst); + vec_xst(srv1, dstStride*15, dst); + } + } + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 26>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u8_t srv =vec_xl(1, srcPix0); /* offset = width2+1 = width<<1 + 1 */ + vec_u8_t srv1 =vec_xl(17, srcPix0); + + if (bFilter){ + LOAD_ZERO; + vec_u8_t tmp_v = vec_xl(0, srcPix0); + vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask)); + vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask)); + vec_u8_t srcv1 = vec_xl(65, srcPix0); + vec_s16_t v0h_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskh)); + vec_s16_t v0l_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskl)); + vec_s16_t v1h_s16 = (vec_s16_t)vec_sra( vec_sub(v0h_s16, c0_s16v), one_u16v ); + vec_s16_t v1l_s16 = (vec_s16_t)vec_sra( vec_sub(v0l_s16, c0_s16v), one_u16v ); + + vec_s16_t vh_sum = vec_add(c1_s16v, v1h_s16); + vec_s16_t vl_sum = vec_add(c1_s16v, v1l_s16); + vec_u16_t vh_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vh_sum)); + vec_u16_t vl_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vl_sum)); + vec_u8_t v_filter_u8 = vec_pack(vh_filter_u16, vl_filter_u16); + + vec_u8_t mask0 = {0x00, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask1 = {0x01, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask2 = {0x02, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask3 = {0x03, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask4 = {0x04, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask5 = {0x05, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask6 = {0x06, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask7 = {0x07, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask8 = {0x08, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask9 = {0x09, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask10 = {0xa, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask11 = {0xb, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask12 = {0xc, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask13 = {0xd, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask14 = {0xe, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask15 = {0xf, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_xst(vec_perm(v_filter_u8, srv, mask0), 0, dst); + vec_xst(srv1, 16, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask1), dstStride, dst); + vec_xst(srv1, dstStride+16, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask2), dstStride*2, dst); + vec_xst(srv1, dstStride*2+16, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask3), dstStride*3, dst); + vec_xst(srv1, dstStride*3+16, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask4), dstStride*4, dst); + vec_xst(srv1, dstStride*4+16, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask5), dstStride*5, dst); + vec_xst(srv1, dstStride*5+16, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask6), dstStride*6, dst); + vec_xst(srv1, dstStride*6+16, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask7), dstStride*7, dst); + vec_xst(srv1, dstStride*7+16, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask8), dstStride*8, dst); + vec_xst(srv1, dstStride*8+16, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask9), dstStride*9, dst); + vec_xst(srv1, dstStride*9+16, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask10), dstStride*10, dst); + vec_xst(srv1, dstStride*10+16, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask11), dstStride*11, dst); + vec_xst(srv1, dstStride*11+16, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask12), dstStride*12, dst); + vec_xst(srv1, dstStride*12+16, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask13), dstStride*13, dst); + vec_xst(srv1, dstStride*13+16, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask14), dstStride*14, dst); + vec_xst(srv1, dstStride*14+16, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask15), dstStride*15, dst); + vec_xst(srv1, dstStride*15+16, dst); + + vec_u8_t srcv2 = vec_xl(81, srcPix0); + vec_s16_t v2h_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv2, u8_to_s16_w8x8_maskh)); + vec_s16_t v2l_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv2, u8_to_s16_w8x8_maskl)); + vec_s16_t v3h_s16 = (vec_s16_t)vec_sra( vec_sub(v2h_s16, c0_s16v), one_u16v ); + vec_s16_t v3l_s16 = (vec_s16_t)vec_sra( vec_sub(v2l_s16, c0_s16v), one_u16v ); + vec_s16_t v2h_sum = vec_add(c1_s16v, v3h_s16); + vec_s16_t v2l_sum = vec_add(c1_s16v, v3l_s16); + vec_u16_t v2h_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v2h_sum)); + vec_u16_t v2l_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v2l_sum)); + vec_u8_t v2_filter_u8 = vec_pack(v2h_filter_u16, v2l_filter_u16); + + vec_xst(vec_perm(v2_filter_u8, srv, mask0), dstStride*16, dst); + vec_xst(srv1, dstStride*16+16, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask1), dstStride*17, dst); + vec_xst(srv1, dstStride*17+16, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask2), dstStride*18, dst); + vec_xst(srv1, dstStride*18+16, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask3), dstStride*19, dst); + vec_xst(srv1, dstStride*19+16, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask4), dstStride*20, dst); + vec_xst(srv1, dstStride*20+16, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask5), dstStride*21, dst); + vec_xst(srv1, dstStride*21+16, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask6), dstStride*22, dst); + vec_xst(srv1, dstStride*22+16, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask7), dstStride*23, dst); + vec_xst(srv1, dstStride*23+16, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask8), dstStride*24, dst); + vec_xst(srv1, dstStride*24+16, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask9), dstStride*25, dst); + vec_xst(srv1, dstStride*25+16, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask10), dstStride*26, dst); + vec_xst(srv1, dstStride*26+16, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask11), dstStride*27, dst); + vec_xst(srv1, dstStride*27+16, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask12), dstStride*28, dst); + vec_xst(srv1, dstStride*28+16, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask13), dstStride*29, dst); + vec_xst(srv1, dstStride*29+16, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask14), dstStride*30, dst); + vec_xst(srv1, dstStride*30+16, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask15), dstStride*31, dst); + vec_xst(srv1, dstStride*31+16, dst); + + } + else{ + int offset = 0; + + for(int i=0; i<32; i++){ + vec_xst(srv, offset, dst); + vec_xst(srv1, 16+offset, dst); + offset += dstStride; + } + } +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<4, 27>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + vec_u8_t vfrac4 = (vec_u8_t){2, 2, 2, 2, 4, 4, 4, 4, 6, 6, 6, 6, 8, 8, 8, 8}; /* fraction[0-3] */ + vec_u8_t vfrac4_32 = (vec_u8_t){30, 30, 30, 30, 28, 28, 28, 28, 26, 26, 26, 26, 24, 24, 24, 24}; /* 32 - fraction[0-3] */ + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 27>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + /* fraction[0-7] */ + vec_u8_t vfrac8_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac8_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac8_2 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac8_3 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* 32 - fraction[0-7] */ + vec_u8_t vfrac8_32_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac8_32_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac8_32_2 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac8_32_3 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + /* y0, y1 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0); + vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + + /* y2, y3 */ + vmle0 = vec_mule(srv0, vfrac8_32_1); + vmlo0 = vec_mulo(srv0, vfrac8_32_1); + vmle1 = vec_mule(srv1, vfrac8_1); + vmlo1 = vec_mulo(srv1, vfrac8_1); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + + /* y4, y5 */ + vmle0 = vec_mule(srv0, vfrac8_32_2); + vmlo0 = vec_mulo(srv0, vfrac8_32_2); + vmle1 = vec_mule(srv1, vfrac8_2); + vmlo1 = vec_mulo(srv1, vfrac8_2); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y6, y7 */ + vmle0 = vec_mule(srv0, vfrac8_32_3); + vmlo0 = vec_mulo(srv0, vfrac8_32_3); + vmle1 = vec_mule(srv1, vfrac8_3); + vmlo1 = vec_mulo(srv1, vfrac8_3); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 27>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + + /* fraction[0-15] */ + vec_u8_t vfrac16_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_1 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_2 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_4 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_5 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_6 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + vec_u8_t vfrac16_8 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_9 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_10 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_12 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_13 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_14 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30,30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + + /* 32 - fraction[0-15] */ + vec_u8_t vfrac16_32_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30,30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_32_1 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_32_2 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_32_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_32_4 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_32_5 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_32_6 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + vec_u8_t vfrac16_32_8 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_32_9 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_32_10 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_32_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_32_12 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_32_13 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_32_14 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + +#if 0 + #define one_line(s0, s1, vf32, vf, vout) {\ + vmle0 = vec_mule(s0, vf32);\ + vmlo0 = vec_mulo(s0, vf32);\ + vmle1 = vec_mule(s1, vf);\ + vmlo1 = vec_mulo(s1, vf);\ + vsume = vec_add(vec_add(vmle0, vmle1), u16_16);\ + ve = vec_sra(vsume, u16_5);\ + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);\ + vo = vec_sra(vsumo, u16_5);\ + vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));\ + } +#endif + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv0, srv1, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv0, srv1, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv0, srv1, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv0, srv1, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv0, srv1, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv0, srv1, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv0, srv1, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv0, srv1, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv0, srv1, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv0, srv1, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv0, srv1, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv0, srv1, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv1, srv2, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 27>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv2 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); /* from y= 15, use srv1, srv2 */ + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); /* y=31, use srv2, srv3 */ + + vec_u8_t srv4 = sv1; + vec_u8_t srv5 = vec_perm(sv1, sv2, mask1); + vec_u8_t srv6 = vec_perm(sv1, sv2, mask2); + vec_u8_t srv7 = vec_perm(sv2, sv2, mask3); + + /* fraction[0-15] */ + vec_u8_t vfrac16_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_1 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_2 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_4 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_5 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_6 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + vec_u8_t vfrac16_8 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_9 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_10 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_12 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_13 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_14 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30,30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + + /* 32 - fraction[0-15] */ + vec_u8_t vfrac16_32_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30,30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_32_1 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_32_2 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_32_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_32_4 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_32_5 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_32_6 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + vec_u8_t vfrac16_32_8 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_32_9 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_32_10 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_32_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_32_12 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_32_13 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_32_14 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv4, srv5, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv4, srv5, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv4, srv5, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv0, srv1, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv4, srv5, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv0, srv1, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv4, srv5, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv0, srv1, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv4, srv5, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv0, srv1, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv4, srv5, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv0, srv1, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv4, srv5, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv0, srv1, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv4, srv5, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv0, srv1, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv4, srv5, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv0, srv1, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv4, srv5, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv0, srv1, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv4, srv5, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv0, srv1, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv4, srv5, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv0, srv1, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv4, srv5, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv0, srv1, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv4, srv5, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv1, srv2, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv5, srv6, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + + one_line(srv1, srv2, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv5, srv6, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv5, srv6, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv5, srv6, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv1, srv2, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv5, srv6, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv1, srv2, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv5, srv6, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv1, srv2, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv5, srv6, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv1, srv2, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv5, srv6, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv1, srv2, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv5, srv6, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv1, srv2, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv5, srv6, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv1, srv2, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv5, srv6, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv1, srv2, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv5, srv6, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv1, srv2, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv5, srv6, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv1, srv2, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv5, srv6, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv1, srv2, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv5, srv6, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv1, srv2, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv5, srv6, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv2, srv3, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv6, srv7, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<4, 28>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + //mode 28 + //int offset[32] = {0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5}; + //int fraction[32] = {5, 10, 15, 20, 25, 30, 3, 8, 13, 18, 23, 28, 1, 6, 11, 16, 21, 26, 31, 4, 9, 14, 19, 24, 29, 2, 7, 12, 17, 22, 27, 0}; + + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + vec_u8_t vfrac4 = (vec_u8_t){5, 5, 5, 5, 10, 10, 10, 10, 15, 15, 15, 15, 20, 20, 20, 20}; /* fraction[0-3] */ + vec_u8_t vfrac4_32 = (vec_u8_t){27, 27, 27, 27, 22, 22, 22, 22, 17, 17, 17, 17, 12, 12, 12, 12}; /* 32 - fraction[0-3] */ + + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 28>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + //mode 28 + //int offset[32] = {0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5}; + //int fraction[32] = {5, 10, 15, 20, 25, 30, 3, 8, 13, 18, 23, 28, 1, 6, 11, 16, 21, 26, 31, 4, 9, 14, 19, 24, 29, 2, 7, 12, 17, 22, 27, 0}; + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + + /* fraction[0-7] */ + vec_u8_t vfrac8_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac8_1 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac8_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac8_3 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 8, 8, 8, 8, 8, 8, 8, 8}; + + /* 32 - fraction[0-7] */ + vec_u8_t vfrac8_32_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac8_32_1 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac8_32_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac8_32_3 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 24, 24, 24, 24, 24, 24, 24, 24}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + /* y0, y1 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0); + vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y2, y3 */ + vmle0 = vec_mule(srv0, vfrac8_32_1); + vmlo0 = vec_mulo(srv0, vfrac8_32_1); + vmle1 = vec_mule(srv1, vfrac8_1); + vmlo1 = vec_mulo(srv1, vfrac8_1); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + + /* y4, y5 */ + vmle0 = vec_mule(srv0, vfrac8_32_2); + vmlo0 = vec_mulo(srv0, vfrac8_32_2); + vmle1 = vec_mule(srv1, vfrac8_2); + vmlo1 = vec_mulo(srv1, vfrac8_2); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y6, y7 */ + vmle0 = vec_mule(srv1, vfrac8_32_3); + vmlo0 = vec_mulo(srv1, vfrac8_32_3); + vmle1 = vec_mule(srv2, vfrac8_3); + vmlo1 = vec_mulo(srv2, vfrac8_3); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 28>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + + //mode 28 + //int offset[32] = {0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5}; + //int fraction[32] = {5, 10, 15, 20, 25, 30, 3, 8, 13, 18, 23, 28, 1, 6, 11, 16, 21, 26, 31, 4, 9, 14, 19, 24, 29, 2, 7, 12, 17, 22, 27, 0}; + + /* fraction[0-15] */ + vec_u8_t vfrac16_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vfrac16_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_2 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; + vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_4 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; + vec_u8_t vfrac16_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_6 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; + vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_8 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13,13, 13, 13, 13,13, 13, 13, 13}; + vec_u8_t vfrac16_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_10 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; + vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_12 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; + vec_u8_t vfrac16_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_14 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; + vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* 32 - fraction[0-15] */ + vec_u8_t vfrac16_32_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; + vec_u8_t vfrac16_32_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_32_2 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; + vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_32_4 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; + vec_u8_t vfrac16_32_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_32_6 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; + vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_32_8 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; + vec_u8_t vfrac16_32_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_32_10 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; + vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_32_12 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; + vec_u8_t vfrac16_32_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_32_14 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; + vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv0, srv1, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv0, srv1, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv0, srv1, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv1, srv2, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv1, srv2, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv1, srv2, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv1, srv2, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv1, srv2, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv1, srv2, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv2, srv3, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv2, srv3, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv2, srv3, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv2, srv3, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 28>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv2 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); /* from y= 15, use srv1, srv2 */ + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); /* y=31, use srv2, srv3 */ + vec_u8_t srv8 = vec_perm(sv0, sv1, mask4); /* y=31, use srv2, srv3 */ + vec_u8_t srv9 = vec_perm(sv0, sv1, mask5); /* y=31, use srv2, srv3 */ + vec_u8_t srv12 = vec_perm(sv0, sv1, mask6); /* y=31, use srv2, srv3 */ + + vec_u8_t srv4 = sv1; + vec_u8_t srv5 = vec_perm(sv1, sv2, mask1); + vec_u8_t srv6 = vec_perm(sv1, sv2, mask2); + vec_u8_t srv7 = vec_perm(sv1, sv2, mask3); + vec_u8_t srv10 = vec_perm(sv1, sv2, mask4); /* y=31, use srv2, srv3 */ + vec_u8_t srv11 = vec_perm(sv1, sv2, mask5); /* y=31, use srv2, srv3 */ + vec_u8_t srv13 = vec_perm(sv1, sv2, mask6); /* y=31, use srv2, srv3 */ + + /* fraction[0-15] */ + vec_u8_t vfrac16_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vfrac16_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_2 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; + vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_4 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; + vec_u8_t vfrac16_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_6 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; + vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_8 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13,13, 13, 13, 13,13, 13, 13, 13}; + vec_u8_t vfrac16_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_10 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; + vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_12 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; + vec_u8_t vfrac16_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_14 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; + vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + vec_u8_t vfrac16_16 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; + vec_u8_t vfrac16_17 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_18 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; + vec_u8_t vfrac16_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_20 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; + vec_u8_t vfrac16_21 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_22 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; + vec_u8_t vfrac16_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_24 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; + vec_u8_t vfrac16_25 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_26 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; + vec_u8_t vfrac16_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_28 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; + vec_u8_t vfrac16_29 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_30 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; + vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + + /* 32 - fraction[0-15] */ + vec_u8_t vfrac16_32_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; + vec_u8_t vfrac16_32_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_32_2 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; + vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_32_4 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; + vec_u8_t vfrac16_32_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_32_6 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; + vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_32_8 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; + vec_u8_t vfrac16_32_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_32_10 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; + vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_32_12 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; + vec_u8_t vfrac16_32_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_32_14 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; + vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + vec_u8_t vfrac16_32_16 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; + vec_u8_t vfrac16_32_17 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_32_18 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; + vec_u8_t vfrac16_32_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_32_20 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; + vec_u8_t vfrac16_32_21 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_32_22 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; + vec_u8_t vfrac16_32_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_32_24 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; + vec_u8_t vfrac16_32_25 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_32_26 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; + vec_u8_t vfrac16_32_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_32_28 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; + vec_u8_t vfrac16_32_29 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_32_30 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv4, srv5, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv4, srv5, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv4, srv5, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv0, srv1, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv4, srv5, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv0, srv1, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv4, srv5, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv0, srv1, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv4, srv5, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv1, srv2, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv5, srv6, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv1, srv2, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv5, srv6, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv1, srv2, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv5, srv6, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv1, srv2, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv5, srv6, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv1, srv2, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv5, srv6, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv1, srv2, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv5, srv6, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv2, srv3, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv6, srv7, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv2, srv3, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv6, srv7, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv2, srv3, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv6, srv7, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv2, srv3, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv6, srv7, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv2, srv3, vfrac16_32_16, vfrac16_16, vout_0); + one_line(srv6, srv7, vfrac16_32_16, vfrac16_16, vout_1); + + one_line(srv2, srv3, vfrac16_32_17, vfrac16_17, vout_2); + one_line(srv6, srv7, vfrac16_32_17, vfrac16_17, vout_3); + + one_line(srv2, srv3, vfrac16_32_18, vfrac16_18, vout_4); + one_line(srv6, srv7, vfrac16_32_18, vfrac16_18, vout_5); + + one_line(srv3, srv8, vfrac16_32_19, vfrac16_19, vout_6); + one_line(srv7, srv10, vfrac16_32_19, vfrac16_19, vout_7); + + one_line(srv3, srv8, vfrac16_32_20, vfrac16_20, vout_8); + one_line(srv7, srv10, vfrac16_32_20, vfrac16_20, vout_9); + + one_line(srv3, srv8, vfrac16_32_21, vfrac16_21, vout_10); + one_line(srv7, srv10, vfrac16_32_21, vfrac16_21, vout_11); + + one_line(srv3, srv8, vfrac16_32_22, vfrac16_22, vout_12); + one_line(srv7, srv10, vfrac16_32_22, vfrac16_22, vout_13); + + one_line(srv3, srv8, vfrac16_32_23, vfrac16_23, vout_14); + one_line(srv7, srv10, vfrac16_32_23, vfrac16_23, vout_15); + + one_line(srv3, srv8, vfrac16_32_24, vfrac16_24, vout_16); + one_line(srv7, srv10, vfrac16_32_24, vfrac16_24, vout_17); + + one_line(srv8, srv9, vfrac16_32_25, vfrac16_25, vout_18); + one_line(srv10, srv11, vfrac16_32_25, vfrac16_25, vout_19); + + one_line(srv8, srv9, vfrac16_32_26, vfrac16_26, vout_20); + one_line(srv10, srv11, vfrac16_32_26, vfrac16_26, vout_21); + + one_line(srv8, srv9, vfrac16_32_27, vfrac16_27, vout_22); + one_line(srv10, srv11, vfrac16_32_27, vfrac16_27, vout_23); + + one_line(srv8, srv9, vfrac16_32_28, vfrac16_28, vout_24); + one_line(srv10, srv11, vfrac16_32_28, vfrac16_28, vout_25); + + one_line(srv8, srv9, vfrac16_32_29, vfrac16_29, vout_26); + one_line(srv10, srv11, vfrac16_32_29, vfrac16_29, vout_27); + + one_line(srv8, srv9, vfrac16_32_30, vfrac16_30, vout_28); + one_line(srv10, srv11, vfrac16_32_30, vfrac16_30, vout_29); + + one_line(srv9, srv12, vfrac16_32_31, vfrac16_31, vout_30); + one_line(srv11, srv13, vfrac16_32_31, vfrac16_31, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<4, 29>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + //mode 29: + //int offset[32] = {0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9}; + //int fraction[32] = {9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, 25, 2, 11, 20, 29, 6, 15, 24, 1, 10, 19, 28, 5, 14, 23, 0}; + + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03, 0x04}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + vec_u8_t vfrac4 = (vec_u8_t){9, 9, 9, 9, 18, 18, 18, 18, 27, 27, 27, 27, 4, 4, 4, 4}; /* fraction[0-3] */ + vec_u8_t vfrac4_32 = (vec_u8_t){23, 23, 23, 23, 14, 14, 14, 14, 5, 5, 5, 5, 28, 28, 28, 28}; /* 32 - fraction[0-3] */ + + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 29>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + //mode 29: + //int offset[32] = {0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9}; + //int fraction[32] = {9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, 25, 2, 11, 20, 29, 6, 15, 24, 1, 10, 19, 28, 5, 14, 23, 0}; + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08}; + vec_u8_t mask2={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08}; + vec_u8_t mask3={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09}; + vec_u8_t mask4={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09}; + vec_u8_t mask5={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 0 */ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 1 */ + vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 0, 1 */ + vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 1, 2 */ + vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 2, 2 */ + vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 2, 3 */ + + /* fraction[0-7] */ + vec_u8_t vfrac8_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac8_1 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac8_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac8_3 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 8, 8, 8, 8, 8, 8, 8, 8}; + + /* 32 - fraction[0-7] */ + vec_u8_t vfrac8_32_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac8_32_1 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac8_32_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac8_32_3 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 24, 24, 24, 24, 24, 24, 24, 24}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + /* y0, y1 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0); + vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + + /* y2, y3 */ + vmle0 = vec_mule(srv2, vfrac8_32_1); + vmlo0 = vec_mulo(srv2, vfrac8_32_1); + vmle1 = vec_mule(srv3, vfrac8_1); + vmlo1 = vec_mulo(srv3, vfrac8_1); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y4, y5 */ + vmle0 = vec_mule(srv1, vfrac8_32_2); + vmlo0 = vec_mulo(srv1, vfrac8_32_2); + vmle1 = vec_mule(srv4, vfrac8_2); + vmlo1 = vec_mulo(srv4, vfrac8_2); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y6, y7 */ + vmle0 = vec_mule(srv3, vfrac8_32_3); + vmlo0 = vec_mulo(srv3, vfrac8_32_3); + vmle1 = vec_mule(srv5, vfrac8_3); + vmlo1 = vec_mulo(srv5, vfrac8_3); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 29>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + //mode 29: + //int offset[32] = {0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9}; + //int fraction[32] = {9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, 25, 2, 11, 20, 29, 6, 15, 24, 1, 10, 19, 28, 5, 14, 23, 0}; + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + + /* fraction[0-15] */ + vec_u8_t vfrac16_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; + vec_u8_t vfrac16_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_2 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; + vec_u8_t vfrac16_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_4 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; + vec_u8_t vfrac16_5 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_6 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; + vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_8 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; + vec_u8_t vfrac16_9 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_10 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; + vec_u8_t vfrac16_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_12 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; + vec_u8_t vfrac16_13 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_14 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; + vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* 32 - fraction[0-15] */ + vec_u8_t vfrac16_32_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; + vec_u8_t vfrac16_32_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_32_2 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vfrac16_32_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_32_4 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; + vec_u8_t vfrac16_32_5 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_32_6 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; + vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_32_8 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; + vec_u8_t vfrac16_32_9 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_32_10 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; + vec_u8_t vfrac16_32_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_32_12 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; + vec_u8_t vfrac16_32_13 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_32_14 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; + vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + +#if 0 + #define one_line(s0, s1, vf32, vf, vout) {\ + vmle0 = vec_mule(s0, vf32);\ + vmlo0 = vec_mulo(s0, vf32);\ + vmle1 = vec_mule(s1, vf);\ + vmlo1 = vec_mulo(s1, vf);\ + vsume = vec_add(vec_add(vmle0, vmle1), u16_16);\ + ve = vec_sra(vsume, u16_5);\ + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16);\ + vo = vec_sra(vsumo, u16_5);\ + vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo));\ + } +#endif + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv1, srv2, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv1, srv2, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv1, srv2, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv1, srv2, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv2, srv3, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv2, srv3, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv2, srv3, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv3, srv4, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv3, srv4, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv3, srv4, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv3, srv4, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv4, srv5, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv4, srv5, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 29>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + //mode 29: + //int offset[32] = {0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9}; + //int fraction[32] = {9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, 25, 2, 11, 20, 29, 6, 15, 24, 1, 10, 19, 28, 5, 14, 23, 0}; + + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv2 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srva = vec_perm(sv0, sv1, mask10); + + vec_u8_t srv00 = sv1; + vec_u8_t srv10 = vec_perm(sv1, sv2, mask1); + vec_u8_t srv20 = vec_perm(sv1, sv2, mask2); + vec_u8_t srv30 = vec_perm(sv1, sv2, mask3); + vec_u8_t srv40 = vec_perm(sv1, sv2, mask4); + vec_u8_t srv50 = vec_perm(sv1, sv2, mask5); + vec_u8_t srv60 = vec_perm(sv1, sv2, mask6); + vec_u8_t srv70 = vec_perm(sv1, sv2, mask7); + vec_u8_t srv80 = vec_perm(sv1, sv2, mask8); + vec_u8_t srv90 = vec_perm(sv1, sv2, mask9); + vec_u8_t srva0 = vec_perm(sv1, sv2, mask10); + + + /* fraction[0-15] */ + vec_u8_t vfrac16_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; + vec_u8_t vfrac16_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_2 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; + vec_u8_t vfrac16_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_4 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; + vec_u8_t vfrac16_5 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_6 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; + vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_8 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; + vec_u8_t vfrac16_9 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_10 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; + vec_u8_t vfrac16_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_12 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; + vec_u8_t vfrac16_13 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_14 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; + vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + vec_u8_t vfrac16_16 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; + vec_u8_t vfrac16_17 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 }; + vec_u8_t vfrac16_18 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; + vec_u8_t vfrac16_19 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_20 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; + vec_u8_t vfrac16_21 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_22 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; + vec_u8_t vfrac16_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_24 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; + vec_u8_t vfrac16_25 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_26 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; + vec_u8_t vfrac16_27 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_28 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vfrac16_29 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_30 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; + vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + + + /* 32 - fraction[0-15] */ + vec_u8_t vfrac16_32_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; + vec_u8_t vfrac16_32_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_32_2 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vfrac16_32_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_32_4 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; + vec_u8_t vfrac16_32_5 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_32_6 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; + vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_32_8 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; + vec_u8_t vfrac16_32_9 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_32_10 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; + vec_u8_t vfrac16_32_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_32_12 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; + vec_u8_t vfrac16_32_13 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_32_14 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; + vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + vec_u8_t vfrac16_32_16 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; + vec_u8_t vfrac16_32_17 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_32_18 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; + vec_u8_t vfrac16_32_19 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_32_20 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; + vec_u8_t vfrac16_32_21 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_32_22 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; + vec_u8_t vfrac16_32_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_32_24 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; + vec_u8_t vfrac16_32_25 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_32_26 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; + vec_u8_t vfrac16_32_27 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_32_28 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; + vec_u8_t vfrac16_32_29 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_32_30 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; + vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv00, srv10, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv00, srv10, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv00, srv10, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv1, srv2, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv10, srv20, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv1, srv2, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv10, srv20, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv1, srv2, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv10, srv20, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv1, srv2, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv10, srv20, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv2, srv3, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv20, srv30, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv2, srv3, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv20, srv30, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv2, srv3, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv20, srv30, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv3, srv4, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv30, srv40, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv3, srv4, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv30, srv40, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv3, srv4, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv30, srv40, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv3, srv4, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv30, srv40, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv4, srv5, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv40, srv50, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv4, srv5, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv40, srv50, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv4, srv5, vfrac16_32_16, vfrac16_16, vout_0); + one_line(srv40, srv50, vfrac16_32_16, vfrac16_16, vout_1); + + one_line(srv5, srv6, vfrac16_32_17, vfrac16_17, vout_2); + one_line(srv50, srv60, vfrac16_32_17, vfrac16_17, vout_3); + + one_line(srv5, srv6, vfrac16_32_18, vfrac16_18, vout_4); + one_line(srv50, srv60, vfrac16_32_18, vfrac16_18, vout_5); + + one_line(srv5, srv6, vfrac16_32_19, vfrac16_19, vout_6); + one_line(srv50, srv60, vfrac16_32_19, vfrac16_19, vout_7); + + one_line(srv5, srv6, vfrac16_32_20, vfrac16_20, vout_8); + one_line(srv50, srv60, vfrac16_32_20, vfrac16_20, vout_9); + + one_line(srv6, srv7, vfrac16_32_21, vfrac16_21, vout_10); + one_line(srv60, srv70, vfrac16_32_21, vfrac16_21, vout_11); + + one_line(srv6, srv7, vfrac16_32_22, vfrac16_22, vout_12); + one_line(srv60, srv70, vfrac16_32_22, vfrac16_22, vout_13); + + one_line(srv6, srv7, vfrac16_32_23, vfrac16_23, vout_14); + one_line(srv60, srv70, vfrac16_32_23, vfrac16_23, vout_15); + + one_line(srv7, srv8, vfrac16_32_24, vfrac16_24, vout_16); + one_line(srv70, srv80, vfrac16_32_24, vfrac16_24, vout_17); + + one_line(srv7, srv8, vfrac16_32_25, vfrac16_25, vout_18); + one_line(srv70, srv80, vfrac16_32_25, vfrac16_25, vout_19); + + one_line(srv7, srv8, vfrac16_32_26, vfrac16_26, vout_20); + one_line(srv70, srv80, vfrac16_32_26, vfrac16_26, vout_21); + + one_line(srv7, srv8, vfrac16_32_27, vfrac16_27, vout_22); + one_line(srv70, srv80, vfrac16_32_27, vfrac16_27, vout_23); + + one_line(srv8, srv9, vfrac16_32_28, vfrac16_28, vout_24); + one_line(srv80, srv90, vfrac16_32_28, vfrac16_28, vout_25); + + one_line(srv8, srv9, vfrac16_32_29, vfrac16_29, vout_26); + one_line(srv80, srv90, vfrac16_32_29, vfrac16_29, vout_27); + + one_line(srv8, srv9, vfrac16_32_30, vfrac16_30, vout_28); + one_line(srv80, srv90, vfrac16_32_30, vfrac16_30, vout_29); + + one_line(srv9, srva, vfrac16_32_31, vfrac16_31, vout_30); + one_line(srv90, srva0, vfrac16_32_31, vfrac16_31, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<4, 30>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + //mode 30: + //int offset[32] = {0, 0, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 6, 6, 7, 7, 8, 8, 8, 9, 9, 10, 10, 10, 11, 11, 12, 12, 13}; + //int fraction[32] = {13, 26, 7, 20, 1, 14, 27, 8, 21, 2, 15, 28, 9, 22, 3, 16, 29, 10, 23, 4, 17, 30, 11, 24, 5, 18, 31, 12, 25, 6, 19, 0}; + + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05, 0x02, 0x03, 0x04, 0x05}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + vec_u8_t vfrac4 = (vec_u8_t){13, 13, 13, 13, 26, 26, 26, 26, 7, 7, 7, 7, 20, 20, 20, 20}; /* fraction[0-3] */ + vec_u8_t vfrac4_32 = (vec_u8_t){19, 19, 19, 19, 6, 6, 6, 6, 25, 25, 25, 25, 12, 12, 12, 12}; /* 32 - fraction[0-3] */ + + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 30>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + //mode 30: + //int offset[32] = {0, 0, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 6, 6, 7, 7, 8, 8, 8, 9, 9, 10, 10, 10, 11, 11, 12, 12, 13}; + //int fraction[32] = {13, 26, 7, 20, 1, 14, 27, 8, 21, 2, 15, 28, 9, 22, 3, 16, 29, 10, 23, 4, 17, 30, 11, 24, 5, 18, 31, 12, 25, 6, 19, 0}; + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a}; + vec_u8_t mask4={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a}; + vec_u8_t mask5={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 0 */ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 1 */ + vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 2 */ + vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 3 */ + vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 2, 3 */ + vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 3, 4 */ + + /* fraction[0-7] */ + vec_u8_t vfrac8_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac8_1 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac8_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 14, 14, 14, 14, 14, 14, 14, 14 }; + vec_u8_t vfrac8_3 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 8, 8, 8, 8, 8, 8, 8, 8}; + + /* 32 - fraction[0-7] */ + vec_u8_t vfrac8_32_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac8_32_1 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac8_32_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac8_32_3 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 24, 24, 24, 24, 24, 24, 24, 24}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + /* y0, y1 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0); + vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + + /* y2, y3 */ + vmle0 = vec_mule(srv1, vfrac8_32_1); + vmlo0 = vec_mulo(srv1, vfrac8_32_1); + vmle1 = vec_mule(srv2, vfrac8_1); + vmlo1 = vec_mulo(srv2, vfrac8_1); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y4, y5 */ + vmle0 = vec_mule(srv2, vfrac8_32_2); + vmlo0 = vec_mulo(srv2, vfrac8_32_2); + vmle1 = vec_mule(srv3, vfrac8_2); + vmlo1 = vec_mulo(srv3, vfrac8_2); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y6, y7 */ + vmle0 = vec_mule(srv4, vfrac8_32_3); + vmlo0 = vec_mulo(srv4, vfrac8_32_3); + vmle1 = vec_mule(srv5, vfrac8_3); + vmlo1 = vec_mulo(srv5, vfrac8_3); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 30>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + //mode 30: + //int offset[32] = {0, 0, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 6, 6, 7, 7, 8, 8, 8, 9, 9, 10, 10, 10, 11, 11, 12, 12, 13}; + //int fraction[32] = {13, 26, 7, 20, 1, 14, 27, 8, 21, 2, 15, 28, 9, 22, 3, 16, 29, 10, 23, 4, 17, 30, 11, 24, 5, 18, 31, 12, 25, 6, 19, 0}; + + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + //vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + //vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + //vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19}; + //vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a}; + //vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b}; + //vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c}; + //vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + //vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + //vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + //vec_u8_t srva = vec_perm(sv0, sv1, mask10); + //vec_u8_t srvb = vec_perm(sv0, sv1, mask11); + //vec_u8_t srvc = vec_perm(sv0, sv1, mask12); + //vec_u8_t srvd = vec_perm(sv0, sv1, mask13); + //vec_u8_t srve = vec_perm(sv0, sv1, mask14); + + /* fraction[0-15] */ + vec_u8_t vfrac16_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; + vec_u8_t vfrac16_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; + vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_4 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; + vec_u8_t vfrac16_5 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_6 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; + vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_8 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; + vec_u8_t vfrac16_9 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_10 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; + vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_12 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; + vec_u8_t vfrac16_13 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_14 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; + vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* 32 - fraction[0-15] */ + vec_u8_t vfrac16_32_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; + vec_u8_t vfrac16_32_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_32_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; + vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_32_4 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; + vec_u8_t vfrac16_32_5 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_32_6 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_32_8 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; + vec_u8_t vfrac16_32_9 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_32_10 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; + vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_32_12 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; + vec_u8_t vfrac16_32_13 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_32_14 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; + vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv1, srv2, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv2, srv3, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv2, srv3, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv2, srv3, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv3, srv4, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv3, srv4, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv4, srv5, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv4, srv5, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv4, srv5, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv5, srv6, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv5, srv6, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv6, srv7, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv6, srv7, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 30>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + //mode 30: + //int offset[32] = {0, 0, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 6, 6, 7, 7, 8, 8, 8, 9, 9, 10, 10, 10, 11, 11, 12, 12, 13}; + //int fraction[32] = {13, 26, 7, 20, 1, 14, 27, 8, 21, 2, 15, 28, 9, 22, 3, 16, 29, 10, 23, 4, 17, 30, 11, 24, 5, 18, 31, 12, 25, 6, 19, 0}; + + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19}; + vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a}; + vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b}; + vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c}; + vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv2 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srva = vec_perm(sv0, sv1, mask10); + vec_u8_t srvb = vec_perm(sv0, sv1, mask11); + vec_u8_t srvc = vec_perm(sv0, sv1, mask12); + vec_u8_t srvd = vec_perm(sv0, sv1, mask13); + vec_u8_t srve = vec_perm(sv0, sv1, mask14); + + vec_u8_t srv00 = sv1; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv10 = vec_perm(sv1, sv2, mask1); + vec_u8_t srv20 = vec_perm(sv1, sv2, mask2); + vec_u8_t srv30 = vec_perm(sv1, sv2, mask3); + vec_u8_t srv40 = vec_perm(sv1, sv2, mask4); + vec_u8_t srv50 = vec_perm(sv1, sv2, mask5); + vec_u8_t srv60 = vec_perm(sv1, sv2, mask6); + vec_u8_t srv70 = vec_perm(sv1, sv2, mask7); + vec_u8_t srv80 = vec_perm(sv1, sv2, mask8); + vec_u8_t srv90 = vec_perm(sv1, sv2, mask9); + vec_u8_t srva0 = vec_perm(sv1, sv2, mask10); + vec_u8_t srvb0 = vec_perm(sv1, sv2, mask11); + vec_u8_t srvc0 = vec_perm(sv1, sv2, mask12); + vec_u8_t srvd0 = vec_perm(sv1, sv2, mask13); + vec_u8_t srve0 = vec_perm(sv1, sv2, mask14); + + + /* fraction[0-15] */ + vec_u8_t vfrac16_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; + vec_u8_t vfrac16_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; + vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_4 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; + vec_u8_t vfrac16_5 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_6 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; + vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_8 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; + vec_u8_t vfrac16_9 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_10 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; + vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_12 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; + vec_u8_t vfrac16_13 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_14 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; + vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + vec_u8_t vfrac16_16 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; + vec_u8_t vfrac16_17 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_18 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; + vec_u8_t vfrac16_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_20 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; + vec_u8_t vfrac16_21 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_22 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; + vec_u8_t vfrac16_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_24 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vfrac16_25 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_26 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; + vec_u8_t vfrac16_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_28 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; + vec_u8_t vfrac16_29 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_30 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; + vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + + + /* 32 - fraction[0-15] */ + vec_u8_t vfrac16_32_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; + vec_u8_t vfrac16_32_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_32_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; + vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_32_4 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; + vec_u8_t vfrac16_32_5 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_32_6 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_32_8 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; + vec_u8_t vfrac16_32_9 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_32_10 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; + vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_32_12 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; + vec_u8_t vfrac16_32_13 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_32_14 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; + vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + vec_u8_t vfrac16_32_16 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; + vec_u8_t vfrac16_32_17 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_32_18 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; + vec_u8_t vfrac16_32_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_32_20 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; + vec_u8_t vfrac16_32_21 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_32_22 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; + vec_u8_t vfrac16_32_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_32_24 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; + vec_u8_t vfrac16_32_25 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_32_26 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; + vec_u8_t vfrac16_32_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_32_28 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; + vec_u8_t vfrac16_32_29 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_32_30 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; + vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv00, srv10, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv00, srv10, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv10, srv20, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv1, srv2, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv10, srv20, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv2, srv3, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv20, srv30, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv2, srv3, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv20, srv30, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv2, srv3, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv20, srv30, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv3, srv4, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv30, srv40, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv3, srv4, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv30, srv40, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv4, srv5, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv40, srv50, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv4, srv5, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv40, srv50, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv4, srv5, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv40, srv50, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv5, srv6, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv50, srv60, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv5, srv6, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv50, srv60, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv6, srv7, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv60, srv70, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv6, srv7, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv60, srv70, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv6, srv7, vfrac16_32_16, vfrac16_16, vout_0); + one_line(srv60, srv70, vfrac16_32_16, vfrac16_16, vout_1); + + one_line(srv7, srv8, vfrac16_32_17, vfrac16_17, vout_2); + one_line(srv70, srv80, vfrac16_32_17, vfrac16_17, vout_3); + + one_line(srv7, srv8, vfrac16_32_18, vfrac16_18, vout_4); + one_line(srv70, srv80, vfrac16_32_18, vfrac16_18, vout_5); + + one_line(srv8, srv9, vfrac16_32_19, vfrac16_19, vout_6); + one_line(srv80, srv90, vfrac16_32_19, vfrac16_19, vout_7); + + one_line(srv8, srv9, vfrac16_32_20, vfrac16_20, vout_8); + one_line(srv80, srv90, vfrac16_32_20, vfrac16_20, vout_9); + + one_line(srv8, srv9, vfrac16_32_21, vfrac16_21, vout_10); + one_line(srv80, srv90, vfrac16_32_21, vfrac16_21, vout_11); + + one_line(srv9, srva, vfrac16_32_22, vfrac16_22, vout_12); + one_line(srv90, srva0, vfrac16_32_22, vfrac16_22, vout_13); + + one_line(srv9, srva, vfrac16_32_23, vfrac16_23, vout_14); + one_line(srv90, srva0, vfrac16_32_23, vfrac16_23, vout_15); + + one_line(srva, srvb, vfrac16_32_24, vfrac16_24, vout_16); + one_line(srva0, srvb0, vfrac16_32_24, vfrac16_24, vout_17); + + one_line(srva, srvb, vfrac16_32_25, vfrac16_25, vout_18); + one_line(srva0, srvb0, vfrac16_32_25, vfrac16_25, vout_19); + + one_line(srva, srvb, vfrac16_32_26, vfrac16_26, vout_20); + one_line(srva0, srvb0, vfrac16_32_26, vfrac16_26, vout_21); + + one_line(srvb, srvc, vfrac16_32_27, vfrac16_27, vout_22); + one_line(srvb0, srvc0, vfrac16_32_27, vfrac16_27, vout_23); + + one_line(srvb, srvc, vfrac16_32_28, vfrac16_28, vout_24); + one_line(srvb0, srvc0, vfrac16_32_28, vfrac16_28, vout_25); + + one_line(srvc, srvd, vfrac16_32_29, vfrac16_29, vout_26); + one_line(srvc0, srvd0, vfrac16_32_29, vfrac16_29, vout_27); + + one_line(srvc, srvd, vfrac16_32_30, vfrac16_30, vout_28); + one_line(srvc0, srvd0, vfrac16_32_30, vfrac16_30, vout_29); + + one_line(srvd, srve, vfrac16_32_31, vfrac16_31, vout_30); + one_line(srvd0, srve0, vfrac16_32_31, vfrac16_31, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<4, 31>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-3; + dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-3; + dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-3; + dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5); + + y=3; off3 = offset[3]; x=0-3; + dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5); + } + */ + //mode 31: + //int offset[32] = {0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 17}; + //int fraction[32] = {17, 2, 19, 4, 21, 6, 23, 8, 25, 10, 27, 12, 29, 14, 31, 16, 1, 18, 3, 20, 5, 22, 7, 24, 9, 26, 11, 28, 13, 30, 15, 0}; + + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05, 0x02, 0x03, 0x04, 0x05, 0x03, 0x04, 0x05, 0x06}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + vec_u8_t vfrac4 = (vec_u8_t){17, 17, 17, 17, 2, 2, 2, 2, 19, 19, 19, 19, 4, 4, 4, 4}; + vec_u8_t vfrac4_32 = (vec_u8_t){15, 15, 15, 15, 30, 30, 30, 30, 13, 13, 13, 13, 28, 28, 28, 28}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 31>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-7; + dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5); + ... + dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off0 + 7] + f[0] * ref[off0 + 7] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-7; + dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5); + ... + dst[y * dstStride + 7] = (pixel)((f32[1]* ref[off1 + 7] + f[1] * ref[off1 + 7] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-7; + dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5); + ... + dst[y * dstStride + 7] = (pixel)((f32[2]* ref[off2 + 7] + f[2] * ref[off2 + 7] + 16) >> 5); + + y=3; off3 = offset[3]; x=0-7; + dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5); + ... + dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off3 + 7] + f[0] * ref[off3 + 7] + 16) >> 5); + + ... + + y=7; off7 = offset[7]; x=0-7; + dst[y * dstStride + 0] = (pixel)((f32[7]* ref[off7 + 0] + f[7] * ref[off7 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[7]* ref[off7 + 1] + f[7] * ref[off7 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[7]* ref[off7 + 2] + f[7] * ref[off7 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[7]* ref[off7 + 3] + f[7] * ref[off7 + 4] + 16) >> 5); + ... + dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off7 + 7] + f[0] * ref[off7 + 7] + 16) >> 5); + } + */ + //mode 31: + //int offset[32] = {0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 17}; + //int fraction[32] = {17, 2, 19, 4, 21, 6, 23, 8, 25, 10, 27, 12, 29, 14, 31, 16, 1, 18, 3, 20, 5, 22, 7, 24, 9, 26, 11, 28, 13, 30, 15, 0}; + + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 0 */ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 1 */ + vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 2 */ + vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 3 */ + vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 2, 3 */ + + /* fraction[0-7] */ + vec_u8_t vfrac8_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac8_1 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac8_2 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac8_3 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 8, 8, 8, 8, 8, 8, 8, 8}; + + /* 32 - fraction[0-7] */ + vec_u8_t vfrac8_32_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac8_32_1 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac8_32_2 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac8_32_3 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 24, 24, 24, 24, 24, 24, 24, 24}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + /* y0, y1 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0); + vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + + /* y2, y3 */ + vmle0 = vec_mule(srv1, vfrac8_32_1); + vmlo0 = vec_mulo(srv1, vfrac8_32_1); + vmle1 = vec_mule(srv2, vfrac8_1); + vmlo1 = vec_mulo(srv2, vfrac8_1); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y4, y5 */ + vmle0 = vec_mule(srv2, vfrac8_32_2); + vmlo0 = vec_mulo(srv2, vfrac8_32_2); + vmle1 = vec_mule(srv3, vfrac8_2); + vmlo1 = vec_mulo(srv3, vfrac8_2); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y6, y7 */ + vmle0 = vec_mule(srv3, vfrac8_32_3); + vmlo0 = vec_mulo(srv3, vfrac8_32_3); + vmle1 = vec_mule(srv4, vfrac8_3); + vmlo1 = vec_mulo(srv4, vfrac8_3); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 31>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[0]* ref[off0 + 15] + f[0] * ref[off0 + 16] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[1]* ref[off1 + 15] + f[1] * ref[off1 + 16] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[2]* ref[off2 + 15] + f[2] * ref[off2 + 16] + 16) >> 5); + + y=3; off3 = offset[3]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[3]* ref[off3 + 15] + f[3] * ref[off3 + 16] + 16) >> 5); + + ... + + y=15; off7 = offset[7]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[15]* ref[off15 + 0] + f[15] * ref[off15 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[15]* ref[off15 + 1] + f[15] * ref[off15 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[15]* ref[off15 + 2] + f[15] * ref[off15 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[15]* ref[off15 + 3] + f[15] * ref[off15 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[15]* ref[off15 + 15] + f[15] * ref[off15 + 16] + 16) >> 5); + } + */ + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + + /* fraction[0-15] */ + vec_u8_t vfrac16_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; + vec_u8_t vfrac16_1 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; + vec_u8_t vfrac16_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_4 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; + vec_u8_t vfrac16_5 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_6 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; + vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_8 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; + vec_u8_t vfrac16_9 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_10 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; + vec_u8_t vfrac16_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_12 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; + vec_u8_t vfrac16_13 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_14 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; + vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* 32 - fraction[0-15] */ + vec_u8_t vfrac16_32_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; + vec_u8_t vfrac16_32_1 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_32_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; + vec_u8_t vfrac16_32_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_32_4 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; + vec_u8_t vfrac16_32_5 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_32_6 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; + vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_32_8 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; + vec_u8_t vfrac16_32_9 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_32_10 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vfrac16_32_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_32_12 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; + vec_u8_t vfrac16_32_13 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_32_14 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; + vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv2, srv3, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv2, srv3, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv3, srv4, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv3, srv4, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv4, srv5, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv4, srv5, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv5, srv6, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv5, srv6, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv6, srv7, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv6, srv7, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv7, srv8, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv7, srv8, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv8, srv9, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 31>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-31; + dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[0]* ref[off0 + 15] + f[0] * ref[off0 + 16] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-31; + dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[1]* ref[off1 + 15] + f[1] * ref[off1 + 16] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-31; + dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[2]* ref[off2 + 15] + f[2] * ref[off2 + 16] + 16) >> 5); + + ... + + y=15; off15 = offset[15]; x=0-31; off15-off30 = 1; + dst[y * dstStride + 0] = (pixel)((f32[15]* ref[off15 + 0] + f[15] * ref[off15 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[15]* ref[off15 + 1] + f[15] * ref[off15 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[15]* ref[off15 + 2] + f[15] * ref[off15 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[15]* ref[off15 + 3] + f[15] * ref[off15 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[15]* ref[off15 + 15] + f[15] * ref[off15 + 16] + 16) >> 5); + + ... + + y=31; off31= offset[31]; x=0-31; off31 = 2; + dst[y * dstStride + 0] = (pixel)((f32[31]* ref[off15 + 0] + f[31] * ref[off31 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[31]* ref[off15 + 1] + f[31] * ref[off31 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[31]* ref[off15 + 2] + f[31] * ref[off31 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[31]* ref[off15 + 3] + f[31] * ref[off31 + 4] + 16) >> 5); + ... + dst[y * dstStride + 31] = (pixel)((f32[31]* ref[off15 + 31] + f[31] * ref[off31 + 32] + 16) >> 5); + } + */ + //mode 31: + //int offset[32] = {0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 17}; + //int fraction[32] = {17, 2, 19, 4, 21, 6, 23, 8, 25, 10, 27, 12, 29, 14, 31, 16, 1, 18, 3, 20, 5, 22, 7, 24, 9, 26, 11, 28, 13, 30, 15, 0}; + + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19}; + vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a}; + vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b}; + vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c}; + vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d}; + vec_u8_t mask15={0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv2 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv3 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srva = vec_perm(sv0, sv1, mask10); + vec_u8_t srvb = vec_perm(sv0, sv1, mask11); + vec_u8_t srvc = vec_perm(sv0, sv1, mask12); + vec_u8_t srvd = vec_perm(sv0, sv1, mask13); + vec_u8_t srve = vec_perm(sv0, sv1, mask14); + vec_u8_t srvf = vec_perm(sv0, sv1, mask15); + + vec_u8_t srv00 = sv1; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv10 = vec_perm(sv1, sv2, mask1); + vec_u8_t srv20 = vec_perm(sv1, sv2, mask2); + vec_u8_t srv30 = vec_perm(sv1, sv2, mask3); + vec_u8_t srv40 = vec_perm(sv1, sv2, mask4); + vec_u8_t srv50 = vec_perm(sv1, sv2, mask5); + vec_u8_t srv60 = vec_perm(sv1, sv2, mask6); + vec_u8_t srv70 = vec_perm(sv1, sv2, mask7); + vec_u8_t srv80 = vec_perm(sv1, sv2, mask8); + vec_u8_t srv90 = vec_perm(sv1, sv2, mask9); + vec_u8_t srva0 = vec_perm(sv1, sv2, mask10); + vec_u8_t srvb0 = vec_perm(sv1, sv2, mask11); + vec_u8_t srvc0 = vec_perm(sv1, sv2, mask12); + vec_u8_t srvd0 = vec_perm(sv1, sv2, mask13); + vec_u8_t srve0 = vec_perm(sv1, sv2, mask14); + vec_u8_t srvf0 = vec_perm(sv1, sv2, mask15); + + vec_u8_t srv000 = sv2; + vec_u8_t srv100 = vec_perm(sv2, sv3, mask1); + vec_u8_t srv200 = vec_perm(sv2, sv3, mask2); + + + /* fraction[0-15] */ + vec_u8_t vfrac16_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; + vec_u8_t vfrac16_1 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; + vec_u8_t vfrac16_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_4 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_5 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_6 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_8 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_9 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_10 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_12 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_13 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_14 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_16 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_17 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_18 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_19 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_20 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_21 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_22 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_24 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_25 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_26 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_27 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_28 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_29 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_30 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + + /* 32 - fraction[0-15] */ +vec_u8_t vfrac16_32_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_16 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_32_17 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_18 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_32_19 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_20 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_32_21 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_22 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_32_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_24 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_32_25 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_26 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_32_27 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_28 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_32_29 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_30 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + //int offset[32] = {0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 17}; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv00, srv10, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv10, srv20, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv10, srv20, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv2, srv3, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv20, srv30, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv2, srv3, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv20, srv30, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv3, srv4, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv30, srv40, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv3, srv4, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv30, srv40, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv4, srv5, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv40, srv50, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv4, srv5, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv40, srv50, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv5, srv6, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv50, srv60, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv5, srv6, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv50, srv60, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv6, srv7, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv60, srv70, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv6, srv7, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv60, srv70, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv7, srv8, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv70, srv80, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv7, srv8, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv70, srv80, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv8, srv9, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv80, srv90, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srv9, srva, vfrac16_32_16, vfrac16_16, vout_0); + one_line(srv90, srva0, vfrac16_32_16, vfrac16_16, vout_1); + + one_line(srv9, srva, vfrac16_32_17, vfrac16_17, vout_2); + one_line(srv90, srva0, vfrac16_32_17, vfrac16_17, vout_3); + + one_line(srva, srvb, vfrac16_32_18, vfrac16_18, vout_4); + one_line(srva0, srvb0, vfrac16_32_18, vfrac16_18, vout_5); + + one_line(srva, srvb, vfrac16_32_19, vfrac16_19, vout_6); + one_line(srva0, srvb0, vfrac16_32_19, vfrac16_19, vout_7); + + one_line(srvb, srvc, vfrac16_32_20, vfrac16_20, vout_8); + one_line(srvb0, srvc0, vfrac16_32_20, vfrac16_20, vout_9); + + one_line(srvb, srvc, vfrac16_32_21, vfrac16_21, vout_10); + one_line(srvb0, srvc0, vfrac16_32_21, vfrac16_21, vout_11); + + one_line(srvc, srvd, vfrac16_32_22, vfrac16_22, vout_12); + one_line(srvc0, srvd0, vfrac16_32_22, vfrac16_22, vout_13); + + one_line(srvc, srvd, vfrac16_32_23, vfrac16_23, vout_14); + one_line(srvc0, srvd0, vfrac16_32_23, vfrac16_23, vout_15); + + one_line(srvd, srve, vfrac16_32_24, vfrac16_24, vout_16); + one_line(srvd0, srve0, vfrac16_32_24, vfrac16_24, vout_17); + + one_line(srvd, srve, vfrac16_32_25, vfrac16_25, vout_18); + one_line(srvd0, srve0, vfrac16_32_25, vfrac16_25, vout_19); + + one_line(srve, srvf, vfrac16_32_26, vfrac16_26, vout_20); + one_line(srve0, srvf0, vfrac16_32_26, vfrac16_26, vout_21); + + one_line(srve, srvf, vfrac16_32_27, vfrac16_27, vout_22); + one_line(srve0, srvf0, vfrac16_32_27, vfrac16_27, vout_23); + + one_line(srvf, srv00, vfrac16_32_28, vfrac16_28, vout_24); + one_line(srvf0, srv000, vfrac16_32_28, vfrac16_28, vout_25); + + one_line(srvf, srv00, vfrac16_32_29, vfrac16_29, vout_26); + one_line(srvf0, srv000, vfrac16_32_29, vfrac16_29, vout_27); + + one_line(srv00, srv10, vfrac16_32_30, vfrac16_30, vout_28); + one_line(srv000, srv100, vfrac16_32_30, vfrac16_30, vout_29); + + one_line(srv10, srv20, vfrac16_32_31, vfrac16_31, vout_30); + one_line(srv100, srv200, vfrac16_32_31, vfrac16_31, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + + +template<> +void intra_pred<4, 32>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-3; + dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-3; + dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-3; + dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5); + + y=3; off3 = offset[3]; x=0-3; + dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5); + } + */ + //mode 32: + //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21}; + //int fraction[32] = {21, 10, 31, 20, 9, 30, 19, 8, 29, 18, 7, 28, 17, 6, 27, 16, 5, 26, 15, 4, 25, 14, 3, 24, 13, 2, 23, 12, 1, 22, 11, 0}; + + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05, 0x02, 0x03, 0x04, 0x05, 0x03, 0x04, 0x05, 0x06}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + +vec_u8_t vfrac4 = (vec_u8_t){21, 21, 21, 21, 10, 10, 10, 10, 31, 31, 31, 31, 20, 20, 20, 20}; +vec_u8_t vfrac4_32 = (vec_u8_t){11, 11, 11, 11, 22, 22, 22, 22, 1, 1, 1, 1, 12, 12, 12, 12}; + + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 32>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-7; + dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5); + ... + dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off0 + 7] + f[0] * ref[off0 + 7] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-7; + dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5); + ... + dst[y * dstStride + 7] = (pixel)((f32[1]* ref[off1 + 7] + f[1] * ref[off1 + 7] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-7; + dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5); + ... + dst[y * dstStride + 7] = (pixel)((f32[2]* ref[off2 + 7] + f[2] * ref[off2 + 7] + 16) >> 5); + + y=3; off3 = offset[3]; x=0-7; + dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5); + ... + dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off3 + 7] + f[0] * ref[off3 + 7] + 16) >> 5); + + ... + + y=7; off7 = offset[7]; x=0-7; + dst[y * dstStride + 0] = (pixel)((f32[7]* ref[off7 + 0] + f[7] * ref[off7 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[7]* ref[off7 + 1] + f[7] * ref[off7 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[7]* ref[off7 + 2] + f[7] * ref[off7 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[7]* ref[off7 + 3] + f[7] * ref[off7 + 4] + 16) >> 5); + ... + dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off7 + 7] + f[0] * ref[off7 + 7] + 16) >> 5); + } + */ + //mode 32: + //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21}; + //int fraction[32] = {21, 10, 31, 20, 9, 30, 19, 8, 29, 18, 7, 28, 17, 6, 27, 16, 5, 26, 15, 4, 25, 14, 3, 24, 13, 2, 23, 12, 1, 22, 11, 0}; + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b}; + vec_u8_t mask5={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c}; + vec_u8_t mask6={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 1 */ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 2 */ + vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 3 */ + vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 3 */ + vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 4, 4 */ + vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 4, 5 */ + vec_u8_t srv6 = vec_perm(srv, srv, mask6); /* 5, 6 */ + +vec_u8_t vfrac8_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac8_1 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac8_2 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac8_3 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 8, 8, 8, 8, 8, 8, 8, 8}; + +vec_u8_t vfrac8_32_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac8_32_1 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac8_32_2 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac8_32_3 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 24, 24, 24, 24, 24, 24, 24, 24}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + /* y0, y1 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0); + vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + + /* y2, y3 */ + vmle0 = vec_mule(srv1, vfrac8_32_1); + vmlo0 = vec_mulo(srv1, vfrac8_32_1); + vmle1 = vec_mule(srv2, vfrac8_1); + vmlo1 = vec_mulo(srv2, vfrac8_1); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y4, y5 */ + vmle0 = vec_mule(srv3, vfrac8_32_2); + vmlo0 = vec_mulo(srv3, vfrac8_32_2); + vmle1 = vec_mule(srv4, vfrac8_2); + vmlo1 = vec_mulo(srv4, vfrac8_2); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21}; + + /* y6, y7 */ + vmle0 = vec_mule(srv5, vfrac8_32_3); + vmlo0 = vec_mulo(srv5, vfrac8_32_3); + vmle1 = vec_mule(srv6, vfrac8_3); + vmlo1 = vec_mulo(srv6, vfrac8_3); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 32>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[0]* ref[off0 + 15] + f[0] * ref[off0 + 16] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[1]* ref[off1 + 15] + f[1] * ref[off1 + 16] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[2]* ref[off2 + 15] + f[2] * ref[off2 + 16] + 16) >> 5); + + y=3; off3 = offset[3]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[3]* ref[off3 + 15] + f[3] * ref[off3 + 16] + 16) >> 5); + + ... + + y=15; off7 = offset[7]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[15]* ref[off15 + 0] + f[15] * ref[off15 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[15]* ref[off15 + 1] + f[15] * ref[off15 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[15]* ref[off15 + 2] + f[15] * ref[off15 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[15]* ref[off15 + 3] + f[15] * ref[off15 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[15]* ref[off15 + 15] + f[15] * ref[off15 + 16] + 16) >> 5); + } + */ + //mode 32: + //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21}; + //int fraction[32] = {21, 10, 31, 20, 9, 30, 19, 8, 29, 18, 7, 28, 17, 6, 27, 16, 5, 26, 15, 4, 25, 14, 3, 24, 13, 2, 23, 12, 1, 22, 11, 0}; + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19}; + vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srva = vec_perm(sv0, sv1, mask10); + vec_u8_t srvb = vec_perm(sv0, sv1, mask11); + +vec_u8_t vfrac16_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_4 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_6 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_8 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_10 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_12 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_14 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + +vec_u8_t vfrac16_32_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv2, srv3, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv3, srv4, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv3, srv4, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv4, srv5, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv5, srv6, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv5, srv6, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv6, srv7, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv7, srv8, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv7, srv8, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv8, srv9, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv9, srva, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv9, srva, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srva, srvb, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 32>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-31; + dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[0]* ref[off0 + 15] + f[0] * ref[off0 + 16] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-31; + dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[1]* ref[off1 + 15] + f[1] * ref[off1 + 16] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-31; + dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[2]* ref[off2 + 15] + f[2] * ref[off2 + 16] + 16) >> 5); + + ... + + y=15; off15 = offset[15]; x=0-31; off15-off30 = 1; + dst[y * dstStride + 0] = (pixel)((f32[15]* ref[off15 + 0] + f[15] * ref[off15 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[15]* ref[off15 + 1] + f[15] * ref[off15 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[15]* ref[off15 + 2] + f[15] * ref[off15 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[15]* ref[off15 + 3] + f[15] * ref[off15 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[15]* ref[off15 + 15] + f[15] * ref[off15 + 16] + 16) >> 5); + + ... + + y=31; off31= offset[31]; x=0-31; off31 = 2; + dst[y * dstStride + 0] = (pixel)((f32[31]* ref[off15 + 0] + f[31] * ref[off31 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[31]* ref[off15 + 1] + f[31] * ref[off31 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[31]* ref[off15 + 2] + f[31] * ref[off31 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[31]* ref[off15 + 3] + f[31] * ref[off31 + 4] + 16) >> 5); + ... + dst[y * dstStride + 31] = (pixel)((f32[31]* ref[off15 + 31] + f[31] * ref[off31 + 32] + 16) >> 5); + } + */ + //mode 32: + //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21}; + //int fraction[32] = {21, 10, 31, 20, 9, 30, 19, 8, 29, 18, 7, 28, 17, 6, 27, 16, 5, 26, 15, 4, 25, 14, 3, 24, 13, 2, 23, 12, 1, 22, 11, 0}; + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19}; + vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a}; + vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b}; + vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c}; + vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d}; + vec_u8_t mask15={0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv2 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv3 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srva = vec_perm(sv0, sv1, mask10); + vec_u8_t srvb = vec_perm(sv0, sv1, mask11); + vec_u8_t srvc = vec_perm(sv0, sv1, mask12); + vec_u8_t srvd = vec_perm(sv0, sv1, mask13); + vec_u8_t srve = vec_perm(sv0, sv1, mask14); + vec_u8_t srvf = vec_perm(sv0, sv1, mask15); + + vec_u8_t srv00 = sv1; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv10 = vec_perm(sv1, sv2, mask1); + vec_u8_t srv20 = vec_perm(sv1, sv2, mask2); + vec_u8_t srv30 = vec_perm(sv1, sv2, mask3); + vec_u8_t srv40 = vec_perm(sv1, sv2, mask4); + vec_u8_t srv50 = vec_perm(sv1, sv2, mask5); + vec_u8_t srv60 = vec_perm(sv1, sv2, mask6); + vec_u8_t srv70 = vec_perm(sv1, sv2, mask7); + vec_u8_t srv80 = vec_perm(sv1, sv2, mask8); + vec_u8_t srv90 = vec_perm(sv1, sv2, mask9); + vec_u8_t srva0 = vec_perm(sv1, sv2, mask10); + vec_u8_t srvb0 = vec_perm(sv1, sv2, mask11); + vec_u8_t srvc0 = vec_perm(sv1, sv2, mask12); + vec_u8_t srvd0 = vec_perm(sv1, sv2, mask13); + vec_u8_t srve0 = vec_perm(sv1, sv2, mask14); + vec_u8_t srvf0 = vec_perm(sv1, sv2, mask15); + + vec_u8_t srv000 = sv2; + vec_u8_t srv100 = vec_perm(sv2, sv3, mask1); + vec_u8_t srv200 = vec_perm(sv2, sv3, mask2); + vec_u8_t srv300 = vec_perm(sv2, sv3, mask3); + vec_u8_t srv400 = vec_perm(sv2, sv3, mask4); + vec_u8_t srv500 = vec_perm(sv2, sv3, mask5); + vec_u8_t srv600 = vec_perm(sv2, sv3, mask6); + +vec_u8_t vfrac16_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_4 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_6 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_8 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_10 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_12 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_14 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_16 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_17 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_18 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_20 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_21 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_22 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_24 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_25 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_26 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_28 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_29 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_30 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + +vec_u8_t vfrac16_32_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_16 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_32_17 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_18 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_32_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_20 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_32_21 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_22 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_32_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_24 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_32_25 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_26 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_32_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_28 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_32_29 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_30 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv00, srv10, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv10, srv20, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv10, srv20, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv2, srv3, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv20, srv30, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv3, srv4, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv30, srv40, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv3, srv4, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv30, srv40, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv4, srv5, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv40, srv50, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv5, srv6, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv50, srv60, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv5, srv6, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv50, srv60, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv6, srv7, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv60, srv70, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv7, srv8, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv70, srv80, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv7, srv8, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv70, srv80, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv8, srv9, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv80, srv90, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv9, srva, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv90, srva0, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv9, srva, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv90, srva0, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srva, srvb, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srva0, srvb0, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srvb, srvc, vfrac16_32_16, vfrac16_16, vout_0); + one_line(srvb0, srvc0, vfrac16_32_16, vfrac16_16, vout_1); + + one_line(srvb, srvc, vfrac16_32_17, vfrac16_17, vout_2); + one_line(srvb0, srvc0, vfrac16_32_17, vfrac16_17, vout_3); + + one_line(srvc, srvd, vfrac16_32_18, vfrac16_18, vout_4); + one_line(srvc0, srvd0, vfrac16_32_18, vfrac16_18, vout_5); + + one_line(srvd, srve, vfrac16_32_19, vfrac16_19, vout_6); + one_line(srvd0, srve0, vfrac16_32_19, vfrac16_19, vout_7); + + one_line(srvd, srve, vfrac16_32_20, vfrac16_20, vout_8); + one_line(srvd0, srve0, vfrac16_32_20, vfrac16_20, vout_9); + + one_line(srve, srvf, vfrac16_32_21, vfrac16_21, vout_10); + one_line(srve0, srvf0, vfrac16_32_21, vfrac16_21, vout_11); + + one_line(srvf, srv00, vfrac16_32_22, vfrac16_22, vout_12); + one_line(srvf0, srv000, vfrac16_32_22, vfrac16_22, vout_13); + + one_line(srvf, srv00, vfrac16_32_23, vfrac16_23, vout_14); + one_line(srvf0, srv000, vfrac16_32_23, vfrac16_23, vout_15); + + one_line(srv00, srv10, vfrac16_32_24, vfrac16_24, vout_16); + one_line(srv000, srv100, vfrac16_32_24, vfrac16_24, vout_17); + + one_line(srv10, srv20, vfrac16_32_25, vfrac16_25, vout_18); + one_line(srv100, srv200, vfrac16_32_25, vfrac16_25, vout_19); + + one_line(srv10, srv20, vfrac16_32_26, vfrac16_26, vout_20); + one_line(srv100, srv200, vfrac16_32_26, vfrac16_26, vout_21); + + one_line(srv20, srv30, vfrac16_32_27, vfrac16_27, vout_22); + one_line(srv200, srv300, vfrac16_32_27, vfrac16_27, vout_23); + + one_line(srv30, srv40, vfrac16_32_28, vfrac16_28, vout_24); + one_line(srv300, srv400, vfrac16_32_28, vfrac16_28, vout_25); + + one_line(srv30, srv40, vfrac16_32_29, vfrac16_29, vout_26); + one_line(srv300, srv400, vfrac16_32_29, vfrac16_29, vout_27); + + one_line(srv40, srv50, vfrac16_32_30, vfrac16_30, vout_28); + one_line(srv400, srv500, vfrac16_32_30, vfrac16_30, vout_29); + + one_line(srv50, srv60, vfrac16_32_31, vfrac16_31, vout_30); + one_line(srv500, srv600, vfrac16_32_31, vfrac16_31, vout_31); + //int offset[32] = { 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21}; + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<4, 33>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-3; + dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-3; + dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-3; + dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5); + + y=3; off3 = offset[3]; x=0-3; + dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5); + } + */ + //mode 33: + //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26}; + //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0}; + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05, 0x03, 0x04, 0x05, 0x06}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05, 0x03, 0x04, 0x05, 0x06, 0x04, 0x05, 0x06, 0x07}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + vec_u8_t vfrac4 = (vec_u8_t){26, 26, 26, 26, 20, 20, 20, 20, 14, 14, 14, 14, 8, 8, 8, 8}; + vec_u8_t vfrac4_32 = (vec_u8_t){6, 6, 6, 6, 12, 12, 12, 12, 18, 18, 18, 18, 24, 24, 24, 24}; + + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==4){ + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_ste((vec_u32_t)vout, 0, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 12), 16, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 8), 32, (unsigned int*)dst); + vec_ste((vec_u32_t)vec_sld(vout, vout, 4), 48, (unsigned int*)dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask2 = {0x08, 0x09, 0x0a, 0x0b, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask3 = {0x0c, 0x0d, 0x0e, 0x0f, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(vout, vec_xl(dstStride*2, dst), v_mask2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout, vec_xl(dstStride*3, dst), v_mask3); + vec_xst(v3, dstStride*3, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 33>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-7; + dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5); + ... + dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off0 + 7] + f[0] * ref[off0 + 7] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-7; + dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5); + ... + dst[y * dstStride + 7] = (pixel)((f32[1]* ref[off1 + 7] + f[1] * ref[off1 + 7] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-7; + dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5); + ... + dst[y * dstStride + 7] = (pixel)((f32[2]* ref[off2 + 7] + f[2] * ref[off2 + 7] + 16) >> 5); + + y=3; off3 = offset[3]; x=0-7; + dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5); + ... + dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off3 + 7] + f[0] * ref[off3 + 7] + 16) >> 5); + + ... + + y=7; off7 = offset[7]; x=0-7; + dst[y * dstStride + 0] = (pixel)((f32[7]* ref[off7 + 0] + f[7] * ref[off7 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[7]* ref[off7 + 1] + f[7] * ref[off7 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[7]* ref[off7 + 2] + f[7] * ref[off7 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[7]* ref[off7 + 3] + f[7] * ref[off7 + 4] + 16) >> 5); + ... + dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off7 + 7] + f[0] * ref[off7 + 7] + 16) >> 5); + } + */ + //mode 33: + //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26}; + //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0}; + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c}; + vec_u8_t mask6={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d}; + vec_u8_t mask7={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 1 */ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 2 */ + vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 3 */ + vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 4 */ + vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 4, 4 */ + vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 4, 5 */ + vec_u8_t srv6 = vec_perm(srv, srv, mask6); /* 5, 6 */ + vec_u8_t srv7 = vec_perm(srv, srv, mask7); /* 6, 7 */ + +vec_u8_t vfrac8_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac8_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac8_2 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac8_3 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 16, 16, 16, 16, 16, 16, 16, 16}; + +vec_u8_t vfrac8_32_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac8_32_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac8_32_2 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac8_32_3 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + /* y0, y1 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0); + vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y2, y3 */ + vmle0 = vec_mule(srv2, vfrac8_32_1); + vmlo0 = vec_mulo(srv2, vfrac8_32_1); + vmle1 = vec_mule(srv3, vfrac8_1); + vmlo1 = vec_mulo(srv3, vfrac8_1); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y4, y5 */ + vmle0 = vec_mule(srv4, vfrac8_32_2); + vmlo0 = vec_mulo(srv4, vfrac8_32_2); + vmle1 = vec_mule(srv5, vfrac8_2); + vmlo1 = vec_mulo(srv5, vfrac8_2); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21}; + + /* y6, y7 */ + vmle0 = vec_mule(srv6, vfrac8_32_3); + vmlo0 = vec_mulo(srv6, vfrac8_32_3); + vmle1 = vec_mule(srv7, vfrac8_3); + vmlo1 = vec_mulo(srv7, vfrac8_3); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + if(dstStride==8){ + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v_mask1 = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(vout_0, vec_xl(0, dst), v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(vout_0, vec_xl(dstStride, dst), v_mask1); + vec_xst(v1, dstStride, dst); + + vec_u8_t v2 = vec_perm(vout_1, vec_xl(dstStride*2, dst), v_mask0); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(vout_1, vec_xl(dstStride*3, dst), v_mask1); + vec_xst(v3, dstStride*3, dst); + + vec_u8_t v4 = vec_perm(vout_2, vec_xl(dstStride*4, dst), v_mask0); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(vout_2, vec_xl(dstStride*5, dst), v_mask1); + vec_xst(v5, dstStride*5, dst); + + vec_u8_t v6 = vec_perm(vout_3, vec_xl(dstStride*6, dst), v_mask0); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(vout_3, vec_xl(dstStride*7, dst), v_mask1); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 33>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[0]* ref[off0 + 15] + f[0] * ref[off0 + 16] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[1]* ref[off1 + 15] + f[1] * ref[off1 + 16] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[2]* ref[off2 + 15] + f[2] * ref[off2 + 16] + 16) >> 5); + + y=3; off3 = offset[3]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[3]* ref[off3 + 15] + f[3] * ref[off3 + 16] + 16) >> 5); + + ... + + y=15; off7 = offset[7]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[15]* ref[off15 + 0] + f[15] * ref[off15 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[15]* ref[off15 + 1] + f[15] * ref[off15 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[15]* ref[off15 + 2] + f[15] * ref[off15 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[15]* ref[off15 + 3] + f[15] * ref[off15 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[15]* ref[off15 + 15] + f[15] * ref[off15 + 16] + 16) >> 5); + } + */ + //mode 33: + //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26}; + //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0}; + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19}; + vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a}; + vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b}; + vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c}; + vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srva = vec_perm(sv0, sv1, mask10); + vec_u8_t srvb = vec_perm(sv0, sv1, mask11); + vec_u8_t srvc = vec_perm(sv0, sv1, mask12); + vec_u8_t srvd = vec_perm(sv0, sv1, mask13); + vec_u8_t srve = vec_perm(sv0, sv1, mask14); + +vec_u8_t vfrac16_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_1 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_2 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_4 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_5 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_6 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_8 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_9 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_10 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_12 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_13 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_14 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + +vec_u8_t vfrac16_32_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv2, srv3, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv3, srv4, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv4, srv5, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv4, srv5, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv5, srv6, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv6, srv7, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv7, srv8, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv8, srv9, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv8, srv9, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv9, srva, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srva, srvb, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srvb, srvc, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srvc, srvd, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srvd, srve, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, dstStride, dst); + vec_xst(vout_2, dstStride*2, dst); + vec_xst(vout_3, dstStride*3, dst); + vec_xst(vout_4, dstStride*4, dst); + vec_xst(vout_5, dstStride*5, dst); + vec_xst(vout_6, dstStride*6, dst); + vec_xst(vout_7, dstStride*7, dst); + vec_xst(vout_8, dstStride*8, dst); + vec_xst(vout_9, dstStride*9, dst); + vec_xst(vout_10, dstStride*10, dst); + vec_xst(vout_11, dstStride*11, dst); + vec_xst(vout_12, dstStride*12, dst); + vec_xst(vout_13, dstStride*13, dst); + vec_xst(vout_14, dstStride*14, dst); + vec_xst(vout_15, dstStride*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 33>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-31; + dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[0]* ref[off0 + 15] + f[0] * ref[off0 + 16] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-31; + dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[1]* ref[off1 + 15] + f[1] * ref[off1 + 16] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-31; + dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[2]* ref[off2 + 15] + f[2] * ref[off2 + 16] + 16) >> 5); + + ... + + y=15; off15 = offset[15]; x=0-31; off15-off30 = 1; + dst[y * dstStride + 0] = (pixel)((f32[15]* ref[off15 + 0] + f[15] * ref[off15 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[15]* ref[off15 + 1] + f[15] * ref[off15 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[15]* ref[off15 + 2] + f[15] * ref[off15 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[15]* ref[off15 + 3] + f[15] * ref[off15 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[15]* ref[off15 + 15] + f[15] * ref[off15 + 16] + 16) >> 5); + + ... + + y=31; off31= offset[31]; x=0-31; off31 = 2; + dst[y * dstStride + 0] = (pixel)((f32[31]* ref[off15 + 0] + f[31] * ref[off31 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[31]* ref[off15 + 1] + f[31] * ref[off31 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[31]* ref[off15 + 2] + f[31] * ref[off31 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[31]* ref[off15 + 3] + f[31] * ref[off31 + 4] + 16) >> 5); + ... + dst[y * dstStride + 31] = (pixel)((f32[31]* ref[off15 + 31] + f[31] * ref[off31 + 32] + 16) >> 5); + } + */ + //mode 33: + //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26}; + //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0}; + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19}; + vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a}; + vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b}; + vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c}; + vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d}; + vec_u8_t mask15={0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(1, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv2 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv3 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srva = vec_perm(sv0, sv1, mask10); + vec_u8_t srvb = vec_perm(sv0, sv1, mask11); + vec_u8_t srvc = vec_perm(sv0, sv1, mask12); + vec_u8_t srvd = vec_perm(sv0, sv1, mask13); + vec_u8_t srve = vec_perm(sv0, sv1, mask14); + vec_u8_t srvf = vec_perm(sv0, sv1, mask15); + + vec_u8_t srv00 = sv1; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv10 = vec_perm(sv1, sv2, mask1); + vec_u8_t srv20 = vec_perm(sv1, sv2, mask2); + vec_u8_t srv30 = vec_perm(sv1, sv2, mask3); + vec_u8_t srv40 = vec_perm(sv1, sv2, mask4); + vec_u8_t srv50 = vec_perm(sv1, sv2, mask5); + vec_u8_t srv60 = vec_perm(sv1, sv2, mask6); + vec_u8_t srv70 = vec_perm(sv1, sv2, mask7); + vec_u8_t srv80 = vec_perm(sv1, sv2, mask8); + vec_u8_t srv90 = vec_perm(sv1, sv2, mask9); + vec_u8_t srva0 = vec_perm(sv1, sv2, mask10); + vec_u8_t srvb0 = vec_perm(sv1, sv2, mask11); + vec_u8_t srvc0 = vec_perm(sv1, sv2, mask12); + vec_u8_t srvd0 = vec_perm(sv1, sv2, mask13); + vec_u8_t srve0 = vec_perm(sv1, sv2, mask14); + vec_u8_t srvf0 = vec_perm(sv1, sv2, mask15); + + vec_u8_t srv000 = sv2; + vec_u8_t srv100 = vec_perm(sv2, sv3, mask1); + vec_u8_t srv200 = vec_perm(sv2, sv3, mask2); + vec_u8_t srv300 = vec_perm(sv2, sv3, mask3); + vec_u8_t srv400 = vec_perm(sv2, sv3, mask4); + vec_u8_t srv500 = vec_perm(sv2, sv3, mask5); + vec_u8_t srv600 = vec_perm(sv2, sv3, mask6); + vec_u8_t srv700 = vec_perm(sv2, sv3, mask7); + vec_u8_t srv800 = vec_perm(sv2, sv3, mask8); + vec_u8_t srv900 = vec_perm(sv2, sv3, mask9); + vec_u8_t srva00 = vec_perm(sv2, sv3, mask10); + vec_u8_t srvb00 = vec_perm(sv2, sv3, mask11); + +vec_u8_t vfrac16_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_1 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_2 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_4 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_5 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_6 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_8 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_9 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_10 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_12 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_13 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_14 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; +vec_u8_t vfrac16_16 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_17 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_18 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_19 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_20 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_21 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_22 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_23 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_24 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_25 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_26 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_27 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_28 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_29 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_30 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + +vec_u8_t vfrac16_32_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; +vec_u8_t vfrac16_32_16 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_17 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_18 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_19 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_20 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_21 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_22 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_23 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_24 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_25 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_26 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_27 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_28 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_29 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_30 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv00, srv10, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv10, srv20, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv2, srv3, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv20, srv30, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv3, srv4, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv30, srv40, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv4, srv5, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv40, srv50, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv4, srv5, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv40, srv50, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv5, srv6, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv50, srv60, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv6, srv7, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv60, srv70, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv7, srv8, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv70, srv80, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv8, srv9, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv80, srv90, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv8, srv9, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv80, srv90, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv9, srva, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv90, srva0, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srva, srvb, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srva0, srvb0, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srvb, srvc, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srvb0, srvc0, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srvc, srvd, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srvc0, srvd0, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srvd, srve, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srvd0, srve0, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, dstStride, dst); + vec_xst(vout_3, dstStride+16, dst); + vec_xst(vout_4, dstStride*2, dst); + vec_xst(vout_5, dstStride*2+16, dst); + vec_xst(vout_6, dstStride*3, dst); + vec_xst(vout_7, dstStride*3+16, dst); + vec_xst(vout_8, dstStride*4, dst); + vec_xst(vout_9, dstStride*4+16, dst); + vec_xst(vout_10, dstStride*5, dst); + vec_xst(vout_11, dstStride*5+16, dst); + vec_xst(vout_12, dstStride*6, dst); + vec_xst(vout_13, dstStride*6+16, dst); + vec_xst(vout_14, dstStride*7, dst); + vec_xst(vout_15, dstStride*7+16, dst); + vec_xst(vout_16, dstStride*8, dst); + vec_xst(vout_17, dstStride*8+16, dst); + vec_xst(vout_18, dstStride*9, dst); + vec_xst(vout_19, dstStride*9+16, dst); + vec_xst(vout_20, dstStride*10, dst); + vec_xst(vout_21, dstStride*10+16, dst); + vec_xst(vout_22, dstStride*11, dst); + vec_xst(vout_23, dstStride*11+16, dst); + vec_xst(vout_24, dstStride*12, dst); + vec_xst(vout_25, dstStride*12+16, dst); + vec_xst(vout_26, dstStride*13, dst); + vec_xst(vout_27, dstStride*13+16, dst); + vec_xst(vout_28, dstStride*14, dst); + vec_xst(vout_29, dstStride*14+16, dst); + vec_xst(vout_30, dstStride*15, dst); + vec_xst(vout_31, dstStride*15+16, dst); + + one_line(srvd, srve, vfrac16_32_16, vfrac16_16, vout_0); + one_line(srvd0, srve0, vfrac16_32_16, vfrac16_16, vout_1); + + one_line(srve, srvf, vfrac16_32_17, vfrac16_17, vout_2); + one_line(srve0, srvf0, vfrac16_32_17, vfrac16_17, vout_3); + + one_line(srvf, srv00, vfrac16_32_18, vfrac16_18, vout_4); + one_line(srvf0, srv000, vfrac16_32_18, vfrac16_18, vout_5); + + one_line(srv00, srv10, vfrac16_32_19, vfrac16_19, vout_6); + one_line(srv000, srv100, vfrac16_32_19, vfrac16_19, vout_7); + + one_line(srv10, srv20, vfrac16_32_20, vfrac16_20, vout_8); + one_line(srv100, srv200, vfrac16_32_20, vfrac16_20, vout_9); + + one_line(srv10, srv20, vfrac16_32_21, vfrac16_21, vout_10); + one_line(srv100, srv200, vfrac16_32_21, vfrac16_21, vout_11); + + one_line(srv20, srv30, vfrac16_32_22, vfrac16_22, vout_12); + one_line(srv200, srv300, vfrac16_32_22, vfrac16_22, vout_13); + + one_line(srv30, srv40, vfrac16_32_23, vfrac16_23, vout_14); + one_line(srv300, srv400, vfrac16_32_23, vfrac16_23, vout_15); + + one_line(srv40, srv50, vfrac16_32_24, vfrac16_24, vout_16); + one_line(srv400, srv500, vfrac16_32_24, vfrac16_24, vout_17); + + one_line(srv50, srv60, vfrac16_32_25, vfrac16_25, vout_18); + one_line(srv500, srv600, vfrac16_32_25, vfrac16_25, vout_19); + + one_line(srv50, srv60, vfrac16_32_26, vfrac16_26, vout_20); + one_line(srv500, srv600, vfrac16_32_26, vfrac16_26, vout_21); + + one_line(srv60, srv70, vfrac16_32_27, vfrac16_27, vout_22); + one_line(srv600, srv700, vfrac16_32_27, vfrac16_27, vout_23); + + one_line(srv70, srv80, vfrac16_32_28, vfrac16_28, vout_24); + one_line(srv700, srv800, vfrac16_32_28, vfrac16_28, vout_25); + + one_line(srv80, srv90, vfrac16_32_29, vfrac16_29, vout_26); + one_line(srv800, srv900, vfrac16_32_29, vfrac16_29, vout_27); + + one_line(srv90, srva0, vfrac16_32_30, vfrac16_30, vout_28); + one_line(srv900, srva00, vfrac16_32_30, vfrac16_30, vout_29); + + one_line(srva0, srvb0, vfrac16_32_31, vfrac16_31, vout_30); + one_line(srva00, srvb00, vfrac16_32_31, vfrac16_31, vout_31); + + vec_xst(vout_0, dstStride*16, dst); + vec_xst(vout_1, dstStride*16+16, dst); + vec_xst(vout_2, dstStride*17, dst); + vec_xst(vout_3, dstStride*17+16, dst); + vec_xst(vout_4, dstStride*18, dst); + vec_xst(vout_5, dstStride*18+16, dst); + vec_xst(vout_6, dstStride*19, dst); + vec_xst(vout_7, dstStride*19+16, dst); + vec_xst(vout_8, dstStride*20, dst); + vec_xst(vout_9, dstStride*20+16, dst); + vec_xst(vout_10, dstStride*21, dst); + vec_xst(vout_11, dstStride*21+16, dst); + vec_xst(vout_12, dstStride*22, dst); + vec_xst(vout_13, dstStride*22+16, dst); + vec_xst(vout_14, dstStride*23, dst); + vec_xst(vout_15, dstStride*23+16, dst); + vec_xst(vout_16, dstStride*24, dst); + vec_xst(vout_17, dstStride*24+16, dst); + vec_xst(vout_18, dstStride*25, dst); + vec_xst(vout_19, dstStride*25+16, dst); + vec_xst(vout_20, dstStride*26, dst); + vec_xst(vout_21, dstStride*26+16, dst); + vec_xst(vout_22, dstStride*27, dst); + vec_xst(vout_23, dstStride*27+16, dst); + vec_xst(vout_24, dstStride*28, dst); + vec_xst(vout_25, dstStride*28+16, dst); + vec_xst(vout_26, dstStride*29, dst); + vec_xst(vout_27, dstStride*29+16, dst); + vec_xst(vout_28, dstStride*30, dst); + vec_xst(vout_29, dstStride*30+16, dst); + vec_xst(vout_30, dstStride*31, dst); + vec_xst(vout_31, dstStride*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<4, 34>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + if(dstStride == 4) { + const vec_u8_t srcV = vec_xl(2, srcPix0); + const vec_u8_t mask = {0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03,0x04, 0x02, 0x03,0x04,0x05, 0x03,0x04,0x05, 0x06}; + vec_u8_t vout = vec_perm(srcV, srcV, mask); + vec_xst(vout, 0, dst); + } + else if(dstStride%16 == 0){ + vec_u8_t v0 = vec_xl(2, srcPix0); + vec_ste((vec_u32_t)v0, 0, (unsigned int*)dst); + vec_u8_t v1 = vec_xl(3, srcPix0); + vec_ste((vec_u32_t)v1, 0, (unsigned int*)(dst+dstStride)); + vec_u8_t v2 = vec_xl(4, srcPix0); + vec_ste((vec_u32_t)v2, 0, (unsigned int*)(dst+dstStride*2)); + vec_u8_t v3 = vec_xl(5, srcPix0); + vec_ste((vec_u32_t)v3, 0, (unsigned int*)(dst+dstStride*3)); + } + else{ + const vec_u8_t srcV = vec_xl(2, srcPix0); /* offset = width2+2 = width<<1 + 2*/ + const vec_u8_t mask_0 = {0x00, 0x01, 0x02, 0x03, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + const vec_u8_t mask_1 = {0x01, 0x02, 0x03, 0x04, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + const vec_u8_t mask_2 = {0x02, 0x03, 0x04, 0x05, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + const vec_u8_t mask_3 = {0x03, 0x04, 0x05, 0x06, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(srcV, vec_xl(0, dst), mask_0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(srcV, vec_xl(dstStride, dst), mask_1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(srcV, vec_xl(dstStride*2, dst), mask_2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(srcV, vec_xl(dstStride*3, dst), mask_3); + vec_xst(v3, dstStride*3, dst); + } +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<8, 34>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + if(dstStride == 8) { + const vec_u8_t srcV1 = vec_xl(2, srcPix0); /* offset = width2+2 = width<<1 + 2*/ + const vec_u8_t mask_0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x01, 0x02, 0x03,0x04, 0x05, 0x06, 0x07, 0x08}; + const vec_u8_t mask_1 = {0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a}; + const vec_u8_t mask_2 = {0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c}; + const vec_u8_t mask_3 = {0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e}; + vec_u8_t v0 = vec_perm(srcV1, srcV1, mask_0); + vec_u8_t v1 = vec_perm(srcV1, srcV1, mask_1); + vec_u8_t v2 = vec_perm(srcV1, srcV1, mask_2); + vec_u8_t v3 = vec_perm(srcV1, srcV1, mask_3); + vec_xst(v0, 0, dst); + vec_xst(v1, 16, dst); + vec_xst(v2, 32, dst); + vec_xst(v3, 48, dst); + } + else{ + const vec_u8_t srcV1 = vec_xl(2, srcPix0); /* offset = width2+2 = width<<1 + 2*/ + const vec_u8_t mask_0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + const vec_u8_t mask_1 = {0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + const vec_u8_t mask_2 = {0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + const vec_u8_t mask_3 = {0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + const vec_u8_t mask_4 = {0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + const vec_u8_t mask_5 = {0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + const vec_u8_t mask_6 = {0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + const vec_u8_t mask_7 = {0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t v0 = vec_perm(srcV1, vec_xl(0, dst), mask_0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(srcV1, vec_xl(dstStride, dst), mask_1); + vec_xst(v1, dstStride, dst); + vec_u8_t v2 = vec_perm(srcV1, vec_xl(dstStride*2, dst), mask_2); + vec_xst(v2, dstStride*2, dst); + vec_u8_t v3 = vec_perm(srcV1, vec_xl(dstStride*3, dst), mask_3); + vec_xst(v3, dstStride*3, dst); + vec_u8_t v4 = vec_perm(srcV1, vec_xl(dstStride*4, dst), mask_4); + vec_xst(v4, dstStride*4, dst); + vec_u8_t v5 = vec_perm(srcV1, vec_xl(dstStride*5, dst), mask_5); + vec_xst(v5, dstStride*5, dst); + vec_u8_t v6 = vec_perm(srcV1, vec_xl(dstStride*6, dst), mask_6); + vec_xst(v6, dstStride*6, dst); + vec_u8_t v7 = vec_perm(srcV1, vec_xl(dstStride*7, dst), mask_7); + vec_xst(v7, dstStride*7, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<16, 34>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + int i; + //int off = dstStride; + //const pixel *srcPix = srcPix0; + for(i=0; i<16; i++){ + vec_xst( vec_xl(2+i, srcPix0), i*dstStride, dst); /* first offset = width2+2 = width<<1 + 2*/ + } + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x <16; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void intra_pred<32, 34>(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int bFilter) +{ + int i; + int off = dstStride; + //const pixel *srcPix = srcPix0; + for(i=0; i<32; i++){ + off = i*dstStride; + vec_xst(vec_xl(2+i, srcPix0), off, dst); /* first offset = width2+2 = width<<1 + 2*/ + vec_xst(vec_xl(18+i, srcPix0), off+16, dst); /* first offset = width2+2 = width<<1 + 2*/ + } +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x <32; x++) + { + printf("%d ",dst[y * dstStride + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + + +template<int width> +void intra_pred_ang_altivec(pixel* dst, intptr_t dstStride, const pixel *srcPix0, int dirMode, int bFilter) +{ + const int size = width; + switch(dirMode){ + case 2: + intra_pred<size, 2>(dst, dstStride, srcPix0, bFilter); + return; + case 3: + intra_pred<size, 3>(dst, dstStride, srcPix0, bFilter); + return; + case 4: + intra_pred<size, 4>(dst, dstStride, srcPix0, bFilter); + return; + case 5: + intra_pred<size, 5>(dst, dstStride, srcPix0, bFilter); + return; + case 6: + intra_pred<size, 6>(dst, dstStride, srcPix0, bFilter); + return; + case 7: + intra_pred<size, 7>(dst, dstStride, srcPix0, bFilter); + return; + case 8: + intra_pred<size, 8>(dst, dstStride, srcPix0, bFilter); + return; + case 9: + intra_pred<size, 9>(dst, dstStride, srcPix0, bFilter); + return; + case 10: + intra_pred<size, 10>(dst, dstStride, srcPix0, bFilter); + return; + case 11: + intra_pred<size, 11>(dst, dstStride, srcPix0, bFilter); + return; + case 12: + intra_pred<size, 12>(dst, dstStride, srcPix0, bFilter); + return; + case 13: + intra_pred<size, 13>(dst, dstStride, srcPix0, bFilter); + return; + case 14: + intra_pred<size, 14>(dst, dstStride, srcPix0, bFilter); + return; + case 15: + intra_pred<size, 15>(dst, dstStride, srcPix0, bFilter); + return; + case 16: + intra_pred<size, 16>(dst, dstStride, srcPix0, bFilter); + return; + case 17: + intra_pred<size, 17>(dst, dstStride, srcPix0, bFilter); + return; + case 18: + intra_pred<size, 18>(dst, dstStride, srcPix0, bFilter); + return; + case 19: + intra_pred<size, 19>(dst, dstStride, srcPix0, bFilter); + return; + case 20: + intra_pred<size, 20>(dst, dstStride, srcPix0, bFilter); + return; + case 21: + intra_pred<size, 21>(dst, dstStride, srcPix0, bFilter); + return; + case 22: + intra_pred<size, 22>(dst, dstStride, srcPix0, bFilter); + return; + case 23: + intra_pred<size, 23>(dst, dstStride, srcPix0, bFilter); + return; + case 24: + intra_pred<size, 24>(dst, dstStride, srcPix0, bFilter); + return; + case 25: + intra_pred<size, 25>(dst, dstStride, srcPix0, bFilter); + return; + case 26: + intra_pred<size, 26>(dst, dstStride, srcPix0, bFilter); + return; + case 27: + intra_pred<size, 27>(dst, dstStride, srcPix0, bFilter); + return; + case 28: + intra_pred<size, 28>(dst, dstStride, srcPix0, bFilter); + return; + case 29: + intra_pred<size, 29>(dst, dstStride, srcPix0, bFilter); + return; + case 30: + intra_pred<size, 30>(dst, dstStride, srcPix0, bFilter); + return; + case 31: + intra_pred<size, 31>(dst, dstStride, srcPix0, bFilter); + return; + case 32: + intra_pred<size, 32>(dst, dstStride, srcPix0, bFilter); + return; + case 33: + intra_pred<size, 33>(dst, dstStride, srcPix0, bFilter); + return; + case 34: + intra_pred<size, 34>(dst, dstStride, srcPix0, bFilter); + return; + default: + printf("No supported intra prediction mode\n"); + exit(1); + } +} + +template<int dstStride, int dirMode> +void one_ang_pred_altivec(pixel* dst, const pixel *srcPix0, int bFilter){}; + +template<> +void one_ang_pred_altivec<4, 2>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<4, 2>(dst, 4, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<8, 2>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<8, 2>(dst, 8, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<16, 2>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<16, 2>(dst, 16, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<32, 2>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<32, 2>(dst, 32, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<4, 18>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<4, 18>(dst, 4, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<8, 18>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<8, 18>(dst, 8, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<16, 18>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<16, 18>(dst, 16, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<32, 18>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<32, 18>(dst, 32, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<4, 19>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<4, 19>(dst, 4, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<8, 19>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<8, 19>(dst, 8, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<16, 19>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<16, 19>(dst, 16, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<32, 19>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<32, 19>(dst, 32, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<4, 20>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<4, 20>(dst, 4, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<8, 20>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<8, 20>(dst, 8, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<16, 20>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<16, 20>(dst, 16, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<32, 20>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<32, 20>(dst, 32, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<4, 21>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<4, 21>(dst, 4, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<8, 21>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<8, 21>(dst, 8, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<16, 21>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<16, 21>(dst, 16, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<32, 21>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<32, 21>(dst, 32, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<4, 22>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<4, 22>(dst, 4, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<8, 22>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<8, 22>(dst, 8, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<16, 22>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<16, 22>(dst, 16, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<32, 22>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<32, 22>(dst, 32, srcPix0, bFilter); + return; +} + + +template<> +void one_ang_pred_altivec<4, 23>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<4, 23>(dst, 4, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<8, 23>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<8, 23>(dst, 8, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<16, 23>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<16, 23>(dst, 16, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<32, 23>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<32, 23>(dst, 32, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<4, 24>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<4, 24>(dst, 4, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<8, 24>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<8, 24>(dst, 8, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<16, 24>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<16, 24>(dst, 16, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<32, 24>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<32, 24>(dst, 32, srcPix0, bFilter); + return; +} + + +template<> +void one_ang_pred_altivec<4, 25>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<4, 25>(dst, 4, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<8, 25>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<8, 25>(dst, 8, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<16, 25>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<16, 25>(dst, 16, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<32, 25>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<32, 25>(dst, 32, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<4, 27>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<4, 27>(dst, 4, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<8, 27>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<8, 27>(dst, 8, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<16, 27>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<16, 27>(dst, 16, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<32, 27>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<32, 27>(dst, 32, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<4, 28>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<4, 28>(dst, 4, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<8, 28>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<8, 28>(dst, 8, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<16, 28>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<16, 28>(dst, 16, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<32, 28>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<32, 28>(dst, 32, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<4, 29>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<4, 29>(dst, 4, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<8, 29>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<8, 29>(dst, 8, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<16, 29>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<16, 29>(dst, 16, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<32, 29>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<32, 29>(dst, 32, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<4, 30>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<4, 30>(dst, 4, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<8, 30>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<8, 30>(dst, 8, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<16, 30>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<16, 30>(dst, 16, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<32, 30>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<32, 30>(dst, 32, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<4, 31>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<4, 31>(dst, 4, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<8, 31>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<8, 31>(dst, 8, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<16, 31>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<16, 31>(dst, 16, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<32, 31>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<32, 31>(dst, 32, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<4, 32>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<4, 32>(dst, 4, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<8, 32>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<8, 32>(dst, 8, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<16, 32>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<16, 32>(dst, 16, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<32, 32>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<32, 32>(dst, 32, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<4, 33>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<4, 33>(dst, 4, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<8, 33>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<8, 33>(dst, 8, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<16, 33>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<16, 33>(dst, 16, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<32, 33>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<32, 33>(dst, 32, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<4, 34>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<4, 34>(dst, 4, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<8, 34>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<8, 34>(dst, 8, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<16, 34>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<16, 34>(dst, 16, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<32, 34>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + intra_pred<32, 34>(dst, 32, srcPix0, bFilter); + return; +} + +template<> +void one_ang_pred_altivec<4, 6>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05, 0x02, 0x03, 0x04, 0x05}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + vec_u8_t vfrac4 = (vec_u8_t){13, 13, 13, 13, 26, 26, 26, 26, 7, 7, 7, 7, 20, 20, 20, 20}; /* fraction[0-3] */ + vec_u8_t vfrac4_32 = (vec_u8_t){19, 19, 19, 19, 6, 6, 6, 6, 25, 25, 25, 25, 12, 12, 12, 12}; /* 32 - fraction[0-3] */ + + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + vec_xst(vout, 0, dst); + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * 4 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<8, 6>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a}; + vec_u8_t mask4={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a}; + vec_u8_t mask5={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 0 */ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 1 */ + vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 2 */ + vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 3 */ + vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 2, 3 */ + vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 3, 4 */ + + /* fraction[0-7] */ + vec_u8_t vfrac8_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac8_1 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac8_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 14, 14, 14, 14, 14, 14, 14, 14 }; + vec_u8_t vfrac8_3 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 8, 8, 8, 8, 8, 8, 8, 8}; + + /* 32 - fraction[0-7] */ + vec_u8_t vfrac8_32_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac8_32_1 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac8_32_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac8_32_3 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 24, 24, 24, 24, 24, 24, 24, 24}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + /* y0, y1 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0); + vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + + /* y2, y3 */ + vmle0 = vec_mule(srv1, vfrac8_32_1); + vmlo0 = vec_mulo(srv1, vfrac8_32_1); + vmle1 = vec_mule(srv2, vfrac8_1); + vmlo1 = vec_mulo(srv2, vfrac8_1); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y4, y5 */ + vmle0 = vec_mule(srv2, vfrac8_32_2); + vmlo0 = vec_mulo(srv2, vfrac8_32_2); + vmle1 = vec_mule(srv3, vfrac8_2); + vmlo1 = vec_mulo(srv3, vfrac8_2); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y6, y7 */ + vmle0 = vec_mule(srv4, vfrac8_32_3); + vmlo0 = vec_mulo(srv4, vfrac8_32_3); + vmle1 = vec_mule(srv5, vfrac8_3); + vmlo1 = vec_mulo(srv5, vfrac8_3); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * 8 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<16, 6>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + /*vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19}; + vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a}; + vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b}; + vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c}; + vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d};*/ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + //vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + //vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + //vec_u8_t srva = vec_perm(sv0, sv1, mask10); + //vec_u8_t srvb = vec_perm(sv0, sv1, mask11); + //vec_u8_t srvc = vec_perm(sv0, sv1, mask12); + //vec_u8_t srvd = vec_perm(sv0, sv1, mask13); + //vec_u8_t srve = vec_perm(sv0, sv1, mask14); + + /* fraction[0-15] */ + vec_u8_t vfrac16_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; + vec_u8_t vfrac16_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; + vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_4 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; + vec_u8_t vfrac16_5 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_6 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; + vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_8 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; + vec_u8_t vfrac16_9 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_10 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; + vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_12 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; + vec_u8_t vfrac16_13 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_14 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; + vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* 32 - fraction[0-15] */ + vec_u8_t vfrac16_32_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; + vec_u8_t vfrac16_32_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_32_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; + vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_32_4 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; + vec_u8_t vfrac16_32_5 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_32_6 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_32_8 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; + vec_u8_t vfrac16_32_9 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_32_10 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; + vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_32_12 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; + vec_u8_t vfrac16_32_13 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_32_14 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; + vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv1, srv2, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv2, srv3, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv2, srv3, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv2, srv3, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv3, srv4, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv3, srv4, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv4, srv5, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv4, srv5, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv4, srv5, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv5, srv6, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv5, srv6, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv6, srv7, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv6, srv7, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 16*2, dst); + vec_xst(vout_3, 16*3, dst); + vec_xst(vout_4, 16*4, dst); + vec_xst(vout_5, 16*5, dst); + vec_xst(vout_6, 16*6, dst); + vec_xst(vout_7, 16*7, dst); + vec_xst(vout_8, 16*8, dst); + vec_xst(vout_9, 16*9, dst); + vec_xst(vout_10, 16*10, dst); + vec_xst(vout_11, 16*11, dst); + vec_xst(vout_12, 16*12, dst); + vec_xst(vout_13, 16*13, dst); + vec_xst(vout_14, 16*14, dst); + vec_xst(vout_15, 16*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * 16 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<32, 6>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19}; + vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a}; + vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b}; + vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c}; + vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srva = vec_perm(sv0, sv1, mask10); + vec_u8_t srvb = vec_perm(sv0, sv1, mask11); + vec_u8_t srvc = vec_perm(sv0, sv1, mask12); + vec_u8_t srvd = vec_perm(sv0, sv1, mask13); + vec_u8_t srve = vec_perm(sv0, sv1, mask14); + + vec_u8_t srv00 = sv1; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv10 = vec_perm(sv1, sv2, mask1); + vec_u8_t srv20 = vec_perm(sv1, sv2, mask2); + vec_u8_t srv30 = vec_perm(sv1, sv2, mask3); + vec_u8_t srv40 = vec_perm(sv1, sv2, mask4); + vec_u8_t srv50 = vec_perm(sv1, sv2, mask5); + vec_u8_t srv60 = vec_perm(sv1, sv2, mask6); + vec_u8_t srv70 = vec_perm(sv1, sv2, mask7); + vec_u8_t srv80 = vec_perm(sv1, sv2, mask8); + vec_u8_t srv90 = vec_perm(sv1, sv2, mask9); + vec_u8_t srva0 = vec_perm(sv1, sv2, mask10); + vec_u8_t srvb0 = vec_perm(sv1, sv2, mask11); + vec_u8_t srvc0 = vec_perm(sv1, sv2, mask12); + vec_u8_t srvd0 = vec_perm(sv1, sv2, mask13); + vec_u8_t srve0 = vec_perm(sv1, sv2, mask14); + + + /* fraction[0-15] */ + vec_u8_t vfrac16_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; + vec_u8_t vfrac16_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; + vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_4 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; + vec_u8_t vfrac16_5 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_6 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; + vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_8 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; + vec_u8_t vfrac16_9 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_10 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; + vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_12 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; + vec_u8_t vfrac16_13 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_14 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; + vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + vec_u8_t vfrac16_16 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; + vec_u8_t vfrac16_17 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_18 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; + vec_u8_t vfrac16_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_20 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; + vec_u8_t vfrac16_21 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_22 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; + vec_u8_t vfrac16_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_24 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vfrac16_25 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_26 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; + vec_u8_t vfrac16_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_28 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; + vec_u8_t vfrac16_29 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_30 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; + vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + + + /* 32 - fraction[0-15] */ + vec_u8_t vfrac16_32_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; + vec_u8_t vfrac16_32_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_32_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; + vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_32_4 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; + vec_u8_t vfrac16_32_5 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_32_6 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_32_8 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; + vec_u8_t vfrac16_32_9 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_32_10 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; + vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_32_12 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; + vec_u8_t vfrac16_32_13 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_32_14 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; + vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + vec_u8_t vfrac16_32_16 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; + vec_u8_t vfrac16_32_17 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_32_18 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; + vec_u8_t vfrac16_32_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_32_20 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; + vec_u8_t vfrac16_32_21 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_32_22 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; + vec_u8_t vfrac16_32_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_32_24 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; + vec_u8_t vfrac16_32_25 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_32_26 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; + vec_u8_t vfrac16_32_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_32_28 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; + vec_u8_t vfrac16_32_29 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_32_30 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; + vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv00, srv10, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv00, srv10, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv10, srv20, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv1, srv2, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv10, srv20, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv2, srv3, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv20, srv30, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv2, srv3, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv20, srv30, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv2, srv3, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv20, srv30, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv3, srv4, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv30, srv40, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv3, srv4, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv30, srv40, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv4, srv5, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv40, srv50, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv4, srv5, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv40, srv50, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv4, srv5, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv40, srv50, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv5, srv6, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv50, srv60, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv5, srv6, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv50, srv60, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv6, srv7, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv60, srv70, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv6, srv7, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv60, srv70, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 32+16, dst); + vec_xst(vout_4, 32*2, dst); + vec_xst(vout_5, 32*2+16, dst); + vec_xst(vout_6, 32*3, dst); + vec_xst(vout_7, 32*3+16, dst); + vec_xst(vout_8, 32*4, dst); + vec_xst(vout_9, 32*4+16, dst); + vec_xst(vout_10, 32*5, dst); + vec_xst(vout_11, 32*5+16, dst); + vec_xst(vout_12, 32*6, dst); + vec_xst(vout_13, 32*6+16, dst); + vec_xst(vout_14, 32*7, dst); + vec_xst(vout_15, 32*7+16, dst); + vec_xst(vout_16, 32*8, dst); + vec_xst(vout_17, 32*8+16, dst); + vec_xst(vout_18, 32*9, dst); + vec_xst(vout_19, 32*9+16, dst); + vec_xst(vout_20, 32*10, dst); + vec_xst(vout_21, 32*10+16, dst); + vec_xst(vout_22, 32*11, dst); + vec_xst(vout_23, 32*11+16, dst); + vec_xst(vout_24, 32*12, dst); + vec_xst(vout_25, 32*12+16, dst); + vec_xst(vout_26, 32*13, dst); + vec_xst(vout_27, 32*13+16, dst); + vec_xst(vout_28, 32*14, dst); + vec_xst(vout_29, 32*14+16, dst); + vec_xst(vout_30, 32*15, dst); + vec_xst(vout_31, 32*15+16, dst); + + one_line(srv6, srv7, vfrac16_32_16, vfrac16_16, vout_0); + one_line(srv60, srv70, vfrac16_32_16, vfrac16_16, vout_1); + + one_line(srv7, srv8, vfrac16_32_17, vfrac16_17, vout_2); + one_line(srv70, srv80, vfrac16_32_17, vfrac16_17, vout_3); + + one_line(srv7, srv8, vfrac16_32_18, vfrac16_18, vout_4); + one_line(srv70, srv80, vfrac16_32_18, vfrac16_18, vout_5); + + one_line(srv8, srv9, vfrac16_32_19, vfrac16_19, vout_6); + one_line(srv80, srv90, vfrac16_32_19, vfrac16_19, vout_7); + + one_line(srv8, srv9, vfrac16_32_20, vfrac16_20, vout_8); + one_line(srv80, srv90, vfrac16_32_20, vfrac16_20, vout_9); + + one_line(srv8, srv9, vfrac16_32_21, vfrac16_21, vout_10); + one_line(srv80, srv90, vfrac16_32_21, vfrac16_21, vout_11); + + one_line(srv9, srva, vfrac16_32_22, vfrac16_22, vout_12); + one_line(srv90, srva0, vfrac16_32_22, vfrac16_22, vout_13); + + one_line(srv9, srva, vfrac16_32_23, vfrac16_23, vout_14); + one_line(srv90, srva0, vfrac16_32_23, vfrac16_23, vout_15); + + one_line(srva, srvb, vfrac16_32_24, vfrac16_24, vout_16); + one_line(srva0, srvb0, vfrac16_32_24, vfrac16_24, vout_17); + + one_line(srva, srvb, vfrac16_32_25, vfrac16_25, vout_18); + one_line(srva0, srvb0, vfrac16_32_25, vfrac16_25, vout_19); + + one_line(srva, srvb, vfrac16_32_26, vfrac16_26, vout_20); + one_line(srva0, srvb0, vfrac16_32_26, vfrac16_26, vout_21); + + one_line(srvb, srvc, vfrac16_32_27, vfrac16_27, vout_22); + one_line(srvb0, srvc0, vfrac16_32_27, vfrac16_27, vout_23); + + one_line(srvb, srvc, vfrac16_32_28, vfrac16_28, vout_24); + one_line(srvb0, srvc0, vfrac16_32_28, vfrac16_28, vout_25); + + one_line(srvc, srvd, vfrac16_32_29, vfrac16_29, vout_26); + one_line(srvc0, srvd0, vfrac16_32_29, vfrac16_29, vout_27); + + one_line(srvc, srvd, vfrac16_32_30, vfrac16_30, vout_28); + one_line(srvc0, srvd0, vfrac16_32_30, vfrac16_30, vout_29); + + one_line(srvd, srve, vfrac16_32_31, vfrac16_31, vout_30); + one_line(srvd0, srve0, vfrac16_32_31, vfrac16_31, vout_31); + + vec_xst(vout_0, 32*16, dst); + vec_xst(vout_1, 32*16+16, dst); + vec_xst(vout_2, 32*17, dst); + vec_xst(vout_3, 32*17+16, dst); + vec_xst(vout_4, 32*18, dst); + vec_xst(vout_5, 32*18+16, dst); + vec_xst(vout_6, 32*19, dst); + vec_xst(vout_7, 32*19+16, dst); + vec_xst(vout_8, 32*20, dst); + vec_xst(vout_9, 32*20+16, dst); + vec_xst(vout_10, 32*21, dst); + vec_xst(vout_11, 32*21+16, dst); + vec_xst(vout_12, 32*22, dst); + vec_xst(vout_13, 32*22+16, dst); + vec_xst(vout_14, 32*23, dst); + vec_xst(vout_15, 32*23+16, dst); + vec_xst(vout_16, 32*24, dst); + vec_xst(vout_17, 32*24+16, dst); + vec_xst(vout_18, 32*25, dst); + vec_xst(vout_19, 32*25+16, dst); + vec_xst(vout_20, 32*26, dst); + vec_xst(vout_21, 32*26+16, dst); + vec_xst(vout_22, 32*27, dst); + vec_xst(vout_23, 32*27+16, dst); + vec_xst(vout_24, 32*28, dst); + vec_xst(vout_25, 32*28+16, dst); + vec_xst(vout_26, 32*29, dst); + vec_xst(vout_27, 32*29+16, dst); + vec_xst(vout_28, 32*30, dst); + vec_xst(vout_29, 32*30+16, dst); + vec_xst(vout_30, 32*31, dst); + vec_xst(vout_31, 32*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * 32 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<4, 7>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + //mode 29: + //int offset[32] = {0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9}; + //int fraction[32] = {9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, 25, 2, 11, 20, 29, 6, 15, 24, 1, 10, 19, 28, 5, 14, 23, 0}; + + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03, 0x04}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + vec_u8_t vfrac4 = (vec_u8_t){9, 9, 9, 9, 18, 18, 18, 18, 27, 27, 27, 27, 4, 4, 4, 4}; /* fraction[0-3] */ + vec_u8_t vfrac4_32 = (vec_u8_t){23, 23, 23, 23, 14, 14, 14, 14, 5, 5, 5, 5, 28, 28, 28, 28}; /* 32 - fraction[0-3] */ + + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + vec_xst(vout, 0, dst); + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * 4 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<8, 7>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + //mode 29: + //int offset[32] = {0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9}; + //int fraction[32] = {9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, 25, 2, 11, 20, 29, 6, 15, 24, 1, 10, 19, 28, 5, 14, 23, 0}; + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08}; + vec_u8_t mask2={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08}; + vec_u8_t mask3={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09}; + vec_u8_t mask4={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09}; + vec_u8_t mask5={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 0 */ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 1 */ + vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 0, 1 */ + vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 1, 2 */ + vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 2, 2 */ + vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 2, 3 */ + + /* fraction[0-7] */ + vec_u8_t vfrac8_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac8_1 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac8_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac8_3 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 8, 8, 8, 8, 8, 8, 8, 8}; + + /* 32 - fraction[0-7] */ + vec_u8_t vfrac8_32_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac8_32_1 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac8_32_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac8_32_3 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 24, 24, 24, 24, 24, 24, 24, 24}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + /* y0, y1 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0); + vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + + /* y2, y3 */ + vmle0 = vec_mule(srv2, vfrac8_32_1); + vmlo0 = vec_mulo(srv2, vfrac8_32_1); + vmle1 = vec_mule(srv3, vfrac8_1); + vmlo1 = vec_mulo(srv3, vfrac8_1); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y4, y5 */ + vmle0 = vec_mule(srv1, vfrac8_32_2); + vmlo0 = vec_mulo(srv1, vfrac8_32_2); + vmle1 = vec_mule(srv4, vfrac8_2); + vmlo1 = vec_mulo(srv4, vfrac8_2); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y6, y7 */ + vmle0 = vec_mule(srv3, vfrac8_32_3); + vmlo0 = vec_mulo(srv3, vfrac8_32_3); + vmle1 = vec_mule(srv5, vfrac8_3); + vmlo1 = vec_mulo(srv5, vfrac8_3); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * 8 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<16, 7>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + //mode 29: + //int offset[32] = {0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9}; + //int fraction[32] = {9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, 25, 2, 11, 20, 29, 6, 15, 24, 1, 10, 19, 28, 5, 14, 23, 0}; + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + + /* fraction[0-15] */ + vec_u8_t vfrac16_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; + vec_u8_t vfrac16_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_2 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; + vec_u8_t vfrac16_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_4 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; + vec_u8_t vfrac16_5 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_6 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; + vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_8 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; + vec_u8_t vfrac16_9 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_10 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; + vec_u8_t vfrac16_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_12 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; + vec_u8_t vfrac16_13 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_14 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; + vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* 32 - fraction[0-15] */ + vec_u8_t vfrac16_32_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; + vec_u8_t vfrac16_32_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_32_2 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vfrac16_32_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_32_4 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; + vec_u8_t vfrac16_32_5 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_32_6 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; + vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_32_8 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; + vec_u8_t vfrac16_32_9 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_32_10 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; + vec_u8_t vfrac16_32_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_32_12 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; + vec_u8_t vfrac16_32_13 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_32_14 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; + vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv1, srv2, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv1, srv2, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv1, srv2, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv1, srv2, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv2, srv3, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv2, srv3, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv2, srv3, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv3, srv4, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv3, srv4, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv3, srv4, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv3, srv4, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv4, srv5, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv4, srv5, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 16*2, dst); + vec_xst(vout_3, 16*3, dst); + vec_xst(vout_4, 16*4, dst); + vec_xst(vout_5, 16*5, dst); + vec_xst(vout_6, 16*6, dst); + vec_xst(vout_7, 16*7, dst); + vec_xst(vout_8, 16*8, dst); + vec_xst(vout_9, 16*9, dst); + vec_xst(vout_10, 16*10, dst); + vec_xst(vout_11, 16*11, dst); + vec_xst(vout_12, 16*12, dst); + vec_xst(vout_13, 16*13, dst); + vec_xst(vout_14, 16*14, dst); + vec_xst(vout_15, 16*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * 16 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<32, 7>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + //mode 29: + //int offset[32] = {0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9}; + //int fraction[32] = {9, 18, 27, 4, 13, 22, 31, 8, 17, 26, 3, 12, 21, 30, 7, 16, 25, 2, 11, 20, 29, 6, 15, 24, 1, 10, 19, 28, 5, 14, 23, 0}; + + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srva = vec_perm(sv0, sv1, mask10); + + vec_u8_t srv00 = sv1; + vec_u8_t srv10 = vec_perm(sv1, sv2, mask1); + vec_u8_t srv20 = vec_perm(sv1, sv2, mask2); + vec_u8_t srv30 = vec_perm(sv1, sv2, mask3); + vec_u8_t srv40 = vec_perm(sv1, sv2, mask4); + vec_u8_t srv50 = vec_perm(sv1, sv2, mask5); + vec_u8_t srv60 = vec_perm(sv1, sv2, mask6); + vec_u8_t srv70 = vec_perm(sv1, sv2, mask7); + vec_u8_t srv80 = vec_perm(sv1, sv2, mask8); + vec_u8_t srv90 = vec_perm(sv1, sv2, mask9); + vec_u8_t srva0 = vec_perm(sv1, sv2, mask10); + + + /* fraction[0-15] */ + vec_u8_t vfrac16_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; + vec_u8_t vfrac16_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_2 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; + vec_u8_t vfrac16_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_4 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; + vec_u8_t vfrac16_5 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_6 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; + vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_8 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; + vec_u8_t vfrac16_9 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_10 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; + vec_u8_t vfrac16_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_12 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; + vec_u8_t vfrac16_13 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_14 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; + vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + vec_u8_t vfrac16_16 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; + vec_u8_t vfrac16_17 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 }; + vec_u8_t vfrac16_18 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; + vec_u8_t vfrac16_19 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_20 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; + vec_u8_t vfrac16_21 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_22 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; + vec_u8_t vfrac16_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_24 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; + vec_u8_t vfrac16_25 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_26 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; + vec_u8_t vfrac16_27 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_28 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vfrac16_29 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_30 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; + vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + + + /* 32 - fraction[0-15] */ + vec_u8_t vfrac16_32_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; + vec_u8_t vfrac16_32_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_32_2 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vfrac16_32_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_32_4 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; + vec_u8_t vfrac16_32_5 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_32_6 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; + vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_32_8 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; + vec_u8_t vfrac16_32_9 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_32_10 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; + vec_u8_t vfrac16_32_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_32_12 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; + vec_u8_t vfrac16_32_13 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_32_14 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; + vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + vec_u8_t vfrac16_32_16 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; + vec_u8_t vfrac16_32_17 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_32_18 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; + vec_u8_t vfrac16_32_19 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_32_20 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; + vec_u8_t vfrac16_32_21 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_32_22 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; + vec_u8_t vfrac16_32_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_32_24 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; + vec_u8_t vfrac16_32_25 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_32_26 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; + vec_u8_t vfrac16_32_27 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_32_28 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; + vec_u8_t vfrac16_32_29 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_32_30 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; + vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv00, srv10, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv00, srv10, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv00, srv10, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv1, srv2, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv10, srv20, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv1, srv2, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv10, srv20, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv1, srv2, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv10, srv20, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv1, srv2, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv10, srv20, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv2, srv3, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv20, srv30, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv2, srv3, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv20, srv30, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv2, srv3, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv20, srv30, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv3, srv4, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv30, srv40, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv3, srv4, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv30, srv40, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv3, srv4, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv30, srv40, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv3, srv4, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv30, srv40, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv4, srv5, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv40, srv50, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv4, srv5, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv40, srv50, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 32+16, dst); + vec_xst(vout_4, 32*2, dst); + vec_xst(vout_5, 32*2+16, dst); + vec_xst(vout_6, 32*3, dst); + vec_xst(vout_7, 32*3+16, dst); + vec_xst(vout_8, 32*4, dst); + vec_xst(vout_9, 32*4+16, dst); + vec_xst(vout_10, 32*5, dst); + vec_xst(vout_11, 32*5+16, dst); + vec_xst(vout_12, 32*6, dst); + vec_xst(vout_13, 32*6+16, dst); + vec_xst(vout_14, 32*7, dst); + vec_xst(vout_15, 32*7+16, dst); + vec_xst(vout_16, 32*8, dst); + vec_xst(vout_17, 32*8+16, dst); + vec_xst(vout_18, 32*9, dst); + vec_xst(vout_19, 32*9+16, dst); + vec_xst(vout_20, 32*10, dst); + vec_xst(vout_21, 32*10+16, dst); + vec_xst(vout_22, 32*11, dst); + vec_xst(vout_23, 32*11+16, dst); + vec_xst(vout_24, 32*12, dst); + vec_xst(vout_25, 32*12+16, dst); + vec_xst(vout_26, 32*13, dst); + vec_xst(vout_27, 32*13+16, dst); + vec_xst(vout_28, 32*14, dst); + vec_xst(vout_29, 32*14+16, dst); + vec_xst(vout_30, 32*15, dst); + vec_xst(vout_31, 32*15+16, dst); + + one_line(srv4, srv5, vfrac16_32_16, vfrac16_16, vout_0); + one_line(srv40, srv50, vfrac16_32_16, vfrac16_16, vout_1); + + one_line(srv5, srv6, vfrac16_32_17, vfrac16_17, vout_2); + one_line(srv50, srv60, vfrac16_32_17, vfrac16_17, vout_3); + + one_line(srv5, srv6, vfrac16_32_18, vfrac16_18, vout_4); + one_line(srv50, srv60, vfrac16_32_18, vfrac16_18, vout_5); + + one_line(srv5, srv6, vfrac16_32_19, vfrac16_19, vout_6); + one_line(srv50, srv60, vfrac16_32_19, vfrac16_19, vout_7); + + one_line(srv5, srv6, vfrac16_32_20, vfrac16_20, vout_8); + one_line(srv50, srv60, vfrac16_32_20, vfrac16_20, vout_9); + + one_line(srv6, srv7, vfrac16_32_21, vfrac16_21, vout_10); + one_line(srv60, srv70, vfrac16_32_21, vfrac16_21, vout_11); + + one_line(srv6, srv7, vfrac16_32_22, vfrac16_22, vout_12); + one_line(srv60, srv70, vfrac16_32_22, vfrac16_22, vout_13); + + one_line(srv6, srv7, vfrac16_32_23, vfrac16_23, vout_14); + one_line(srv60, srv70, vfrac16_32_23, vfrac16_23, vout_15); + + one_line(srv7, srv8, vfrac16_32_24, vfrac16_24, vout_16); + one_line(srv70, srv80, vfrac16_32_24, vfrac16_24, vout_17); + + one_line(srv7, srv8, vfrac16_32_25, vfrac16_25, vout_18); + one_line(srv70, srv80, vfrac16_32_25, vfrac16_25, vout_19); + + one_line(srv7, srv8, vfrac16_32_26, vfrac16_26, vout_20); + one_line(srv70, srv80, vfrac16_32_26, vfrac16_26, vout_21); + + one_line(srv7, srv8, vfrac16_32_27, vfrac16_27, vout_22); + one_line(srv70, srv80, vfrac16_32_27, vfrac16_27, vout_23); + + one_line(srv8, srv9, vfrac16_32_28, vfrac16_28, vout_24); + one_line(srv80, srv90, vfrac16_32_28, vfrac16_28, vout_25); + + one_line(srv8, srv9, vfrac16_32_29, vfrac16_29, vout_26); + one_line(srv80, srv90, vfrac16_32_29, vfrac16_29, vout_27); + + one_line(srv8, srv9, vfrac16_32_30, vfrac16_30, vout_28); + one_line(srv80, srv90, vfrac16_32_30, vfrac16_30, vout_29); + + one_line(srv9, srva, vfrac16_32_31, vfrac16_31, vout_30); + one_line(srv90, srva0, vfrac16_32_31, vfrac16_31, vout_31); + + vec_xst(vout_0, 32*16, dst); + vec_xst(vout_1, 32*16+16, dst); + vec_xst(vout_2, 32*17, dst); + vec_xst(vout_3, 32*17+16, dst); + vec_xst(vout_4, 32*18, dst); + vec_xst(vout_5, 32*18+16, dst); + vec_xst(vout_6, 32*19, dst); + vec_xst(vout_7, 32*19+16, dst); + vec_xst(vout_8, 32*20, dst); + vec_xst(vout_9, 32*20+16, dst); + vec_xst(vout_10, 32*21, dst); + vec_xst(vout_11, 32*21+16, dst); + vec_xst(vout_12, 32*22, dst); + vec_xst(vout_13, 32*22+16, dst); + vec_xst(vout_14, 32*23, dst); + vec_xst(vout_15, 32*23+16, dst); + vec_xst(vout_16, 32*24, dst); + vec_xst(vout_17, 32*24+16, dst); + vec_xst(vout_18, 32*25, dst); + vec_xst(vout_19, 32*25+16, dst); + vec_xst(vout_20, 32*26, dst); + vec_xst(vout_21, 32*26+16, dst); + vec_xst(vout_22, 32*27, dst); + vec_xst(vout_23, 32*27+16, dst); + vec_xst(vout_24, 32*28, dst); + vec_xst(vout_25, 32*28+16, dst); + vec_xst(vout_26, 32*29, dst); + vec_xst(vout_27, 32*29+16, dst); + vec_xst(vout_28, 32*30, dst); + vec_xst(vout_29, 32*30+16, dst); + vec_xst(vout_30, 32*31, dst); + vec_xst(vout_31, 32*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * 32 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<4, 8>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + //mode 28 + //int offset[32] = {0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5}; + //int fraction[32] = {5, 10, 15, 20, 25, 30, 3, 8, 13, 18, 23, 28, 1, 6, 11, 16, 21, 26, 31, 4, 9, 14, 19, 24, 29, 2, 7, 12, 17, 22, 27, 0}; + + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + vec_u8_t vfrac4 = (vec_u8_t){5, 5, 5, 5, 10, 10, 10, 10, 15, 15, 15, 15, 20, 20, 20, 20}; /* fraction[0-3] */ + vec_u8_t vfrac4_32 = (vec_u8_t){27, 27, 27, 27, 22, 22, 22, 22, 17, 17, 17, 17, 12, 12, 12, 12}; /* 32 - fraction[0-3] */ + + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + vec_xst(vout, 0, dst); + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * 4 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<8, 8>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + //mode 28 + //int offset[32] = {0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5}; + //int fraction[32] = {5, 10, 15, 20, 25, 30, 3, 8, 13, 18, 23, 28, 1, 6, 11, 16, 21, 26, 31, 4, 9, 14, 19, 24, 29, 2, 7, 12, 17, 22, 27, 0}; + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + + /* fraction[0-7] */ + vec_u8_t vfrac8_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac8_1 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac8_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac8_3 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 8, 8, 8, 8, 8, 8, 8, 8}; + + /* 32 - fraction[0-7] */ + vec_u8_t vfrac8_32_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac8_32_1 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac8_32_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac8_32_3 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 24, 24, 24, 24, 24, 24, 24, 24}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + /* y0, y1 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0); + vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y2, y3 */ + vmle0 = vec_mule(srv0, vfrac8_32_1); + vmlo0 = vec_mulo(srv0, vfrac8_32_1); + vmle1 = vec_mule(srv1, vfrac8_1); + vmlo1 = vec_mulo(srv1, vfrac8_1); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y4, y5 */ + vmle0 = vec_mule(srv0, vfrac8_32_2); + vmlo0 = vec_mulo(srv0, vfrac8_32_2); + vmle1 = vec_mule(srv1, vfrac8_2); + vmlo1 = vec_mulo(srv1, vfrac8_2); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y6, y7 */ + vmle0 = vec_mule(srv1, vfrac8_32_3); + vmlo0 = vec_mulo(srv1, vfrac8_32_3); + vmle1 = vec_mule(srv2, vfrac8_3); + vmlo1 = vec_mulo(srv2, vfrac8_3); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * 8 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<16, 8>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + + //mode 28 + //int offset[32] = {0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5}; + //int fraction[32] = {5, 10, 15, 20, 25, 30, 3, 8, 13, 18, 23, 28, 1, 6, 11, 16, 21, 26, 31, 4, 9, 14, 19, 24, 29, 2, 7, 12, 17, 22, 27, 0}; + + /* fraction[0-15] */ + vec_u8_t vfrac16_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vfrac16_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_2 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; + vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_4 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; + vec_u8_t vfrac16_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_6 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; + vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_8 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13,13, 13, 13, 13,13, 13, 13, 13}; + vec_u8_t vfrac16_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_10 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; + vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_12 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; + vec_u8_t vfrac16_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_14 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; + vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* 32 - fraction[0-15] */ + vec_u8_t vfrac16_32_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; + vec_u8_t vfrac16_32_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_32_2 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; + vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_32_4 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; + vec_u8_t vfrac16_32_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_32_6 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; + vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_32_8 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; + vec_u8_t vfrac16_32_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_32_10 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; + vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_32_12 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; + vec_u8_t vfrac16_32_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_32_14 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; + vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv0, srv1, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv0, srv1, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv0, srv1, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv1, srv2, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv1, srv2, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv1, srv2, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv1, srv2, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv1, srv2, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv1, srv2, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv2, srv3, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv2, srv3, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv2, srv3, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv2, srv3, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 16*2, dst); + vec_xst(vout_3, 16*3, dst); + vec_xst(vout_4, 16*4, dst); + vec_xst(vout_5, 16*5, dst); + vec_xst(vout_6, 16*6, dst); + vec_xst(vout_7, 16*7, dst); + vec_xst(vout_8, 16*8, dst); + vec_xst(vout_9, 16*9, dst); + vec_xst(vout_10, 16*10, dst); + vec_xst(vout_11, 16*11, dst); + vec_xst(vout_12, 16*12, dst); + vec_xst(vout_13, 16*13, dst); + vec_xst(vout_14, 16*14, dst); + vec_xst(vout_15, 16*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * 16 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<32, 8>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); /* from y= 15, use srv1, srv2 */ + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); /* y=31, use srv2, srv3 */ + vec_u8_t srv8 = vec_perm(sv0, sv1, mask4); /* y=31, use srv2, srv3 */ + vec_u8_t srv9 = vec_perm(sv0, sv1, mask5); /* y=31, use srv2, srv3 */ + vec_u8_t srv12 = vec_perm(sv0, sv1, mask6); /* y=31, use srv2, srv3 */ + + vec_u8_t srv4 = sv1; + vec_u8_t srv5 = vec_perm(sv1, sv2, mask1); + vec_u8_t srv6 = vec_perm(sv1, sv2, mask2); + vec_u8_t srv7 = vec_perm(sv1, sv2, mask3); + vec_u8_t srv10 = vec_perm(sv1, sv2, mask4); /* y=31, use srv2, srv3 */ + vec_u8_t srv11 = vec_perm(sv1, sv2, mask5); /* y=31, use srv2, srv3 */ + vec_u8_t srv13 = vec_perm(sv1, sv2, mask6); /* y=31, use srv2, srv3 */ + + /* fraction[0-15] */ + vec_u8_t vfrac16_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vfrac16_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_2 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; + vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_4 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; + vec_u8_t vfrac16_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_6 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; + vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_8 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13,13, 13, 13, 13,13, 13, 13, 13}; + vec_u8_t vfrac16_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_10 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; + vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_12 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; + vec_u8_t vfrac16_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_14 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; + vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + vec_u8_t vfrac16_16 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; + vec_u8_t vfrac16_17 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_18 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; + vec_u8_t vfrac16_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_20 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; + vec_u8_t vfrac16_21 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_22 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; + vec_u8_t vfrac16_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_24 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; + vec_u8_t vfrac16_25 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_26 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; + vec_u8_t vfrac16_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_28 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; + vec_u8_t vfrac16_29 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_30 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; + vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + + /* 32 - fraction[0-15] */ + vec_u8_t vfrac16_32_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; + vec_u8_t vfrac16_32_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_32_2 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; + vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_32_4 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; + vec_u8_t vfrac16_32_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_32_6 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; + vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_32_8 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; + vec_u8_t vfrac16_32_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_32_10 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; + vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_32_12 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; + vec_u8_t vfrac16_32_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_32_14 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; + vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + vec_u8_t vfrac16_32_16 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; + vec_u8_t vfrac16_32_17 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_32_18 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; + vec_u8_t vfrac16_32_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_32_20 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; + vec_u8_t vfrac16_32_21 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_32_22 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; + vec_u8_t vfrac16_32_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_32_24 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; + vec_u8_t vfrac16_32_25 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_32_26 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; + vec_u8_t vfrac16_32_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_32_28 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; + vec_u8_t vfrac16_32_29 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_32_30 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv4, srv5, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv4, srv5, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv4, srv5, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv0, srv1, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv4, srv5, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv0, srv1, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv4, srv5, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv0, srv1, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv4, srv5, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv1, srv2, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv5, srv6, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv1, srv2, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv5, srv6, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv1, srv2, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv5, srv6, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv1, srv2, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv5, srv6, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv1, srv2, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv5, srv6, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv1, srv2, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv5, srv6, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv2, srv3, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv6, srv7, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv2, srv3, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv6, srv7, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv2, srv3, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv6, srv7, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv2, srv3, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv6, srv7, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 32+16, dst); + vec_xst(vout_4, 32*2, dst); + vec_xst(vout_5, 32*2+16, dst); + vec_xst(vout_6, 32*3, dst); + vec_xst(vout_7, 32*3+16, dst); + vec_xst(vout_8, 32*4, dst); + vec_xst(vout_9, 32*4+16, dst); + vec_xst(vout_10, 32*5, dst); + vec_xst(vout_11, 32*5+16, dst); + vec_xst(vout_12, 32*6, dst); + vec_xst(vout_13, 32*6+16, dst); + vec_xst(vout_14, 32*7, dst); + vec_xst(vout_15, 32*7+16, dst); + vec_xst(vout_16, 32*8, dst); + vec_xst(vout_17, 32*8+16, dst); + vec_xst(vout_18, 32*9, dst); + vec_xst(vout_19, 32*9+16, dst); + vec_xst(vout_20, 32*10, dst); + vec_xst(vout_21, 32*10+16, dst); + vec_xst(vout_22, 32*11, dst); + vec_xst(vout_23, 32*11+16, dst); + vec_xst(vout_24, 32*12, dst); + vec_xst(vout_25, 32*12+16, dst); + vec_xst(vout_26, 32*13, dst); + vec_xst(vout_27, 32*13+16, dst); + vec_xst(vout_28, 32*14, dst); + vec_xst(vout_29, 32*14+16, dst); + vec_xst(vout_30, 32*15, dst); + vec_xst(vout_31, 32*15+16, dst); + + one_line(srv2, srv3, vfrac16_32_16, vfrac16_16, vout_0); + one_line(srv6, srv7, vfrac16_32_16, vfrac16_16, vout_1); + + one_line(srv2, srv3, vfrac16_32_17, vfrac16_17, vout_2); + one_line(srv6, srv7, vfrac16_32_17, vfrac16_17, vout_3); + + one_line(srv2, srv3, vfrac16_32_18, vfrac16_18, vout_4); + one_line(srv6, srv7, vfrac16_32_18, vfrac16_18, vout_5); + + one_line(srv3, srv8, vfrac16_32_19, vfrac16_19, vout_6); + one_line(srv7, srv10, vfrac16_32_19, vfrac16_19, vout_7); + + one_line(srv3, srv8, vfrac16_32_20, vfrac16_20, vout_8); + one_line(srv7, srv10, vfrac16_32_20, vfrac16_20, vout_9); + + one_line(srv3, srv8, vfrac16_32_21, vfrac16_21, vout_10); + one_line(srv7, srv10, vfrac16_32_21, vfrac16_21, vout_11); + + one_line(srv3, srv8, vfrac16_32_22, vfrac16_22, vout_12); + one_line(srv7, srv10, vfrac16_32_22, vfrac16_22, vout_13); + + one_line(srv3, srv8, vfrac16_32_23, vfrac16_23, vout_14); + one_line(srv7, srv10, vfrac16_32_23, vfrac16_23, vout_15); + + one_line(srv3, srv8, vfrac16_32_24, vfrac16_24, vout_16); + one_line(srv7, srv10, vfrac16_32_24, vfrac16_24, vout_17); + + one_line(srv8, srv9, vfrac16_32_25, vfrac16_25, vout_18); + one_line(srv10, srv11, vfrac16_32_25, vfrac16_25, vout_19); + + one_line(srv8, srv9, vfrac16_32_26, vfrac16_26, vout_20); + one_line(srv10, srv11, vfrac16_32_26, vfrac16_26, vout_21); + + one_line(srv8, srv9, vfrac16_32_27, vfrac16_27, vout_22); + one_line(srv10, srv11, vfrac16_32_27, vfrac16_27, vout_23); + + one_line(srv8, srv9, vfrac16_32_28, vfrac16_28, vout_24); + one_line(srv10, srv11, vfrac16_32_28, vfrac16_28, vout_25); + + one_line(srv8, srv9, vfrac16_32_29, vfrac16_29, vout_26); + one_line(srv10, srv11, vfrac16_32_29, vfrac16_29, vout_27); + + one_line(srv8, srv9, vfrac16_32_30, vfrac16_30, vout_28); + one_line(srv10, srv11, vfrac16_32_30, vfrac16_30, vout_29); + + one_line(srv9, srv12, vfrac16_32_31, vfrac16_31, vout_30); + one_line(srv11, srv13, vfrac16_32_31, vfrac16_31, vout_31); + + vec_xst(vout_0, 32*16, dst); + vec_xst(vout_1, 32*16+16, dst); + vec_xst(vout_2, 32*17, dst); + vec_xst(vout_3, 32*17+16, dst); + vec_xst(vout_4, 32*18, dst); + vec_xst(vout_5, 32*18+16, dst); + vec_xst(vout_6, 32*19, dst); + vec_xst(vout_7, 32*19+16, dst); + vec_xst(vout_8, 32*20, dst); + vec_xst(vout_9, 32*20+16, dst); + vec_xst(vout_10, 32*21, dst); + vec_xst(vout_11, 32*21+16, dst); + vec_xst(vout_12, 32*22, dst); + vec_xst(vout_13, 32*22+16, dst); + vec_xst(vout_14, 32*23, dst); + vec_xst(vout_15, 32*23+16, dst); + vec_xst(vout_16, 32*24, dst); + vec_xst(vout_17, 32*24+16, dst); + vec_xst(vout_18, 32*25, dst); + vec_xst(vout_19, 32*25+16, dst); + vec_xst(vout_20, 32*26, dst); + vec_xst(vout_21, 32*26+16, dst); + vec_xst(vout_22, 32*27, dst); + vec_xst(vout_23, 32*27+16, dst); + vec_xst(vout_24, 32*28, dst); + vec_xst(vout_25, 32*28+16, dst); + vec_xst(vout_26, 32*29, dst); + vec_xst(vout_27, 32*29+16, dst); + vec_xst(vout_28, 32*30, dst); + vec_xst(vout_29, 32*30+16, dst); + vec_xst(vout_30, 32*31, dst); + vec_xst(vout_31, 32*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * 32 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<4, 9>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03, 0x00, 0x01, 0x02, 0x03}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + vec_u8_t vfrac4 = (vec_u8_t){2, 2, 2, 2, 4, 4, 4, 4, 6, 6, 6, 6, 8, 8, 8, 8}; /* fraction[0-3] */ + vec_u8_t vfrac4_32 = (vec_u8_t){30, 30, 30, 30, 28, 28, 28, 28, 26, 26, 26, 26, 24, 24, 24, 24}; /* 32 - fraction[0-3] */ + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + vec_xst(vout, 0, dst); + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * 4 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<8, 9>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */ /*width2*/ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + /* fraction[0-7] */ + vec_u8_t vfrac8_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac8_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac8_2 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac8_3 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* 32 - fraction[0-7] */ + vec_u8_t vfrac8_32_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac8_32_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac8_32_2 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac8_32_3 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + /* y0, y1 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0); + vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y2, y3 */ + vmle0 = vec_mule(srv0, vfrac8_32_1); + vmlo0 = vec_mulo(srv0, vfrac8_32_1); + vmle1 = vec_mule(srv1, vfrac8_1); + vmlo1 = vec_mulo(srv1, vfrac8_1); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y4, y5 */ + vmle0 = vec_mule(srv0, vfrac8_32_2); + vmlo0 = vec_mulo(srv0, vfrac8_32_2); + vmle1 = vec_mule(srv1, vfrac8_2); + vmlo1 = vec_mulo(srv1, vfrac8_2); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y6, y7 */ + vmle0 = vec_mule(srv0, vfrac8_32_3); + vmlo0 = vec_mulo(srv0, vfrac8_32_3); + vmle1 = vec_mule(srv1, vfrac8_3); + vmlo1 = vec_mulo(srv1, vfrac8_3); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * 8 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<16, 9>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = vec_perm(sv0, sv1, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + + /* fraction[0-15] */ + vec_u8_t vfrac16_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_1 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_2 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_4 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_5 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_6 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + vec_u8_t vfrac16_8 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_9 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_10 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_12 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_13 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_14 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30,30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + + /* 32 - fraction[0-15] */ + vec_u8_t vfrac16_32_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30,30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_32_1 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_32_2 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_32_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_32_4 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_32_5 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_32_6 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + vec_u8_t vfrac16_32_8 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_32_9 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_32_10 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_32_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_32_12 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_32_13 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_32_14 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv0, srv1, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv0, srv1, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv0, srv1, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv0, srv1, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv0, srv1, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv0, srv1, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv0, srv1, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv0, srv1, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv0, srv1, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv0, srv1, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv0, srv1, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv0, srv1, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv1, srv2, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + vec_xst(vout_4, 64, dst); + vec_xst(vout_5, 80, dst); + vec_xst(vout_6, 96, dst); + vec_xst(vout_7, 112, dst); + vec_xst(vout_8, 128, dst); + vec_xst(vout_9, 144, dst); + vec_xst(vout_10, 160, dst); + vec_xst(vout_11, 176, dst); + vec_xst(vout_12, 192, dst); + vec_xst(vout_13, 208, dst); + vec_xst(vout_14, 224, dst); + vec_xst(vout_15, 240, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * 16 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<32, 9>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); /* from y= 15, use srv1, srv2 */ + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); /* y=31, use srv2, srv3 */ + + vec_u8_t srv4 = sv1; + vec_u8_t srv5 = vec_perm(sv1, sv2, mask1); + vec_u8_t srv6 = vec_perm(sv1, sv2, mask2); + vec_u8_t srv7 = vec_perm(sv2, sv2, mask3); + + /* fraction[0-15] */ + vec_u8_t vfrac16_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_1 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_2 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_4 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_5 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_6 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + vec_u8_t vfrac16_8 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_9 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_10 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_12 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_13 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_14 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30,30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + + /* 32 - fraction[0-15] */ + vec_u8_t vfrac16_32_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30,30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_32_1 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_32_2 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_32_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_32_4 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_32_5 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_32_6 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + vec_u8_t vfrac16_32_8 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_32_9 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_32_10 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_32_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_32_12 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_32_13 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_32_14 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv4, srv5, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv4, srv5, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv4, srv5, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv0, srv1, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv4, srv5, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv0, srv1, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv4, srv5, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv0, srv1, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv4, srv5, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv0, srv1, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv4, srv5, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv0, srv1, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv4, srv5, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv0, srv1, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv4, srv5, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv0, srv1, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv4, srv5, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv0, srv1, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv4, srv5, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv0, srv1, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv4, srv5, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv0, srv1, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv4, srv5, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv0, srv1, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv4, srv5, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv0, srv1, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv4, srv5, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv1, srv2, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv5, srv6, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 32+16, dst); + vec_xst(vout_4, 32*2, dst); + vec_xst(vout_5, 32*2+16, dst); + vec_xst(vout_6, 32*3, dst); + vec_xst(vout_7, 32*3+16, dst); + vec_xst(vout_8, 32*4, dst); + vec_xst(vout_9, 32*4+16, dst); + vec_xst(vout_10, 32*5, dst); + vec_xst(vout_11, 32*5+16, dst); + vec_xst(vout_12, 32*6, dst); + vec_xst(vout_13, 32*6+16, dst); + vec_xst(vout_14, 32*7, dst); + vec_xst(vout_15, 32*7+16, dst); + vec_xst(vout_16, 32*8, dst); + vec_xst(vout_17, 32*8+16, dst); + vec_xst(vout_18, 32*9, dst); + vec_xst(vout_19, 32*9+16, dst); + vec_xst(vout_20, 32*10, dst); + vec_xst(vout_21, 32*10+16, dst); + vec_xst(vout_22, 32*11, dst); + vec_xst(vout_23, 32*11+16, dst); + vec_xst(vout_24, 32*12, dst); + vec_xst(vout_25, 32*12+16, dst); + vec_xst(vout_26, 32*13, dst); + vec_xst(vout_27, 32*13+16, dst); + vec_xst(vout_28, 32*14, dst); + vec_xst(vout_29, 32*14+16, dst); + vec_xst(vout_30, 32*15, dst); + vec_xst(vout_31, 32*15+16, dst); + + + one_line(srv1, srv2, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv5, srv6, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv5, srv6, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv5, srv6, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv1, srv2, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv5, srv6, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv1, srv2, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv5, srv6, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv1, srv2, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv5, srv6, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv1, srv2, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv5, srv6, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv1, srv2, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv5, srv6, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv1, srv2, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv5, srv6, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv1, srv2, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv5, srv6, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv1, srv2, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv5, srv6, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv1, srv2, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv5, srv6, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv1, srv2, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv5, srv6, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv1, srv2, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv5, srv6, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv1, srv2, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv5, srv6, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv2, srv3, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv6, srv7, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 32*16, dst); + vec_xst(vout_1, 32*16+16, dst); + vec_xst(vout_2, 32*17, dst); + vec_xst(vout_3, 32*17+16, dst); + vec_xst(vout_4, 32*18, dst); + vec_xst(vout_5, 32*18+16, dst); + vec_xst(vout_6, 32*19, dst); + vec_xst(vout_7, 32*19+16, dst); + vec_xst(vout_8, 32*20, dst); + vec_xst(vout_9, 32*20+16, dst); + vec_xst(vout_10, 32*21, dst); + vec_xst(vout_11, 32*21+16, dst); + vec_xst(vout_12, 32*22, dst); + vec_xst(vout_13, 32*22+16, dst); + vec_xst(vout_14, 32*23, dst); + vec_xst(vout_15, 32*23+16, dst); + vec_xst(vout_16, 32*24, dst); + vec_xst(vout_17, 32*24+16, dst); + vec_xst(vout_18, 32*25, dst); + vec_xst(vout_19, 32*25+16, dst); + vec_xst(vout_20, 32*26, dst); + vec_xst(vout_21, 32*26+16, dst); + vec_xst(vout_22, 32*27, dst); + vec_xst(vout_23, 32*27+16, dst); + vec_xst(vout_24, 32*28, dst); + vec_xst(vout_25, 32*28+16, dst); + vec_xst(vout_26, 32*29, dst); + vec_xst(vout_27, 32*29+16, dst); + vec_xst(vout_28, 32*30, dst); + vec_xst(vout_29, 32*30+16, dst); + vec_xst(vout_30, 32*31, dst); + vec_xst(vout_31, 32*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * 32 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<4, 10>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u8_t srcV = vec_xl(9, srcPix0); /* offset = width2+1 = width<<1 + 1 */ + if (bFilter){ + LOAD_ZERO; + vec_u8_t tmp_v = vec_xl(0, srcPix0); + vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask)); + vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srcV, u8_to_s16_b0_mask)); + vec_s16_t v0_s16 = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_w4x4_mask1)); + vec_s16_t v1_s16 = (vec_s16_t)vec_sra( vec_sub(v0_s16, c0_s16v), one_u16v ); + vec_s16_t v_sum = vec_add(c1_s16v, v1_s16); + vec_u16_t v_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v_sum)); + vec_u8_t v_filter_u8 = vec_pack(v_filter_u16, zero_u16v); + vec_u8_t mask = {0x00, 0x11, 0x12, 0x13, 0x01, 0x11, 0x12, 0x13, 0x02, 0x11, 0x12, 0x13, 0x03, 0x11, 0x12, 0x13}; + vec_u8_t v0 = vec_perm(v_filter_u8, srcV, mask); + vec_xst(v0, 0, dst); + } + else{ + vec_u8_t v0 = (vec_u8_t)vec_splat((vec_u32_t)srcV, 0); + vec_xst(v0, 0, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * 4 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<8, 10>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u8_t srcV = vec_xl(17, srcPix0); /* offset = width2+1 = width<<1 + 1 */ + + if (bFilter){ + LOAD_ZERO; + vec_u8_t tmp_v = vec_xl(0, srcPix0); + vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask)); + vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srcV, u8_to_s16_b0_mask)); + vec_s16_t v0_s16 = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_w8x8_mask1)); + vec_s16_t v1_s16 = (vec_s16_t)vec_sra( vec_sub(v0_s16, c0_s16v), one_u16v ); + vec_s16_t v_sum = vec_add(c1_s16v, v1_s16); + vec_u16_t v_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v_sum)); + vec_u8_t v_filter_u8 = vec_pack(v_filter_u16, zero_u16v); + vec_u8_t v_mask0 = {0x00, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x01, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + vec_u8_t v_mask1 = {0x02, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x03, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + vec_u8_t v_mask2 = {0x04, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x05, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + vec_u8_t v_mask3 = {0x06, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x07, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + vec_u8_t v0 = vec_perm(v_filter_u8, srcV, v_mask0); + vec_xst(v0, 0, dst); + vec_u8_t v1 = vec_perm(v_filter_u8, srcV, v_mask1); + vec_xst(v1, 16, dst); + vec_u8_t v2 = vec_perm(v_filter_u8, srcV, v_mask2); + vec_xst(v2, 32, dst); + vec_u8_t v3 = vec_perm(v_filter_u8, srcV, v_mask3); + vec_xst(v3, 48, dst); + } + else{ + vec_u8_t v_mask0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07}; + vec_u8_t v0 = vec_perm(srcV, srcV, v_mask0); + vec_xst(v0, 0, dst); + vec_xst(v0, 16, dst); + vec_xst(v0, 32, dst); + vec_xst(v0, 48, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * 8 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif + +} + +template<> +void one_ang_pred_altivec<16, 10>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u8_t srv =vec_xl(33, srcPix0); /* offset = width2+1 = width<<1 + 1 */ + + if (bFilter){ + LOAD_ZERO; + vec_u8_t tmp_v = vec_xl(0, srcPix0); + vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask)); + vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask)); + vec_u8_t srcv1 = vec_xl(1, srcPix0); + vec_s16_t v0h_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskh)); + vec_s16_t v0l_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskl)); + vec_s16_t v1h_s16 = (vec_s16_t)vec_sra( vec_sub(v0h_s16, c0_s16v), one_u16v ); + vec_s16_t v1l_s16 = (vec_s16_t)vec_sra( vec_sub(v0l_s16, c0_s16v), one_u16v ); + vec_s16_t vh_sum = vec_add(c1_s16v, v1h_s16); + vec_s16_t vl_sum = vec_add(c1_s16v, v1l_s16); + vec_u16_t vh_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vh_sum)); + vec_u16_t vl_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vl_sum)); + vec_u8_t v_filter_u8 = vec_pack(vh_filter_u16, vl_filter_u16); + + vec_u8_t mask0 = {0x00, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask1 = {0x01, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask2 = {0x02, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask3 = {0x03, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask4 = {0x04, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask5 = {0x05, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask6 = {0x06, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask7 = {0x07, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask8 = {0x08, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask9 = {0x09, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask10 = {0xa, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask11 = {0xb, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask12 = {0xc, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask13 = {0xd, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask14 = {0xe, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask15 = {0xf, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + + vec_xst(vec_perm(v_filter_u8, srv, mask0), 0, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask1), 16, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask2), 32, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask3), 48, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask4), 64, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask5), 80, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask6), 96, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask7), 112, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask8), 128, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask9), 144, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask10), 160, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask11), 176, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask12), 192, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask13), 208, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask14), 224, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask15), 240, dst); + } + else{ + vec_xst(srv, 0, dst); + vec_xst(srv, 16, dst); + vec_xst(srv, 32, dst); + vec_xst(srv, 48, dst); + vec_xst(srv, 64, dst); + vec_xst(srv, 80, dst); + vec_xst(srv, 96, dst); + vec_xst(srv, 112, dst); + vec_xst(srv, 128, dst); + vec_xst(srv, 144, dst); + vec_xst(srv, 160, dst); + vec_xst(srv, 176, dst); + vec_xst(srv, 192, dst); + vec_xst(srv, 208, dst); + vec_xst(srv, 224, dst); + vec_xst(srv, 240, dst); + } +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * 16 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<32, 10>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u8_t srv =vec_xl(65, srcPix0); /* offset = width2+1 = width<<1 + 1 */ + vec_u8_t srv1 =vec_xl(81, srcPix0); + //vec_u8_t vout; + int offset = 0; + + + if (bFilter){ + LOAD_ZERO; + vec_u8_t tmp_v = vec_xl(0, srcPix0); + vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask)); + vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask)); + vec_u8_t srcv1 = vec_xl(1, srcPix0); + vec_s16_t v0h_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskh)); + vec_s16_t v0l_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskl)); + vec_s16_t v1h_s16 = (vec_s16_t)vec_sra( vec_sub(v0h_s16, c0_s16v), one_u16v ); + vec_s16_t v1l_s16 = (vec_s16_t)vec_sra( vec_sub(v0l_s16, c0_s16v), one_u16v ); + vec_s16_t vh_sum = vec_add(c1_s16v, v1h_s16); + vec_s16_t vl_sum = vec_add(c1_s16v, v1l_s16); + vec_u16_t vh_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vh_sum)); + vec_u16_t vl_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vl_sum)); + vec_u8_t v_filter_u8 = vec_pack(vh_filter_u16, vl_filter_u16); + + vec_u8_t mask0 = {0x00, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask1 = {0x01, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask2 = {0x02, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask3 = {0x03, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask4 = {0x04, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask5 = {0x05, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask6 = {0x06, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask7 = {0x07, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask8 = {0x08, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask9 = {0x09, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask10 = {0xa, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask11 = {0xb, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask12 = {0xc, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask13 = {0xd, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask14 = {0xe, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask15 = {0xf, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_xst(vec_perm(v_filter_u8, srv, mask0), 0, dst); + vec_xst(srv1, 16, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask1), 32, dst); + vec_xst(srv1, 48, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask2), 64, dst); + vec_xst(srv1, 80, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask3), 96, dst); + vec_xst(srv1, 112, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask4), 128, dst); + vec_xst(srv1, 144, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask5), 160, dst); + vec_xst(srv1, 176, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask6), 192, dst); + vec_xst(srv1, 208, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask7), 224, dst); + vec_xst(srv1, 240, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask8), 256, dst); + vec_xst(srv1, 272, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask9), 288, dst); + vec_xst(srv1, 304, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask10), 320, dst); + vec_xst(srv1, 336, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask11), 352, dst); + vec_xst(srv1, 368, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask12), 384, dst); + vec_xst(srv1, 400, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask13), 416, dst); + vec_xst(srv1, 432, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask14), 448, dst); + vec_xst(srv1, 464, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask15), 480, dst); + vec_xst(srv1, 496, dst); + + vec_u8_t srcv2 = vec_xl(17, srcPix0); + vec_s16_t v2h_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv2, u8_to_s16_w8x8_maskh)); + vec_s16_t v2l_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv2, u8_to_s16_w8x8_maskl)); + vec_s16_t v3h_s16 = (vec_s16_t)vec_sra( vec_sub(v2h_s16, c0_s16v), one_u16v ); + vec_s16_t v3l_s16 = (vec_s16_t)vec_sra( vec_sub(v2l_s16, c0_s16v), one_u16v ); + vec_s16_t v2h_sum = vec_add(c1_s16v, v3h_s16); + vec_s16_t v2l_sum = vec_add(c1_s16v, v3l_s16); + vec_u16_t v2h_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v2h_sum)); + vec_u16_t v2l_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v2l_sum)); + vec_u8_t v2_filter_u8 = vec_pack(v2h_filter_u16, v2l_filter_u16); + vec_xst(vec_perm(v2_filter_u8, srv, mask0), 512, dst); + vec_xst(srv1, 528, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask1), 544, dst); + vec_xst(srv1, 560, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask2), 576, dst); + vec_xst(srv1, 592, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask3), 608, dst); + vec_xst(srv1, 624, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask4), 640, dst); + vec_xst(srv1, 656, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask5), 672, dst); + vec_xst(srv1, 688, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask6), 704, dst); + vec_xst(srv1, 720, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask7), 736, dst); + vec_xst(srv1, 752, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask8), 768, dst); + vec_xst(srv1, 784, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask9), 800, dst); + vec_xst(srv1, 816, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask10), 832, dst); + vec_xst(srv1, 848, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask11), 864, dst); + vec_xst(srv1, 880, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask12), 896, dst); + vec_xst(srv1, 912, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask13), 928, dst); + vec_xst(srv1, 944, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask14), 960, dst); + vec_xst(srv1, 976, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask15), 992, dst); + vec_xst(srv1, 1008, dst); + + } + else{ + for(int i = 0; i<32; i++){ + vec_xst(srv, offset, dst); + vec_xst(srv1, offset+16, dst); + offset += 32; + } + } +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * 32 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<4, 26>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u8_t srv =vec_xl(0, srcPix0); /* offset = width2+1 = width<<1 + 1 */ + vec_u8_t v0; + + if (bFilter){ + LOAD_ZERO; + vec_u8_t tmp_v = vec_sld(srv, srv, 15); + vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask)); + vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask)); + vec_s16_t v0_s16 = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_w4x4_mask9)); + vec_s16_t v1_s16 = (vec_s16_t)vec_sra( vec_sub(v0_s16, c0_s16v), one_u16v ); + vec_s16_t v_sum = vec_add(c1_s16v, v1_s16); + vec_u16_t v_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v_sum)); + vec_u8_t v_filter_u8 = vec_pack(v_filter_u16, zero_u16v); + vec_u8_t v_mask = {0x10, 0x02, 0x03, 0x04, 0x11, 0x02, 0x03, 0x04, 0x12, 0x02, 0x03, 0x04, 0x13, 0x02, 0x03, 0x04}; + v0 = vec_perm(srv, v_filter_u8, v_mask); + } + else{ + vec_u8_t v_mask = {0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04}; + v0 = vec_perm(srv, srv, v_mask); + } + vec_xst(v0, 0, dst); + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * 4 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<8, 26>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u8_t srv =vec_xl(0, srcPix0); /* offset = width2+1 = width<<1 + 1 */ + + if (bFilter){ + LOAD_ZERO; + vec_u8_t tmp_v = vec_xl(17, srcPix0); + vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask)); + vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b1_mask)); + vec_s16_t v0_s16 = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_w8x8_maskh)); + vec_s16_t v1_s16 = (vec_s16_t)vec_sra( vec_sub(v0_s16, c0_s16v), one_u16v ); + vec_s16_t v_sum = vec_add(c1_s16v, v1_s16); + vec_u16_t v_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v_sum)); + vec_u8_t v_filter_u8 = vec_pack(v_filter_u16, zero_u16v); + vec_u8_t v_mask0 = {0x00, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x01, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t v_mask1 = {0x02, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x03, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t v_mask2 = {0x04, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x05, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t v_mask3 = {0x06, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x07, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t v0 = vec_perm(v_filter_u8, srv, v_mask0); + vec_u8_t v1 = vec_perm(v_filter_u8, srv, v_mask1); + vec_u8_t v2 = vec_perm(v_filter_u8, srv, v_mask2); + vec_u8_t v3 = vec_perm(v_filter_u8, srv, v_mask3); + vec_xst(v0, 0, dst); + vec_xst(v1, 16, dst); + vec_xst(v2, 32, dst); + vec_xst(v3, 48, dst); + } + else{ + vec_u8_t v_mask = {0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08}; + vec_u8_t v0 = vec_perm(srv, srv, v_mask); + vec_xst(v0, 0, dst); + vec_xst(v0, 16, dst); + vec_xst(v0, 32, dst); + vec_xst(v0, 48, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * 8 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<16, 26>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u8_t srv =vec_xl(0, srcPix0); + vec_u8_t srv1 =vec_xl(1, srcPix0); + + if (bFilter){ + LOAD_ZERO; + vec_u8_t tmp_v = vec_xl(33, srcPix0); /* offset = width2+1 = width<<1 + 1 */ + vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask)); + vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b1_mask)); + vec_s16_t v0h_s16 = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_w8x8_maskh)); + vec_s16_t v0l_s16 = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_w8x8_maskl)); + vec_s16_t v1h_s16 = (vec_s16_t)vec_sra( vec_sub(v0h_s16, c0_s16v), one_u16v ); + vec_s16_t v1l_s16 = (vec_s16_t)vec_sra( vec_sub(v0l_s16, c0_s16v), one_u16v ); + vec_s16_t vh_sum = vec_add(c1_s16v, v1h_s16); + vec_s16_t vl_sum = vec_add(c1_s16v, v1l_s16); + vec_u16_t vh_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vh_sum)); + vec_u16_t vl_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vl_sum)); + vec_u8_t v_filter_u8 = vec_pack(vh_filter_u16, vl_filter_u16); + vec_u8_t mask0 = {0x00, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask1 = {0x01, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask2 = {0x02, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask3 = {0x03, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask4 = {0x04, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask5 = {0x05, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask6 = {0x06, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask7 = {0x07, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask8 = {0x08, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask9 = {0x09, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask10 = {0xa, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask11 = {0xb, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask12 = {0xc, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask13 = {0xd, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask14 = {0xe, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask15 = {0xf, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + + + vec_xst(vec_perm(v_filter_u8, srv1, mask0), 0, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask1), 16, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask2), 32, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask3), 48, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask4), 64, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask5), 80, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask6), 96, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask7), 112, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask8), 128, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask9), 144, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask10), 160, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask11), 176, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask12), 192, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask13), 208, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask14), 224, dst); + vec_xst(vec_perm(v_filter_u8, srv1, mask15), 240, dst); + } + else{ + vec_xst(srv1, 0, dst); + vec_xst(srv1, 16, dst); + vec_xst(srv1, 32, dst); + vec_xst(srv1, 48, dst); + vec_xst(srv1, 64, dst); + vec_xst(srv1, 80, dst); + vec_xst(srv1, 96, dst); + vec_xst(srv1, 112, dst); + vec_xst(srv1, 128, dst); + vec_xst(srv1, 144, dst); + vec_xst(srv1, 160, dst); + vec_xst(srv1, 176, dst); + vec_xst(srv1, 192, dst); + vec_xst(srv1, 208, dst); + vec_xst(srv1, 224, dst); + vec_xst(srv1, 240, dst); + } + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * 16 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<32, 26>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u8_t srv =vec_xl(1, srcPix0); /* offset = width2+1 = width<<1 + 1 */ + vec_u8_t srv1 =vec_xl(17, srcPix0); + + if (bFilter){ + LOAD_ZERO; + vec_u8_t tmp_v = vec_xl(0, srcPix0); + vec_s16_t c0_s16v = (vec_s16_t)(vec_perm(zero_u8v, tmp_v, u8_to_s16_b0_mask)); + vec_s16_t c1_s16v = (vec_s16_t)(vec_perm(zero_u8v, srv, u8_to_s16_b0_mask)); + vec_u8_t srcv1 = vec_xl(65, srcPix0); + vec_s16_t v0h_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskh)); + vec_s16_t v0l_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv1, u8_to_s16_w8x8_maskl)); + vec_s16_t v1h_s16 = (vec_s16_t)vec_sra( vec_sub(v0h_s16, c0_s16v), one_u16v ); + vec_s16_t v1l_s16 = (vec_s16_t)vec_sra( vec_sub(v0l_s16, c0_s16v), one_u16v ); + + vec_s16_t vh_sum = vec_add(c1_s16v, v1h_s16); + vec_s16_t vl_sum = vec_add(c1_s16v, v1l_s16); + vec_u16_t vh_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vh_sum)); + vec_u16_t vl_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, vl_sum)); + vec_u8_t v_filter_u8 = vec_pack(vh_filter_u16, vl_filter_u16); + + vec_u8_t mask0 = {0x00, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask1 = {0x01, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask2 = {0x02, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask3 = {0x03, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask4 = {0x04, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask5 = {0x05, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask6 = {0x06, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask7 = {0x07, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask8 = {0x08, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask9 = {0x09, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask10 = {0xa, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask11 = {0xb, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask12 = {0xc, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask13 = {0xd, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask14 = {0xe, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_u8_t mask15 = {0xf, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; + vec_xst(vec_perm(v_filter_u8, srv, mask0), 0, dst); + vec_xst(srv1, 16, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask1), 32, dst); + vec_xst(srv1, 48, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask2), 64, dst); + vec_xst(srv1, 80, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask3), 96, dst); + vec_xst(srv1, 112, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask4), 128, dst); + vec_xst(srv1, 144, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask5), 160, dst); + vec_xst(srv1, 176, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask6), 192, dst); + vec_xst(srv1, 208, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask7), 224, dst); + vec_xst(srv1, 240, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask8), 256, dst); + vec_xst(srv1, 272, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask9), 288, dst); + vec_xst(srv1, 304, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask10), 320, dst); + vec_xst(srv1, 336, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask11), 352, dst); + vec_xst(srv1, 368, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask12), 384, dst); + vec_xst(srv1, 400, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask13), 416, dst); + vec_xst(srv1, 432, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask14), 448, dst); + vec_xst(srv1, 464, dst); + vec_xst(vec_perm(v_filter_u8, srv, mask15), 480, dst); + vec_xst(srv1, 496, dst); + + vec_u8_t srcv2 = vec_xl(81, srcPix0); + vec_s16_t v2h_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv2, u8_to_s16_w8x8_maskh)); + vec_s16_t v2l_s16 = (vec_s16_t)(vec_perm(zero_u8v, srcv2, u8_to_s16_w8x8_maskl)); + vec_s16_t v3h_s16 = (vec_s16_t)vec_sra( vec_sub(v2h_s16, c0_s16v), one_u16v ); + vec_s16_t v3l_s16 = (vec_s16_t)vec_sra( vec_sub(v2l_s16, c0_s16v), one_u16v ); + vec_s16_t v2h_sum = vec_add(c1_s16v, v3h_s16); + vec_s16_t v2l_sum = vec_add(c1_s16v, v3l_s16); + vec_u16_t v2h_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v2h_sum)); + vec_u16_t v2l_filter_u16 = (vector unsigned short)vec_min( min_s16v, vec_max(zero_s16v, v2l_sum)); + vec_u8_t v2_filter_u8 = vec_pack(v2h_filter_u16, v2l_filter_u16); + + vec_xst(vec_perm(v2_filter_u8, srv, mask0), 512, dst); + vec_xst(srv1, 528, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask1), 544, dst); + vec_xst(srv1, 560, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask2), 576, dst); + vec_xst(srv1, 592, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask3), 608, dst); + vec_xst(srv1, 624, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask4), 640, dst); + vec_xst(srv1, 656, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask5), 672, dst); + vec_xst(srv1, 688, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask6), 704, dst); + vec_xst(srv1, 720, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask7), 736, dst); + vec_xst(srv1, 752, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask8), 768, dst); + vec_xst(srv1, 784, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask9), 800, dst); + vec_xst(srv1, 816, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask10), 832, dst); + vec_xst(srv1, 848, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask11), 864, dst); + vec_xst(srv1, 880, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask12), 896, dst); + vec_xst(srv1, 912, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask13), 928, dst); + vec_xst(srv1, 944, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask14), 960, dst); + vec_xst(srv1, 976, dst); + vec_xst(vec_perm(v2_filter_u8, srv, mask15), 992, dst); + vec_xst(srv1, 1008, dst); + + } + else{ + int offset = 0; + for(int i=0; i<32; i++){ + vec_xst(srv, offset, dst); + vec_xst(srv1, 16+offset, dst); + offset += 32; + } + } +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * 32 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<4, 3>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + //mode 33: + //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26}; + //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0}; + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05, 0x03, 0x04, 0x05, 0x06}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05, 0x03, 0x04, 0x05, 0x06, 0x04, 0x05, 0x06, 0x07}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + vec_u8_t vfrac4 = (vec_u8_t){26, 26, 26, 26, 20, 20, 20, 20, 14, 14, 14, 14, 8, 8, 8, 8}; + vec_u8_t vfrac4_32 = (vec_u8_t){6, 6, 6, 6, 12, 12, 12, 12, 18, 18, 18, 18, 24, 24, 24, 24}; + + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + vec_xst(vout, 0, dst); + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * 4 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<8, 3>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + //mode 33: + //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26}; + //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0}; + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c}; + vec_u8_t mask6={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d}; + vec_u8_t mask7={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 1 */ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 2 */ + vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 3 */ + vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 4 */ + vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 4, 4 */ + vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 4, 5 */ + vec_u8_t srv6 = vec_perm(srv, srv, mask6); /* 5, 6 */ + vec_u8_t srv7 = vec_perm(srv, srv, mask7); /* 6, 7 */ + +vec_u8_t vfrac8_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac8_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac8_2 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac8_3 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 16, 16, 16, 16, 16, 16, 16, 16}; + +vec_u8_t vfrac8_32_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac8_32_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac8_32_2 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac8_32_3 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + /* y0, y1 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0); + vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y2, y3 */ + vmle0 = vec_mule(srv2, vfrac8_32_1); + vmlo0 = vec_mulo(srv2, vfrac8_32_1); + vmle1 = vec_mule(srv3, vfrac8_1); + vmlo1 = vec_mulo(srv3, vfrac8_1); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y4, y5 */ + vmle0 = vec_mule(srv4, vfrac8_32_2); + vmlo0 = vec_mulo(srv4, vfrac8_32_2); + vmle1 = vec_mule(srv5, vfrac8_2); + vmlo1 = vec_mulo(srv5, vfrac8_2); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21}; + + /* y6, y7 */ + vmle0 = vec_mule(srv6, vfrac8_32_3); + vmlo0 = vec_mulo(srv6, vfrac8_32_3); + vmle1 = vec_mule(srv7, vfrac8_3); + vmlo1 = vec_mulo(srv7, vfrac8_3); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * 8 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<16, 3>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[0]* ref[off0 + 15] + f[0] * ref[off0 + 16] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[1]* ref[off1 + 15] + f[1] * ref[off1 + 16] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[2]* ref[off2 + 15] + f[2] * ref[off2 + 16] + 16) >> 5); + + y=3; off3 = offset[3]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[3]* ref[off3 + 15] + f[3] * ref[off3 + 16] + 16) >> 5); + + ... + + y=15; off7 = offset[7]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[15]* ref[off15 + 0] + f[15] * ref[off15 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[15]* ref[off15 + 1] + f[15] * ref[off15 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[15]* ref[off15 + 2] + f[15] * ref[off15 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[15]* ref[off15 + 3] + f[15] * ref[off15 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[15]* ref[off15 + 15] + f[15] * ref[off15 + 16] + 16) >> 5); + } + */ + //mode 33: + //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26}; + //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0}; + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19}; + vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a}; + vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b}; + vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c}; + vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srva = vec_perm(sv0, sv1, mask10); + vec_u8_t srvb = vec_perm(sv0, sv1, mask11); + vec_u8_t srvc = vec_perm(sv0, sv1, mask12); + vec_u8_t srvd = vec_perm(sv0, sv1, mask13); + vec_u8_t srve = vec_perm(sv0, sv1, mask14); + +vec_u8_t vfrac16_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_1 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_2 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_4 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_5 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_6 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_8 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_9 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_10 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_12 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_13 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_14 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + +vec_u8_t vfrac16_32_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv2, srv3, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv3, srv4, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv4, srv5, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv4, srv5, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv5, srv6, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv6, srv7, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv7, srv8, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv8, srv9, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv8, srv9, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv9, srva, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srva, srvb, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srvb, srvc, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srvc, srvd, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srvd, srve, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 16*2, dst); + vec_xst(vout_3, 16*3, dst); + vec_xst(vout_4, 16*4, dst); + vec_xst(vout_5, 16*5, dst); + vec_xst(vout_6, 16*6, dst); + vec_xst(vout_7, 16*7, dst); + vec_xst(vout_8, 16*8, dst); + vec_xst(vout_9, 16*9, dst); + vec_xst(vout_10, 16*10, dst); + vec_xst(vout_11, 16*11, dst); + vec_xst(vout_12, 16*12, dst); + vec_xst(vout_13, 16*13, dst); + vec_xst(vout_14, 16*14, dst); + vec_xst(vout_15, 16*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * 16 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<32, 3>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-31; + dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[0]* ref[off0 + 15] + f[0] * ref[off0 + 16] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-31; + dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[1]* ref[off1 + 15] + f[1] * ref[off1 + 16] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-31; + dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[2]* ref[off2 + 15] + f[2] * ref[off2 + 16] + 16) >> 5); + + ... + + y=15; off15 = offset[15]; x=0-31; off15-off30 = 1; + dst[y * dstStride + 0] = (pixel)((f32[15]* ref[off15 + 0] + f[15] * ref[off15 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[15]* ref[off15 + 1] + f[15] * ref[off15 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[15]* ref[off15 + 2] + f[15] * ref[off15 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[15]* ref[off15 + 3] + f[15] * ref[off15 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[15]* ref[off15 + 15] + f[15] * ref[off15 + 16] + 16) >> 5); + + ... + + y=31; off31= offset[31]; x=0-31; off31 = 2; + dst[y * dstStride + 0] = (pixel)((f32[31]* ref[off15 + 0] + f[31] * ref[off31 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[31]* ref[off15 + 1] + f[31] * ref[off31 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[31]* ref[off15 + 2] + f[31] * ref[off31 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[31]* ref[off15 + 3] + f[31] * ref[off31 + 4] + 16) >> 5); + ... + dst[y * dstStride + 31] = (pixel)((f32[31]* ref[off15 + 31] + f[31] * ref[off31 + 32] + 16) >> 5); + } + */ + //mode 33: + //int offset[32] = {0, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 17, 18, 19, 20, 21, 21, 22, 23, 24, 25, 26}; + //int fraction[32] = {26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0, 26, 20, 14, 8, 2, 28, 22, 16, 10, 4, 30, 24, 18, 12, 6, 0}; + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19}; + vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a}; + vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b}; + vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c}; + vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d}; + vec_u8_t mask15={0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv3 =vec_xl(113, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srva = vec_perm(sv0, sv1, mask10); + vec_u8_t srvb = vec_perm(sv0, sv1, mask11); + vec_u8_t srvc = vec_perm(sv0, sv1, mask12); + vec_u8_t srvd = vec_perm(sv0, sv1, mask13); + vec_u8_t srve = vec_perm(sv0, sv1, mask14); + vec_u8_t srvf = vec_perm(sv0, sv1, mask15); + + vec_u8_t srv00 = sv1; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv10 = vec_perm(sv1, sv2, mask1); + vec_u8_t srv20 = vec_perm(sv1, sv2, mask2); + vec_u8_t srv30 = vec_perm(sv1, sv2, mask3); + vec_u8_t srv40 = vec_perm(sv1, sv2, mask4); + vec_u8_t srv50 = vec_perm(sv1, sv2, mask5); + vec_u8_t srv60 = vec_perm(sv1, sv2, mask6); + vec_u8_t srv70 = vec_perm(sv1, sv2, mask7); + vec_u8_t srv80 = vec_perm(sv1, sv2, mask8); + vec_u8_t srv90 = vec_perm(sv1, sv2, mask9); + vec_u8_t srva0 = vec_perm(sv1, sv2, mask10); + vec_u8_t srvb0 = vec_perm(sv1, sv2, mask11); + vec_u8_t srvc0 = vec_perm(sv1, sv2, mask12); + vec_u8_t srvd0 = vec_perm(sv1, sv2, mask13); + vec_u8_t srve0 = vec_perm(sv1, sv2, mask14); + vec_u8_t srvf0 = vec_perm(sv1, sv2, mask15); + + vec_u8_t srv000 = sv2; + vec_u8_t srv100 = vec_perm(sv2, sv3, mask1); + vec_u8_t srv200 = vec_perm(sv2, sv3, mask2); + vec_u8_t srv300 = vec_perm(sv2, sv3, mask3); + vec_u8_t srv400 = vec_perm(sv2, sv3, mask4); + vec_u8_t srv500 = vec_perm(sv2, sv3, mask5); + vec_u8_t srv600 = vec_perm(sv2, sv3, mask6); + vec_u8_t srv700 = vec_perm(sv2, sv3, mask7); + vec_u8_t srv800 = vec_perm(sv2, sv3, mask8); + vec_u8_t srv900 = vec_perm(sv2, sv3, mask9); + vec_u8_t srva00 = vec_perm(sv2, sv3, mask10); + vec_u8_t srvb00 = vec_perm(sv2, sv3, mask11); + +vec_u8_t vfrac16_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_1 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_2 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_4 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_5 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_6 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_8 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_9 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_10 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_12 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_13 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_14 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; +vec_u8_t vfrac16_16 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_17 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_18 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_19 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_20 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_21 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_22 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_23 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_24 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_25 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_26 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_27 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_28 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_29 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_30 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + +vec_u8_t vfrac16_32_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; +vec_u8_t vfrac16_32_16 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_17 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_18 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_19 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_20 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_21 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_22 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_23 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_24 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_25 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_26 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_27 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_28 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_29 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_30 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv00, srv10, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv10, srv20, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv2, srv3, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv20, srv30, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv3, srv4, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv30, srv40, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv4, srv5, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv40, srv50, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv4, srv5, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv40, srv50, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv5, srv6, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv50, srv60, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv6, srv7, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv60, srv70, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv7, srv8, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv70, srv80, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv8, srv9, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv80, srv90, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv8, srv9, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv80, srv90, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv9, srva, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv90, srva0, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srva, srvb, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srva0, srvb0, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srvb, srvc, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srvb0, srvc0, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srvc, srvd, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srvc0, srvd0, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srvd, srve, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srvd0, srve0, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 32+16, dst); + vec_xst(vout_4, 32*2, dst); + vec_xst(vout_5, 32*2+16, dst); + vec_xst(vout_6, 32*3, dst); + vec_xst(vout_7, 32*3+16, dst); + vec_xst(vout_8, 32*4, dst); + vec_xst(vout_9, 32*4+16, dst); + vec_xst(vout_10, 32*5, dst); + vec_xst(vout_11, 32*5+16, dst); + vec_xst(vout_12, 32*6, dst); + vec_xst(vout_13, 32*6+16, dst); + vec_xst(vout_14, 32*7, dst); + vec_xst(vout_15, 32*7+16, dst); + vec_xst(vout_16, 32*8, dst); + vec_xst(vout_17, 32*8+16, dst); + vec_xst(vout_18, 32*9, dst); + vec_xst(vout_19, 32*9+16, dst); + vec_xst(vout_20, 32*10, dst); + vec_xst(vout_21, 32*10+16, dst); + vec_xst(vout_22, 32*11, dst); + vec_xst(vout_23, 32*11+16, dst); + vec_xst(vout_24, 32*12, dst); + vec_xst(vout_25, 32*12+16, dst); + vec_xst(vout_26, 32*13, dst); + vec_xst(vout_27, 32*13+16, dst); + vec_xst(vout_28, 32*14, dst); + vec_xst(vout_29, 32*14+16, dst); + vec_xst(vout_30, 32*15, dst); + vec_xst(vout_31, 32*15+16, dst); + + one_line(srvd, srve, vfrac16_32_16, vfrac16_16, vout_0); + one_line(srvd0, srve0, vfrac16_32_16, vfrac16_16, vout_1); + + one_line(srve, srvf, vfrac16_32_17, vfrac16_17, vout_2); + one_line(srve0, srvf0, vfrac16_32_17, vfrac16_17, vout_3); + + one_line(srvf, srv00, vfrac16_32_18, vfrac16_18, vout_4); + one_line(srvf0, srv000, vfrac16_32_18, vfrac16_18, vout_5); + + one_line(srv00, srv10, vfrac16_32_19, vfrac16_19, vout_6); + one_line(srv000, srv100, vfrac16_32_19, vfrac16_19, vout_7); + + one_line(srv10, srv20, vfrac16_32_20, vfrac16_20, vout_8); + one_line(srv100, srv200, vfrac16_32_20, vfrac16_20, vout_9); + + one_line(srv10, srv20, vfrac16_32_21, vfrac16_21, vout_10); + one_line(srv100, srv200, vfrac16_32_21, vfrac16_21, vout_11); + + one_line(srv20, srv30, vfrac16_32_22, vfrac16_22, vout_12); + one_line(srv200, srv300, vfrac16_32_22, vfrac16_22, vout_13); + + one_line(srv30, srv40, vfrac16_32_23, vfrac16_23, vout_14); + one_line(srv300, srv400, vfrac16_32_23, vfrac16_23, vout_15); + + one_line(srv40, srv50, vfrac16_32_24, vfrac16_24, vout_16); + one_line(srv400, srv500, vfrac16_32_24, vfrac16_24, vout_17); + + one_line(srv50, srv60, vfrac16_32_25, vfrac16_25, vout_18); + one_line(srv500, srv600, vfrac16_32_25, vfrac16_25, vout_19); + + one_line(srv50, srv60, vfrac16_32_26, vfrac16_26, vout_20); + one_line(srv500, srv600, vfrac16_32_26, vfrac16_26, vout_21); + + one_line(srv60, srv70, vfrac16_32_27, vfrac16_27, vout_22); + one_line(srv600, srv700, vfrac16_32_27, vfrac16_27, vout_23); + + one_line(srv70, srv80, vfrac16_32_28, vfrac16_28, vout_24); + one_line(srv700, srv800, vfrac16_32_28, vfrac16_28, vout_25); + + one_line(srv80, srv90, vfrac16_32_29, vfrac16_29, vout_26); + one_line(srv800, srv900, vfrac16_32_29, vfrac16_29, vout_27); + + one_line(srv90, srva0, vfrac16_32_30, vfrac16_30, vout_28); + one_line(srv900, srva00, vfrac16_32_30, vfrac16_30, vout_29); + + one_line(srva0, srvb0, vfrac16_32_31, vfrac16_31, vout_30); + one_line(srva00, srvb00, vfrac16_32_31, vfrac16_31, vout_31); + + vec_xst(vout_0, 32*16, dst); + vec_xst(vout_1, 32*16+16, dst); + vec_xst(vout_2, 32*17, dst); + vec_xst(vout_3, 32*17+16, dst); + vec_xst(vout_4, 32*18, dst); + vec_xst(vout_5, 32*18+16, dst); + vec_xst(vout_6, 32*19, dst); + vec_xst(vout_7, 32*19+16, dst); + vec_xst(vout_8, 32*20, dst); + vec_xst(vout_9, 32*20+16, dst); + vec_xst(vout_10, 32*21, dst); + vec_xst(vout_11, 32*21+16, dst); + vec_xst(vout_12, 32*22, dst); + vec_xst(vout_13, 32*22+16, dst); + vec_xst(vout_14, 32*23, dst); + vec_xst(vout_15, 32*23+16, dst); + vec_xst(vout_16, 32*24, dst); + vec_xst(vout_17, 32*24+16, dst); + vec_xst(vout_18, 32*25, dst); + vec_xst(vout_19, 32*25+16, dst); + vec_xst(vout_20, 32*26, dst); + vec_xst(vout_21, 32*26+16, dst); + vec_xst(vout_22, 32*27, dst); + vec_xst(vout_23, 32*27+16, dst); + vec_xst(vout_24, 32*28, dst); + vec_xst(vout_25, 32*28+16, dst); + vec_xst(vout_26, 32*29, dst); + vec_xst(vout_27, 32*29+16, dst); + vec_xst(vout_28, 32*30, dst); + vec_xst(vout_29, 32*30+16, dst); + vec_xst(vout_30, 32*31, dst); + vec_xst(vout_31, 32*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * 32 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<4, 4>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05, 0x02, 0x03, 0x04, 0x05, 0x03, 0x04, 0x05, 0x06}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + +vec_u8_t vfrac4 = (vec_u8_t){21, 21, 21, 21, 10, 10, 10, 10, 31, 31, 31, 31, 20, 20, 20, 20}; +vec_u8_t vfrac4_32 = (vec_u8_t){11, 11, 11, 11, 22, 22, 22, 22, 1, 1, 1, 1, 12, 12, 12, 12}; + + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + vec_xst(vout, 0, dst); + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * 4 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<8, 4>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b}; + vec_u8_t mask5={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c}; + vec_u8_t mask6={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 1 */ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 2 */ + vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 3 */ + vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 3 */ + vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 4, 4 */ + vec_u8_t srv5 = vec_perm(srv, srv, mask5); /* 4, 5 */ + vec_u8_t srv6 = vec_perm(srv, srv, mask6); /* 5, 6 */ + +vec_u8_t vfrac8_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac8_1 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac8_2 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac8_3 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 8, 8, 8, 8, 8, 8, 8, 8}; + +vec_u8_t vfrac8_32_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac8_32_1 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac8_32_2 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac8_32_3 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 24, 24, 24, 24, 24, 24, 24, 24}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + /* y0, y1 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0); + vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + + /* y2, y3 */ + vmle0 = vec_mule(srv1, vfrac8_32_1); + vmlo0 = vec_mulo(srv1, vfrac8_32_1); + vmle1 = vec_mule(srv2, vfrac8_1); + vmlo1 = vec_mulo(srv2, vfrac8_1); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y4, y5 */ + vmle0 = vec_mule(srv3, vfrac8_32_2); + vmlo0 = vec_mulo(srv3, vfrac8_32_2); + vmle1 = vec_mule(srv4, vfrac8_2); + vmlo1 = vec_mulo(srv4, vfrac8_2); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + //int offset[32] = {0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 7, 7, 8, 9, 9, 10, 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21}; + + /* y6, y7 */ + vmle0 = vec_mule(srv5, vfrac8_32_3); + vmlo0 = vec_mulo(srv5, vfrac8_32_3); + vmle1 = vec_mule(srv6, vfrac8_3); + vmlo1 = vec_mulo(srv6, vfrac8_3); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * 8 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<16, 4>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19}; + vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srva = vec_perm(sv0, sv1, mask10); + vec_u8_t srvb = vec_perm(sv0, sv1, mask11); + +vec_u8_t vfrac16_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_4 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_6 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_8 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_10 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_12 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_14 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + +vec_u8_t vfrac16_32_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv2, srv3, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv3, srv4, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv3, srv4, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv4, srv5, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv5, srv6, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv5, srv6, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv6, srv7, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv7, srv8, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv7, srv8, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv8, srv9, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv9, srva, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv9, srva, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srva, srvb, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 16*2, dst); + vec_xst(vout_3, 16*3, dst); + vec_xst(vout_4, 16*4, dst); + vec_xst(vout_5, 16*5, dst); + vec_xst(vout_6, 16*6, dst); + vec_xst(vout_7, 16*7, dst); + vec_xst(vout_8, 16*8, dst); + vec_xst(vout_9, 16*9, dst); + vec_xst(vout_10, 16*10, dst); + vec_xst(vout_11, 16*11, dst); + vec_xst(vout_12, 16*12, dst); + vec_xst(vout_13, 16*13, dst); + vec_xst(vout_14, 16*14, dst); + vec_xst(vout_15, 16*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * 16 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<32, 4>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19}; + vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a}; + vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b}; + vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c}; + vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d}; + vec_u8_t mask15={0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv3 =vec_xl(113, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srva = vec_perm(sv0, sv1, mask10); + vec_u8_t srvb = vec_perm(sv0, sv1, mask11); + vec_u8_t srvc = vec_perm(sv0, sv1, mask12); + vec_u8_t srvd = vec_perm(sv0, sv1, mask13); + vec_u8_t srve = vec_perm(sv0, sv1, mask14); + vec_u8_t srvf = vec_perm(sv0, sv1, mask15); + + vec_u8_t srv00 = sv1; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv10 = vec_perm(sv1, sv2, mask1); + vec_u8_t srv20 = vec_perm(sv1, sv2, mask2); + vec_u8_t srv30 = vec_perm(sv1, sv2, mask3); + vec_u8_t srv40 = vec_perm(sv1, sv2, mask4); + vec_u8_t srv50 = vec_perm(sv1, sv2, mask5); + vec_u8_t srv60 = vec_perm(sv1, sv2, mask6); + vec_u8_t srv70 = vec_perm(sv1, sv2, mask7); + vec_u8_t srv80 = vec_perm(sv1, sv2, mask8); + vec_u8_t srv90 = vec_perm(sv1, sv2, mask9); + vec_u8_t srva0 = vec_perm(sv1, sv2, mask10); + vec_u8_t srvb0 = vec_perm(sv1, sv2, mask11); + vec_u8_t srvc0 = vec_perm(sv1, sv2, mask12); + vec_u8_t srvd0 = vec_perm(sv1, sv2, mask13); + vec_u8_t srve0 = vec_perm(sv1, sv2, mask14); + vec_u8_t srvf0 = vec_perm(sv1, sv2, mask15); + + vec_u8_t srv000 = sv2; + vec_u8_t srv100 = vec_perm(sv2, sv3, mask1); + vec_u8_t srv200 = vec_perm(sv2, sv3, mask2); + vec_u8_t srv300 = vec_perm(sv2, sv3, mask3); + vec_u8_t srv400 = vec_perm(sv2, sv3, mask4); + vec_u8_t srv500 = vec_perm(sv2, sv3, mask5); + vec_u8_t srv600 = vec_perm(sv2, sv3, mask6); + +vec_u8_t vfrac16_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_4 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_6 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_8 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_10 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_12 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_14 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_16 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_17 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_18 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_20 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_21 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_22 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_24 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_25 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_26 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_28 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_29 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_30 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + +vec_u8_t vfrac16_32_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_16 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_32_17 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_18 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_32_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_20 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_32_21 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_22 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_32_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_24 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_32_25 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_26 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_32_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_28 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_32_29 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_30 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv00, srv10, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv10, srv20, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv10, srv20, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv2, srv3, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv20, srv30, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv3, srv4, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv30, srv40, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv3, srv4, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv30, srv40, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv4, srv5, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv40, srv50, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv5, srv6, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv50, srv60, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv5, srv6, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv50, srv60, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv6, srv7, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv60, srv70, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv7, srv8, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv70, srv80, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv7, srv8, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv70, srv80, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv8, srv9, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv80, srv90, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv9, srva, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv90, srva0, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv9, srva, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv90, srva0, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srva, srvb, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srva0, srvb0, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 32+16, dst); + vec_xst(vout_4, 32*2, dst); + vec_xst(vout_5, 32*2+16, dst); + vec_xst(vout_6, 32*3, dst); + vec_xst(vout_7, 32*3+16, dst); + vec_xst(vout_8, 32*4, dst); + vec_xst(vout_9, 32*4+16, dst); + vec_xst(vout_10, 32*5, dst); + vec_xst(vout_11, 32*5+16, dst); + vec_xst(vout_12, 32*6, dst); + vec_xst(vout_13, 32*6+16, dst); + vec_xst(vout_14, 32*7, dst); + vec_xst(vout_15, 32*7+16, dst); + vec_xst(vout_16, 32*8, dst); + vec_xst(vout_17, 32*8+16, dst); + vec_xst(vout_18, 32*9, dst); + vec_xst(vout_19, 32*9+16, dst); + vec_xst(vout_20, 32*10, dst); + vec_xst(vout_21, 32*10+16, dst); + vec_xst(vout_22, 32*11, dst); + vec_xst(vout_23, 32*11+16, dst); + vec_xst(vout_24, 32*12, dst); + vec_xst(vout_25, 32*12+16, dst); + vec_xst(vout_26, 32*13, dst); + vec_xst(vout_27, 32*13+16, dst); + vec_xst(vout_28, 32*14, dst); + vec_xst(vout_29, 32*14+16, dst); + vec_xst(vout_30, 32*15, dst); + vec_xst(vout_31, 32*15+16, dst); + + one_line(srvb, srvc, vfrac16_32_16, vfrac16_16, vout_0); + one_line(srvb0, srvc0, vfrac16_32_16, vfrac16_16, vout_1); + + one_line(srvb, srvc, vfrac16_32_17, vfrac16_17, vout_2); + one_line(srvb0, srvc0, vfrac16_32_17, vfrac16_17, vout_3); + + one_line(srvc, srvd, vfrac16_32_18, vfrac16_18, vout_4); + one_line(srvc0, srvd0, vfrac16_32_18, vfrac16_18, vout_5); + + one_line(srvd, srve, vfrac16_32_19, vfrac16_19, vout_6); + one_line(srvd0, srve0, vfrac16_32_19, vfrac16_19, vout_7); + + one_line(srvd, srve, vfrac16_32_20, vfrac16_20, vout_8); + one_line(srvd0, srve0, vfrac16_32_20, vfrac16_20, vout_9); + + one_line(srve, srvf, vfrac16_32_21, vfrac16_21, vout_10); + one_line(srve0, srvf0, vfrac16_32_21, vfrac16_21, vout_11); + + one_line(srvf, srv00, vfrac16_32_22, vfrac16_22, vout_12); + one_line(srvf0, srv000, vfrac16_32_22, vfrac16_22, vout_13); + + one_line(srvf, srv00, vfrac16_32_23, vfrac16_23, vout_14); + one_line(srvf0, srv000, vfrac16_32_23, vfrac16_23, vout_15); + + one_line(srv00, srv10, vfrac16_32_24, vfrac16_24, vout_16); + one_line(srv000, srv100, vfrac16_32_24, vfrac16_24, vout_17); + + one_line(srv10, srv20, vfrac16_32_25, vfrac16_25, vout_18); + one_line(srv100, srv200, vfrac16_32_25, vfrac16_25, vout_19); + + one_line(srv10, srv20, vfrac16_32_26, vfrac16_26, vout_20); + one_line(srv100, srv200, vfrac16_32_26, vfrac16_26, vout_21); + + one_line(srv20, srv30, vfrac16_32_27, vfrac16_27, vout_22); + one_line(srv200, srv300, vfrac16_32_27, vfrac16_27, vout_23); + + one_line(srv30, srv40, vfrac16_32_28, vfrac16_28, vout_24); + one_line(srv300, srv400, vfrac16_32_28, vfrac16_28, vout_25); + + one_line(srv30, srv40, vfrac16_32_29, vfrac16_29, vout_26); + one_line(srv300, srv400, vfrac16_32_29, vfrac16_29, vout_27); + + one_line(srv40, srv50, vfrac16_32_30, vfrac16_30, vout_28); + one_line(srv400, srv500, vfrac16_32_30, vfrac16_30, vout_29); + + one_line(srv50, srv60, vfrac16_32_31, vfrac16_31, vout_30); + one_line(srv500, srv600, vfrac16_32_31, vfrac16_31, vout_31); + //int offset[32] = { 11, 11, 12, 13, 13, 14, 15, 15, 16, 17, 17, 18, 19, 19, 20, 21}; + + vec_xst(vout_0, 32*16, dst); + vec_xst(vout_1, 32*16+16, dst); + vec_xst(vout_2, 32*17, dst); + vec_xst(vout_3, 32*17+16, dst); + vec_xst(vout_4, 32*18, dst); + vec_xst(vout_5, 32*18+16, dst); + vec_xst(vout_6, 32*19, dst); + vec_xst(vout_7, 32*19+16, dst); + vec_xst(vout_8, 32*20, dst); + vec_xst(vout_9, 32*20+16, dst); + vec_xst(vout_10, 32*21, dst); + vec_xst(vout_11, 32*21+16, dst); + vec_xst(vout_12, 32*22, dst); + vec_xst(vout_13, 32*22+16, dst); + vec_xst(vout_14, 32*23, dst); + vec_xst(vout_15, 32*23+16, dst); + vec_xst(vout_16, 32*24, dst); + vec_xst(vout_17, 32*24+16, dst); + vec_xst(vout_18, 32*25, dst); + vec_xst(vout_19, 32*25+16, dst); + vec_xst(vout_20, 32*26, dst); + vec_xst(vout_21, 32*26+16, dst); + vec_xst(vout_22, 32*27, dst); + vec_xst(vout_23, 32*27+16, dst); + vec_xst(vout_24, 32*28, dst); + vec_xst(vout_25, 32*28+16, dst); + vec_xst(vout_26, 32*29, dst); + vec_xst(vout_27, 32*29+16, dst); + vec_xst(vout_28, 32*30, dst); + vec_xst(vout_29, 32*30+16, dst); + vec_xst(vout_30, 32*31, dst); + vec_xst(vout_31, 32*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * 32 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<4, 5>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-3; + dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-3; + dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-3; + dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5); + + y=3; off3 = offset[3]; x=0-3; + dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5); + } + */ + //mode 31: + //int offset[32] = {0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 17}; + //int fraction[32] = {17, 2, 19, 4, 21, 6, 23, 8, 25, 10, 27, 12, 29, 14, 31, 16, 1, 18, 3, 20, 5, 22, 7, 24, 9, 26, 11, 28, 13, 30, 15, 0}; + + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x01, 0x02, 0x03, 0x04, 0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x02, 0x03, 0x04, 0x05, 0x02, 0x03, 0x04, 0x05, 0x03, 0x04, 0x05, 0x06}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(9, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + vec_u8_t vfrac4 = (vec_u8_t){17, 17, 17, 17, 2, 2, 2, 2, 19, 19, 19, 19, 4, 4, 4, 4}; /* fraction[0-3] */ + vec_u8_t vfrac4_32 = (vec_u8_t){15, 15, 15, 15, 30, 30, 30, 30, 13, 13, 13, 13, 28, 28, 28, 28}; /* 32 - fraction[0-3] */ + + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + vec_xst(vout, 0, dst); + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * 4 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<8, 5>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-7; + dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5); + ... + dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off0 + 7] + f[0] * ref[off0 + 7] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-7; + dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5); + ... + dst[y * dstStride + 7] = (pixel)((f32[1]* ref[off1 + 7] + f[1] * ref[off1 + 7] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-7; + dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5); + ... + dst[y * dstStride + 7] = (pixel)((f32[2]* ref[off2 + 7] + f[2] * ref[off2 + 7] + 16) >> 5); + + y=3; off3 = offset[3]; x=0-7; + dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5); + ... + dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off3 + 7] + f[0] * ref[off3 + 7] + 16) >> 5); + + ... + + y=7; off7 = offset[7]; x=0-7; + dst[y * dstStride + 0] = (pixel)((f32[7]* ref[off7 + 0] + f[7] * ref[off7 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[7]* ref[off7 + 1] + f[7] * ref[off7 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[7]* ref[off7 + 2] + f[7] * ref[off7 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[7]* ref[off7 + 3] + f[7] * ref[off7 + 4] + 16) >> 5); + ... + dst[y * dstStride + 7] = (pixel)((f32[0]* ref[off7 + 7] + f[0] * ref[off7 + 7] + 16) >> 5); + } + */ + //mode 31: + //int offset[32] = {0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 17}; + //int fraction[32] = {17, 2, 19, 4, 21, 6, 23, 8, 25, 10, 27, 12, 29, 14, 31, 16, 1, 18, 3, 20, 5, 22, 7, 24, 9, 26, 11, 28, 13, 30, 15, 0}; + + vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv =vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-7] = 0 */ + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* 0, 0 */ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); /* 1, 1 */ + vec_u8_t srv2 = vec_perm(srv, srv, mask2); /* 2, 2 */ + vec_u8_t srv3 = vec_perm(srv, srv, mask3); /* 3, 3 */ + vec_u8_t srv4 = vec_perm(srv, srv, mask4); /* 2, 3 */ + +vec_u8_t vfrac8_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac8_1 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac8_2 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac8_3 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac8_32_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac8_32_1 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac8_32_2 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac8_32_3 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 24, 24, 24, 24, 24, 24, 24, 24}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + /* y0, y1 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac8_32_0); /* (32 - fraction) * ref[offset + x], x=0-7 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac8_32_0); + vec_u16_t vmle1 = vec_mule(srv1, vfrac8_0); /* fraction * ref[offset + x + 1], x=0-7 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac8_0); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_0 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + + /* y2, y3 */ + vmle0 = vec_mule(srv1, vfrac8_32_1); + vmlo0 = vec_mulo(srv1, vfrac8_32_1); + vmle1 = vec_mule(srv2, vfrac8_1); + vmlo1 = vec_mulo(srv2, vfrac8_1); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_1 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y4, y5 */ + vmle0 = vec_mule(srv2, vfrac8_32_2); + vmlo0 = vec_mulo(srv2, vfrac8_32_2); + vmle1 = vec_mule(srv3, vfrac8_2); + vmlo1 = vec_mulo(srv3, vfrac8_2); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_2 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + /* y6, y7 */ + vmle0 = vec_mule(srv3, vfrac8_32_3); + vmlo0 = vec_mulo(srv3, vfrac8_32_3); + vmle1 = vec_mule(srv4, vfrac8_3); + vmlo1 = vec_mulo(srv4, vfrac8_3); + vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + ve = vec_sra(vsume, u16_5); + vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vo = vec_sra(vsumo, u16_5); + vec_u8_t vout_3 = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * 8 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<16, 5>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[0]* ref[off0 + 15] + f[0] * ref[off0 + 16] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[1]* ref[off1 + 15] + f[1] * ref[off1 + 16] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[2]* ref[off2 + 15] + f[2] * ref[off2 + 16] + 16) >> 5); + + y=3; off3 = offset[3]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[3]* ref[off3 + 0] + f[3] * ref[off3 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[3]* ref[off3 + 1] + f[3] * ref[off3 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[3]* ref[off3 + 2] + f[3] * ref[off3 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[3]* ref[off3 + 3] + f[3] * ref[off3 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[3]* ref[off3 + 15] + f[3] * ref[off3 + 16] + 16) >> 5); + + ... + + y=15; off7 = offset[7]; x=0-15; + dst[y * dstStride + 0] = (pixel)((f32[15]* ref[off15 + 0] + f[15] * ref[off15 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[15]* ref[off15 + 1] + f[15] * ref[off15 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[15]* ref[off15 + 2] + f[15] * ref[off15 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[15]* ref[off15 + 3] + f[15] * ref[off15 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[15]* ref[off15 + 15] + f[15] * ref[off15 + 16] + 16) >> 5); + } + */ + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(49, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + +vec_u8_t vfrac16_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_1 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_4 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_5 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_6 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_8 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_9 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_10 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_12 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_13 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_14 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv2, srv3, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv2, srv3, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv3, srv4, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv3, srv4, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv4, srv5, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv4, srv5, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv5, srv6, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv5, srv6, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv6, srv7, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv6, srv7, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv7, srv8, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv7, srv8, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv8, srv9, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 16*2, dst); + vec_xst(vout_3, 16*3, dst); + vec_xst(vout_4, 16*4, dst); + vec_xst(vout_5, 16*5, dst); + vec_xst(vout_6, 16*6, dst); + vec_xst(vout_7, 16*7, dst); + vec_xst(vout_8, 16*8, dst); + vec_xst(vout_9, 16*9, dst); + vec_xst(vout_10, 16*10, dst); + vec_xst(vout_11, 16*11, dst); + vec_xst(vout_12, 16*12, dst); + vec_xst(vout_13, 16*13, dst); + vec_xst(vout_14, 16*14, dst); + vec_xst(vout_15, 16*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * 16 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<32, 5>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + /* + for (int y = 0; y < width; y++) + { + y=0; off0 = offset[0]; x=0-31; + dst[y * dstStride + 0] = (pixel)((f32[0]* ref[off0 + 0] + f[0] * ref[off0 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[0]* ref[off0 + 1] + f[0] * ref[off0 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[0]* ref[off0 + 2] + f[0] * ref[off0 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[0]* ref[off0 + 3] + f[0] * ref[off0 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[0]* ref[off0 + 15] + f[0] * ref[off0 + 16] + 16) >> 5); + + y=1; off1 = offset[1]; x=0-31; + dst[y * dstStride + 0] = (pixel)((f32[1]* ref[off1 + 0] + f[1] * ref[off1 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[1]* ref[off1 + 1] + f[1] * ref[off1 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[1]* ref[off1 + 2] + f[1] * ref[off1 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[1]* ref[off1 + 3] + f[1] * ref[off1 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[1]* ref[off1 + 15] + f[1] * ref[off1 + 16] + 16) >> 5); + + y=2; off2 = offset[2]; x=0-31; + dst[y * dstStride + 0] = (pixel)((f32[2]* ref[off2 + 0] + f[2] * ref[off2 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[2]* ref[off2 + 1] + f[2] * ref[off2 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[2]* ref[off2 + 2] + f[2] * ref[off2 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[2]* ref[off2 + 3] + f[2] * ref[off2 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[2]* ref[off2 + 15] + f[2] * ref[off2 + 16] + 16) >> 5); + + ... + + y=15; off15 = offset[15]; x=0-31; off15-off30 = 1; + dst[y * dstStride + 0] = (pixel)((f32[15]* ref[off15 + 0] + f[15] * ref[off15 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[15]* ref[off15 + 1] + f[15] * ref[off15 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[15]* ref[off15 + 2] + f[15] * ref[off15 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[15]* ref[off15 + 3] + f[15] * ref[off15 + 4] + 16) >> 5); + ... + dst[y * dstStride + 15] = (pixel)((f32[15]* ref[off15 + 15] + f[15] * ref[off15 + 16] + 16) >> 5); + + ... + + y=31; off31= offset[31]; x=0-31; off31 = 2; + dst[y * dstStride + 0] = (pixel)((f32[31]* ref[off15 + 0] + f[31] * ref[off31 + 1] + 16) >> 5); + dst[y * dstStride + 1] = (pixel)((f32[31]* ref[off15 + 1] + f[31] * ref[off31 + 2] + 16) >> 5); + dst[y * dstStride + 2] = (pixel)((f32[31]* ref[off15 + 2] + f[31] * ref[off31 + 3] + 16) >> 5); + dst[y * dstStride + 3] = (pixel)((f32[31]* ref[off15 + 3] + f[31] * ref[off31 + 4] + 16) >> 5); + ... + dst[y * dstStride + 31] = (pixel)((f32[31]* ref[off15 + 31] + f[31] * ref[off31 + 32] + 16) >> 5); + } + */ + //mode 31: + //int offset[32] = {0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 17}; + //int fraction[32] = {17, 2, 19, 4, 21, 6, 23, 8, 25, 10, 27, 12, 29, 14, 31, 16, 1, 18, 3, 20, 5, 22, 7, 24, 9, 26, 11, 28, 13, 30, 15, 0}; + + //vec_u8_t mask0={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f}; + vec_u8_t mask1={0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10}; + vec_u8_t mask2={0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11}; + vec_u8_t mask3={0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12}; + vec_u8_t mask4={0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t mask5={0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u8_t mask6={0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + vec_u8_t mask7={0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + vec_u8_t mask8={0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + vec_u8_t mask9={0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t mask10={0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19}; + vec_u8_t mask11={0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a}; + vec_u8_t mask12={0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b}; + vec_u8_t mask13={0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c}; + vec_u8_t mask14={0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d}; + vec_u8_t mask15={0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e}; + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t sv0 =vec_xl(65, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv1 =vec_xl(81, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv2 =vec_xl(97, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + vec_u8_t sv3 =vec_xl(113, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-14] = 0, off[15] = 1 */ + + vec_u8_t srv0 = sv0; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(sv0, sv1, mask1); + vec_u8_t srv2 = vec_perm(sv0, sv1, mask2); + vec_u8_t srv3 = vec_perm(sv0, sv1, mask3); + vec_u8_t srv4 = vec_perm(sv0, sv1, mask4); + vec_u8_t srv5 = vec_perm(sv0, sv1, mask5); + vec_u8_t srv6 = vec_perm(sv0, sv1, mask6); + vec_u8_t srv7 = vec_perm(sv0, sv1, mask7); + vec_u8_t srv8 = vec_perm(sv0, sv1, mask8); + vec_u8_t srv9 = vec_perm(sv0, sv1, mask9); + vec_u8_t srva = vec_perm(sv0, sv1, mask10); + vec_u8_t srvb = vec_perm(sv0, sv1, mask11); + vec_u8_t srvc = vec_perm(sv0, sv1, mask12); + vec_u8_t srvd = vec_perm(sv0, sv1, mask13); + vec_u8_t srve = vec_perm(sv0, sv1, mask14); + vec_u8_t srvf = vec_perm(sv0, sv1, mask15); + + vec_u8_t srv00 = sv1; /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv10 = vec_perm(sv1, sv2, mask1); + vec_u8_t srv20 = vec_perm(sv1, sv2, mask2); + vec_u8_t srv30 = vec_perm(sv1, sv2, mask3); + vec_u8_t srv40 = vec_perm(sv1, sv2, mask4); + vec_u8_t srv50 = vec_perm(sv1, sv2, mask5); + vec_u8_t srv60 = vec_perm(sv1, sv2, mask6); + vec_u8_t srv70 = vec_perm(sv1, sv2, mask7); + vec_u8_t srv80 = vec_perm(sv1, sv2, mask8); + vec_u8_t srv90 = vec_perm(sv1, sv2, mask9); + vec_u8_t srva0 = vec_perm(sv1, sv2, mask10); + vec_u8_t srvb0 = vec_perm(sv1, sv2, mask11); + vec_u8_t srvc0 = vec_perm(sv1, sv2, mask12); + vec_u8_t srvd0 = vec_perm(sv1, sv2, mask13); + vec_u8_t srve0 = vec_perm(sv1, sv2, mask14); + vec_u8_t srvf0 = vec_perm(sv1, sv2, mask15); + + vec_u8_t srv000 = sv2; + vec_u8_t srv100 = vec_perm(sv2, sv3, mask1); + vec_u8_t srv200 = vec_perm(sv2, sv3, mask2); + + +vec_u8_t vfrac16_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_1 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_4 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_5 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_6 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_8 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_9 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_10 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_12 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_13 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_14 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_16 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_17 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_18 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_19 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_20 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_21 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_22 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_24 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_25 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_26 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_27 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_28 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_29 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_30 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + +vec_u8_t vfrac16_32_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_16 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_32_17 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_18 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_32_19 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_20 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_32_21 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_22 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_32_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_24 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_32_25 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_26 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_32_27 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_28 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_32_29 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_30 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + //int offset[32] = {0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 17}; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv00, srv10, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv1, srv2, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv10, srv20, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv1, srv2, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv10, srv20, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv2, srv3, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv20, srv30, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv2, srv3, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv20, srv30, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv3, srv4, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv30, srv40, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv3, srv4, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv30, srv40, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv4, srv5, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv40, srv50, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv4, srv5, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv40, srv50, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv5, srv6, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv50, srv60, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv5, srv6, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv50, srv60, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv6, srv7, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv60, srv70, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv6, srv7, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv60, srv70, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv7, srv8, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv70, srv80, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv7, srv8, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv70, srv80, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv8, srv9, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv80, srv90, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 32+16, dst); + vec_xst(vout_4, 32*2, dst); + vec_xst(vout_5, 32*2+16, dst); + vec_xst(vout_6, 32*3, dst); + vec_xst(vout_7, 32*3+16, dst); + vec_xst(vout_8, 32*4, dst); + vec_xst(vout_9, 32*4+16, dst); + vec_xst(vout_10, 32*5, dst); + vec_xst(vout_11, 32*5+16, dst); + vec_xst(vout_12, 32*6, dst); + vec_xst(vout_13, 32*6+16, dst); + vec_xst(vout_14, 32*7, dst); + vec_xst(vout_15, 32*7+16, dst); + vec_xst(vout_16, 32*8, dst); + vec_xst(vout_17, 32*8+16, dst); + vec_xst(vout_18, 32*9, dst); + vec_xst(vout_19, 32*9+16, dst); + vec_xst(vout_20, 32*10, dst); + vec_xst(vout_21, 32*10+16, dst); + vec_xst(vout_22, 32*11, dst); + vec_xst(vout_23, 32*11+16, dst); + vec_xst(vout_24, 32*12, dst); + vec_xst(vout_25, 32*12+16, dst); + vec_xst(vout_26, 32*13, dst); + vec_xst(vout_27, 32*13+16, dst); + vec_xst(vout_28, 32*14, dst); + vec_xst(vout_29, 32*14+16, dst); + vec_xst(vout_30, 32*15, dst); + vec_xst(vout_31, 32*15+16, dst); + + one_line(srv9, srva, vfrac16_32_16, vfrac16_16, vout_0); + one_line(srv90, srva0, vfrac16_32_16, vfrac16_16, vout_1); + + one_line(srv9, srva, vfrac16_32_17, vfrac16_17, vout_2); + one_line(srv90, srva0, vfrac16_32_17, vfrac16_17, vout_3); + + one_line(srva, srvb, vfrac16_32_18, vfrac16_18, vout_4); + one_line(srva0, srvb0, vfrac16_32_18, vfrac16_18, vout_5); + + one_line(srva, srvb, vfrac16_32_19, vfrac16_19, vout_6); + one_line(srva0, srvb0, vfrac16_32_19, vfrac16_19, vout_7); + + one_line(srvb, srvc, vfrac16_32_20, vfrac16_20, vout_8); + one_line(srvb0, srvc0, vfrac16_32_20, vfrac16_20, vout_9); + + one_line(srvb, srvc, vfrac16_32_21, vfrac16_21, vout_10); + one_line(srvb0, srvc0, vfrac16_32_21, vfrac16_21, vout_11); + + one_line(srvc, srvd, vfrac16_32_22, vfrac16_22, vout_12); + one_line(srvc0, srvd0, vfrac16_32_22, vfrac16_22, vout_13); + + one_line(srvc, srvd, vfrac16_32_23, vfrac16_23, vout_14); + one_line(srvc0, srvd0, vfrac16_32_23, vfrac16_23, vout_15); + + one_line(srvd, srve, vfrac16_32_24, vfrac16_24, vout_16); + one_line(srvd0, srve0, vfrac16_32_24, vfrac16_24, vout_17); + + one_line(srvd, srve, vfrac16_32_25, vfrac16_25, vout_18); + one_line(srvd0, srve0, vfrac16_32_25, vfrac16_25, vout_19); + + one_line(srve, srvf, vfrac16_32_26, vfrac16_26, vout_20); + one_line(srve0, srvf0, vfrac16_32_26, vfrac16_26, vout_21); + + one_line(srve, srvf, vfrac16_32_27, vfrac16_27, vout_22); + one_line(srve0, srvf0, vfrac16_32_27, vfrac16_27, vout_23); + + one_line(srvf, srv00, vfrac16_32_28, vfrac16_28, vout_24); + one_line(srvf0, srv000, vfrac16_32_28, vfrac16_28, vout_25); + + one_line(srvf, srv00, vfrac16_32_29, vfrac16_29, vout_26); + one_line(srvf0, srv000, vfrac16_32_29, vfrac16_29, vout_27); + + one_line(srv00, srv10, vfrac16_32_30, vfrac16_30, vout_28); + one_line(srv000, srv100, vfrac16_32_30, vfrac16_30, vout_29); + + one_line(srv10, srv20, vfrac16_32_31, vfrac16_31, vout_30); + one_line(srv100, srv200, vfrac16_32_31, vfrac16_31, vout_31); + + vec_xst(vout_0, 32*16, dst); + vec_xst(vout_1, 32*16+16, dst); + vec_xst(vout_2, 32*17, dst); + vec_xst(vout_3, 32*17+16, dst); + vec_xst(vout_4, 32*18, dst); + vec_xst(vout_5, 32*18+16, dst); + vec_xst(vout_6, 32*19, dst); + vec_xst(vout_7, 32*19+16, dst); + vec_xst(vout_8, 32*20, dst); + vec_xst(vout_9, 32*20+16, dst); + vec_xst(vout_10, 32*21, dst); + vec_xst(vout_11, 32*21+16, dst); + vec_xst(vout_12, 32*22, dst); + vec_xst(vout_13, 32*22+16, dst); + vec_xst(vout_14, 32*23, dst); + vec_xst(vout_15, 32*23+16, dst); + vec_xst(vout_16, 32*24, dst); + vec_xst(vout_17, 32*24+16, dst); + vec_xst(vout_18, 32*25, dst); + vec_xst(vout_19, 32*25+16, dst); + vec_xst(vout_20, 32*26, dst); + vec_xst(vout_21, 32*26+16, dst); + vec_xst(vout_22, 32*27, dst); + vec_xst(vout_23, 32*27+16, dst); + vec_xst(vout_24, 32*28, dst); + vec_xst(vout_25, 32*28+16, dst); + vec_xst(vout_26, 32*29, dst); + vec_xst(vout_27, 32*29+16, dst); + vec_xst(vout_28, 32*30, dst); + vec_xst(vout_29, 32*30+16, dst); + vec_xst(vout_30, 32*31, dst); + vec_xst(vout_31, 32*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * 32 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + + +template<> +void one_ang_pred_altivec<4, 17>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t mask0={0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3}; + vec_u8_t mask1={0x4, 0x5, 0x6, 0x7, 0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4}; + + /*vec_u8_t srv_left=vec_xl(8, srcPix0); + vec_u8_t srv_right=vec_xl(0, srcPix0); + vec_u8_t refmask_4={0x4, 0x2, 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1);*/ + + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(9, srcPix0); + vec_u8_t refmask_4={0x4, 0x2, 0x1, 0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + vec_u8_t vfrac4 = (vec_u8_t){6, 6, 6, 6, 12, 12, 12, 12, 18, 18, 18, 18, 24, 24, 24, 24}; + vec_u8_t vfrac4_32 = (vec_u8_t){26, 26, 26, 26, 20, 20, 20, 20, 14, 14, 14, 14, 8, 8, 8, 8}; + + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + vec_xst(vout, 0, dst); + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * 4 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<8, 17>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u8_t mask0={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, }; + vec_u8_t mask1={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, }; + vec_u8_t mask2={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, }; + vec_u8_t mask3={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, }; + vec_u8_t mask4={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; + vec_u8_t mask5={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, }; + vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, }; + vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vout_0, vout_1, vout_2, vout_3; + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + +/* + vec_u8_t srv_left=vec_xl(16, srcPix0); + vec_u8_t srv_right=vec_xl(0, srcPix0); + vec_u8_t refmask_8={0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); +*/ + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(17, srcPix0); + vec_u8_t refmask_8={0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + vec_u8_t srv3 = vec_perm(srv, srv, mask3); + vec_u8_t srv4 = vec_perm(srv, srv, mask4); + vec_u8_t srv5 = vec_perm(srv, srv, mask5); + vec_u8_t srv6 = vec_perm(srv, srv, mask6); + vec_u8_t srv7 = vec_perm(srv, srv, mask7); + + + /* fraction[0-7] */ +vec_u8_t vfrac8_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac8_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac8_2 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac8_3 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* 32 - fraction[0-7] */ +vec_u8_t vfrac8_32_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac8_32_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac8_32_2 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac8_32_3 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 16, 16, 16, 16, 16, 16, 16, 16}; + +one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0); +one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1); +one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2); +one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * 8 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<16, 17>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u8_t mask0={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, }; + vec_u8_t mask1={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; + vec_u8_t mask2={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; + vec_u8_t mask3={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; + vec_u8_t mask4={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; + //vec_u8_t mask5={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; + vec_u8_t mask6={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; + vec_u8_t mask7={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; + vec_u8_t mask8={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; + vec_u8_t mask9={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; + //vec_u8_t mask10={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; + vec_u8_t mask11={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; + vec_u8_t mask12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; + vec_u8_t mask13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; + vec_u8_t mask14={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; + //vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; + vec_u8_t maskadd1_0={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + +/* + vec_u8_t srv_left=vec_xl(32, srcPix0); + vec_u8_t srv_right=vec_xl(0, srcPix0); + vec_u8_t refmask_16 ={0xf, 0xe, 0xc, 0xb, 0xa, 0x9, 0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(4, srcPix0); +*/ + vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_16={0xf, 0xe, 0xc, 0xb, 0xa, 0x9, 0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x00, 0x10, 0x11, 0x12}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(36, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = vec_perm(s0, s1, mask1); + vec_u8_t srv2 = vec_perm(s0, s1, mask2); + vec_u8_t srv3 = vec_perm(s0, s1, mask3); + vec_u8_t srv4 = vec_perm(s0, s1, mask4); + vec_u8_t srv5 =srv4; + vec_u8_t srv6 = vec_perm(s0, s1, mask6); + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = vec_perm(s0, s1, mask8); + vec_u8_t srv9 = vec_perm(s0, s1, mask9); + vec_u8_t srv10 = srv9; + vec_u8_t srv11 = vec_perm(s0, s1, mask11); + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = vec_perm(s0, s1, mask13); + vec_u8_t srv14 = vec_perm(s0, s1, mask14); + vec_u8_t srv15 = srv14; + + vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0); + vec_u8_t srv1_add1 = srv0; + vec_u8_t srv2_add1 = srv1; + vec_u8_t srv3_add1 = srv2; + vec_u8_t srv4_add1 = srv3; + vec_u8_t srv5_add1 = srv3; + vec_u8_t srv6_add1 = srv4; + vec_u8_t srv7_add1 = srv6; + vec_u8_t srv8_add1 = srv7; + vec_u8_t srv9_add1 = srv8; + vec_u8_t srv10_add1 = srv8; + vec_u8_t srv11_add1 = srv9; + vec_u8_t srv12_add1= srv11; + vec_u8_t srv13_add1 = srv12; + vec_u8_t srv14_add1 = srv13; + vec_u8_t srv15_add1 = srv13; + + + /* fraction[0-15] */ +vec_u8_t vfrac16_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_1 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_2 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_4 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_5 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_6 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_8 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_9 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_10 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_12 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_13 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_14 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + + /* 32- fraction[0-15] */ +vec_u8_t vfrac16_32_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 16*2, dst); + vec_xst(vout_3, 16*3, dst); + vec_xst(vout_4, 16*4, dst); + vec_xst(vout_5, 16*5, dst); + vec_xst(vout_6, 16*6, dst); + vec_xst(vout_7, 16*7, dst); + vec_xst(vout_8, 16*8, dst); + vec_xst(vout_9, 16*9, dst); + vec_xst(vout_10, 16*10, dst); + vec_xst(vout_11, 16*11, dst); + vec_xst(vout_12, 16*12, dst); + vec_xst(vout_13, 16*13, dst); + vec_xst(vout_14, 16*14, dst); + vec_xst(vout_15, 16*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * 16 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<32, 17>(pixel* dst, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +vec_u8_t mask1={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t mask2={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask3={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask4={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +//vec_u8_t mask5={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask6={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask7={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask8={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask9={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask10={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask11={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask12={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, }; +vec_u8_t mask13={0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, }; +vec_u8_t mask14={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, }; +//vec_u8_t mask15={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, }; +vec_u8_t mask16={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, }; + +vec_u8_t mask17={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; +vec_u8_t mask18={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; +vec_u8_t mask19={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +vec_u8_t mask20={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +//vec_u8_t mask21={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t mask22={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask23={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask24={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask25={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +//vec_u8_t mask26={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask27={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask28={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask29={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +//vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; + +vec_u8_t maskadd1_0={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + +/* + vec_u8_t refmask_32_0 ={0x1f, 0x1e, 0x1c, 0x1b, 0x1a, 0x19, 0x17, 0x16, 0x15, 0x14, 0x12, 0x11, 0x10, 0xf, 0xe, 0xc}; + vec_u8_t refmask_32_1 = {0xb, 0xa, 0x9, 0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + vec_u8_t srv_left0=vec_xl(64, srcPix0); + vec_u8_t srv_left1=vec_xl(80, srcPix0); + vec_u8_t srv_right=vec_xl(0, srcPix0); + vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0); + vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1); + vec_u8_t s2 = vec_xl(7, srcPix0); + vec_u8_t s3 = vec_xl(16+7, srcPix0); +*/ + vec_u8_t srv_left0=vec_xl(0, srcPix0); + vec_u8_t srv_left1=vec_xl(16, srcPix0); + vec_u8_t srv_right=vec_xl(65, srcPix0); + vec_u8_t refmask_32_0={0x1f, 0x1e, 0x1c, 0x1b, 0x1a, 0x19, 0x17, 0x16, 0x15, 0x14, 0x12, 0x11, 0x10, 0xf, 0xe, 0xc }; + vec_u8_t refmask_32_1={0xb, 0xa, 0x9, 0x7, 0x6, 0x5, 0x4, 0x2, 0x1, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0); + vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1); + vec_u8_t s2 = vec_xl(71, srcPix0); + vec_u8_t s3 = vec_xl(87, srcPix0); + + vec_u8_t srv0 = vec_perm(s1, s2, mask0); + vec_u8_t srv1 = vec_perm(s1, s2, mask1); + vec_u8_t srv2 = vec_perm(s1, s2, mask2); + vec_u8_t srv3 = vec_perm(s1, s2, mask3); + vec_u8_t srv4 = vec_perm(s1, s2, mask4); + vec_u8_t srv5 =srv4; + vec_u8_t srv6 = vec_perm(s1, s2, mask6); + vec_u8_t srv7 = vec_perm(s1, s2, mask7); + vec_u8_t srv8 = vec_perm(s1, s2, mask8); + vec_u8_t srv9 = vec_perm(s1, s2, mask9); + vec_u8_t srv10 = srv9; + vec_u8_t srv11 = s1; + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = vec_perm(s0, s1, mask13); + vec_u8_t srv14 = vec_perm(s0, s1, mask14); + vec_u8_t srv15 = srv14; + + vec_u8_t srv16_0 = vec_perm(s2, s3, mask0); + vec_u8_t srv16_1 = vec_perm(s2, s3, mask1); + vec_u8_t srv16_2 = vec_perm(s2, s3, mask2); + vec_u8_t srv16_3 = vec_perm(s2, s3, mask3); + vec_u8_t srv16_4 = vec_perm(s2, s3, mask4); + vec_u8_t srv16_5 =srv16_4; + vec_u8_t srv16_6 = vec_perm(s2, s3, mask6); + vec_u8_t srv16_7 = vec_perm(s2, s3, mask7); + vec_u8_t srv16_8 = vec_perm(s2, s3, mask8); + vec_u8_t srv16_9 = vec_perm(s2, s3, mask9); + vec_u8_t srv16_10 = srv16_9; + vec_u8_t srv16_11 = s2; + vec_u8_t srv16_12= vec_perm(s1, s2, mask12); + vec_u8_t srv16_13 = vec_perm(s1, s2, mask13); + vec_u8_t srv16_14 = vec_perm(s1, s2, mask14); + vec_u8_t srv16_15 = srv16_14; + //0,1,2,3,4,4,6,7,8,9,9(1,2),11(1),12(0,1),13,14,14,15,16,17,18,19,20,20,22,23,24,25,25,27,28,29,30(0),30, + + vec_u8_t srv16 = vec_perm(s0, s1, mask16); + vec_u8_t srv17 = vec_perm(s0, s1, mask17); + vec_u8_t srv18 = vec_perm(s0, s1, mask18); + vec_u8_t srv19 = vec_perm(s0, s1, mask19); + vec_u8_t srv20 = vec_perm(s0, s1, mask20); + vec_u8_t srv21 = srv20; + vec_u8_t srv22 = vec_perm(s0, s1, mask22); + vec_u8_t srv23 = vec_perm(s0, s1, mask23); + vec_u8_t srv24 = vec_perm(s0, s1, mask24); + vec_u8_t srv25 = vec_perm(s0, s1, mask25); + vec_u8_t srv26 = srv25; + vec_u8_t srv27 = vec_perm(s0, s1, mask27); + vec_u8_t srv28 = vec_perm(s0, s1, mask28); + vec_u8_t srv29 = vec_perm(s0, s1, mask29); + vec_u8_t srv30 = s0; + vec_u8_t srv31 = s0; + + vec_u8_t srv16_16 = vec_perm(s1, s2, mask16); + vec_u8_t srv16_17 = vec_perm(s1, s2, mask17); + vec_u8_t srv16_18 = vec_perm(s1, s2, mask18); + vec_u8_t srv16_19 = vec_perm(s1, s2, mask19); + vec_u8_t srv16_20 = vec_perm(s1, s2, mask20); + vec_u8_t srv16_21 = srv16_20; + vec_u8_t srv16_22 = vec_perm(s1, s2, mask22); + vec_u8_t srv16_23 = vec_perm(s1, s2, mask23); + vec_u8_t srv16_24 = vec_perm(s1, s2, mask24); + vec_u8_t srv16_25 = vec_perm(s1, s2, mask25); + vec_u8_t srv16_26 = srv16_25; + vec_u8_t srv16_27 = vec_perm(s1, s2, mask27); + vec_u8_t srv16_28 = vec_perm(s1, s2, mask28); + vec_u8_t srv16_29 = vec_perm(s1, s2, mask29); + vec_u8_t srv16_30 = s1; + vec_u8_t srv16_31 = s1; + + vec_u8_t srv0add1 = vec_perm(s1, s2, maskadd1_0); + vec_u8_t srv1add1 = srv0; + vec_u8_t srv2add1 = srv1; + vec_u8_t srv3add1 = srv2; + vec_u8_t srv4add1 = srv3; + vec_u8_t srv5add1 = srv3; + vec_u8_t srv6add1 = srv4; + vec_u8_t srv7add1 = srv6; + vec_u8_t srv8add1 = srv7; + vec_u8_t srv9add1 = srv8; + vec_u8_t srv10add1 = srv8; + vec_u8_t srv11add1 = srv9; + vec_u8_t srv12add1= srv11; + vec_u8_t srv13add1 = srv12; + vec_u8_t srv14add1 = srv13; + vec_u8_t srv15add1 = srv13; + + //0(1,2),1,2,3,3.4,6,7,8,8,9,11(1),12(0,1),13,13,14,16, 17, 18,19,19,20,22,26,24,24,25,27,28,29,29, + + vec_u8_t srv16add1_0 = vec_perm(s2, s3, maskadd1_0); + vec_u8_t srv16add1_1 = srv16_0; + vec_u8_t srv16add1_2 = srv16_1; + vec_u8_t srv16add1_3 = srv16_2; + vec_u8_t srv16add1_4 = srv16_3; + vec_u8_t srv16add1_5 = srv16_3; + vec_u8_t srv16add1_6 = srv16_4; + vec_u8_t srv16add1_7 = srv16_6; + vec_u8_t srv16add1_8 = srv16_7; + vec_u8_t srv16add1_9 = srv16_8; + vec_u8_t srv16add1_10 = srv16_8; + vec_u8_t srv16add1_11 = srv16_9; + vec_u8_t srv16add1_12= srv16_11; + vec_u8_t srv16add1_13 = srv16_12; + vec_u8_t srv16add1_14 = srv16_13; + vec_u8_t srv16add1_15 = srv16_13; + + vec_u8_t srv16add1 = srv14; + vec_u8_t srv17add1 = srv16; + vec_u8_t srv18add1 = srv17; + vec_u8_t srv19add1 = srv18; + vec_u8_t srv20add1 = srv19; + vec_u8_t srv21add1 = srv19; + vec_u8_t srv22add1 = srv20; + vec_u8_t srv23add1 = srv22; + vec_u8_t srv24add1 = srv23; + vec_u8_t srv25add1 = srv24; + vec_u8_t srv26add1 = srv24; + vec_u8_t srv27add1 = srv25; + vec_u8_t srv28add1 = srv27; + vec_u8_t srv29add1 = srv28; + vec_u8_t srv30add1 = srv29; + vec_u8_t srv31add1 = srv29; + + vec_u8_t srv16add1_16 = srv16_14; + vec_u8_t srv16add1_17 = srv16_16; + vec_u8_t srv16add1_18 = srv16_17; + vec_u8_t srv16add1_19 = srv16_18; + vec_u8_t srv16add1_20 = srv16_19; + vec_u8_t srv16add1_21 = srv16_19; + vec_u8_t srv16add1_22 = srv16_20; + vec_u8_t srv16add1_23 = srv16_22; + vec_u8_t srv16add1_24 = srv16_23; + vec_u8_t srv16add1_25 = srv16_24; + vec_u8_t srv16add1_26 = srv16_24; + vec_u8_t srv16add1_27 = srv16_25; + vec_u8_t srv16add1_28 = srv16_27; + vec_u8_t srv16add1_29 = srv16_28; + vec_u8_t srv16add1_30 = srv16_29; + vec_u8_t srv16add1_31 = srv16_29; + +vec_u8_t vfrac16_0 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_1 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_2 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_4 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_5 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_6 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_8 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_9 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_10 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_12 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_13 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_14 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; +vec_u8_t vfrac16_32_0 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 32+16, dst); + vec_xst(vout_4, 32*2, dst); + vec_xst(vout_5, 32*2+16, dst); + vec_xst(vout_6, 32*3, dst); + vec_xst(vout_7, 32*3+16, dst); + vec_xst(vout_8, 32*4, dst); + vec_xst(vout_9, 32*4+16, dst); + vec_xst(vout_10, 32*5, dst); + vec_xst(vout_11, 32*5+16, dst); + vec_xst(vout_12, 32*6, dst); + vec_xst(vout_13, 32*6+16, dst); + vec_xst(vout_14, 32*7, dst); + vec_xst(vout_15, 32*7+16, dst); + vec_xst(vout_16, 32*8, dst); + vec_xst(vout_17, 32*8+16, dst); + vec_xst(vout_18, 32*9, dst); + vec_xst(vout_19, 32*9+16, dst); + vec_xst(vout_20, 32*10, dst); + vec_xst(vout_21, 32*10+16, dst); + vec_xst(vout_22, 32*11, dst); + vec_xst(vout_23, 32*11+16, dst); + vec_xst(vout_24, 32*12, dst); + vec_xst(vout_25, 32*12+16, dst); + vec_xst(vout_26, 32*13, dst); + vec_xst(vout_27, 32*13+16, dst); + vec_xst(vout_28, 32*14, dst); + vec_xst(vout_29, 32*14+16, dst); + vec_xst(vout_30, 32*15, dst); + vec_xst(vout_31, 32*15+16, dst); + + one_line(srv16, srv16add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv16_16, srv16add1_16, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv17, srv17add1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv16_17, srv16add1_17, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv18, srv18add1, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv16_18, srv16add1_18, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv19, srv19add1, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv16_19, srv16add1_19, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv20, srv20add1, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv16_20, srv16add1_20, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv21, srv21add1, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv16_21, srv16add1_21, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv22, srv22add1, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv16_22, srv16add1_22, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv23, srv23add1, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv16_23, srv16add1_23, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv24, srv24add1, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv16_24, srv16add1_24, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv25, srv25add1, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv16_25, srv16add1_25, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv26, srv26add1, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv16_26, srv16add1_26, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv27, srv27add1, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv16_27, srv16add1_27, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv28, srv28add1, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv16_28, srv16add1_28, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv29, srv29add1, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv16_29, srv16add1_29, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv30, srv30add1, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv16_30, srv16add1_30, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv31, srv31add1, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv16_31, srv16add1_31, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 32*16, dst); + vec_xst(vout_1, 32*16+16, dst); + vec_xst(vout_2, 32*17, dst); + vec_xst(vout_3, 32*17+16, dst); + vec_xst(vout_4, 32*18, dst); + vec_xst(vout_5, 32*18+16, dst); + vec_xst(vout_6, 32*19, dst); + vec_xst(vout_7, 32*19+16, dst); + vec_xst(vout_8, 32*20, dst); + vec_xst(vout_9, 32*20+16, dst); + vec_xst(vout_10, 32*21, dst); + vec_xst(vout_11, 32*21+16, dst); + vec_xst(vout_12, 32*22, dst); + vec_xst(vout_13, 32*22+16, dst); + vec_xst(vout_14, 32*23, dst); + vec_xst(vout_15, 32*23+16, dst); + vec_xst(vout_16, 32*24, dst); + vec_xst(vout_17, 32*24+16, dst); + vec_xst(vout_18, 32*25, dst); + vec_xst(vout_19, 32*25+16, dst); + vec_xst(vout_20, 32*26, dst); + vec_xst(vout_21, 32*26+16, dst); + vec_xst(vout_22, 32*27, dst); + vec_xst(vout_23, 32*27+16, dst); + vec_xst(vout_24, 32*28, dst); + vec_xst(vout_25, 32*28+16, dst); + vec_xst(vout_26, 32*29, dst); + vec_xst(vout_27, 32*29+16, dst); + vec_xst(vout_28, 32*30, dst); + vec_xst(vout_29, 32*30+16, dst); + vec_xst(vout_30, 32*31, dst); + vec_xst(vout_31, 32*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * 32 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<4, 16>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t mask0={0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3, }; + vec_u8_t mask1={0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, }; + +/* + vec_u8_t srv_left=vec_xl(8, srcPix0); + vec_u8_t srv_right=vec_xl(0, srcPix0); + vec_u8_t refmask_4={0x3, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); +*/ + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(9, srcPix0); + vec_u8_t refmask_4={0x3, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, }; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + vec_u8_t vfrac4 = (vec_u8_t){11, 11, 11, 11, 22, 22, 22, 22, 1, 1, 1, 1, 12, 12, 12, 12}; + vec_u8_t vfrac4_32 = (vec_u8_t){21, 21, 21, 21, 10, 10, 10, 10, 31, 31, 31, 31, 20, 20, 20, 20}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + vec_xst(vout, 0, dst); + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * 4 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<8, 16>(pixel* dst, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, }; +vec_u8_t mask1={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, }; +vec_u8_t mask2={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, }; +vec_u8_t mask3={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, }; +vec_u8_t mask4={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +vec_u8_t mask5={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, }; +vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, }; +vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; + + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vout_0, vout_1, vout_2, vout_3; + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + +/* + vec_u8_t srv_left=vec_xl(16, srcPix0); + vec_u8_t srv_right=vec_xl(0, srcPix0); + vec_u8_t refmask_8={0x8, 0x6, 0x5, 0x3, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); +*/ + vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(17, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_8={0x8, 0x6, 0x5, 0x3, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, }; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + vec_u8_t srv3 = vec_perm(srv, srv, mask3); + vec_u8_t srv4 = vec_perm(srv, srv, mask4); + vec_u8_t srv5 = vec_perm(srv, srv, mask5); + vec_u8_t srv6 = vec_perm(srv, srv, mask6); + vec_u8_t srv7 = vec_perm(srv, srv, mask7); + + +vec_u8_t vfrac8_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac8_1 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac8_2 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac8_3 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 24, 24, 24, 24, 24, 24, 24, 24}; + +vec_u8_t vfrac8_32_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac8_32_1 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac8_32_2 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac8_32_3 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 8, 8, 8, 8, 8, 8, 8, 8}; + +one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0); +one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1); +one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2); +one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * 8 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<16, 16>(pixel* dst, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; +vec_u8_t mask1={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +//vec_u8_t mask2={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +vec_u8_t mask3={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t mask4={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +//vec_u8_t mask5={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask6={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask7={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +//vec_u8_t mask8={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask9={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask10={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask11={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t maskadd1_0={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; +/*vec_u8_t maskadd1_1={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; +vec_u8_t maskadd1_2={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; +vec_u8_t maskadd1_3={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +vec_u8_t maskadd1_4={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t maskadd1_5={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t maskadd1_6={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t maskadd1_7={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t maskadd1_8={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t maskadd1_9={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t maskadd1_10={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t maskadd1_11={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t maskadd1_12={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_14={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +*/ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + +/* + vec_u8_t srv_left=vec_xl(32, srcPix0); + vec_u8_t srv_right=vec_xl(0, srcPix0); + vec_u8_t refmask_16={0xf, 0xe, 0xc, 0xb, 0x9, 0x8, 0x6, 0x5, 0x3, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(6, srcPix0); +*/ + vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_16={0xf, 0xe, 0xc, 0xb, 0x9, 0x8, 0x6, 0x5, 0x3, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(38, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = vec_perm(s0, s1, mask1); + vec_u8_t srv2 = srv1; + vec_u8_t srv3 = vec_perm(s0, s1, mask3); + vec_u8_t srv4 = vec_perm(s0, s1, mask4); + vec_u8_t srv5 =srv4; + vec_u8_t srv6 = vec_perm(s0, s1, mask6); + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = srv7; + vec_u8_t srv9 = vec_perm(s0, s1, mask9); + vec_u8_t srv10 = vec_perm(s0, s1, mask10); + vec_u8_t srv11 = srv10; + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = vec_perm(s0, s1, mask13); + vec_u8_t srv14 = srv13; + vec_u8_t srv15 = vec_perm(s0, s1, mask15); + + vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0); + vec_u8_t srv1_add1 = srv0; + vec_u8_t srv2_add1 = srv0; + vec_u8_t srv3_add1 = srv1; + vec_u8_t srv4_add1 = srv3; + vec_u8_t srv5_add1 = srv3; + vec_u8_t srv6_add1 = srv4; + vec_u8_t srv7_add1 = srv6; + vec_u8_t srv8_add1 = srv6; + vec_u8_t srv9_add1 = srv7; + vec_u8_t srv10_add1 = srv9; + vec_u8_t srv11_add1 = srv9; + vec_u8_t srv12_add1= srv10; + vec_u8_t srv13_add1 = srv12; + vec_u8_t srv14_add1 = srv12; + vec_u8_t srv15_add1 = srv13; +vec_u8_t vfrac16_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_4 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_6 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_8 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_10 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_12 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_14 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + +vec_u8_t vfrac16_32_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 16*2, dst); + vec_xst(vout_3, 16*3, dst); + vec_xst(vout_4, 16*4, dst); + vec_xst(vout_5, 16*5, dst); + vec_xst(vout_6, 16*6, dst); + vec_xst(vout_7, 16*7, dst); + vec_xst(vout_8, 16*8, dst); + vec_xst(vout_9, 16*9, dst); + vec_xst(vout_10, 16*10, dst); + vec_xst(vout_11, 16*11, dst); + vec_xst(vout_12, 16*12, dst); + vec_xst(vout_13, 16*13, dst); + vec_xst(vout_14, 16*14, dst); + vec_xst(vout_15, 16*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * 16 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<32, 16>(pixel* dst, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask1={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask2={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask3={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask4={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask5={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask6={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask7={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, }; +//vec_u8_t mask8={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, }; +vec_u8_t mask9={0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, }; +vec_u8_t mask10={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, }; +//vec_u8_t mask11={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, }; +vec_u8_t mask12={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, }; +vec_u8_t mask13={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; +//vec_u8_t mask14={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; +vec_u8_t mask15={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; + +vec_u8_t mask16={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +//vec_u8_t mask17={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +vec_u8_t mask18={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t mask19={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +//vec_u8_t mask20={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask21={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask22={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +//vec_u8_t mask23={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask24={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask25={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask26={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask27={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask28={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask29={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +//vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; + +vec_u8_t maskadd1_0={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + +/* + vec_u8_t refmask_32_0 = {0x1e, 0x1d, 0x1b, 0x1a, 0x18, 0x17, 0x15, 0x14, 0x12, 0x11, 0xf, 0xe, 0xc, 0xb, 0x9, 0x8, }; + vec_u8_t refmask_32_1 = {0x6, 0x5, 0x3, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b}; + vec_u8_t srv_left0=vec_xl(64, srcPix0); + vec_u8_t srv_left1=vec_xl(80, srcPix0); + vec_u8_t srv_right=vec_xl(0, srcPix0); + vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0); + vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1); + vec_u8_t s2 = vec_xl(12, srcPix0); + vec_u8_t s3 = vec_xl(16+12, srcPix0); +*/ + vec_u8_t srv_left0=vec_xl(0, srcPix0); + vec_u8_t srv_left1=vec_xl(16, srcPix0); + vec_u8_t srv_right=vec_xl(65, srcPix0); + vec_u8_t refmask_32_0={0x1e, 0x1d, 0x1b, 0x1a, 0x18, 0x17, 0x15, 0x14, 0x12, 0x11, 0xf, 0xe, 0xc, 0xb, 0x9, 0x8}; + vec_u8_t refmask_32_1={0x6, 0x5, 0x3, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a}; + vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0); + vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1); + vec_u8_t s2 = vec_xl(76, srcPix0); + vec_u8_t s3 = vec_xl(92, srcPix0); + + vec_u8_t srv0 = vec_perm(s1, s2, mask0); + vec_u8_t srv1 = vec_perm(s1, s2, mask1); + vec_u8_t srv2 = srv1; + vec_u8_t srv3 = vec_perm(s1, s2, mask3); + vec_u8_t srv4 = vec_perm(s1, s2, mask4); + vec_u8_t srv5 = srv4; + vec_u8_t srv6 = s1; + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = srv7; + vec_u8_t srv9 = vec_perm(s0, s1, mask9); + vec_u8_t srv10 = vec_perm(s0, s1, mask10); + vec_u8_t srv11 = srv10; + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = vec_perm(s0, s1, mask13); + vec_u8_t srv14 = srv13; + vec_u8_t srv15 = vec_perm(s0, s1, mask15); + + vec_u8_t srv16_0 = vec_perm(s2, s3, mask0); + vec_u8_t srv16_1 = vec_perm(s2, s3, mask1); + vec_u8_t srv16_2 = srv16_1; + vec_u8_t srv16_3 = vec_perm(s2, s3, mask3); + vec_u8_t srv16_4 = vec_perm(s2, s3, mask4); + vec_u8_t srv16_5 = srv16_4; + vec_u8_t srv16_6 = s2; + vec_u8_t srv16_7 = vec_perm(s1, s2, mask7); + vec_u8_t srv16_8 = srv16_7; + vec_u8_t srv16_9 = vec_perm(s1, s2, mask9); + vec_u8_t srv16_10 = vec_perm(s1, s2, mask10); + vec_u8_t srv16_11 = srv16_10; + vec_u8_t srv16_12= vec_perm(s1, s2, mask12); + vec_u8_t srv16_13 = vec_perm(s1, s2, mask13); + vec_u8_t srv16_14 = srv16_13; + vec_u8_t srv16_15 = vec_perm(s1, s2, mask15); + + //0(1,2),1,1,3,4,4,6(1),7(0,1),7,9,10,10,12,13,13,15,16,16,18,19,19,21,22,22,24,25,25,27,28,28,30,30 + + vec_u8_t srv16 = vec_perm(s0, s1, mask16); + vec_u8_t srv17 = srv16; + vec_u8_t srv18 = vec_perm(s0, s1, mask18); + vec_u8_t srv19 = vec_perm(s0, s1, mask19); + vec_u8_t srv20 = srv19; + vec_u8_t srv21 = vec_perm(s0, s1, mask21); + vec_u8_t srv22 = vec_perm(s0, s1, mask22); + vec_u8_t srv23 = srv22; + vec_u8_t srv24 = vec_perm(s0, s1, mask24); + vec_u8_t srv25 = vec_perm(s0, s1, mask25); + vec_u8_t srv26 = srv25; + vec_u8_t srv27 = vec_perm(s0, s1, mask27); + vec_u8_t srv28 = vec_perm(s0, s1, mask28); + vec_u8_t srv29 = srv28; + vec_u8_t srv30 = s0; + vec_u8_t srv31 = s0; + + vec_u8_t srv16_16 = vec_perm(s1, s2, mask16); + vec_u8_t srv16_17 = srv16_16; + vec_u8_t srv16_18 = vec_perm(s1, s2, mask18); + vec_u8_t srv16_19 = vec_perm(s1, s2, mask19); + vec_u8_t srv16_20 = srv16_19; + vec_u8_t srv16_21 = vec_perm(s1, s2, mask21); + vec_u8_t srv16_22 = vec_perm(s1, s2, mask22); + vec_u8_t srv16_23 = srv16_22; + vec_u8_t srv16_24 = vec_perm(s1, s2, mask24); + vec_u8_t srv16_25 = vec_perm(s1, s2, mask25); + vec_u8_t srv16_26 = srv16_25; + vec_u8_t srv16_27 = vec_perm(s1, s2, mask27); + vec_u8_t srv16_28 = vec_perm(s1, s2, mask28); + vec_u8_t srv16_29 = srv16_28; + vec_u8_t srv16_30 = s1; + vec_u8_t srv16_31 = s1; + + vec_u8_t srv0add1 = vec_perm(s1, s2, maskadd1_0); + vec_u8_t srv1add1 = srv0; + vec_u8_t srv2add1 = srv0; + vec_u8_t srv3add1 = srv1; + vec_u8_t srv4add1 = srv3; + vec_u8_t srv5add1 = srv3; + vec_u8_t srv6add1 = srv4; + vec_u8_t srv7add1 = s1; + vec_u8_t srv8add1 = s1; + vec_u8_t srv9add1 = srv7; + vec_u8_t srv10add1 = srv9; + vec_u8_t srv11add1 = srv9; + vec_u8_t srv12add1= srv10; + vec_u8_t srv13add1 = srv12; + vec_u8_t srv14add1 = srv12; + vec_u8_t srv15add1 = srv13; + + vec_u8_t srv16add1_0 = vec_perm(s2, s3, maskadd1_0); + vec_u8_t srv16add1_1 = srv16_0; + vec_u8_t srv16add1_2 = srv16_0; + vec_u8_t srv16add1_3 = srv16_1; + vec_u8_t srv16add1_4 = srv16_3; + vec_u8_t srv16add1_5 = srv16_3; + vec_u8_t srv16add1_6 = srv16_4; + vec_u8_t srv16add1_7 = s2; + vec_u8_t srv16add1_8 = s2; + vec_u8_t srv16add1_9 = srv16_7; + vec_u8_t srv16add1_10 = srv16_9; + vec_u8_t srv16add1_11 = srv16_9; + vec_u8_t srv16add1_12= srv16_10; + vec_u8_t srv16add1_13 = srv16_12; + vec_u8_t srv16add1_14 = srv16_12; + vec_u8_t srv16add1_15 = srv16_13; + + //0,0,1,3,3,4,6(0),6,7,9,9,10,12,12,13,15,15,16,18,18,19,21,21,22,24,24,25,27,27,28,28 + + vec_u8_t srv16add1 = srv15; + vec_u8_t srv17add1 = srv15; + vec_u8_t srv18add1 = srv16; + vec_u8_t srv19add1 = srv18; + vec_u8_t srv20add1 = srv18; + vec_u8_t srv21add1 = srv19; + vec_u8_t srv22add1 = srv21; + vec_u8_t srv23add1 = srv21; + vec_u8_t srv24add1 = srv22; + vec_u8_t srv25add1 = srv24; + vec_u8_t srv26add1 = srv24; + vec_u8_t srv27add1 = srv25; + vec_u8_t srv28add1 = srv27; + vec_u8_t srv29add1 = srv27; + vec_u8_t srv30add1 = srv28; + vec_u8_t srv31add1 = srv28; + + vec_u8_t srv16add1_16 = srv16_15; + vec_u8_t srv16add1_17 = srv16_15; + vec_u8_t srv16add1_18 = srv16_16; + vec_u8_t srv16add1_19 = srv16_18; + vec_u8_t srv16add1_20 = srv16_18; + vec_u8_t srv16add1_21 = srv16_19; + vec_u8_t srv16add1_22 = srv16_21; + vec_u8_t srv16add1_23 = srv16_21; + vec_u8_t srv16add1_24 = srv16_22; + vec_u8_t srv16add1_25 = srv16_24; + vec_u8_t srv16add1_26 = srv16_24; + vec_u8_t srv16add1_27 = srv16_25; + vec_u8_t srv16add1_28 = srv16_27; + vec_u8_t srv16add1_29 = srv16_27; + vec_u8_t srv16add1_30 = srv16_28; + vec_u8_t srv16add1_31 = srv16_28; + +vec_u8_t vfrac16_0 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_4 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_6 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_8 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_10 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_12 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_14 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_16 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_17 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_18 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_20 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_21 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_22 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_24 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_25 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_26 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_28 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_29 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_30 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; +vec_u8_t vfrac16_32_0 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_16 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_32_17 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_18 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_32_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_20 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_32_21 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_22 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_32_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_24 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_32_25 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_26 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_32_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_28 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_32_29 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_30 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 32+16, dst); + vec_xst(vout_4, 32*2, dst); + vec_xst(vout_5, 32*2+16, dst); + vec_xst(vout_6, 32*3, dst); + vec_xst(vout_7, 32*3+16, dst); + vec_xst(vout_8, 32*4, dst); + vec_xst(vout_9, 32*4+16, dst); + vec_xst(vout_10, 32*5, dst); + vec_xst(vout_11, 32*5+16, dst); + vec_xst(vout_12, 32*6, dst); + vec_xst(vout_13, 32*6+16, dst); + vec_xst(vout_14, 32*7, dst); + vec_xst(vout_15, 32*7+16, dst); + vec_xst(vout_16, 32*8, dst); + vec_xst(vout_17, 32*8+16, dst); + vec_xst(vout_18, 32*9, dst); + vec_xst(vout_19, 32*9+16, dst); + vec_xst(vout_20, 32*10, dst); + vec_xst(vout_21, 32*10+16, dst); + vec_xst(vout_22, 32*11, dst); + vec_xst(vout_23, 32*11+16, dst); + vec_xst(vout_24, 32*12, dst); + vec_xst(vout_25, 32*12+16, dst); + vec_xst(vout_26, 32*13, dst); + vec_xst(vout_27, 32*13+16, dst); + vec_xst(vout_28, 32*14, dst); + vec_xst(vout_29, 32*14+16, dst); + vec_xst(vout_30, 32*15, dst); + vec_xst(vout_31, 32*15+16, dst); + + one_line(srv16, srv16add1, vfrac16_32_16, vfrac16_16, vout_0); + one_line(srv16_16, srv16add1_16, vfrac16_32_16, vfrac16_16, vout_1); + + one_line(srv17, srv17add1, vfrac16_32_17, vfrac16_17, vout_2); + one_line(srv16_17, srv16add1_17, vfrac16_32_17, vfrac16_17, vout_3); + + one_line(srv18, srv18add1, vfrac16_32_18, vfrac16_18, vout_4); + one_line(srv16_18, srv16add1_18, vfrac16_32_18, vfrac16_18, vout_5); + + one_line(srv19, srv19add1, vfrac16_32_19, vfrac16_19, vout_6); + one_line(srv16_19, srv16add1_19, vfrac16_32_19, vfrac16_19, vout_7); + + one_line(srv20, srv20add1, vfrac16_32_20, vfrac16_20, vout_8); + one_line(srv16_20, srv16add1_20, vfrac16_32_20, vfrac16_20, vout_9); + + one_line(srv21, srv21add1, vfrac16_32_21, vfrac16_21, vout_10); + one_line(srv16_21, srv16add1_21, vfrac16_32_21, vfrac16_21, vout_11); + + one_line(srv22, srv22add1, vfrac16_32_22, vfrac16_22, vout_12); + one_line(srv16_22, srv16add1_22, vfrac16_32_22, vfrac16_22, vout_13); + + one_line(srv23, srv23add1, vfrac16_32_23, vfrac16_23, vout_14); + one_line(srv16_23, srv16add1_23, vfrac16_32_23, vfrac16_23, vout_15); + + one_line(srv24, srv24add1, vfrac16_32_24, vfrac16_24, vout_16); + one_line(srv16_24, srv16add1_24, vfrac16_32_24, vfrac16_24, vout_17); + + one_line(srv25, srv25add1, vfrac16_32_25, vfrac16_25, vout_18); + one_line(srv16_25, srv16add1_25, vfrac16_32_25, vfrac16_25, vout_19); + + one_line(srv26, srv26add1, vfrac16_32_26, vfrac16_26, vout_20); + one_line(srv16_26, srv16add1_26, vfrac16_32_26, vfrac16_26, vout_21); + + one_line(srv27, srv27add1, vfrac16_32_27, vfrac16_27, vout_22); + one_line(srv16_27, srv16add1_27, vfrac16_32_27, vfrac16_27, vout_23); + + one_line(srv28, srv28add1, vfrac16_32_28, vfrac16_28, vout_24); + one_line(srv16_28, srv16add1_28, vfrac16_32_28, vfrac16_28, vout_25); + + one_line(srv29, srv29add1, vfrac16_32_29, vfrac16_29, vout_26); + one_line(srv16_29, srv16add1_29, vfrac16_32_29, vfrac16_29, vout_27); + + one_line(srv30, srv30add1, vfrac16_32_30, vfrac16_30, vout_28); + one_line(srv16_30, srv16add1_30, vfrac16_32_30, vfrac16_30, vout_29); + + one_line(srv31, srv31add1, vfrac16_32_31, vfrac16_31, vout_30); + one_line(srv16_31, srv16add1_31, vfrac16_32_31, vfrac16_31, vout_31); + + vec_xst(vout_0, 32*16, dst); + vec_xst(vout_1, 32*16+16, dst); + vec_xst(vout_2, 32*17, dst); + vec_xst(vout_3, 32*17+16, dst); + vec_xst(vout_4, 32*18, dst); + vec_xst(vout_5, 32*18+16, dst); + vec_xst(vout_6, 32*19, dst); + vec_xst(vout_7, 32*19+16, dst); + vec_xst(vout_8, 32*20, dst); + vec_xst(vout_9, 32*20+16, dst); + vec_xst(vout_10, 32*21, dst); + vec_xst(vout_11, 32*21+16, dst); + vec_xst(vout_12, 32*22, dst); + vec_xst(vout_13, 32*22+16, dst); + vec_xst(vout_14, 32*23, dst); + vec_xst(vout_15, 32*23+16, dst); + vec_xst(vout_16, 32*24, dst); + vec_xst(vout_17, 32*24+16, dst); + vec_xst(vout_18, 32*25, dst); + vec_xst(vout_19, 32*25+16, dst); + vec_xst(vout_20, 32*26, dst); + vec_xst(vout_21, 32*26+16, dst); + vec_xst(vout_22, 32*27, dst); + vec_xst(vout_23, 32*27+16, dst); + vec_xst(vout_24, 32*28, dst); + vec_xst(vout_25, 32*28+16, dst); + vec_xst(vout_26, 32*29, dst); + vec_xst(vout_27, 32*29+16, dst); + vec_xst(vout_28, 32*30, dst); + vec_xst(vout_29, 32*30+16, dst); + vec_xst(vout_30, 32*31, dst); + vec_xst(vout_31, 32*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * 32 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<4, 15>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t mask0={0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3, }; + vec_u8_t mask1={0x3, 0x4, 0x5, 0x6, 0x2, 0x3, 0x4, 0x5, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, }; + +/* + vec_u8_t srv_left=vec_xl(8, srcPix0); + vec_u8_t srv_right=vec_xl(0, srcPix0); + vec_u8_t refmask_4={0x4, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); +*/ + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(9, srcPix0); + vec_u8_t refmask_4={0x4, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, }; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t vfrac4 = (vec_u8_t){15, 15, 15, 15, 30, 30, 30, 30, 13, 13, 13, 13, 28, 28, 28, 28}; + vec_u8_t vfrac4_32 = (vec_u8_t){17, 17, 17, 17, 2, 2, 2, 2, 19, 19, 19, 19, 4, 4, 4, 4}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + vec_xst(vout, 0, dst); + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * 4 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<8, 15>(pixel* dst, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, }; +vec_u8_t mask1={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, }; +vec_u8_t mask2={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +vec_u8_t mask3={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, }; +vec_u8_t mask4={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; +vec_u8_t mask5={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, }; +vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vout_0, vout_1, vout_2, vout_3; + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + +/* + vec_u8_t srv_left=vec_xl(16, srcPix0); + vec_u8_t srv_right=vec_xl(0, srcPix0); + vec_u8_t refmask_8={0x8, 0x6, 0x4, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00, 0x00, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); +*/ + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(17, srcPix0); + vec_u8_t refmask_8={0x8, 0x6, 0x4, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + vec_u8_t srv3 = vec_perm(srv, srv, mask3); + vec_u8_t srv4 = vec_perm(srv, srv, mask4); + vec_u8_t srv5 = vec_perm(srv, srv, mask5); + vec_u8_t srv6 = vec_perm(srv, srv, mask6); + vec_u8_t srv7 = vec_perm(srv, srv, mask7); + + +vec_u8_t vfrac8_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac8_1 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac8_2 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac8_3 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac8_32_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac8_32_1 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac8_32_2 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac8_32_3 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 8, 8, 8, 8, 8, 8, 8, 8}; + +one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0); +one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1); +one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2); +one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * 8 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<16, 15>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t mask0={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t mask1={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +//vec_u8_t mask2={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask3={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +//vec_u8_t mask4={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask5={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +//vec_u8_t mask6={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask7={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +//vec_u8_t mask8={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask9={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask10={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask11={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +//vec_u8_t mask12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; + +vec_u8_t maskadd1_0={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +/*vec_u8_t maskadd1_1={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t maskadd1_2={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t maskadd1_3={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t maskadd1_4={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t maskadd1_5={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t maskadd1_6={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t maskadd1_7={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t maskadd1_8={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t maskadd1_9={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t maskadd1_10={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t maskadd1_11={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_12={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_14={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/ + +/* + vec_u8_t srv_left=vec_xl(32, srcPix0); + vec_u8_t srv_right=vec_xl(0, srcPix0); + vec_u8_t refmask_16={0xf, 0xd, 0xb, 0x9, 0x8, 0x6, 0x4, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(8, srcPix0); +*/ + vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_16={0xf, 0xd, 0xb, 0x9, 0x8, 0x6, 0x4, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(40, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = vec_perm(s0, s1, mask1); + vec_u8_t srv2 = srv1; + vec_u8_t srv3 = vec_perm(s0, s1, mask3); + vec_u8_t srv4 = srv3; + vec_u8_t srv5 = vec_perm(s0, s1, mask5); + vec_u8_t srv6 = srv5; + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = srv7; + vec_u8_t srv9 = vec_perm(s0, s1, mask9); + vec_u8_t srv10 = srv9; + vec_u8_t srv11 = vec_perm(s0, s1, mask11); + vec_u8_t srv12= srv11; + vec_u8_t srv13 = vec_perm(s0, s1, mask13); + vec_u8_t srv14 = srv13; + vec_u8_t srv15 = vec_perm(s0, s1, mask15); + + vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0); + vec_u8_t srv1_add1 = srv0; + vec_u8_t srv2_add1 = srv0; + vec_u8_t srv3_add1 = srv1; + vec_u8_t srv4_add1 = srv1; + vec_u8_t srv5_add1 = srv3; + vec_u8_t srv6_add1 = srv3; + vec_u8_t srv7_add1 = srv5; + vec_u8_t srv8_add1 = srv5; + vec_u8_t srv9_add1 = srv7; + vec_u8_t srv10_add1 = srv7; + vec_u8_t srv11_add1 = srv9; + vec_u8_t srv12_add1= srv9; + vec_u8_t srv13_add1 = srv11; + vec_u8_t srv14_add1 = srv11; + vec_u8_t srv15_add1 = srv13; + +vec_u8_t vfrac16_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_1 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_4 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_5 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_6 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_8 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_9 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_10 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_12 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_13 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_14 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + +vec_u8_t vfrac16_32_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 16*2, dst); + vec_xst(vout_3, 16*3, dst); + vec_xst(vout_4, 16*4, dst); + vec_xst(vout_5, 16*5, dst); + vec_xst(vout_6, 16*6, dst); + vec_xst(vout_7, 16*7, dst); + vec_xst(vout_8, 16*8, dst); + vec_xst(vout_9, 16*9, dst); + vec_xst(vout_10, 16*10, dst); + vec_xst(vout_11, 16*11, dst); + vec_xst(vout_12, 16*12, dst); + vec_xst(vout_13, 16*13, dst); + vec_xst(vout_14, 16*14, dst); + vec_xst(vout_15, 16*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * 16 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<32, 15>(pixel* dst, const pixel *srcPix0, int bFilter) +{ +//vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask1={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, }; +//vec_u8_t mask2={0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, }; +vec_u8_t mask3={0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, }; +//vec_u8_t mask4={0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, }; +vec_u8_t mask5={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, }; +//vec_u8_t mask6={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, }; +vec_u8_t mask7={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, }; +//vec_u8_t mask8={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, }; +vec_u8_t mask9={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; +//vec_u8_t mask10={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; +vec_u8_t mask11={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; +//vec_u8_t mask12={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; +vec_u8_t mask13={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +//vec_u8_t mask14={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +vec_u8_t mask15={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; + +vec_u8_t mask16={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +//vec_u8_t mask17={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask18={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +//vec_u8_t mask19={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask20={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +//vec_u8_t mask21={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask22={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +//vec_u8_t mask23={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask24={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask25={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask26={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +//vec_u8_t mask27={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask28={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask29={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +//vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; + +vec_u8_t maskadd1_0={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + +/* + vec_u8_t srv_left0=vec_xl(64, srcPix0); + vec_u8_t srv_left1=vec_xl(80, srcPix0); + vec_u8_t refmask_32 = {0x1e, 0x1c, 0x1a, 0x18, 0x17, 0x15, 0x13, 0x11, 0xf, 0xd, 0xb, 0x9, 0x8, 0x6, 0x4, 0x2}; + vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32); + vec_u8_t s1 = vec_xl(0, srcPix0);; + vec_u8_t s2 = vec_xl(16, srcPix0); + vec_u8_t s3 = vec_xl(32, srcPix0); + */ + vec_u8_t srv_left0=vec_xl(0, srcPix0); + vec_u8_t srv_left1=vec_xl(16, srcPix0); + vec_u8_t srv_right=vec_xl(65, srcPix0); + vec_u8_t refmask_32_0={0x1e, 0x1c, 0x1a, 0x18, 0x17, 0x15, 0x13, 0x11, 0xf, 0xd, 0xb, 0x9, 0x8, 0x6, 0x4, 0x2}; + vec_u8_t refmask_32_1={0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e}; + vec_u8_t s0 = vec_perm(srv_left0, srv_left1, refmask_32_0); + vec_u8_t s1 = vec_perm(srv_left0, srv_right, refmask_32_1); + vec_u8_t s2 = vec_xl(80, srcPix0); + vec_u8_t s3 = vec_xl(96, srcPix0); + + vec_u8_t srv0 = s1; + vec_u8_t srv1 = vec_perm(s0, s1, mask1); + vec_u8_t srv2 = srv1; + vec_u8_t srv3 = vec_perm(s0, s1, mask3); + vec_u8_t srv4 = srv3; + vec_u8_t srv5 = vec_perm(s0, s1, mask5); + vec_u8_t srv6 = srv5; + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = srv7; + vec_u8_t srv9 = vec_perm(s0, s1, mask9); + vec_u8_t srv10 = srv9; + vec_u8_t srv11 = vec_perm(s0, s1, mask11); + vec_u8_t srv12= srv11; + vec_u8_t srv13 = vec_perm(s0, s1, mask13); + vec_u8_t srv14 = srv13; + vec_u8_t srv15 = vec_perm(s0, s1, mask15); + + vec_u8_t srv16_0 = s2; + vec_u8_t srv16_1 = vec_perm(s1, s2, mask1); + vec_u8_t srv16_2 = srv16_1; + vec_u8_t srv16_3 = vec_perm(s1, s2, mask3); + vec_u8_t srv16_4 = srv16_3; + vec_u8_t srv16_5 = vec_perm(s1, s2, mask5); + vec_u8_t srv16_6 = srv16_5; + vec_u8_t srv16_7 = vec_perm(s1, s2, mask7); + vec_u8_t srv16_8 = srv16_7; + vec_u8_t srv16_9 = vec_perm(s1, s2, mask9); + vec_u8_t srv16_10 = srv16_9; + vec_u8_t srv16_11 = vec_perm(s1, s2, mask11); + vec_u8_t srv16_12= srv16_11; + vec_u8_t srv16_13 = vec_perm(s1, s2, mask13); + vec_u8_t srv16_14 = srv16_13; + vec_u8_t srv16_15 = vec_perm(s1, s2, mask15); + + //s1, 1,1,3,3,5,5,7,7,9,9,11,11,13,13,15,16,16,18,18,20,20,22,22,24,24,26,26,28,28,s0,s0 + + vec_u8_t srv16 = vec_perm(s0, s1, mask16); + vec_u8_t srv17 = srv16; + vec_u8_t srv18 = vec_perm(s0, s1, mask18); + vec_u8_t srv19 = srv18; + vec_u8_t srv20 = vec_perm(s0, s1, mask20); + vec_u8_t srv21 = srv20; + vec_u8_t srv22 = vec_perm(s0, s1, mask22); + vec_u8_t srv23 = srv22; + vec_u8_t srv24 = vec_perm(s0, s1, mask24); + vec_u8_t srv25 = srv24; + vec_u8_t srv26 = vec_perm(s0, s1, mask26); + vec_u8_t srv27 = srv26; + vec_u8_t srv28 = vec_perm(s0, s1, mask28); + vec_u8_t srv29 = srv28; + vec_u8_t srv30 = s0; + vec_u8_t srv31 = s0; + + vec_u8_t srv16_16 = vec_perm(s1, s2, mask16); + vec_u8_t srv16_17 = srv16_16; + vec_u8_t srv16_18 = vec_perm(s1, s2, mask18); + vec_u8_t srv16_19 = srv16_18; + vec_u8_t srv16_20 = vec_perm(s1, s2, mask20); + vec_u8_t srv16_21 = srv16_20; + vec_u8_t srv16_22 = vec_perm(s1, s2, mask22); + vec_u8_t srv16_23 = srv16_22; + vec_u8_t srv16_24 = vec_perm(s1, s2, mask24); + vec_u8_t srv16_25 = srv16_24; + vec_u8_t srv16_26 = vec_perm(s1, s2, mask26); + vec_u8_t srv16_27 = srv16_26; + vec_u8_t srv16_28 = vec_perm(s1, s2, mask28); + vec_u8_t srv16_29 = srv16_28; + vec_u8_t srv16_30 = s1; + vec_u8_t srv16_31 = s1; + + vec_u8_t srv0add1 = vec_perm(s1, s2, maskadd1_0); + vec_u8_t srv1add1 = s1; + vec_u8_t srv2add1 = s1; + vec_u8_t srv3add1 = srv1; + vec_u8_t srv4add1 = srv1; + vec_u8_t srv5add1 = srv3; + vec_u8_t srv6add1 = srv3; + vec_u8_t srv7add1 = srv6; + vec_u8_t srv8add1 = srv6; + vec_u8_t srv9add1 = srv7; + vec_u8_t srv10add1 = srv7; + vec_u8_t srv11add1 = srv9; + vec_u8_t srv12add1= srv9; + vec_u8_t srv13add1 = srv11; + vec_u8_t srv14add1 = srv11; + vec_u8_t srv15add1 = srv14; + + vec_u8_t srv16add1_0 = vec_perm(s2, s3, maskadd1_0); + vec_u8_t srv16add1_1 = s2; + vec_u8_t srv16add1_2 = s2; + vec_u8_t srv16add1_3 = srv16_1; + vec_u8_t srv16add1_4 = srv16_1; + vec_u8_t srv16add1_5 = srv16_3; + vec_u8_t srv16add1_6 = srv16_3; + vec_u8_t srv16add1_7 = srv16_6; + vec_u8_t srv16add1_8 = srv16_6; + vec_u8_t srv16add1_9 = srv16_7; + vec_u8_t srv16add1_10 = srv16_7; + vec_u8_t srv16add1_11 = srv16_9; + vec_u8_t srv16add1_12= srv16_9; + vec_u8_t srv16add1_13 = srv16_11; + vec_u8_t srv16add1_14 = srv16_11; + vec_u8_t srv16add1_15 = srv16_14; + + //srv28, s1,s1, 1,1,3,3,6,6,7,7,9,9,11,11,14,15,15,16,16,18,18,20,20,22,22,24,24,26,26,28,28, + + vec_u8_t srv16add1 = srv15; + vec_u8_t srv17add1 = srv15; + vec_u8_t srv18add1 = srv16; + vec_u8_t srv19add1 = srv16; + vec_u8_t srv20add1 = srv18; + vec_u8_t srv21add1 = srv18; + vec_u8_t srv22add1 = srv20; + vec_u8_t srv23add1 = srv20; + vec_u8_t srv24add1 = srv22; + vec_u8_t srv25add1 = srv22; + vec_u8_t srv26add1 = srv24; + vec_u8_t srv27add1 = srv24; + vec_u8_t srv28add1 = srv26; + vec_u8_t srv29add1 = srv26; + vec_u8_t srv30add1 = srv28; + vec_u8_t srv31add1 = srv28; + + vec_u8_t srv16add1_16 = srv16_15; + vec_u8_t srv16add1_17 = srv16_15; + vec_u8_t srv16add1_18 = srv16_16; + vec_u8_t srv16add1_19 = srv16_16; + vec_u8_t srv16add1_20 = srv16_18; + vec_u8_t srv16add1_21 = srv16_18; + vec_u8_t srv16add1_22 = srv16_20; + vec_u8_t srv16add1_23 = srv16_20; + vec_u8_t srv16add1_24 = srv16_22; + vec_u8_t srv16add1_25 = srv16_22; + vec_u8_t srv16add1_26 = srv16_24; + vec_u8_t srv16add1_27 = srv16_24; + vec_u8_t srv16add1_28 = srv16_26; + vec_u8_t srv16add1_29 = srv16_26; + vec_u8_t srv16add1_30 = srv16_28; + vec_u8_t srv16add1_31 = srv16_28; + +vec_u8_t vfrac16_0 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_1 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_4 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_5 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_6 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_8 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_9 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_10 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_12 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_13 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_14 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_16 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_17 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_18 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_19 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_20 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_21 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_22 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_24 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_25 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_26 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_27 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_28 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_29 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_30 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; +vec_u8_t vfrac16_32_0 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_16 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_32_17 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_18 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_32_19 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_20 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_32_21 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_22 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_32_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_24 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_32_25 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_26 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_32_27 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_28 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_32_29 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_30 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 32+16, dst); + vec_xst(vout_4, 32*2, dst); + vec_xst(vout_5, 32*2+16, dst); + vec_xst(vout_6, 32*3, dst); + vec_xst(vout_7, 32*3+16, dst); + vec_xst(vout_8, 32*4, dst); + vec_xst(vout_9, 32*4+16, dst); + vec_xst(vout_10, 32*5, dst); + vec_xst(vout_11, 32*5+16, dst); + vec_xst(vout_12, 32*6, dst); + vec_xst(vout_13, 32*6+16, dst); + vec_xst(vout_14, 32*7, dst); + vec_xst(vout_15, 32*7+16, dst); + vec_xst(vout_16, 32*8, dst); + vec_xst(vout_17, 32*8+16, dst); + vec_xst(vout_18, 32*9, dst); + vec_xst(vout_19, 32*9+16, dst); + vec_xst(vout_20, 32*10, dst); + vec_xst(vout_21, 32*10+16, dst); + vec_xst(vout_22, 32*11, dst); + vec_xst(vout_23, 32*11+16, dst); + vec_xst(vout_24, 32*12, dst); + vec_xst(vout_25, 32*12+16, dst); + vec_xst(vout_26, 32*13, dst); + vec_xst(vout_27, 32*13+16, dst); + vec_xst(vout_28, 32*14, dst); + vec_xst(vout_29, 32*14+16, dst); + vec_xst(vout_30, 32*15, dst); + vec_xst(vout_31, 32*15+16, dst); + + one_line(srv16, srv16add1, vfrac16_32_16, vfrac16_16, vout_0); + one_line(srv16_16, srv16add1_16, vfrac16_32_16, vfrac16_16, vout_1); + + one_line(srv17, srv17add1, vfrac16_32_17, vfrac16_17, vout_2); + one_line(srv16_17, srv16add1_17, vfrac16_32_17, vfrac16_17, vout_3); + + one_line(srv18, srv18add1, vfrac16_32_18, vfrac16_18, vout_4); + one_line(srv16_18, srv16add1_18, vfrac16_32_18, vfrac16_18, vout_5); + + one_line(srv19, srv19add1, vfrac16_32_19, vfrac16_19, vout_6); + one_line(srv16_19, srv16add1_19, vfrac16_32_19, vfrac16_19, vout_7); + + one_line(srv20, srv20add1, vfrac16_32_20, vfrac16_20, vout_8); + one_line(srv16_20, srv16add1_20, vfrac16_32_20, vfrac16_20, vout_9); + + one_line(srv21, srv21add1, vfrac16_32_21, vfrac16_21, vout_10); + one_line(srv16_21, srv16add1_21, vfrac16_32_21, vfrac16_21, vout_11); + + one_line(srv22, srv22add1, vfrac16_32_22, vfrac16_22, vout_12); + one_line(srv16_22, srv16add1_22, vfrac16_32_22, vfrac16_22, vout_13); + + one_line(srv23, srv23add1, vfrac16_32_23, vfrac16_23, vout_14); + one_line(srv16_23, srv16add1_23, vfrac16_32_23, vfrac16_23, vout_15); + + one_line(srv24, srv24add1, vfrac16_32_24, vfrac16_24, vout_16); + one_line(srv16_24, srv16add1_24, vfrac16_32_24, vfrac16_24, vout_17); + + one_line(srv25, srv25add1, vfrac16_32_25, vfrac16_25, vout_18); + one_line(srv16_25, srv16add1_25, vfrac16_32_25, vfrac16_25, vout_19); + + one_line(srv26, srv26add1, vfrac16_32_26, vfrac16_26, vout_20); + one_line(srv16_26, srv16add1_26, vfrac16_32_26, vfrac16_26, vout_21); + + one_line(srv27, srv27add1, vfrac16_32_27, vfrac16_27, vout_22); + one_line(srv16_27, srv16add1_27, vfrac16_32_27, vfrac16_27, vout_23); + + one_line(srv28, srv28add1, vfrac16_32_28, vfrac16_28, vout_24); + one_line(srv16_28, srv16add1_28, vfrac16_32_28, vfrac16_28, vout_25); + + one_line(srv29, srv29add1, vfrac16_32_29, vfrac16_29, vout_26); + one_line(srv16_29, srv16add1_29, vfrac16_32_29, vfrac16_29, vout_27); + + one_line(srv30, srv30add1, vfrac16_32_30, vfrac16_30, vout_28); + one_line(srv16_30, srv16add1_30, vfrac16_32_30, vfrac16_30, vout_29); + + one_line(srv31, srv31add1, vfrac16_32_31, vfrac16_31, vout_30); + one_line(srv16_31, srv16add1_31, vfrac16_32_31, vfrac16_31, vout_31); + + vec_xst(vout_0, 32*16, dst); + vec_xst(vout_1, 32*16+16, dst); + vec_xst(vout_2, 32*17, dst); + vec_xst(vout_3, 32*17+16, dst); + vec_xst(vout_4, 32*18, dst); + vec_xst(vout_5, 32*18+16, dst); + vec_xst(vout_6, 32*19, dst); + vec_xst(vout_7, 32*19+16, dst); + vec_xst(vout_8, 32*20, dst); + vec_xst(vout_9, 32*20+16, dst); + vec_xst(vout_10, 32*21, dst); + vec_xst(vout_11, 32*21+16, dst); + vec_xst(vout_12, 32*22, dst); + vec_xst(vout_13, 32*22+16, dst); + vec_xst(vout_14, 32*23, dst); + vec_xst(vout_15, 32*23+16, dst); + vec_xst(vout_16, 32*24, dst); + vec_xst(vout_17, 32*24+16, dst); + vec_xst(vout_18, 32*25, dst); + vec_xst(vout_19, 32*25+16, dst); + vec_xst(vout_20, 32*26, dst); + vec_xst(vout_21, 32*26+16, dst); + vec_xst(vout_22, 32*27, dst); + vec_xst(vout_23, 32*27+16, dst); + vec_xst(vout_24, 32*28, dst); + vec_xst(vout_25, 32*28+16, dst); + vec_xst(vout_26, 32*29, dst); + vec_xst(vout_27, 32*29+16, dst); + vec_xst(vout_28, 32*30, dst); + vec_xst(vout_29, 32*30+16, dst); + vec_xst(vout_30, 32*31, dst); + vec_xst(vout_31, 32*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * 32 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + + +template<> +void one_ang_pred_altivec<4, 14>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t mask0={0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, }; + vec_u8_t mask1={0x2, 0x3, 0x4, 0x5, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, }; + +/* + vec_u8_t srv_left=vec_xl(8, srcPix0); + vec_u8_t srv_right=vec_xl(0, srcPix0); + vec_u8_t refmask_4={0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); +*/ + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(9, srcPix0); + vec_u8_t refmask_4={0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, }; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); /* need to update for each mode y=0, offset[0]; y=1, offset[1]; y=2, offset[2]...*/ + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t vfrac4 = (vec_u8_t){19, 19, 19, 19, 6, 6, 6, 6, 25, 25, 25, 25, 12, 12, 12, 12}; + vec_u8_t vfrac4_32 = (vec_u8_t){13, 13, 13, 13, 26, 26, 26, 26, 7, 7, 7, 7, 20, 20, 20, 20}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + vec_xst(vout, 0, dst); + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * 4 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<8, 14>(pixel* dst, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, }; +vec_u8_t mask1={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, }; +vec_u8_t mask2={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +vec_u8_t mask3={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, }; +vec_u8_t mask4={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; +vec_u8_t mask5={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, }; +vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; + + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vout_0, vout_1, vout_2, vout_3; + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + +/* + vec_u8_t srv_left=vec_xl(16, srcPix0); + vec_u8_t srv_right=vec_xl(0, srcPix0); + vec_u8_t refmask_8={0x7, 0x5, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); +*/ + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(17, srcPix0); + vec_u8_t refmask_8={0x7, 0x5, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, 0x00, 0x00, }; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + vec_u8_t srv3 = vec_perm(srv, srv, mask3); + vec_u8_t srv4 = vec_perm(srv, srv, mask4); + vec_u8_t srv5 = vec_perm(srv, srv, mask5); + vec_u8_t srv6 = vec_perm(srv, srv, mask6); + vec_u8_t srv7 = vec_perm(srv, srv, mask7); + + +vec_u8_t vfrac8_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac8_1 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac8_2 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac8_3 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac8_32_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac8_32_1 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac8_32_2 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac8_32_3 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 8, 8, 8, 8, 8, 8, 8, 8}; + +one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0); +one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1); +one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2); +one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * 8 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<16, 14>(pixel* dst, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +//vec_u8_t mask1={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask2={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +//vec_u8_t mask3={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask4={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +//vec_u8_t mask5={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +//vec_u8_t mask6={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask7={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask8={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask9={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +//vec_u8_t mask10={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +//vec_u8_t mask11={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask12={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask14={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +//vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; + +vec_u8_t maskadd1_0={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +/*vec_u8_t maskadd1_1={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t maskadd1_2={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t maskadd1_3={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t maskadd1_4={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t maskadd1_5={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t maskadd1_6={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t maskadd1_7={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t maskadd1_8={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t maskadd1_9={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_10={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_11={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/ + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + +/* + vec_u8_t srv_left=vec_xl(32, srcPix0); + vec_u8_t srv_right=vec_xl(0, srcPix0); + vec_u8_t refmask_16={0xf, 0xc, 0xa, 0x7, 0x5, 0x2, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(10, srcPix0); +*/ + vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_16={0xf, 0xc, 0xa, 0x7, 0x5, 0x2, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(42, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = srv0; + vec_u8_t srv2 = vec_perm(s0, s1, mask2); + vec_u8_t srv3 = srv2; + vec_u8_t srv4 = vec_perm(s0, s1, mask4); + vec_u8_t srv5 = srv4; + vec_u8_t srv6 = srv4; + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = srv7; + vec_u8_t srv9 = vec_perm(s0, s1, mask9); + vec_u8_t srv10 = srv9; + vec_u8_t srv11 = srv9; + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = srv12; + vec_u8_t srv14 = vec_perm(s0, s1, mask14); + vec_u8_t srv15 = srv14; + + vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0); + vec_u8_t srv1_add1 = srv0_add1; + vec_u8_t srv2_add1 = srv0; + vec_u8_t srv3_add1 = srv0; + vec_u8_t srv4_add1 = srv2; + vec_u8_t srv5_add1 = srv2; + vec_u8_t srv6_add1 = srv2; + vec_u8_t srv7_add1 = srv4; + vec_u8_t srv8_add1 = srv4; + vec_u8_t srv9_add1 = srv7; + vec_u8_t srv10_add1 = srv7; + vec_u8_t srv11_add1 = srv7; + vec_u8_t srv12_add1= srv9; + vec_u8_t srv13_add1 = srv9; + vec_u8_t srv14_add1 = srv12; + vec_u8_t srv15_add1 = srv12; +vec_u8_t vfrac16_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_4 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_5 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_6 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_8 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_9 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_10 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_12 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_13 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_14 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + +vec_u8_t vfrac16_32_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 16*2, dst); + vec_xst(vout_3, 16*3, dst); + vec_xst(vout_4, 16*4, dst); + vec_xst(vout_5, 16*5, dst); + vec_xst(vout_6, 16*6, dst); + vec_xst(vout_7, 16*7, dst); + vec_xst(vout_8, 16*8, dst); + vec_xst(vout_9, 16*9, dst); + vec_xst(vout_10, 16*10, dst); + vec_xst(vout_11, 16*11, dst); + vec_xst(vout_12, 16*12, dst); + vec_xst(vout_13, 16*13, dst); + vec_xst(vout_14, 16*14, dst); + vec_xst(vout_15, 16*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * 16 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<32, 14>(pixel* dst, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, }; +//vec_u8_t mask1={0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, }; +vec_u8_t mask2={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; +//vec_u8_t mask3={0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, }; +vec_u8_t mask4={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; +//vec_u8_t mask5={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; +//vec_u8_t mask6={0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, }; +vec_u8_t mask7={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +//vec_u8_t mask8={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; +vec_u8_t mask9={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +//vec_u8_t mask10={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +//vec_u8_t mask11={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t mask12={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +//vec_u8_t mask13={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask14={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +//vec_u8_t mask15={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; + +//vec_u8_t mask16={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask17={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +//vec_u8_t mask18={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask19={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +//vec_u8_t mask20={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +//vec_u8_t mask21={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask22={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask23={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask24={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +//vec_u8_t mask25={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +//vec_u8_t mask26={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask27={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask28={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask29={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +//vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +//vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; + +vec_u8_t maskadd1_0={0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + +/* + vec_u8_t srv_left0 = vec_xl(64, srcPix0); + vec_u8_t srv_left1 = vec_xl(80, srcPix0); + vec_u8_t srv_right = vec_xl(0, srcPix0);; + vec_u8_t refmask_32_0 ={0x1e, 0x1b, 0x19, 0x16, 0x14, 0x11, 0xf, 0xc, 0xa, 0x7, 0x5, 0x2, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t refmask_32_1 ={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x10, 0x11, 0x12, 0x13}; + vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 ); + vec_u8_t s1 = vec_xl(4, srcPix0);; + vec_u8_t s2 = vec_xl(20, srcPix0); + */ + vec_u8_t srv_left0=vec_xl(0, srcPix0); + vec_u8_t srv_left1=vec_xl(16, srcPix0); + vec_u8_t srv_right=vec_xl(65, srcPix0); + vec_u8_t refmask_32_0={0x1e, 0x1b, 0x19, 0x16, 0x14, 0x11, 0xf, 0xc, 0xa, 0x7, 0x5, 0x2, 0x00, 0x0, 0x0, 0x0}; + vec_u8_t refmask_32_1={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0x10, 0x11, 0x12}; + vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 ); + vec_u8_t s1 = vec_xl(68, srcPix0); + vec_u8_t s2 = vec_xl(84, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = srv0; + vec_u8_t srv2 = vec_perm(s0, s1, mask2); + vec_u8_t srv3 = srv2; + vec_u8_t srv4 = vec_perm(s0, s1, mask4); + vec_u8_t srv5 = srv4; + vec_u8_t srv6 = srv4; + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = srv7; + vec_u8_t srv9 = vec_perm(s0, s1, mask9); + vec_u8_t srv10 = srv9; + vec_u8_t srv11 = srv9; + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = srv12; + vec_u8_t srv14 = vec_perm(s0, s1, mask14); + vec_u8_t srv15 = srv14; + + vec_u8_t srv16_0 = vec_perm(s1, s2, mask0); + vec_u8_t srv16_1 = srv16_0; + vec_u8_t srv16_2 = vec_perm(s1, s2, mask2); + vec_u8_t srv16_3 = srv16_2; + vec_u8_t srv16_4 = vec_perm(s1, s2, mask4); + vec_u8_t srv16_5 = srv16_4; + vec_u8_t srv16_6 = srv16_4; + vec_u8_t srv16_7 = vec_perm(s1, s2, mask7); + vec_u8_t srv16_8 = srv16_7; + vec_u8_t srv16_9 = vec_perm(s1, s2, mask9); + vec_u8_t srv16_10 = srv16_9; + vec_u8_t srv16_11 = srv16_9; + vec_u8_t srv16_12= vec_perm(s1, s2, mask12); + vec_u8_t srv16_13 = srv16_12; + vec_u8_t srv16_14 = vec_perm(s1, s2, mask14); + vec_u8_t srv16_15 = srv16_14; + + //0(0,1),0,2,2,4,4,4,7,7,9,9,9,12,12,14,14,14,17,17,19,19,19,22,22,24,24,24,27,27,s0,s0,s0 + + vec_u8_t srv16 = srv14; + vec_u8_t srv17 = vec_perm(s0, s1, mask17); + vec_u8_t srv18 = srv17; + vec_u8_t srv19 = vec_perm(s0, s1, mask19); + vec_u8_t srv20 = srv19; + vec_u8_t srv21 = srv19; + vec_u8_t srv22 = vec_perm(s0, s1, mask22); + vec_u8_t srv23 = srv22; + vec_u8_t srv24 = vec_perm(s0, s1, mask24); + vec_u8_t srv25 = srv24; + vec_u8_t srv26 = srv24; + vec_u8_t srv27 = vec_perm(s0, s1, mask27); + vec_u8_t srv28 = srv27; + vec_u8_t srv29 = s0; + vec_u8_t srv30 = s0; + vec_u8_t srv31 = s0; + + vec_u8_t srv16_16 = srv16_14; + vec_u8_t srv16_17 = vec_perm(s1, s2, mask17); + vec_u8_t srv16_18 = srv16_17; + vec_u8_t srv16_19 = vec_perm(s1, s2, mask19); + vec_u8_t srv16_20 = srv16_19; + vec_u8_t srv16_21 = srv16_19; + vec_u8_t srv16_22 = vec_perm(s1, s2, mask22); + vec_u8_t srv16_23 = srv16_22; + vec_u8_t srv16_24 = vec_perm(s1, s2, mask24); + vec_u8_t srv16_25 = srv16_24; + vec_u8_t srv16_26 = srv16_24; + vec_u8_t srv16_27 = vec_perm(s1, s2, mask27); + vec_u8_t srv16_28 = srv16_27; + vec_u8_t srv16_29 = s1; + vec_u8_t srv16_30 = s1; + vec_u8_t srv16_31 = s1; + + vec_u8_t srv0add1 = vec_perm(s0, s1, maskadd1_0); + vec_u8_t srv1add1 = srv0add1; + vec_u8_t srv2add1 = srv0; + vec_u8_t srv3add1 = srv0; + vec_u8_t srv4add1 = srv2; + vec_u8_t srv5add1 = srv2; + vec_u8_t srv6add1 = srv2; + vec_u8_t srv7add1 = srv4; + vec_u8_t srv8add1 = srv4; + vec_u8_t srv9add1 = srv7; + vec_u8_t srv10add1 = srv7; + vec_u8_t srv11add1 = srv7; + vec_u8_t srv12add1= srv9; + vec_u8_t srv13add1 = srv9; + vec_u8_t srv14add1 = srv12; + vec_u8_t srv15add1 = srv12; + + vec_u8_t srv16add1_0 = vec_perm(s1, s2, maskadd1_0); + vec_u8_t srv16add1_1 = srv16add1_0; + vec_u8_t srv16add1_2 = srv16_0; + vec_u8_t srv16add1_3 = srv16_0; + vec_u8_t srv16add1_4 = srv16_2; + vec_u8_t srv16add1_5 = srv16_2; + vec_u8_t srv16add1_6 = srv16_2; + vec_u8_t srv16add1_7 = srv16_4; + vec_u8_t srv16add1_8 = srv16_4; + vec_u8_t srv16add1_9 = srv16_7; + vec_u8_t srv16add1_10 = srv16_7; + vec_u8_t srv16add1_11 = srv16_7; + vec_u8_t srv16add1_12= srv16_9; + vec_u8_t srv16add1_13 = srv16_9; + vec_u8_t srv16add1_14 = srv16_12; + vec_u8_t srv16add1_15 = srv16_12; + + //srv28, s1,s1, 1,1,3,3,6,6,7,7,9,9,11,11,14,15,15,16,16,18,18,20,20,22,22,24,24,26,26,28,28, + //0,0,2,2,2,4,4,7,7,7,9,9,12,12,12,14,14,17,17,17,19,19,22,22,22,24,24,27,27,27, + + vec_u8_t srv16add1 = srv12; + vec_u8_t srv17add1 = srv14; + vec_u8_t srv18add1 = srv14; + vec_u8_t srv19add1 = srv17; + vec_u8_t srv20add1 = srv17; + vec_u8_t srv21add1 = srv17; + vec_u8_t srv22add1 = srv19; + vec_u8_t srv23add1 = srv19; + vec_u8_t srv24add1 = srv22; + vec_u8_t srv25add1 = srv22; + vec_u8_t srv26add1 = srv22; + vec_u8_t srv27add1 = srv24; + vec_u8_t srv28add1 = srv24; + vec_u8_t srv29add1 = srv27; + vec_u8_t srv30add1 = srv27; + vec_u8_t srv31add1 = srv27; + + vec_u8_t srv16add1_16 = srv16_12; + vec_u8_t srv16add1_17 = srv16_14; + vec_u8_t srv16add1_18 = srv16_14; + vec_u8_t srv16add1_19 = srv16_17; + vec_u8_t srv16add1_20 = srv16_17; + vec_u8_t srv16add1_21 = srv16_17; + vec_u8_t srv16add1_22 = srv16_19; + vec_u8_t srv16add1_23 = srv16_19; + vec_u8_t srv16add1_24 = srv16_22; + vec_u8_t srv16add1_25 = srv16_22; + vec_u8_t srv16add1_26 = srv16_22; + vec_u8_t srv16add1_27 = srv16_24; + vec_u8_t srv16add1_28 = srv16_24; + vec_u8_t srv16add1_29 = srv16_27; + vec_u8_t srv16add1_30 = srv16_27; + vec_u8_t srv16add1_31 = srv16_27; + +vec_u8_t vfrac16_0 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_4 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_5 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_6 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_8 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_9 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_10 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_12 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_13 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_14 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_16 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_17 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_18 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_20 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_21 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_22 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_24 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_25 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_26 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_28 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_29 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_30 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; +vec_u8_t vfrac16_32_0 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_16 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_32_17 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_18 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_32_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_20 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_32_21 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_22 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_32_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_24 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_32_25 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_26 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_32_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_28 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_32_29 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_30 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 32+16, dst); + vec_xst(vout_4, 32*2, dst); + vec_xst(vout_5, 32*2+16, dst); + vec_xst(vout_6, 32*3, dst); + vec_xst(vout_7, 32*3+16, dst); + vec_xst(vout_8, 32*4, dst); + vec_xst(vout_9, 32*4+16, dst); + vec_xst(vout_10, 32*5, dst); + vec_xst(vout_11, 32*5+16, dst); + vec_xst(vout_12, 32*6, dst); + vec_xst(vout_13, 32*6+16, dst); + vec_xst(vout_14, 32*7, dst); + vec_xst(vout_15, 32*7+16, dst); + vec_xst(vout_16, 32*8, dst); + vec_xst(vout_17, 32*8+16, dst); + vec_xst(vout_18, 32*9, dst); + vec_xst(vout_19, 32*9+16, dst); + vec_xst(vout_20, 32*10, dst); + vec_xst(vout_21, 32*10+16, dst); + vec_xst(vout_22, 32*11, dst); + vec_xst(vout_23, 32*11+16, dst); + vec_xst(vout_24, 32*12, dst); + vec_xst(vout_25, 32*12+16, dst); + vec_xst(vout_26, 32*13, dst); + vec_xst(vout_27, 32*13+16, dst); + vec_xst(vout_28, 32*14, dst); + vec_xst(vout_29, 32*14+16, dst); + vec_xst(vout_30, 32*15, dst); + vec_xst(vout_31, 32*15+16, dst); + + one_line(srv16, srv16add1, vfrac16_32_16, vfrac16_16, vout_0); + one_line(srv16_16, srv16add1_16, vfrac16_32_16, vfrac16_16, vout_1); + + one_line(srv17, srv17add1, vfrac16_32_17, vfrac16_17, vout_2); + one_line(srv16_17, srv16add1_17, vfrac16_32_17, vfrac16_17, vout_3); + + one_line(srv18, srv18add1, vfrac16_32_18, vfrac16_18, vout_4); + one_line(srv16_18, srv16add1_18, vfrac16_32_18, vfrac16_18, vout_5); + + one_line(srv19, srv19add1, vfrac16_32_19, vfrac16_19, vout_6); + one_line(srv16_19, srv16add1_19, vfrac16_32_19, vfrac16_19, vout_7); + + one_line(srv20, srv20add1, vfrac16_32_20, vfrac16_20, vout_8); + one_line(srv16_20, srv16add1_20, vfrac16_32_20, vfrac16_20, vout_9); + + one_line(srv21, srv21add1, vfrac16_32_21, vfrac16_21, vout_10); + one_line(srv16_21, srv16add1_21, vfrac16_32_21, vfrac16_21, vout_11); + + one_line(srv22, srv22add1, vfrac16_32_22, vfrac16_22, vout_12); + one_line(srv16_22, srv16add1_22, vfrac16_32_22, vfrac16_22, vout_13); + + one_line(srv23, srv23add1, vfrac16_32_23, vfrac16_23, vout_14); + one_line(srv16_23, srv16add1_23, vfrac16_32_23, vfrac16_23, vout_15); + + one_line(srv24, srv24add1, vfrac16_32_24, vfrac16_24, vout_16); + one_line(srv16_24, srv16add1_24, vfrac16_32_24, vfrac16_24, vout_17); + + one_line(srv25, srv25add1, vfrac16_32_25, vfrac16_25, vout_18); + one_line(srv16_25, srv16add1_25, vfrac16_32_25, vfrac16_25, vout_19); + + one_line(srv26, srv26add1, vfrac16_32_26, vfrac16_26, vout_20); + one_line(srv16_26, srv16add1_26, vfrac16_32_26, vfrac16_26, vout_21); + + one_line(srv27, srv27add1, vfrac16_32_27, vfrac16_27, vout_22); + one_line(srv16_27, srv16add1_27, vfrac16_32_27, vfrac16_27, vout_23); + + one_line(srv28, srv28add1, vfrac16_32_28, vfrac16_28, vout_24); + one_line(srv16_28, srv16add1_28, vfrac16_32_28, vfrac16_28, vout_25); + + one_line(srv29, srv29add1, vfrac16_32_29, vfrac16_29, vout_26); + one_line(srv16_29, srv16add1_29, vfrac16_32_29, vfrac16_29, vout_27); + + one_line(srv30, srv30add1, vfrac16_32_30, vfrac16_30, vout_28); + one_line(srv16_30, srv16add1_30, vfrac16_32_30, vfrac16_30, vout_29); + + one_line(srv31, srv31add1, vfrac16_32_31, vfrac16_31, vout_30); + one_line(srv16_31, srv16add1_31, vfrac16_32_31, vfrac16_31, vout_31); + + vec_xst(vout_0, 32*16, dst); + vec_xst(vout_1, 32*16+16, dst); + vec_xst(vout_2, 32*17, dst); + vec_xst(vout_3, 32*17+16, dst); + vec_xst(vout_4, 32*18, dst); + vec_xst(vout_5, 32*18+16, dst); + vec_xst(vout_6, 32*19, dst); + vec_xst(vout_7, 32*19+16, dst); + vec_xst(vout_8, 32*20, dst); + vec_xst(vout_9, 32*20+16, dst); + vec_xst(vout_10, 32*21, dst); + vec_xst(vout_11, 32*21+16, dst); + vec_xst(vout_12, 32*22, dst); + vec_xst(vout_13, 32*22+16, dst); + vec_xst(vout_14, 32*23, dst); + vec_xst(vout_15, 32*23+16, dst); + vec_xst(vout_16, 32*24, dst); + vec_xst(vout_17, 32*24+16, dst); + vec_xst(vout_18, 32*25, dst); + vec_xst(vout_19, 32*25+16, dst); + vec_xst(vout_20, 32*26, dst); + vec_xst(vout_21, 32*26+16, dst); + vec_xst(vout_22, 32*27, dst); + vec_xst(vout_23, 32*27+16, dst); + vec_xst(vout_24, 32*28, dst); + vec_xst(vout_25, 32*28+16, dst); + vec_xst(vout_26, 32*29, dst); + vec_xst(vout_27, 32*29+16, dst); + vec_xst(vout_28, 32*30, dst); + vec_xst(vout_29, 32*30+16, dst); + vec_xst(vout_30, 32*31, dst); + vec_xst(vout_31, 32*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * 32 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<4, 13>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t mask0={0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x0, 0x1, 0x2, 0x3, }; + vec_u8_t mask1={0x2, 0x3, 0x4, 0x5, 0x2, 0x3, 0x4, 0x5, 0x2, 0x3, 0x4, 0x5, 0x1, 0x2, 0x3, 0x4, }; + +/* + vec_u8_t srv_left=vec_xl(8, srcPix0); + vec_u8_t srv_right=vec_xl(0, srcPix0); + vec_u8_t refmask_4={0x4, 0x10, 0x11, 0x12, 0x13, 0x14, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); +*/ + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(9, srcPix0); + vec_u8_t refmask_4={0x4, 0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, }; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t vfrac4 = (vec_u8_t){23, 23, 23, 23, 14, 14, 14, 14, 5, 5, 5, 5, 28, 28, 28, 28}; + vec_u8_t vfrac4_32 = (vec_u8_t){9, 9, 9, 9, 18, 18, 18, 18, 27, 27, 27, 27, 4, 4, 4, 4}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); /* (32 - fraction) * ref[offset + x], x=0-3 */ + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); /* fraction * ref[offset + x + 1], x=0-3 */ + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + vec_xst(vout, 0, dst); + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * 4 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<8, 13>(pixel* dst, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +vec_u8_t mask1={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, }; +vec_u8_t mask2={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; +vec_u8_t mask3={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +vec_u8_t mask4={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; +vec_u8_t mask5={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, }; +vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vout_0, vout_1, vout_2, vout_3; + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + +/* + vec_u8_t srv_left=vec_xl(16, srcPix0); + vec_u8_t srv_right=vec_xl(0, srcPix0); + vec_u8_t refmask_8={0x7, 0x4, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00, 0x00, 0x00, 0x00, 0x00, }; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); +*/ + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(17, srcPix0); + vec_u8_t refmask_8={0x7, 0x4, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, 0x00, 0x00, 0x00, }; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + vec_u8_t srv3 = vec_perm(srv, srv, mask3); + vec_u8_t srv4 = vec_perm(srv, srv, mask4); + vec_u8_t srv5 = vec_perm(srv, srv, mask5); + vec_u8_t srv6 = vec_perm(srv, srv, mask6); + vec_u8_t srv7 = vec_perm(srv, srv, mask7); + + +vec_u8_t vfrac8_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac8_1 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac8_2 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac8_3 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac8_32_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac8_32_1 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac8_32_2 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac8_32_3 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 8, 8, 8, 8, 8, 8, 8, 8}; + +one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0); +one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1); +one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2); +one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * 8 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<16, 13>(pixel* dst, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +//vec_u8_t mask1={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +//vec_u8_t mask2={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask3={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask4={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask5={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask6={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +//vec_u8_t mask8={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +//vec_u8_t mask9={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask10={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask11={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask12={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +//vec_u8_t mask13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask14={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +//vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; + +vec_u8_t maskadd1_0={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +/*vec_u8_t maskadd1_1={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t maskadd1_2={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t maskadd1_3={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t maskadd1_4={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t maskadd1_5={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t maskadd1_6={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t maskadd1_7={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_8={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_9={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_10={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_11={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/ + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + +/* + vec_u8_t srv_left=vec_xl(32, srcPix0); + vec_u8_t srv_right=vec_xl(0, srcPix0); + vec_u8_t refmask_16={0xe, 0xb, 0x7, 0x4, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(12, srcPix0); +*/ + vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_16={0xe, 0xb, 0x7, 0x4, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(44, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = srv0; + vec_u8_t srv2 = srv0; + vec_u8_t srv3 = vec_perm(s0, s1, mask3); + vec_u8_t srv4 = srv3; + vec_u8_t srv5 = srv3; + vec_u8_t srv6 = srv3; + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = srv7; + vec_u8_t srv9 = srv7; + vec_u8_t srv10 = vec_perm(s0, s1, mask10); + vec_u8_t srv11 = srv10; + vec_u8_t srv12= srv10; + vec_u8_t srv13 = srv10; + vec_u8_t srv14 = vec_perm(s0, s1, mask14); + vec_u8_t srv15 = srv14; + + vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0); + vec_u8_t srv1_add1 = srv0_add1; + vec_u8_t srv2_add1 = srv0_add1; + vec_u8_t srv3_add1 = srv0; + vec_u8_t srv4_add1 = srv0; + vec_u8_t srv5_add1 = srv0; + vec_u8_t srv6_add1 = srv0; + vec_u8_t srv7_add1 = srv3; + vec_u8_t srv8_add1 = srv3; + vec_u8_t srv9_add1 = srv3; + vec_u8_t srv10_add1 = srv7; + vec_u8_t srv11_add1 = srv7; + vec_u8_t srv12_add1= srv7; + vec_u8_t srv13_add1 = srv7; + vec_u8_t srv14_add1 = srv10; + vec_u8_t srv15_add1 = srv10; +vec_u8_t vfrac16_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_2 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_4 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_5 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_6 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_8 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_9 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_10 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_12 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_13 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_14 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 16*2, dst); + vec_xst(vout_3, 16*3, dst); + vec_xst(vout_4, 16*4, dst); + vec_xst(vout_5, 16*5, dst); + vec_xst(vout_6, 16*6, dst); + vec_xst(vout_7, 16*7, dst); + vec_xst(vout_8, 16*8, dst); + vec_xst(vout_9, 16*9, dst); + vec_xst(vout_10, 16*10, dst); + vec_xst(vout_11, 16*11, dst); + vec_xst(vout_12, 16*12, dst); + vec_xst(vout_13, 16*13, dst); + vec_xst(vout_14, 16*14, dst); + vec_xst(vout_15, 16*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * 16 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<32, 13>(pixel* dst, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +//vec_u8_t mask1={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +//vec_u8_t mask2={0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, }; +vec_u8_t mask3={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +//vec_u8_t mask4={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +//vec_u8_t mask5={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +//vec_u8_t mask6={0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, }; +vec_u8_t mask7={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +//vec_u8_t mask8={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +//vec_u8_t mask9={0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, }; +vec_u8_t mask10={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +//vec_u8_t mask11={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +//vec_u8_t mask12={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +//vec_u8_t mask13={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; +vec_u8_t mask14={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +//vec_u8_t mask15={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; + +//vec_u8_t mask16={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask17={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask18={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask19={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +//vec_u8_t mask20={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask21={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +//vec_u8_t mask22={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +//vec_u8_t mask23={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask24={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +/*vec_u8_t mask25={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask26={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask27={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask28={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask29={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };*/ +vec_u8_t maskadd1_0={0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + +/* + vec_u8_t srv_left0 = vec_xl(64, srcPix0); + vec_u8_t srv_left1 = vec_xl(80, srcPix0); + vec_u8_t srv_right = vec_xl(0, srcPix0);; + vec_u8_t refmask_32_0 ={0x1c, 0x19, 0x15, 0x12, 0xe, 0xb, 0x7, 0x4, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t refmask_32_1 ={0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; + vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 ); + vec_u8_t s1 = vec_xl(8, srcPix0);; + vec_u8_t s2 = vec_xl(24, srcPix0); +*/ + vec_u8_t srv_left0=vec_xl(0, srcPix0); + vec_u8_t srv_left1=vec_xl(16, srcPix0); + vec_u8_t srv_right=vec_xl(65, srcPix0); + vec_u8_t refmask_32_0={0x1c, 0x19, 0x15, 0x12, 0xe, 0xb, 0x7, 0x4, 0x00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}; + vec_u8_t refmask_32_1={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16}; + vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 ); + vec_u8_t s1 = vec_xl(72, srcPix0); + vec_u8_t s2 = vec_xl(88, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = srv0; + vec_u8_t srv2 = srv0; + vec_u8_t srv3 = vec_perm(s0, s1, mask3); + vec_u8_t srv4 = srv3; + vec_u8_t srv5 = srv3; + vec_u8_t srv6 = srv3; + vec_u8_t srv7 = vec_perm(s0, s1, mask7); + vec_u8_t srv8 = srv7; + vec_u8_t srv9 = srv7; + vec_u8_t srv10 = vec_perm(s0, s1, mask10); + vec_u8_t srv11 = srv10; + vec_u8_t srv12= srv10; + vec_u8_t srv13 = srv10; + vec_u8_t srv14 = vec_perm(s0, s1, mask14); + vec_u8_t srv15 = srv14; + + //0,0,0,3,3,3,3,7,7,7,10,10,10,10,14,14,14,17,17,17,17,21,21,21,24,24,24,24,s0,s0,s0,s0 + + vec_u8_t srv16_0 = vec_perm(s1, s2, mask0); + vec_u8_t srv16_1 = srv16_0; + vec_u8_t srv16_2 = srv16_0; + vec_u8_t srv16_3 = vec_perm(s1, s2, mask3); + vec_u8_t srv16_4 = srv16_3; + vec_u8_t srv16_5 = srv16_3; + vec_u8_t srv16_6 = srv16_3; + vec_u8_t srv16_7 = vec_perm(s1, s2, mask7); + vec_u8_t srv16_8 = srv16_7; + vec_u8_t srv16_9 = srv16_7; + vec_u8_t srv16_10 = vec_perm(s1, s2, mask10); + vec_u8_t srv16_11 = srv16_10; + vec_u8_t srv16_12= srv16_10; + vec_u8_t srv16_13 = srv16_10; + vec_u8_t srv16_14 = vec_perm(s1, s2, mask14); + vec_u8_t srv16_15 = srv16_14; + + vec_u8_t srv16 = srv14; + vec_u8_t srv17 = vec_perm(s0, s1, mask17); + vec_u8_t srv18 = srv17; + vec_u8_t srv19 = srv17; + vec_u8_t srv20 = srv17; + vec_u8_t srv21 = vec_perm(s0, s1, mask21); + vec_u8_t srv22 = srv21; + vec_u8_t srv23 = srv21; + vec_u8_t srv24 = vec_perm(s0, s1, mask24); + vec_u8_t srv25 = srv24; + vec_u8_t srv26 = srv24; + vec_u8_t srv27 = srv24; + vec_u8_t srv28 = s0; + vec_u8_t srv29 = s0; + vec_u8_t srv30 = s0; + vec_u8_t srv31 = s0; + + vec_u8_t srv16_16 = srv16_14; + vec_u8_t srv16_17 = vec_perm(s1, s2, mask17); + vec_u8_t srv16_18 = srv16_17; + vec_u8_t srv16_19 = srv16_17; + vec_u8_t srv16_20 = srv16_17; + vec_u8_t srv16_21 = vec_perm(s1, s2, mask21); + vec_u8_t srv16_22 = srv16_21; + vec_u8_t srv16_23 = srv16_21; + vec_u8_t srv16_24 = vec_perm(s1, s2, mask24); + vec_u8_t srv16_25 = srv16_24; + vec_u8_t srv16_26 = srv16_24; + vec_u8_t srv16_27 = srv16_24; + vec_u8_t srv16_28 = s1; + vec_u8_t srv16_29 = s1; + vec_u8_t srv16_30 = s1; + vec_u8_t srv16_31 = s1; + + vec_u8_t srv0add1 = vec_perm(s0, s1, maskadd1_0); + vec_u8_t srv1add1 = srv0add1; + vec_u8_t srv2add1 = srv0add1; + vec_u8_t srv3add1 = srv0; + vec_u8_t srv4add1 = srv0; + vec_u8_t srv5add1 = srv0; + vec_u8_t srv6add1 = srv0; + vec_u8_t srv7add1 = srv3; + vec_u8_t srv8add1 = srv3; + vec_u8_t srv9add1 = srv3; + vec_u8_t srv10add1 = srv7; + vec_u8_t srv11add1 = srv7; + vec_u8_t srv12add1= srv7; + vec_u8_t srv13add1 = srv7; + vec_u8_t srv14add1 = srv10; + vec_u8_t srv15add1 = srv10; + //0,0,0,0,3,3,3,7,7,7,7,10,10,10,14,14,14,14,17,17,17,21,21,21,21,24,24,24,24, + vec_u8_t srv16add1_0 = vec_perm(s1, s2, maskadd1_0); + vec_u8_t srv16add1_1 = srv16add1_0; + vec_u8_t srv16add1_2 = srv16add1_0; + vec_u8_t srv16add1_3 = srv16_0; + vec_u8_t srv16add1_4 = srv16_0; + vec_u8_t srv16add1_5 = srv16_0; + vec_u8_t srv16add1_6 = srv16_0; + vec_u8_t srv16add1_7 = srv16_3; + vec_u8_t srv16add1_8 = srv16_3; + vec_u8_t srv16add1_9 = srv16_3; + vec_u8_t srv16add1_10 = srv16_7; + vec_u8_t srv16add1_11 = srv16_7; + vec_u8_t srv16add1_12= srv16_7; + vec_u8_t srv16add1_13 = srv16_7; + vec_u8_t srv16add1_14 = srv16_10; + vec_u8_t srv16add1_15 = srv16_10; + + vec_u8_t srv16add1 = srv10; + vec_u8_t srv17add1 = srv14; + vec_u8_t srv18add1 = srv14; + vec_u8_t srv19add1 = srv14; + vec_u8_t srv20add1 = srv14; + vec_u8_t srv21add1 = srv17; + vec_u8_t srv22add1 = srv17; + vec_u8_t srv23add1 = srv17; + vec_u8_t srv24add1 = srv21; + vec_u8_t srv25add1 = srv21; + vec_u8_t srv26add1 = srv21; + vec_u8_t srv27add1 = srv21; + vec_u8_t srv28add1 = srv24; + vec_u8_t srv29add1 = srv24; + vec_u8_t srv30add1 = srv24; + vec_u8_t srv31add1 = srv24; + + vec_u8_t srv16add1_16 = srv16_10; + vec_u8_t srv16add1_17 = srv16_14; + vec_u8_t srv16add1_18 = srv16_14; + vec_u8_t srv16add1_19 = srv16_14; + vec_u8_t srv16add1_20 = srv16_14; + vec_u8_t srv16add1_21 = srv16_17; + vec_u8_t srv16add1_22 = srv16_17; + vec_u8_t srv16add1_23 = srv16_17; + vec_u8_t srv16add1_24 = srv16_21; + vec_u8_t srv16add1_25 = srv16_21; + vec_u8_t srv16add1_26 = srv16_21; + vec_u8_t srv16add1_27 = srv16_21; + vec_u8_t srv16add1_28 = srv16_24; + vec_u8_t srv16add1_29 = srv16_24; + vec_u8_t srv16add1_30 = srv16_24; + vec_u8_t srv16add1_31 = srv16_24; + +vec_u8_t vfrac16_0 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_1 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_2 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_3 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_4 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_5 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_6 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_8 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_9 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_10 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_11 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_12 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_13 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_14 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_16 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_17 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_18 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_19 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_20 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_21 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_22 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_24 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_25 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_26 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_27 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_28 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_29 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_30 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; +vec_u8_t vfrac16_32_0 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_16 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_32_17 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_18 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_32_19 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_20 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_32_21 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_22 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_32_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_24 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_32_25 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_26 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_32_27 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_28 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_32_29 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_30 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 32+16, dst); + vec_xst(vout_4, 32*2, dst); + vec_xst(vout_5, 32*2+16, dst); + vec_xst(vout_6, 32*3, dst); + vec_xst(vout_7, 32*3+16, dst); + vec_xst(vout_8, 32*4, dst); + vec_xst(vout_9, 32*4+16, dst); + vec_xst(vout_10, 32*5, dst); + vec_xst(vout_11, 32*5+16, dst); + vec_xst(vout_12, 32*6, dst); + vec_xst(vout_13, 32*6+16, dst); + vec_xst(vout_14, 32*7, dst); + vec_xst(vout_15, 32*7+16, dst); + vec_xst(vout_16, 32*8, dst); + vec_xst(vout_17, 32*8+16, dst); + vec_xst(vout_18, 32*9, dst); + vec_xst(vout_19, 32*9+16, dst); + vec_xst(vout_20, 32*10, dst); + vec_xst(vout_21, 32*10+16, dst); + vec_xst(vout_22, 32*11, dst); + vec_xst(vout_23, 32*11+16, dst); + vec_xst(vout_24, 32*12, dst); + vec_xst(vout_25, 32*12+16, dst); + vec_xst(vout_26, 32*13, dst); + vec_xst(vout_27, 32*13+16, dst); + vec_xst(vout_28, 32*14, dst); + vec_xst(vout_29, 32*14+16, dst); + vec_xst(vout_30, 32*15, dst); + vec_xst(vout_31, 32*15+16, dst); + + one_line(srv16, srv16add1, vfrac16_32_16, vfrac16_16, vout_0); + one_line(srv16_16, srv16add1_16, vfrac16_32_16, vfrac16_16, vout_1); + + one_line(srv17, srv17add1, vfrac16_32_17, vfrac16_17, vout_2); + one_line(srv16_17, srv16add1_17, vfrac16_32_17, vfrac16_17, vout_3); + + one_line(srv18, srv18add1, vfrac16_32_18, vfrac16_18, vout_4); + one_line(srv16_18, srv16add1_18, vfrac16_32_18, vfrac16_18, vout_5); + + one_line(srv19, srv19add1, vfrac16_32_19, vfrac16_19, vout_6); + one_line(srv16_19, srv16add1_19, vfrac16_32_19, vfrac16_19, vout_7); + + one_line(srv20, srv20add1, vfrac16_32_20, vfrac16_20, vout_8); + one_line(srv16_20, srv16add1_20, vfrac16_32_20, vfrac16_20, vout_9); + + one_line(srv21, srv21add1, vfrac16_32_21, vfrac16_21, vout_10); + one_line(srv16_21, srv16add1_21, vfrac16_32_21, vfrac16_21, vout_11); + + one_line(srv22, srv22add1, vfrac16_32_22, vfrac16_22, vout_12); + one_line(srv16_22, srv16add1_22, vfrac16_32_22, vfrac16_22, vout_13); + + one_line(srv23, srv23add1, vfrac16_32_23, vfrac16_23, vout_14); + one_line(srv16_23, srv16add1_23, vfrac16_32_23, vfrac16_23, vout_15); + + one_line(srv24, srv24add1, vfrac16_32_24, vfrac16_24, vout_16); + one_line(srv16_24, srv16add1_24, vfrac16_32_24, vfrac16_24, vout_17); + + one_line(srv25, srv25add1, vfrac16_32_25, vfrac16_25, vout_18); + one_line(srv16_25, srv16add1_25, vfrac16_32_25, vfrac16_25, vout_19); + + one_line(srv26, srv26add1, vfrac16_32_26, vfrac16_26, vout_20); + one_line(srv16_26, srv16add1_26, vfrac16_32_26, vfrac16_26, vout_21); + + one_line(srv27, srv27add1, vfrac16_32_27, vfrac16_27, vout_22); + one_line(srv16_27, srv16add1_27, vfrac16_32_27, vfrac16_27, vout_23); + + one_line(srv28, srv28add1, vfrac16_32_28, vfrac16_28, vout_24); + one_line(srv16_28, srv16add1_28, vfrac16_32_28, vfrac16_28, vout_25); + + one_line(srv29, srv29add1, vfrac16_32_29, vfrac16_29, vout_26); + one_line(srv16_29, srv16add1_29, vfrac16_32_29, vfrac16_29, vout_27); + + one_line(srv30, srv30add1, vfrac16_32_30, vfrac16_30, vout_28); + one_line(srv16_30, srv16add1_30, vfrac16_32_30, vfrac16_30, vout_29); + + one_line(srv31, srv31add1, vfrac16_32_31, vfrac16_31, vout_30); + one_line(srv16_31, srv16add1_31, vfrac16_32_31, vfrac16_31, vout_31); + + vec_xst(vout_0, 32*16, dst); + vec_xst(vout_1, 32*16+16, dst); + vec_xst(vout_2, 32*17, dst); + vec_xst(vout_3, 32*17+16, dst); + vec_xst(vout_4, 32*18, dst); + vec_xst(vout_5, 32*18+16, dst); + vec_xst(vout_6, 32*19, dst); + vec_xst(vout_7, 32*19+16, dst); + vec_xst(vout_8, 32*20, dst); + vec_xst(vout_9, 32*20+16, dst); + vec_xst(vout_10, 32*21, dst); + vec_xst(vout_11, 32*21+16, dst); + vec_xst(vout_12, 32*22, dst); + vec_xst(vout_13, 32*22+16, dst); + vec_xst(vout_14, 32*23, dst); + vec_xst(vout_15, 32*23+16, dst); + vec_xst(vout_16, 32*24, dst); + vec_xst(vout_17, 32*24+16, dst); + vec_xst(vout_18, 32*25, dst); + vec_xst(vout_19, 32*25+16, dst); + vec_xst(vout_20, 32*26, dst); + vec_xst(vout_21, 32*26+16, dst); + vec_xst(vout_22, 32*27, dst); + vec_xst(vout_23, 32*27+16, dst); + vec_xst(vout_24, 32*28, dst); + vec_xst(vout_25, 32*28+16, dst); + vec_xst(vout_26, 32*29, dst); + vec_xst(vout_27, 32*29+16, dst); + vec_xst(vout_28, 32*30, dst); + vec_xst(vout_29, 32*30+16, dst); + vec_xst(vout_30, 32*31, dst); + vec_xst(vout_31, 32*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * 32 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + + +template<> +void one_ang_pred_altivec<4, 12>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, }; + vec_u8_t mask1={0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, }; + + //vec_u8_t srv = vec_xl(0, srcPix0); + + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(9, srcPix0); + vec_u8_t refmask_4={0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, }; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + + vec_u8_t vfrac4 = (vec_u8_t){27, 27, 27, 27, 22, 22, 22, 22, 17, 17, 17, 17, 12, 12, 12, 12}; + vec_u8_t vfrac4_32 = (vec_u8_t){5, 5, 5, 5, 10, 10, 10, 10, 15, 15, 15, 15, 20, 20, 20, 20}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + vec_xst(vout, 0, dst); + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * 4 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<8, 12>(pixel* dst, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; +vec_u8_t mask1={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +vec_u8_t mask2={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; +vec_u8_t mask3={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +vec_u8_t mask4={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; +vec_u8_t mask5={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, }; +vec_u8_t mask6={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, }; +vec_u8_t mask7={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; + + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vout_0, vout_1, vout_2, vout_3; + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + +/* + vec_u8_t srv_left=vec_xl(16, srcPix0); + vec_u8_t srv_right=vec_xl(0, srcPix0); + vec_u8_t refmask_8={0x6, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); +*/ + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(17, srcPix0); + vec_u8_t refmask_8={0x6, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, }; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + vec_u8_t srv3 = vec_perm(srv, srv, mask3); + vec_u8_t srv4 = vec_perm(srv, srv, mask4); + vec_u8_t srv5 = vec_perm(srv, srv, mask5); + vec_u8_t srv6 = vec_perm(srv, srv, mask6); + vec_u8_t srv7 = vec_perm(srv, srv, mask7); + +vec_u8_t vfrac8_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac8_1 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac8_2 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac8_3 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac8_32_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac8_32_1 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac8_32_2 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac8_32_3 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 8, 8, 8, 8, 8, 8, 8, 8}; + +one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0); +one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1); +one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2); +one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * 8 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<16, 12>(pixel* dst, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +/*vec_u8_t mask1={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask2={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask3={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask4={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask5={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };*/ +vec_u8_t mask6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +/*vec_u8_t mask7={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask8={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask9={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask10={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask11={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/ +vec_u8_t mask12={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +/*vec_u8_t mask13={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask14={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };*/ + +vec_u8_t maskadd1_0={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +/*vec_u8_t maskadd1_1={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_2={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_3={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_4={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_5={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t maskadd1_6={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_7={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_8={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_9={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_10={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_11={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t maskadd1_12={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/ + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + +/* + vec_u8_t srv_left=vec_xl(32, srcPix0); + vec_u8_t srv_right=vec_xl(0, srcPix0); + vec_u8_t refmask_16={0xd, 0x6, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(14, srcPix0); +*/ + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(33, srcPix0); + vec_u8_t refmask_16={0xd, 0x6, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(46, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = srv0; + vec_u8_t srv2 = srv0; + vec_u8_t srv3 = srv0; + vec_u8_t srv4 = srv0; + vec_u8_t srv5 = srv0; + vec_u8_t srv6 = vec_perm(s0, s1, mask6); + vec_u8_t srv7 = srv6; + vec_u8_t srv8 = srv6; + vec_u8_t srv9 = srv6; + vec_u8_t srv10 = srv6; + vec_u8_t srv11 = srv6; + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = srv12; + vec_u8_t srv14 = srv12; + vec_u8_t srv15 = srv12; + + vec_u8_t srv0_add1 = vec_perm(s0, s1, maskadd1_0); + vec_u8_t srv1_add1 = srv0_add1; + vec_u8_t srv2_add1 = srv0_add1; + vec_u8_t srv3_add1 = srv0_add1; + vec_u8_t srv4_add1 = srv0_add1; + vec_u8_t srv5_add1 = srv0_add1; + vec_u8_t srv6_add1 = srv0; + vec_u8_t srv7_add1 = srv0; + vec_u8_t srv8_add1 = srv0; + vec_u8_t srv9_add1 = srv0; + vec_u8_t srv10_add1 = srv0; + vec_u8_t srv11_add1 = srv0; + vec_u8_t srv12_add1= srv6; + vec_u8_t srv13_add1 = srv6; + vec_u8_t srv14_add1 = srv6; + vec_u8_t srv15_add1 = srv6; +vec_u8_t vfrac16_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_2 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_4 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_6 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_8 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_10 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_12 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_14 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv0_add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv1, srv1_add1, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv2, srv2_add1, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv3, srv3_add1, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv4, srv4_add1, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv5, srv5_add1, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv6, srv6_add1, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv7, srv7_add1, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv8, srv8_add1, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv9, srv9_add1, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv10, srv10_add1, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv11, srv11_add1, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv12, srv12_add1, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv13, srv13_add1, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv14, srv14_add1, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv15, srv15_add1, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 16*2, dst); + vec_xst(vout_3, 16*3, dst); + vec_xst(vout_4, 16*4, dst); + vec_xst(vout_5, 16*5, dst); + vec_xst(vout_6, 16*6, dst); + vec_xst(vout_7, 16*7, dst); + vec_xst(vout_8, 16*8, dst); + vec_xst(vout_9, 16*9, dst); + vec_xst(vout_10, 16*10, dst); + vec_xst(vout_11, 16*11, dst); + vec_xst(vout_12, 16*12, dst); + vec_xst(vout_13, 16*13, dst); + vec_xst(vout_14, 16*14, dst); + vec_xst(vout_15, 16*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * 16 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<32, 12>(pixel* dst, const pixel *srcPix0, int bFilter) +{ +vec_u8_t mask0={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +/*vec_u8_t mask1={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask2={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask3={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask4={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, }; +vec_u8_t mask5={0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, };*/ +vec_u8_t mask6={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +/*vec_u8_t mask7={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask8={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask9={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask10={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, }; +vec_u8_t mask11={0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, };*/ +vec_u8_t mask12={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +/*vec_u8_t mask13={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask14={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask15={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; + +vec_u8_t mask16={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask17={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; +vec_u8_t mask18={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, };*/ +vec_u8_t mask19={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +/*vec_u8_t mask20={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask21={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask22={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask23={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask24={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t mask25={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask26={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask27={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask28={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask29={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask30={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask31={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };*/ + +vec_u8_t maskadd1_0={0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, 0x12, 0x13, 0x14, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + +/* + vec_u8_t srv_left0 = vec_xl(64, srcPix0); + vec_u8_t srv_left1 = vec_xl(80, srcPix0); + vec_u8_t srv_right = vec_xl(0, srcPix0);; + vec_u8_t refmask_32_0 ={0x1a, 0x13, 0xd, 0x6, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}; + vec_u8_t refmask_32_1 ={0x00, 0x01, 0x02, 0x03, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b}; + vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 ); + vec_u8_t s1 = vec_xl(12, srcPix0); + vec_u8_t s2 = vec_xl(28, srcPix0); +*/ + vec_u8_t srv_left0=vec_xl(0, srcPix0); + vec_u8_t srv_left1=vec_xl(16, srcPix0); + vec_u8_t srv_right=vec_xl(65, srcPix0); + vec_u8_t refmask_32_0={0x1a, 0x13, 0xd, 0x6, 0x00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}; + vec_u8_t refmask_32_1={0x0, 0x1, 0x2, 0x3, 0x4, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a}; + vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 ); + vec_u8_t s1 = vec_xl(76, srcPix0); + vec_u8_t s2 = vec_xl(92, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv1 = srv0; + vec_u8_t srv2 = srv0; + vec_u8_t srv3 = srv0; + vec_u8_t srv4 = srv0; + vec_u8_t srv5 = srv0; + vec_u8_t srv6 = vec_perm(s0, s1, mask6); + vec_u8_t srv7 = srv6; + vec_u8_t srv8 = srv6; + vec_u8_t srv9 = srv6; + vec_u8_t srv10 = srv6; + vec_u8_t srv11 = srv6; + vec_u8_t srv12= vec_perm(s0, s1, mask12); + vec_u8_t srv13 = srv12; + vec_u8_t srv14 = srv12; + vec_u8_t srv15 = srv12; + + //0,0,0,3,3,3,3,7,7,7,10,10,10,10,14,14,14,17,17,17,17,21,21,21,24,24,24,24,s0,s0,s0,s0 + + vec_u8_t srv16_0 = vec_perm(s1, s2, mask0); + vec_u8_t srv16_1 = srv16_0; + vec_u8_t srv16_2 = srv16_0; + vec_u8_t srv16_3 = srv16_0; + vec_u8_t srv16_4 = srv16_0; + vec_u8_t srv16_5 = srv16_0; + vec_u8_t srv16_6 = vec_perm(s1, s2, mask6); + vec_u8_t srv16_7 = srv16_6; + vec_u8_t srv16_8 = srv16_6; + vec_u8_t srv16_9 = srv16_6; + vec_u8_t srv16_10 = srv16_6; + vec_u8_t srv16_11 = srv16_6; + vec_u8_t srv16_12= vec_perm(s1, s2, mask12); + vec_u8_t srv16_13 = srv16_12; + vec_u8_t srv16_14 = srv16_12; + vec_u8_t srv16_15 = srv16_12; + + vec_u8_t srv16 = srv12; + vec_u8_t srv17 = srv12; + vec_u8_t srv18 = srv12; + vec_u8_t srv19 = vec_perm(s0, s1, mask19); + vec_u8_t srv20 = srv19; + vec_u8_t srv21 = srv19; + vec_u8_t srv22 = srv19; + vec_u8_t srv23 = srv19; + vec_u8_t srv24 = srv19; + vec_u8_t srv25 = s0; + vec_u8_t srv26 = s0; + vec_u8_t srv27 = s0; + vec_u8_t srv28 = s0; + vec_u8_t srv29 = s0; + vec_u8_t srv30 = s0; + vec_u8_t srv31 = s0; + + vec_u8_t srv16_16 = srv16_12; + vec_u8_t srv16_17 = srv16_12; + vec_u8_t srv16_18 = srv16_12; + vec_u8_t srv16_19 = vec_perm(s1, s2, mask19); + vec_u8_t srv16_20 = srv16_19; + vec_u8_t srv16_21 = srv16_19; + vec_u8_t srv16_22 = srv16_19; + vec_u8_t srv16_23 = srv16_19; + vec_u8_t srv16_24 = srv16_19; + vec_u8_t srv16_25 = s1; + vec_u8_t srv16_26 = s1; + vec_u8_t srv16_27 = s1; + vec_u8_t srv16_28 = s1; + vec_u8_t srv16_29 = s1; + vec_u8_t srv16_30 = s1; + vec_u8_t srv16_31 = s1; + + vec_u8_t srv0add1 = vec_perm(s0, s1, maskadd1_0); + vec_u8_t srv1add1 = srv0add1; + vec_u8_t srv2add1 = srv0add1; + vec_u8_t srv3add1 = srv0add1; + vec_u8_t srv4add1 = srv0add1; + vec_u8_t srv5add1 = srv0add1; + vec_u8_t srv6add1 = srv0; + vec_u8_t srv7add1 = srv0; + vec_u8_t srv8add1 = srv0; + vec_u8_t srv9add1 = srv0; + vec_u8_t srv10add1 = srv0; + vec_u8_t srv11add1 = srv0; + vec_u8_t srv12add1= srv6; + vec_u8_t srv13add1 = srv6; + vec_u8_t srv14add1 = srv6; + vec_u8_t srv15add1 = srv6; + + vec_u8_t srv16add1_0 = vec_perm(s1, s2, maskadd1_0); + vec_u8_t srv16add1_1 = srv16add1_0; + vec_u8_t srv16add1_2 = srv16add1_0; + vec_u8_t srv16add1_3 = srv16add1_0; + vec_u8_t srv16add1_4 = srv16add1_0; + vec_u8_t srv16add1_5 = srv16add1_0; + vec_u8_t srv16add1_6 = srv16_0; + vec_u8_t srv16add1_7 = srv16_0; + vec_u8_t srv16add1_8 = srv16_0; + vec_u8_t srv16add1_9 = srv16_0; + vec_u8_t srv16add1_10 = srv16_0; + vec_u8_t srv16add1_11 = srv16_0; + vec_u8_t srv16add1_12= srv16_6; + vec_u8_t srv16add1_13 = srv16_6; + vec_u8_t srv16add1_14 = srv16_6; + vec_u8_t srv16add1_15 = srv16_6; + + vec_u8_t srv16add1 = srv6; + vec_u8_t srv17add1 = srv6; + vec_u8_t srv18add1 = srv6; + vec_u8_t srv19add1 = srv12; + vec_u8_t srv20add1 = srv12; + vec_u8_t srv21add1 = srv12; + vec_u8_t srv22add1 = srv12; + vec_u8_t srv23add1 = srv12; + vec_u8_t srv24add1 = srv12; + vec_u8_t srv25add1 = srv19; + vec_u8_t srv26add1 = srv19; + vec_u8_t srv27add1 = srv19; + vec_u8_t srv28add1 = srv19; + vec_u8_t srv29add1 = srv19; + vec_u8_t srv30add1 = srv19; + vec_u8_t srv31add1 = srv19; + + vec_u8_t srv16add1_16 = srv16_6; + vec_u8_t srv16add1_17 = srv16_6; + vec_u8_t srv16add1_18 = srv16_6; + vec_u8_t srv16add1_19 = srv16_12; + vec_u8_t srv16add1_20 = srv16_12; + vec_u8_t srv16add1_21 = srv16_12; + vec_u8_t srv16add1_22 = srv16_12; + vec_u8_t srv16add1_23 = srv16_12; + vec_u8_t srv16add1_24 = srv16_12; + vec_u8_t srv16add1_25 = srv16_19; + vec_u8_t srv16add1_26 = srv16_19; + vec_u8_t srv16add1_27 = srv16_19; + vec_u8_t srv16add1_28 = srv16_19; + vec_u8_t srv16add1_29 = srv16_19; + vec_u8_t srv16add1_30 = srv16_19; + vec_u8_t srv16add1_31 = srv16_19; + +vec_u8_t vfrac16_0 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_1 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_2 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_3 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_4 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_5 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_6 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_7 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_8 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_9 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_10 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_11 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_12 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_13 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_14 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_16 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_17 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_18 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_19 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_20 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_21 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_22 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_23 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_24 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_25 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_26 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_27 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_28 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_29 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_30 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_31 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; +vec_u8_t vfrac16_32_0 = (vec_u8_t){5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_16 = (vec_u8_t){21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21}; +vec_u8_t vfrac16_32_17 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_18 = (vec_u8_t){31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31}; +vec_u8_t vfrac16_32_19 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_20 = (vec_u8_t){9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9}; +vec_u8_t vfrac16_32_21 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_22 = (vec_u8_t){19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19}; +vec_u8_t vfrac16_32_23 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_24 = (vec_u8_t){29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29}; +vec_u8_t vfrac16_32_25 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_26 = (vec_u8_t){7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7}; +vec_u8_t vfrac16_32_27 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_28 = (vec_u8_t){17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17}; +vec_u8_t vfrac16_32_29 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_30 = (vec_u8_t){27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27}; +vec_u8_t vfrac16_32_31 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv1, srv1add1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv16_1, srv16add1_1, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv2, srv2add1, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv16_2, srv16add1_2, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv3, srv3add1, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv16_3, srv16add1_3, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv4, srv4add1, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv16_4, srv16add1_4, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv5, srv5add1, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv16_5, srv16add1_5, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv6, srv6add1, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv16_6, srv16add1_6, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv7, srv7add1, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv16_7, srv16add1_7, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv8, srv8add1, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv16_8, srv16add1_8, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv9, srv9add1, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv16_9, srv16add1_9, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv10, srv10add1, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv16_10, srv16add1_10, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv11, srv11add1, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv16_11, srv16add1_11, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv12, srv12add1, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv16_12, srv16add1_12, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv13, srv13add1, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv16_13, srv16add1_13, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv14, srv14add1, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv16_14, srv16add1_14, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv15, srv15add1, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv16_15, srv16add1_15, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 32+16, dst); + vec_xst(vout_4, 32*2, dst); + vec_xst(vout_5, 32*2+16, dst); + vec_xst(vout_6, 32*3, dst); + vec_xst(vout_7, 32*3+16, dst); + vec_xst(vout_8, 32*4, dst); + vec_xst(vout_9, 32*4+16, dst); + vec_xst(vout_10, 32*5, dst); + vec_xst(vout_11, 32*5+16, dst); + vec_xst(vout_12, 32*6, dst); + vec_xst(vout_13, 32*6+16, dst); + vec_xst(vout_14, 32*7, dst); + vec_xst(vout_15, 32*7+16, dst); + vec_xst(vout_16, 32*8, dst); + vec_xst(vout_17, 32*8+16, dst); + vec_xst(vout_18, 32*9, dst); + vec_xst(vout_19, 32*9+16, dst); + vec_xst(vout_20, 32*10, dst); + vec_xst(vout_21, 32*10+16, dst); + vec_xst(vout_22, 32*11, dst); + vec_xst(vout_23, 32*11+16, dst); + vec_xst(vout_24, 32*12, dst); + vec_xst(vout_25, 32*12+16, dst); + vec_xst(vout_26, 32*13, dst); + vec_xst(vout_27, 32*13+16, dst); + vec_xst(vout_28, 32*14, dst); + vec_xst(vout_29, 32*14+16, dst); + vec_xst(vout_30, 32*15, dst); + vec_xst(vout_31, 32*15+16, dst); + + one_line(srv16, srv16add1, vfrac16_32_16, vfrac16_16, vout_0); + one_line(srv16_16, srv16add1_16, vfrac16_32_16, vfrac16_16, vout_1); + + one_line(srv17, srv17add1, vfrac16_32_17, vfrac16_17, vout_2); + one_line(srv16_17, srv16add1_17, vfrac16_32_17, vfrac16_17, vout_3); + + one_line(srv18, srv18add1, vfrac16_32_18, vfrac16_18, vout_4); + one_line(srv16_18, srv16add1_18, vfrac16_32_18, vfrac16_18, vout_5); + + one_line(srv19, srv19add1, vfrac16_32_19, vfrac16_19, vout_6); + one_line(srv16_19, srv16add1_19, vfrac16_32_19, vfrac16_19, vout_7); + + one_line(srv20, srv20add1, vfrac16_32_20, vfrac16_20, vout_8); + one_line(srv16_20, srv16add1_20, vfrac16_32_20, vfrac16_20, vout_9); + + one_line(srv21, srv21add1, vfrac16_32_21, vfrac16_21, vout_10); + one_line(srv16_21, srv16add1_21, vfrac16_32_21, vfrac16_21, vout_11); + + one_line(srv22, srv22add1, vfrac16_32_22, vfrac16_22, vout_12); + one_line(srv16_22, srv16add1_22, vfrac16_32_22, vfrac16_22, vout_13); + + one_line(srv23, srv23add1, vfrac16_32_23, vfrac16_23, vout_14); + one_line(srv16_23, srv16add1_23, vfrac16_32_23, vfrac16_23, vout_15); + + one_line(srv24, srv24add1, vfrac16_32_24, vfrac16_24, vout_16); + one_line(srv16_24, srv16add1_24, vfrac16_32_24, vfrac16_24, vout_17); + + one_line(srv25, srv25add1, vfrac16_32_25, vfrac16_25, vout_18); + one_line(srv16_25, srv16add1_25, vfrac16_32_25, vfrac16_25, vout_19); + + one_line(srv26, srv26add1, vfrac16_32_26, vfrac16_26, vout_20); + one_line(srv16_26, srv16add1_26, vfrac16_32_26, vfrac16_26, vout_21); + + one_line(srv27, srv27add1, vfrac16_32_27, vfrac16_27, vout_22); + one_line(srv16_27, srv16add1_27, vfrac16_32_27, vfrac16_27, vout_23); + + one_line(srv28, srv28add1, vfrac16_32_28, vfrac16_28, vout_24); + one_line(srv16_28, srv16add1_28, vfrac16_32_28, vfrac16_28, vout_25); + + one_line(srv29, srv29add1, vfrac16_32_29, vfrac16_29, vout_26); + one_line(srv16_29, srv16add1_29, vfrac16_32_29, vfrac16_29, vout_27); + + one_line(srv30, srv30add1, vfrac16_32_30, vfrac16_30, vout_28); + one_line(srv16_30, srv16add1_30, vfrac16_32_30, vfrac16_30, vout_29); + + one_line(srv31, srv31add1, vfrac16_32_31, vfrac16_31, vout_30); + one_line(srv16_31, srv16add1_31, vfrac16_32_31, vfrac16_31, vout_31); + + vec_xst(vout_0, 32*16, dst); + vec_xst(vout_1, 32*16+16, dst); + vec_xst(vout_2, 32*17, dst); + vec_xst(vout_3, 32*17+16, dst); + vec_xst(vout_4, 32*18, dst); + vec_xst(vout_5, 32*18+16, dst); + vec_xst(vout_6, 32*19, dst); + vec_xst(vout_7, 32*19+16, dst); + vec_xst(vout_8, 32*20, dst); + vec_xst(vout_9, 32*20+16, dst); + vec_xst(vout_10, 32*21, dst); + vec_xst(vout_11, 32*21+16, dst); + vec_xst(vout_12, 32*22, dst); + vec_xst(vout_13, 32*22+16, dst); + vec_xst(vout_14, 32*23, dst); + vec_xst(vout_15, 32*23+16, dst); + vec_xst(vout_16, 32*24, dst); + vec_xst(vout_17, 32*24+16, dst); + vec_xst(vout_18, 32*25, dst); + vec_xst(vout_19, 32*25+16, dst); + vec_xst(vout_20, 32*26, dst); + vec_xst(vout_21, 32*26+16, dst); + vec_xst(vout_22, 32*27, dst); + vec_xst(vout_23, 32*27+16, dst); + vec_xst(vout_24, 32*28, dst); + vec_xst(vout_25, 32*28+16, dst); + vec_xst(vout_26, 32*29, dst); + vec_xst(vout_27, 32*29+16, dst); + vec_xst(vout_28, 32*30, dst); + vec_xst(vout_29, 32*30+16, dst); + vec_xst(vout_30, 32*31, dst); + vec_xst(vout_31, 32*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * 32 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + + +template<> +void one_ang_pred_altivec<4, 11>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, 0x0, 0x1, 0x2, 0x3, }; + vec_u8_t mask1={0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, 0x1, 0x2, 0x3, 0x4, }; +/* + vec_u8_t srv=vec_xl(0, srcPix0); +*/ + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(9, srcPix0); + vec_u8_t refmask_4={0x00, 0x10, 0x11, 0x12, 0x13, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, }; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_4); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t vfrac4 = (vec_u8_t){30, 30, 30, 30, 28, 28, 28, 28, 26, 26, 26, 26, 24, 24, 24, 24}; + vec_u8_t vfrac4_32 = (vec_u8_t){2, 2, 2, 2, 4, 4, 4, 4, 6, 6, 6, 6, 8, 8, 8, 8}; + + vec_u16_t vmle0 = vec_mule(srv0, vfrac4_32); + vec_u16_t vmlo0 = vec_mulo(srv0, vfrac4_32); + vec_u16_t vmle1 = vec_mule(srv1, vfrac4); + vec_u16_t vmlo1 = vec_mulo(srv1, vfrac4); + vec_u16_t vsume = vec_add(vec_add(vmle0, vmle1), u16_16); + vec_u16_t ve = vec_sra(vsume, u16_5); + vec_u16_t vsumo = vec_add(vec_add(vmlo0, vmlo1), u16_16); + vec_u16_t vo = vec_sra(vsumo, u16_5); + vec_u8_t vout = vec_pack(vec_mergeh(ve, vo), vec_mergel(ve, vo)); + + vec_xst(vout, 0, dst); + +#ifdef DEBUG + for (int y = 0; y < 4; y++) + { + for (int x = 0; x < 4; x++) + { + printf("%d ",dst[y * 4 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<8, 11>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, }; + vec_u8_t mask1={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; + vec_u8_t mask2={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, }; + vec_u8_t mask3={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; + vec_u8_t mask4={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, }; + vec_u8_t mask5={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; + vec_u8_t mask6={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, }; + vec_u8_t mask7={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, }; + + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + vec_u8_t vout_0, vout_1, vout_2, vout_3; + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + + vec_u8_t srv_left=vec_xl(0, srcPix0); + vec_u8_t srv_right=vec_xl(17, srcPix0); + vec_u8_t refmask_8={0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, }; + vec_u8_t srv = vec_perm(srv_left, srv_right, refmask_8); + + vec_u8_t srv0 = vec_perm(srv, srv, mask0); + vec_u8_t srv1 = vec_perm(srv, srv, mask1); + vec_u8_t srv2 = vec_perm(srv, srv, mask2); + vec_u8_t srv3 = vec_perm(srv, srv, mask3); + vec_u8_t srv4 = vec_perm(srv, srv, mask4); + vec_u8_t srv5 = vec_perm(srv, srv, mask5); + vec_u8_t srv6 = vec_perm(srv, srv, mask6); + vec_u8_t srv7 = vec_perm(srv, srv, mask7); + +vec_u8_t vfrac8_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac8_1 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac8_2 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac8_3 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac8_32_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac8_32_1 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac8_32_2 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac8_32_3 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 16, 16, 16, 16, 16, 16, 16, 16}; + +one_line(srv0, srv1, vfrac8_32_0, vfrac8_0, vout_0); +one_line(srv2, srv3, vfrac8_32_1, vfrac8_1, vout_1); +one_line(srv4, srv5, vfrac8_32_2, vfrac8_2, vout_2); +one_line(srv6, srv7, vfrac8_32_3, vfrac8_3, vout_3); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 48, dst); + +#ifdef DEBUG + for (int y = 0; y < 8; y++) + { + for (int x = 0; x < 8; x++) + { + printf("%d ",dst[y * 8 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + + +template<> +void one_ang_pred_altivec<16, 11>(pixel* dst, const pixel *srcPix0, int bFilter) +{ +/*vec_u8_t mask0={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask1={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask2={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask3={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask4={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask5={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask6={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask7={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask8={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask9={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask10={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask11={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask12={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask13={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask14={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, }; +vec_u8_t mask15={0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, };*/ +vec_u8_t maskadd1_0={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +/*vec_u8_t maskadd1_1={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_2={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_3={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_4={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_5={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_6={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_7={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_8={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_9={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_10={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_11={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_12={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_13={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_14={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; +vec_u8_t maskadd1_15={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, };*/ + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; + + vec_u8_t srv_left=vec_xl(0, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t srv_right=vec_xl(33, srcPix0); /* ref[offset + x], ref=srcPix0+1; offset[0-3] = 0 */ + vec_u8_t refmask_16={0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_16); + vec_u8_t s1 = vec_xl(48, srcPix0); + + vec_u8_t srv0 = s0; + vec_u8_t srv1 = vec_perm(s0, s1, maskadd1_0); + +vec_u8_t vfrac16_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_1 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_2 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_4 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_5 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_6 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_8 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_9 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_10 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_12 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_13 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_14 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; +vec_u8_t vfrac16_32_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; +vec_u8_t vfrac16_32_1 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; +vec_u8_t vfrac16_32_2 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; +vec_u8_t vfrac16_32_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; +vec_u8_t vfrac16_32_4 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; +vec_u8_t vfrac16_32_5 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; +vec_u8_t vfrac16_32_6 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; +vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; +vec_u8_t vfrac16_32_8 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; +vec_u8_t vfrac16_32_9 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; +vec_u8_t vfrac16_32_10 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; +vec_u8_t vfrac16_32_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; +vec_u8_t vfrac16_32_12 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; +vec_u8_t vfrac16_32_13 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; +vec_u8_t vfrac16_32_14 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; +vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + + one_line(srv0, srv1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv0, srv1, vfrac16_32_1, vfrac16_1, vout_1); + one_line(srv0, srv1, vfrac16_32_2, vfrac16_2, vout_2); + one_line(srv0, srv1, vfrac16_32_3, vfrac16_3, vout_3); + one_line(srv0, srv1, vfrac16_32_4, vfrac16_4, vout_4); + one_line(srv0, srv1, vfrac16_32_5, vfrac16_5, vout_5); + one_line(srv0, srv1, vfrac16_32_6, vfrac16_6, vout_6); + one_line(srv0, srv1, vfrac16_32_7, vfrac16_7, vout_7); + one_line(srv0, srv1, vfrac16_32_8, vfrac16_8, vout_8); + one_line(srv0, srv1, vfrac16_32_9, vfrac16_9, vout_9); + one_line(srv0, srv1, vfrac16_32_10, vfrac16_10, vout_10); + one_line(srv0, srv1, vfrac16_32_11, vfrac16_11, vout_11); + one_line(srv0, srv1, vfrac16_32_12, vfrac16_12, vout_12); + one_line(srv0, srv1, vfrac16_32_13, vfrac16_13, vout_13); + one_line(srv0, srv1, vfrac16_32_14, vfrac16_14, vout_14); + one_line(srv0, srv1, vfrac16_32_15, vfrac16_15, vout_15); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 16*2, dst); + vec_xst(vout_3, 16*3, dst); + vec_xst(vout_4, 16*4, dst); + vec_xst(vout_5, 16*5, dst); + vec_xst(vout_6, 16*6, dst); + vec_xst(vout_7, 16*7, dst); + vec_xst(vout_8, 16*8, dst); + vec_xst(vout_9, 16*9, dst); + vec_xst(vout_10, 16*10, dst); + vec_xst(vout_11, 16*11, dst); + vec_xst(vout_12, 16*12, dst); + vec_xst(vout_13, 16*13, dst); + vec_xst(vout_14, 16*14, dst); + vec_xst(vout_15, 16*15, dst); + +#ifdef DEBUG + for (int y = 0; y < 16; y++) + { + for (int x = 0; x < 16; x++) + { + printf("%d ",dst[y * 16 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + +template<> +void one_ang_pred_altivec<32, 11>(pixel* dst, const pixel *srcPix0, int bFilter) +{ + vec_u8_t mask0={0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, }; + vec_u8_t maskadd1_0={0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf, 0x10, 0x11, }; + + vec_u16_t u16_16 = {16, 16, 16, 16, 16, 16, 16, 16}; + vec_u16_t u16_5 = {5, 5, 5, 5, 5, 5, 5, 5}; +/* + vec_u8_t srv_left = vec_xl(80, srcPix0); + vec_u8_t srv_right = vec_xl(0, srcPix0);; + vec_u8_t refmask_32 ={0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e}; + vec_u8_t s0 = vec_perm(srv_left, srv_right, refmask_32); + vec_u8_t s1 = vec_xl(15, srcPix0);; + vec_u8_t s2 = vec_xl(31, srcPix0); +*/ + vec_u8_t srv_left0=vec_xl(0, srcPix0); + vec_u8_t srv_left1=vec_xl(16, srcPix0); + vec_u8_t srv_right=vec_xl(65, srcPix0); + vec_u8_t refmask_32_0={0x10, 0x00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}; + vec_u8_t refmask_32_1={0x0, 0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d}; + vec_u8_t s0 = vec_perm( vec_perm(srv_left0, srv_left1, refmask_32_0), srv_right, refmask_32_1 ); + vec_u8_t s1 = vec_xl(79, srcPix0); + vec_u8_t s2 = vec_xl(95, srcPix0); + + vec_u8_t srv0 = vec_perm(s0, s1, mask0); + vec_u8_t srv16_0 = vec_perm(s1, s2, mask0); + vec_u8_t srv0add1 = vec_perm(s0, s1, maskadd1_0); + vec_u8_t srv16add1_0 = vec_perm(s1, s2, maskadd1_0); + + vec_u8_t vfrac16_0 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_1 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_2 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_3 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_4 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_5 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_6 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + vec_u8_t vfrac16_8 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_9 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_10 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_11 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_12 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_13 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_14 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_15 = (vec_u8_t){0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + + vec_u8_t vfrac16_32_0 = (vec_u8_t){2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2}; + vec_u8_t vfrac16_32_1 = (vec_u8_t){4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4}; + vec_u8_t vfrac16_32_2 = (vec_u8_t){6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6}; + vec_u8_t vfrac16_32_3 = (vec_u8_t){8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8}; + vec_u8_t vfrac16_32_4 = (vec_u8_t){10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; + vec_u8_t vfrac16_32_5 = (vec_u8_t){12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12}; + vec_u8_t vfrac16_32_6 = (vec_u8_t){14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14}; + vec_u8_t vfrac16_32_7 = (vec_u8_t){16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16}; + vec_u8_t vfrac16_32_8 = (vec_u8_t){18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18}; + vec_u8_t vfrac16_32_9 = (vec_u8_t){20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20}; + vec_u8_t vfrac16_32_10 = (vec_u8_t){22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22}; + vec_u8_t vfrac16_32_11 = (vec_u8_t){24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24}; + vec_u8_t vfrac16_32_12 = (vec_u8_t){26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26}; + vec_u8_t vfrac16_32_13 = (vec_u8_t){28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28}; + vec_u8_t vfrac16_32_14 = (vec_u8_t){30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30}; + vec_u8_t vfrac16_32_15 = (vec_u8_t){32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32}; + + + /* dst[y * dstStride + x] = (pixel)((f32[y]* ref[0 + x] + f[y] * ref[0 + x + 1] + 16) >> 5 */ + vec_u16_t vmle0, vmlo0, vmle1, vmlo1, vsume, ve, vsumo, vo; + vec_u8_t vout_0, vout_1, vout_2, vout_3, vout_4, vout_5, vout_6, vout_7; + vec_u8_t vout_8, vout_9, vout_10, vout_11, vout_12, vout_13, vout_14, vout_15; + vec_u8_t vout_16, vout_17, vout_18, vout_19, vout_20, vout_21, vout_22, vout_23; + vec_u8_t vout_24, vout_25, vout_26, vout_27, vout_28, vout_29, vout_30, vout_31; + + one_line(srv0, srv0add1, vfrac16_32_0, vfrac16_0, vout_0); + one_line(srv16_0, srv16add1_0, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(srv0, srv0add1, vfrac16_32_1, vfrac16_1, vout_2); + one_line(srv16_0, srv16add1_0, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(srv0, srv0add1, vfrac16_32_2, vfrac16_2, vout_4); + one_line(srv16_0, srv16add1_0, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(srv0, srv0add1, vfrac16_32_3, vfrac16_3, vout_6); + one_line(srv16_0, srv16add1_0, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(srv0, srv0add1, vfrac16_32_4, vfrac16_4, vout_8); + one_line(srv16_0, srv16add1_0, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(srv0, srv0add1, vfrac16_32_5, vfrac16_5, vout_10); + one_line(srv16_0, srv16add1_0, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(srv0, srv0add1, vfrac16_32_6, vfrac16_6, vout_12); + one_line(srv16_0, srv16add1_0, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(srv0, srv0add1, vfrac16_32_7, vfrac16_7, vout_14); + one_line(srv16_0, srv16add1_0, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(srv0, srv0add1, vfrac16_32_8, vfrac16_8, vout_16); + one_line(srv16_0, srv16add1_0, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(srv0, srv0add1, vfrac16_32_9, vfrac16_9, vout_18); + one_line(srv16_0, srv16add1_0, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(srv0, srv0add1, vfrac16_32_10, vfrac16_10, vout_20); + one_line(srv16_0, srv16add1_0, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(srv0, srv0add1, vfrac16_32_11, vfrac16_11, vout_22); + one_line(srv16_0, srv16add1_0, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(srv0, srv0add1, vfrac16_32_12, vfrac16_12, vout_24); + one_line(srv16_0, srv16add1_0, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(srv0, srv0add1, vfrac16_32_13, vfrac16_13, vout_26); + one_line(srv16_0, srv16add1_0, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(srv0, srv0add1, vfrac16_32_14, vfrac16_14, vout_28); + one_line(srv16_0, srv16add1_0, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(srv0, srv0add1, vfrac16_32_15, vfrac16_15, vout_30); + one_line(srv16_0, srv16add1_0, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 0, dst); + vec_xst(vout_1, 16, dst); + vec_xst(vout_2, 32, dst); + vec_xst(vout_3, 32+16, dst); + vec_xst(vout_4, 32*2, dst); + vec_xst(vout_5, 32*2+16, dst); + vec_xst(vout_6, 32*3, dst); + vec_xst(vout_7, 32*3+16, dst); + vec_xst(vout_8, 32*4, dst); + vec_xst(vout_9, 32*4+16, dst); + vec_xst(vout_10, 32*5, dst); + vec_xst(vout_11, 32*5+16, dst); + vec_xst(vout_12, 32*6, dst); + vec_xst(vout_13, 32*6+16, dst); + vec_xst(vout_14, 32*7, dst); + vec_xst(vout_15, 32*7+16, dst); + vec_xst(vout_16, 32*8, dst); + vec_xst(vout_17, 32*8+16, dst); + vec_xst(vout_18, 32*9, dst); + vec_xst(vout_19, 32*9+16, dst); + vec_xst(vout_20, 32*10, dst); + vec_xst(vout_21, 32*10+16, dst); + vec_xst(vout_22, 32*11, dst); + vec_xst(vout_23, 32*11+16, dst); + vec_xst(vout_24, 32*12, dst); + vec_xst(vout_25, 32*12+16, dst); + vec_xst(vout_26, 32*13, dst); + vec_xst(vout_27, 32*13+16, dst); + vec_xst(vout_28, 32*14, dst); + vec_xst(vout_29, 32*14+16, dst); + vec_xst(vout_30, 32*15, dst); + vec_xst(vout_31, 32*15+16, dst); + + one_line(s0, srv0, vfrac16_32_0, vfrac16_0, vout_0); + one_line(s1, srv16_0, vfrac16_32_0, vfrac16_0, vout_1); + + one_line(s0, srv0, vfrac16_32_1, vfrac16_1, vout_2); + one_line(s1, srv16_0, vfrac16_32_1, vfrac16_1, vout_3); + + one_line(s0, srv0, vfrac16_32_2, vfrac16_2, vout_4); + one_line(s1, srv16_0, vfrac16_32_2, vfrac16_2, vout_5); + + one_line(s0, srv0, vfrac16_32_3, vfrac16_3, vout_6); + one_line(s1, srv16_0, vfrac16_32_3, vfrac16_3, vout_7); + + one_line(s0, srv0, vfrac16_32_4, vfrac16_4, vout_8); + one_line(s1, srv16_0, vfrac16_32_4, vfrac16_4, vout_9); + + one_line(s0, srv0, vfrac16_32_5, vfrac16_5, vout_10); + one_line(s1, srv16_0, vfrac16_32_5, vfrac16_5, vout_11); + + one_line(s0, srv0, vfrac16_32_6, vfrac16_6, vout_12); + one_line(s1, srv16_0, vfrac16_32_6, vfrac16_6, vout_13); + + one_line(s0, srv0, vfrac16_32_7, vfrac16_7, vout_14); + one_line(s1, srv16_0, vfrac16_32_7, vfrac16_7, vout_15); + + one_line(s0, srv0, vfrac16_32_8, vfrac16_8, vout_16); + one_line(s1, srv16_0, vfrac16_32_8, vfrac16_8, vout_17); + + one_line(s0, srv0, vfrac16_32_9, vfrac16_9, vout_18); + one_line(s1, srv16_0, vfrac16_32_9, vfrac16_9, vout_19); + + one_line(s0, srv0, vfrac16_32_10, vfrac16_10, vout_20); + one_line(s1, srv16_0, vfrac16_32_10, vfrac16_10, vout_21); + + one_line(s0, srv0, vfrac16_32_11, vfrac16_11, vout_22); + one_line(s1, srv16_0, vfrac16_32_11, vfrac16_11, vout_23); + + one_line(s0, srv0, vfrac16_32_12, vfrac16_12, vout_24); + one_line(s1, srv16_0, vfrac16_32_12, vfrac16_12, vout_25); + + one_line(s0, srv0, vfrac16_32_13, vfrac16_13, vout_26); + one_line(s1, srv16_0, vfrac16_32_13, vfrac16_13, vout_27); + + one_line(s0, srv0, vfrac16_32_14, vfrac16_14, vout_28); + one_line(s1, srv16_0, vfrac16_32_14, vfrac16_14, vout_29); + + one_line(s0, srv0, vfrac16_32_15, vfrac16_15, vout_30); + one_line(s1, srv16_0, vfrac16_32_15, vfrac16_15, vout_31); + + vec_xst(vout_0, 32*16, dst); + vec_xst(vout_1, 32*16+16, dst); + vec_xst(vout_2, 32*17, dst); + vec_xst(vout_3, 32*17+16, dst); + vec_xst(vout_4, 32*18, dst); + vec_xst(vout_5, 32*18+16, dst); + vec_xst(vout_6, 32*19, dst); + vec_xst(vout_7, 32*19+16, dst); + vec_xst(vout_8, 32*20, dst); + vec_xst(vout_9, 32*20+16, dst); + vec_xst(vout_10, 32*21, dst); + vec_xst(vout_11, 32*21+16, dst); + vec_xst(vout_12, 32*22, dst); + vec_xst(vout_13, 32*22+16, dst); + vec_xst(vout_14, 32*23, dst); + vec_xst(vout_15, 32*23+16, dst); + vec_xst(vout_16, 32*24, dst); + vec_xst(vout_17, 32*24+16, dst); + vec_xst(vout_18, 32*25, dst); + vec_xst(vout_19, 32*25+16, dst); + vec_xst(vout_20, 32*26, dst); + vec_xst(vout_21, 32*26+16, dst); + vec_xst(vout_22, 32*27, dst); + vec_xst(vout_23, 32*27+16, dst); + vec_xst(vout_24, 32*28, dst); + vec_xst(vout_25, 32*28+16, dst); + vec_xst(vout_26, 32*29, dst); + vec_xst(vout_27, 32*29+16, dst); + vec_xst(vout_28, 32*30, dst); + vec_xst(vout_29, 32*30+16, dst); + vec_xst(vout_30, 32*31, dst); + vec_xst(vout_31, 32*31+16, dst); + + +#ifdef DEBUG + for (int y = 0; y < 32; y++) + { + for (int x = 0; x < 32; x++) + { + printf("%d ",dst[y * 32 + x] ); + } + printf("\n"); + } + printf("\n\n"); +#endif +} + + +#define ONE_ANG(log2Size, mode, dest, refPix, filtPix, bLuma)\ +{\ + const int width = 1<< log2Size;\ + pixel *srcPix0 = (g_intraFilterFlags[mode] & width ? filtPix : refPix);\ + pixel *dst = dest + ((mode - 2) << (log2Size * 2));\ + srcPix0 = refPix;\ + dst = dest;\ + one_ang_pred_altivec<width, mode>(dst, srcPix0, bLuma);\ +} + + +template<int log2Size> +void all_angs_pred_altivec(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma) +{ + ONE_ANG(log2Size, 2, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 3, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 4, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 5, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 6, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 7, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 8, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 9, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 10, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 11, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 12, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 13, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 14, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 15, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 16, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 17, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 18, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 19, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 20, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 21, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 22, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 23, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 24, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 25, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 26, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 27, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 28, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 29, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 30, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 31, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 32, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 33, dest, refPix, filtPix, bLuma); + ONE_ANG(log2Size, 34, dest, refPix, filtPix, bLuma); + return; +} + +void setupIntraPrimitives_altivec(EncoderPrimitives &p) +{ + for (int i = 2; i < NUM_INTRA_MODE; i++) + { + p.cu[BLOCK_4x4].intra_pred[i] = intra_pred_ang_altivec<4>; + p.cu[BLOCK_8x8].intra_pred[i] = intra_pred_ang_altivec<8>; + p.cu[BLOCK_16x16].intra_pred[i] = intra_pred_ang_altivec<16>; + p.cu[BLOCK_32x32].intra_pred[i] = intra_pred_ang_altivec<32>; + } + + p.cu[BLOCK_4x4].intra_pred_allangs = all_angs_pred_altivec<2>; + p.cu[BLOCK_8x8].intra_pred_allangs = all_angs_pred_altivec<3>; + p.cu[BLOCK_16x16].intra_pred_allangs = all_angs_pred_altivec<4>; + p.cu[BLOCK_32x32].intra_pred_allangs = all_angs_pred_altivec<5>; +} + +} +
View file
x265_2.2.tar.gz/source/common/ppc/ipfilter_altivec.cpp
Added
@@ -0,0 +1,1522 @@ +/***************************************************************************** + * Copyright (C) 2013 x265 project + * + * Authors: Roger Moussalli <rmoussal@us.ibm.com> + * Min Chen <min.chen@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include <iostream> +#include "common.h" +#include "primitives.h" +#include "ppccommon.h" + +using namespace X265_NS; + +// ORIGINAL : for(col=0; col<16; col++) {sum[col] = src[ocol+col + 0 * srcStride] * c[0];} +#define multiply_pixel_coeff(/*vector int*/ v_sum_0, /*vector int*/ v_sum_1, /*vector int*/ v_sum_2, /*vector int*/ v_sum_3, /*const pixel * */ src, /*int*/ src_offset, /*vector signed short*/ v_coeff) \ +{ \ + vector unsigned char v_pixel ; \ + vector signed short v_pixel_16_h, v_pixel_16_l ; \ + const vector signed short v_mask_unisgned_8_to_16 = {0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF} ; \ +\ + /* load the pixels */ \ + v_pixel = vec_xl(src_offset, src) ; \ +\ + /* unpack the 8-bit pixels to 16-bit values (and undo the sign extension) */ \ + v_pixel_16_h = vec_unpackh((vector signed char)v_pixel) ; \ + v_pixel_16_l = vec_unpackl((vector signed char)v_pixel) ; \ + v_pixel_16_h = vec_and(v_pixel_16_h, v_mask_unisgned_8_to_16) ; \ + v_pixel_16_l = vec_and(v_pixel_16_l, v_mask_unisgned_8_to_16) ; \ +\ + /* multiply the pixels by the coefficient */ \ + v_sum_0 = vec_mule(v_pixel_16_h, v_coeff) ; \ + v_sum_1 = vec_mulo(v_pixel_16_h, v_coeff) ; \ + v_sum_2 = vec_mule(v_pixel_16_l, v_coeff) ; \ + v_sum_3 = vec_mulo(v_pixel_16_l, v_coeff) ; \ +} // end multiply_pixel_coeff() + + +// ORIGINAL : for(col=0; col<16; col++) {sum[col] += src[ocol+col + 1 * srcStride] * c[1];} +#define multiply_accumulate_pixel_coeff(/*vector int*/ v_sum_0, /*vector int*/ v_sum_1, /*vector int*/ v_sum_2, /*vector int*/ v_sum_3, /*const pixel * */ src, /*int*/ src_offset, /*vector signed short*/ v_coeff) \ +{ \ + vector unsigned char v_pixel ; \ + vector signed short v_pixel_16_h, v_pixel_16_l ; \ + vector int v_product_int_0, v_product_int_1, v_product_int_2, v_product_int_3 ; \ + const vector signed short v_mask_unisgned_8_to_16 = {0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF} ; \ +\ + /* ORIGINAL : for(col=0; col<16; col++) {sum[col] = src[ocol+col + 0 * srcStride] * c[0];} */ \ + /* load the pixels */ \ + v_pixel = vec_xl(src_offset, src) ; \ +\ + /* unpack the 8-bit pixels to 16-bit values (and undo the sign extension) */ \ + v_pixel_16_h = vec_unpackh((vector signed char)v_pixel) ; \ + v_pixel_16_l = vec_unpackl((vector signed char)v_pixel) ; \ + v_pixel_16_h = vec_and(v_pixel_16_h, v_mask_unisgned_8_to_16) ; \ + v_pixel_16_l = vec_and(v_pixel_16_l, v_mask_unisgned_8_to_16) ; \ +\ + /* multiply the pixels by the coefficient */ \ + v_product_int_0 = vec_mule(v_pixel_16_h, v_coeff) ; \ + v_product_int_1 = vec_mulo(v_pixel_16_h, v_coeff) ; \ + v_product_int_2 = vec_mule(v_pixel_16_l, v_coeff) ; \ + v_product_int_3 = vec_mulo(v_pixel_16_l, v_coeff) ; \ +\ + /* accumulate the results with the sum vectors */ \ + v_sum_0 = vec_add(v_sum_0, v_product_int_0) ; \ + v_sum_1 = vec_add(v_sum_1, v_product_int_1) ; \ + v_sum_2 = vec_add(v_sum_2, v_product_int_2) ; \ + v_sum_3 = vec_add(v_sum_3, v_product_int_3) ; \ +} // end multiply_accumulate_pixel_coeff() + + + +#if 0 +//ORIGINAL +// Works with the following values: +// N = 8 +// width >= 16 (multiple of 16) +// any height +template<int N, int width, int height> +void interp_vert_pp_altivec(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx) +{ + + + const int16_t* c = (N == 4) ? g_chromaFilter[coeffIdx] : g_lumaFilter[coeffIdx]; + const int shift = IF_FILTER_PREC; + const int offset = 1 << (shift - 1); + const uint16_t maxVal = (1 << X265_DEPTH) - 1; + + src -= (N / 2 - 1) * srcStride; + + + // Vector to hold replicated shift amount + const vector unsigned int v_shift = {shift, shift, shift, shift} ; + + // Vector to hold replicated offset + const vector int v_offset = {offset, offset, offset, offset} ; + + // Vector to hold replicated maxVal + const vector signed short v_maxVal = {maxVal, maxVal, maxVal, maxVal, maxVal, maxVal, maxVal, maxVal} ; + + + // Vector to hold replicated coefficients (one coefficient replicated per vector) + vector signed short v_coeff_0, v_coeff_1, v_coeff_2, v_coeff_3, v_coeff_4, v_coeff_5, v_coeff_6, v_coeff_7 ; + vector signed short v_coefficients = vec_xl(0, c) ; // load all coefficients into one vector + + // Replicate the coefficients into respective vectors + v_coeff_0 = vec_splat(v_coefficients, 0) ; + v_coeff_1 = vec_splat(v_coefficients, 1) ; + v_coeff_2 = vec_splat(v_coefficients, 2) ; + v_coeff_3 = vec_splat(v_coefficients, 3) ; + v_coeff_4 = vec_splat(v_coefficients, 4) ; + v_coeff_5 = vec_splat(v_coefficients, 5) ; + v_coeff_6 = vec_splat(v_coefficients, 6) ; + v_coeff_7 = vec_splat(v_coefficients, 7) ; + + + + int row, ocol, col; + for (row = 0; row < height; row++) + { + for (ocol = 0; ocol < width; ocol+=16) + { + + + // int sum[16] ; + // int16_t val[16] ; + + // --> for(col=0; col<16; col++) {sum[col] = src[ocol+col + 1 * srcStride] * c[0];} + // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 1 * srcStride] * c[1];} + // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 2 * srcStride] * c[2];} + // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 3 * srcStride] * c[3];} + // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 4 * srcStride] * c[4];} + // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 5 * srcStride] * c[5];} + // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 6 * srcStride] * c[6];} + // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 7 * srcStride] * c[7];} + + + vector signed int v_sum_0, v_sum_1, v_sum_2, v_sum_3 ; + vector signed short v_val_0, v_val_1 ; + + + + multiply_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, ocol, v_coeff_0) ; + multiply_accumulate_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, ocol + 1 * srcStride, v_coeff_1) ; + multiply_accumulate_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, ocol + 2 * srcStride, v_coeff_2) ; + multiply_accumulate_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, ocol + 3 * srcStride, v_coeff_3) ; + multiply_accumulate_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, ocol + 4 * srcStride, v_coeff_4) ; + multiply_accumulate_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, ocol + 5 * srcStride, v_coeff_5) ; + multiply_accumulate_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, ocol + 6 * srcStride, v_coeff_6) ; + multiply_accumulate_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, ocol + 7 * srcStride, v_coeff_7) ; + + + + + + // --> for(col=0; col<16; col++) {val[col] = (int16_t)((sum[col] + offset) >> shift);} + // Add offset + v_sum_0 = vec_add(v_sum_0, v_offset) ; + v_sum_1 = vec_add(v_sum_1, v_offset) ; + v_sum_2 = vec_add(v_sum_2, v_offset) ; + v_sum_3 = vec_add(v_sum_3, v_offset) ; + // Shift right by "shift" + v_sum_0 = vec_sra(v_sum_0, v_shift) ; + v_sum_1 = vec_sra(v_sum_1, v_shift) ; + v_sum_2 = vec_sra(v_sum_2, v_shift) ; + v_sum_3 = vec_sra(v_sum_3, v_shift) ; + + // Pack into 16-bit numbers + v_val_0 = vec_pack(v_sum_0, v_sum_2) ; + v_val_1 = vec_pack(v_sum_1, v_sum_3) ; + + + + // --> for(col=0; col<16; col++) {val[col] = (val[col] < 0) ? 0 : val[col];} + vector bool short v_comp_zero_0, v_comp_zero_1 ; + vector signed short v_max_masked_0, v_max_masked_1 ; + vector signed short zeros16 = {0,0,0,0,0,0,0,0} ; + // Compute less than 0 + v_comp_zero_0 = vec_cmplt(v_val_0, zeros16) ; + v_comp_zero_1 = vec_cmplt(v_val_1, zeros16) ; + // Keep values that are greater or equal to 0 + v_val_0 = vec_andc(v_val_0, v_comp_zero_0) ; + v_val_1 = vec_andc(v_val_1, v_comp_zero_1) ; + + + + // --> for(col=0; col<16; col++) {val[col] = (val[col] > maxVal) ? maxVal : val[col];} + vector bool short v_comp_max_0, v_comp_max_1 ; + // Compute greater than max + v_comp_max_0 = vec_cmpgt(v_val_0, v_maxVal) ; + v_comp_max_1 = vec_cmpgt(v_val_1, v_maxVal) ; + // Replace values greater than maxVal with maxVal + v_val_0 = vec_sel(v_val_0, v_maxVal, v_comp_max_0) ; + v_val_1 = vec_sel(v_val_1, v_maxVal, v_comp_max_1) ; + + + + // --> for(col=0; col<16; col++) {dst[ocol+col] = (pixel)val[col];} + // Pack the vals into 8-bit numbers + // but also re-ordering them - side effect of mule and mulo + vector unsigned char v_result ; + vector unsigned char v_perm_index = {0x00, 0x10, 0x02, 0x12, 0x04, 0x14, 0x06, 0x16, 0x08 ,0x18, 0x0A, 0x1A, 0x0C, 0x1C, 0x0E, 0x1E} ; + v_result = (vector unsigned char)vec_perm(v_val_0, v_val_1, v_perm_index) ; + // Store the results back to dst[] + vec_xst(v_result, ocol, (unsigned char *)dst) ; + } + + src += srcStride; + dst += dstStride; + } +} // end interp_vert_pp_altivec() +#else +// Works with the following values: +// N = 8 +// width >= 16 (multiple of 16) +// any height +template<int N, int width, int height> +void interp_vert_pp_altivec(const pixel* __restrict__ src, intptr_t srcStride, pixel* __restrict__ dst, intptr_t dstStride, int coeffIdx) +{ + const int16_t* __restrict__ c = (N == 4) ? g_chromaFilter[coeffIdx] : g_lumaFilter[coeffIdx]; + int shift = IF_FILTER_PREC; + int offset = 1 << (shift - 1); + uint16_t maxVal = (1 << X265_DEPTH) - 1; + + src -= (N / 2 - 1) * srcStride; + + vector signed short vcoeff0 = vec_splats(c[0]); + vector signed short vcoeff1 = vec_splats(c[1]); + vector signed short vcoeff2 = vec_splats(c[2]); + vector signed short vcoeff3 = vec_splats(c[3]); + vector signed short vcoeff4 = vec_splats(c[4]); + vector signed short vcoeff5 = vec_splats(c[5]); + vector signed short vcoeff6 = vec_splats(c[6]); + vector signed short vcoeff7 = vec_splats(c[7]); + vector signed short voffset = vec_splats((short)offset); + vector signed short vshift = vec_splats((short)shift); + vector signed short vmaxVal = vec_splats((short)maxVal); + vector signed short vzero_s16 = vec_splats( (signed short)0u);; + vector signed int vzero_s32 = vec_splats( (signed int)0u); + vector unsigned char vzero_u8 = vec_splats( (unsigned char)0u ); + vector unsigned char vchar_to_short_maskH = {24, 0, 25, 0, 26, 0, 27, 0, 28, 0, 29, 0, 30, 0, 31, 0}; + vector unsigned char vchar_to_short_maskL = {16, 0, 17, 0 ,18, 0, 19, 0, 20, 0, 21, 0, 22, 0, 23, 0}; + + vector signed short vsrcH, vsrcL, vsumH, vsumL; + vector unsigned char vsrc; + + vector signed short vsrc2H, vsrc2L, vsum2H, vsum2L; + vector unsigned char vsrc2; + + const pixel* __restrict__ src2 = src+srcStride; + pixel* __restrict__ dst2 = dst+dstStride; + + int row, col; + for (row = 0; row < height; row+=2) + { + for (col = 0; col < width; col+=16) + { + vsrc = vec_xl(0, (unsigned char*)&src[col + 0*srcStride]); + vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH ); + vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL ); + vsumH = vsrcH * vcoeff0; + vsumL = vsrcL * vcoeff0; + + vsrc = vec_xl(0, (unsigned char*)&src[col + 1*srcStride]); + vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH ); + vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL ); + vsumH += vsrcH * vcoeff1; + vsumL += vsrcL * vcoeff1; + + vsrc = vec_xl(0, (unsigned char*)&src[col + 2*srcStride]); + vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH ); + vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL ); + vsumH += vsrcH * vcoeff2; + vsumL += vsrcL * vcoeff2; + + vsrc = vec_xl(0, (unsigned char*)&src[col + 3*srcStride]); + vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH ); + vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL ); + vsumH += vsrcH * vcoeff3; + vsumL += vsrcL * vcoeff3; + + vsrc = vec_xl(0, (unsigned char*)&src[col + 4*srcStride]); + vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH ); + vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL ); + vsumH += vsrcH * vcoeff4; + vsumL += vsrcL * vcoeff4; + + vsrc = vec_xl(0, (unsigned char*)&src[col + 5*srcStride]); + vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH ); + vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL ); + vsumH += vsrcH * vcoeff5; + vsumL += vsrcL * vcoeff5; + + vsrc = vec_xl(0, (unsigned char*)&src[col + 6*srcStride]); + vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH ); + vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL ); + vsumH += vsrcH * vcoeff6; + vsumL += vsrcL * vcoeff6; + + vsrc = vec_xl(0, (unsigned char*)&src[col + 7*srcStride]); + vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH ); + vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL ); + vsumH += vsrcH * vcoeff7; + vsumL += vsrcL * vcoeff7; + + vector short vvalH = (vsumH + voffset) >> vshift; + vvalH = vec_max( vvalH, vzero_s16 ); + vvalH = vec_min( vvalH, vmaxVal ); + + vector short vvalL = (vsumL + voffset) >> vshift; + vvalL = vec_max( vvalL, vzero_s16 ); + vvalL = vec_min( vvalL, vmaxVal ); + + vector signed char vdst = vec_pack( vvalL, vvalH ); + vec_xst( vdst, 0, (signed char*)&dst[col] ); + + vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 0*srcStride]); + vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH ); + vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL ); + vsum2H = vsrc2H * vcoeff0; + vsum2L = vsrc2L * vcoeff0; + + vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 1*srcStride]); + vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH ); + vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL ); + vsum2H += vsrc2H * vcoeff1; + vsum2L += vsrc2L * vcoeff1; + + vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 2*srcStride]); + vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH ); + vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL ); + vsum2H += vsrc2H * vcoeff2; + vsum2L += vsrc2L * vcoeff2; + + vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 3*srcStride]); + vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH ); + vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL ); + vsum2H += vsrc2H * vcoeff3; + vsum2L += vsrc2L * vcoeff3; + + vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 4*srcStride]); + vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH ); + vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL ); + vsum2H += vsrc2H * vcoeff4; + vsum2L += vsrc2L * vcoeff4; + + vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 5*srcStride]); + vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH ); + vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL ); + vsum2H += vsrc2H * vcoeff5; + vsum2L += vsrc2L * vcoeff5; + + vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 6*srcStride]); + vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH ); + vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL ); + vsum2H += vsrc2H * vcoeff6; + vsum2L += vsrc2L * vcoeff6; + + vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 7*srcStride]); + vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH ); + vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL ); + vsum2H += vsrc2H * vcoeff7; + vsum2L += vsrc2L * vcoeff7; + + vector short vval2H = (vsum2H + voffset) >> vshift; + vval2H = vec_max( vval2H, vzero_s16 ); + vval2H = vec_min( vval2H, vmaxVal ); + + vector short vval2L = (vsum2L + voffset) >> vshift; + vval2L = vec_max( vval2L, vzero_s16 ); + vval2L = vec_min( vval2L, vmaxVal ); + + vector signed char vdst2 = vec_pack( vval2L, vval2H ); + vec_xst( vdst2, 0, (signed char*)&dst2[col] ); + } + + src += 2*srcStride; + dst += 2*dstStride; + src2 += 2*srcStride; + dst2 += 2*dstStride; + } +} +#endif + + +// ORIGINAL : for(col=0; col<16; col++) {sum[col] = src[ocol+col + 0 * srcStride] * c[0];} +#define multiply_sp_pixel_coeff(/*vector int*/ v_sum_0, /*vector int*/ v_sum_1, /*vector int*/ v_sum_2, /*vector int*/ v_sum_3, /*const int16_t * */ src, /*int*/ src_offset, /*vector signed short*/ v_coeff) \ +{ \ + vector signed short v_pixel_16_h, v_pixel_16_l ; \ +\ + /* load the pixels */ \ + v_pixel_16_h = vec_xl(src_offset, src) ; \ + v_pixel_16_l = vec_xl(src_offset + 16, src) ; \ +\ + /* multiply the pixels by the coefficient */ \ + v_sum_0 = vec_mule(v_pixel_16_h, v_coeff) ; \ + v_sum_1 = vec_mulo(v_pixel_16_h, v_coeff) ; \ + v_sum_2 = vec_mule(v_pixel_16_l, v_coeff) ; \ + v_sum_3 = vec_mulo(v_pixel_16_l, v_coeff) ; \ +\ +} // end multiply_pixel_coeff() + + +// ORIGINAL : for(col=0; col<16; col++) {sum[col] += src[ocol+col + 1 * srcStride] * c[1];} +#define multiply_accumulate_sp_pixel_coeff(/*vector int*/ v_sum_0, /*vector int*/ v_sum_1, /*vector int*/ v_sum_2, /*vector int*/ v_sum_3, /*const pixel * */ src, /*int*/ src_offset, /*vector signed short*/ v_coeff) \ +{ \ + vector signed short v_pixel_16_h, v_pixel_16_l ; \ + vector int v_product_int_0, v_product_int_1, v_product_int_2, v_product_int_3 ; \ +\ + /* ORIGINAL : for(col=0; col<16; col++) {sum[col] = src[ocol+col + 0 * srcStride] * c[0];} */ \ +\ + /* load the pixels */ \ + v_pixel_16_h = vec_xl(src_offset, src) ; \ + v_pixel_16_l = vec_xl(src_offset + 16, src) ; \ +\ + /* multiply the pixels by the coefficient */ \ + v_product_int_0 = vec_mule(v_pixel_16_h, v_coeff) ; \ + v_product_int_1 = vec_mulo(v_pixel_16_h, v_coeff) ; \ + v_product_int_2 = vec_mule(v_pixel_16_l, v_coeff) ; \ + v_product_int_3 = vec_mulo(v_pixel_16_l, v_coeff) ; \ +\ + /* accumulate the results with the sum vectors */ \ + v_sum_0 = vec_add(v_sum_0, v_product_int_0) ; \ + v_sum_1 = vec_add(v_sum_1, v_product_int_1) ; \ + v_sum_2 = vec_add(v_sum_2, v_product_int_2) ; \ + v_sum_3 = vec_add(v_sum_3, v_product_int_3) ; \ +\ +} // end multiply_accumulate_pixel_coeff() + + +// Works with the following values: +// N = 8 +// width >= 16 (multiple of 16) +// any height +template <int N, int width, int height> +void filterVertical_sp_altivec(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx) +{ + int headRoom = IF_INTERNAL_PREC - X265_DEPTH; + unsigned int shift = IF_FILTER_PREC + headRoom; + int offset = (1 << (shift - 1)) + (IF_INTERNAL_OFFS << IF_FILTER_PREC); + const uint16_t maxVal = (1 << X265_DEPTH) - 1; + const int16_t* coeff = (N == 8 ? g_lumaFilter[coeffIdx] : g_chromaFilter[coeffIdx]); + + src -= (N / 2 - 1) * srcStride; + + + // Vector to hold replicated shift amount + const vector unsigned int v_shift = {shift, shift, shift, shift} ; + + // Vector to hold replicated offset + const vector int v_offset = {offset, offset, offset, offset} ; + + // Vector to hold replicated maxVal + const vector signed short v_maxVal = {maxVal, maxVal, maxVal, maxVal, maxVal, maxVal, maxVal, maxVal} ; + + + // Vector to hold replicated coefficients (one coefficient replicated per vector) + vector signed short v_coeff_0, v_coeff_1, v_coeff_2, v_coeff_3, v_coeff_4, v_coeff_5, v_coeff_6, v_coeff_7 ; + vector signed short v_coefficients = vec_xl(0, coeff) ; // load all coefficients into one vector + + // Replicate the coefficients into respective vectors + v_coeff_0 = vec_splat(v_coefficients, 0) ; + v_coeff_1 = vec_splat(v_coefficients, 1) ; + v_coeff_2 = vec_splat(v_coefficients, 2) ; + v_coeff_3 = vec_splat(v_coefficients, 3) ; + v_coeff_4 = vec_splat(v_coefficients, 4) ; + v_coeff_5 = vec_splat(v_coefficients, 5) ; + v_coeff_6 = vec_splat(v_coefficients, 6) ; + v_coeff_7 = vec_splat(v_coefficients, 7) ; + + + + int row, ocol, col; + for (row = 0; row < height; row++) + { + for (ocol = 0; ocol < width; ocol+= 16 ) + { + + // int sum[16] ; + // int16_t val[16] ; + + // --> for(col=0; col<16; col++) {sum[col] = src[ocol+col + 1 * srcStride] * c[0];} + // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 1 * srcStride] * c[1];} + // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 2 * srcStride] * c[2];} + // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 3 * srcStride] * c[3];} + // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 4 * srcStride] * c[4];} + // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 5 * srcStride] * c[5];} + // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 6 * srcStride] * c[6];} + // --> for(col=0; col<16; col++) {sum[col] += src[ocol+col + 7 * srcStride] * c[7];} + + + vector signed int v_sum_0, v_sum_1, v_sum_2, v_sum_3 ; + vector signed short v_val_0, v_val_1 ; + + + // Added a factor of 2 to the offset since this is a BYTE offset, and each input pixel is of size 2Bytes + multiply_sp_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, ocol * 2, v_coeff_0) ; + multiply_accumulate_sp_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, (ocol + 1 * srcStride) * 2, v_coeff_1) ; + multiply_accumulate_sp_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, (ocol + 2 * srcStride) * 2, v_coeff_2) ; + multiply_accumulate_sp_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, (ocol + 3 * srcStride) * 2, v_coeff_3) ; + multiply_accumulate_sp_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, (ocol + 4 * srcStride) * 2, v_coeff_4) ; + multiply_accumulate_sp_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, (ocol + 5 * srcStride) * 2, v_coeff_5) ; + multiply_accumulate_sp_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, (ocol + 6 * srcStride) * 2, v_coeff_6) ; + multiply_accumulate_sp_pixel_coeff(v_sum_0, v_sum_1, v_sum_2, v_sum_3, src, (ocol + 7 * srcStride) * 2, v_coeff_7) ; + + + + + + // --> for(col=0; col<16; col++) {val[col] = (int16_t)((sum[col] + offset) >> shift);} + // Add offset + v_sum_0 = vec_add(v_sum_0, v_offset) ; + v_sum_1 = vec_add(v_sum_1, v_offset) ; + v_sum_2 = vec_add(v_sum_2, v_offset) ; + v_sum_3 = vec_add(v_sum_3, v_offset) ; + // Shift right by "shift" + v_sum_0 = vec_sra(v_sum_0, v_shift) ; + v_sum_1 = vec_sra(v_sum_1, v_shift) ; + v_sum_2 = vec_sra(v_sum_2, v_shift) ; + v_sum_3 = vec_sra(v_sum_3, v_shift) ; + + // Pack into 16-bit numbers + v_val_0 = vec_pack(v_sum_0, v_sum_2) ; + v_val_1 = vec_pack(v_sum_1, v_sum_3) ; + + + + // --> for(col=0; col<16; col++) {val[col] = (val[col] < 0) ? 0 : val[col];} + vector bool short v_comp_zero_0, v_comp_zero_1 ; + vector signed short v_max_masked_0, v_max_masked_1 ; + vector signed short zeros16 = {0,0,0,0,0,0,0,0} ; + // Compute less than 0 + v_comp_zero_0 = vec_cmplt(v_val_0, zeros16) ; + v_comp_zero_1 = vec_cmplt(v_val_1, zeros16) ; + // Keep values that are greater or equal to 0 + v_val_0 = vec_andc(v_val_0, v_comp_zero_0) ; + v_val_1 = vec_andc(v_val_1, v_comp_zero_1) ; + + + + // --> for(col=0; col<16; col++) {val[col] = (val[col] > maxVal) ? maxVal : val[col];} + vector bool short v_comp_max_0, v_comp_max_1 ; + // Compute greater than max + v_comp_max_0 = vec_cmpgt(v_val_0, v_maxVal) ; + v_comp_max_1 = vec_cmpgt(v_val_1, v_maxVal) ; + // Replace values greater than maxVal with maxVal + v_val_0 = vec_sel(v_val_0, v_maxVal, v_comp_max_0) ; + v_val_1 = vec_sel(v_val_1, v_maxVal, v_comp_max_1) ; + + + + // --> for(col=0; col<16; col++) {dst[ocol+col] = (pixel)val[col];} + // Pack the vals into 8-bit numbers + // but also re-ordering them - side effect of mule and mulo + vector unsigned char v_result ; + vector unsigned char v_perm_index = {0x00, 0x10, 0x02, 0x12, 0x04, 0x14, 0x06, 0x16, 0x08 ,0x18, 0x0A, 0x1A, 0x0C, 0x1C, 0x0E, 0x1E} ; + v_result = (vector unsigned char)vec_perm(v_val_0, v_val_1, v_perm_index) ; + // Store the results back to dst[] + vec_xst(v_result, ocol, (unsigned char *)dst) ; + } + + src += srcStride; + dst += dstStride; + } +} // end filterVertical_sp_altivec() + + + + + +// Works with the following values: +// N = 8 +// width >= 32 (multiple of 32) +// any height +template <int N, int width, int height> +void interp_horiz_ps_altivec(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt) +{ + + const int16_t* coeff = (N == 4) ? g_chromaFilter[coeffIdx] : g_lumaFilter[coeffIdx]; + int headRoom = IF_INTERNAL_PREC - X265_DEPTH; + unsigned int shift = IF_FILTER_PREC - headRoom; + int offset = -IF_INTERNAL_OFFS << shift; + int blkheight = height; + + src -= N / 2 - 1; + + if (isRowExt) + { + src -= (N / 2 - 1) * srcStride; + blkheight += N - 1; + } + + + vector signed short v_coeff ; + v_coeff = vec_xl(0, coeff) ; + + + vector unsigned char v_pixel_char_0, v_pixel_char_1, v_pixel_char_2 ; + vector signed short v_pixel_short_0, v_pixel_short_1, v_pixel_short_2, v_pixel_short_3, v_pixel_short_4 ; + const vector signed short v_mask_unisgned_char_to_short = {0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF} ; \ + const vector signed int v_zeros_int = {0, 0, 0, 0} ; + const vector signed short v_zeros_short = {0, 0, 0, 0, 0, 0, 0, 0} ; + + vector signed int v_product_0_0, v_product_0_1 ; + vector signed int v_product_1_0, v_product_1_1 ; + vector signed int v_product_2_0, v_product_2_1 ; + vector signed int v_product_3_0, v_product_3_1 ; + + vector signed int v_sum_0, v_sum_1, v_sum_2, v_sum_3 ; + + vector signed int v_sums_temp_col0, v_sums_temp_col1, v_sums_temp_col2, v_sums_temp_col3 ; + vector signed int v_sums_col0_0, v_sums_col0_1 ; + vector signed int v_sums_col1_0, v_sums_col1_1 ; + vector signed int v_sums_col2_0, v_sums_col2_1 ; + vector signed int v_sums_col3_0, v_sums_col3_1 ; + + + const vector signed int v_offset = {offset, offset, offset, offset}; + const vector unsigned int v_shift = {shift, shift, shift, shift} ; + + + vector unsigned char v_sums_shamt = {0x20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} ; + + + + pixel *next_src ; + int16_t *next_dst ; + + int row, col; + for (row = 0; row < blkheight; row++) + { + next_src = (pixel *)src + srcStride ; + next_dst = (int16_t *)dst + dstStride ; + + for(int col_iter=0; col_iter<width; col_iter+=32) + { + // Load a full row of pixels (32 + 7) + v_pixel_char_0 = vec_xl(0, src) ; + v_pixel_char_1 = vec_xl(16, src) ; + v_pixel_char_2 = vec_xl(32, src) ; + + + v_sums_temp_col0 = v_zeros_int ; + v_sums_temp_col1 = v_zeros_int ; + v_sums_temp_col2 = v_zeros_int ; + v_sums_temp_col3 = v_zeros_int ; + + + // Expand the loaded pixels into shorts + v_pixel_short_0 = vec_unpackh((vector signed char)v_pixel_char_0) ; + v_pixel_short_1 = vec_unpackl((vector signed char)v_pixel_char_0) ; + v_pixel_short_2 = vec_unpackh((vector signed char)v_pixel_char_1) ; + v_pixel_short_3 = vec_unpackl((vector signed char)v_pixel_char_1) ; + v_pixel_short_4 = vec_unpackh((vector signed char)v_pixel_char_2) ; + + v_pixel_short_0 = vec_and(v_pixel_short_0, v_mask_unisgned_char_to_short) ; + v_pixel_short_1 = vec_and(v_pixel_short_1, v_mask_unisgned_char_to_short) ; + v_pixel_short_2 = vec_and(v_pixel_short_2, v_mask_unisgned_char_to_short) ; + v_pixel_short_3 = vec_and(v_pixel_short_3, v_mask_unisgned_char_to_short) ; + v_pixel_short_4 = vec_and(v_pixel_short_4, v_mask_unisgned_char_to_short) ; + + + + // Four colum sets are processed below + // One colum per set per iteration + for(col=0; col < 8; col++) + { + + // Multiply the pixels by the coefficients + v_product_0_0 = vec_mule(v_pixel_short_0, v_coeff) ; + v_product_0_1 = vec_mulo(v_pixel_short_0, v_coeff) ; + + v_product_1_0 = vec_mule(v_pixel_short_1, v_coeff) ; + v_product_1_1 = vec_mulo(v_pixel_short_1, v_coeff) ; + + v_product_2_0 = vec_mule(v_pixel_short_2, v_coeff) ; + v_product_2_1 = vec_mulo(v_pixel_short_2, v_coeff) ; + + v_product_3_0 = vec_mule(v_pixel_short_3, v_coeff) ; + v_product_3_1 = vec_mulo(v_pixel_short_3, v_coeff) ; + + + // Sum up the multiplication results + v_sum_0 = vec_add(v_product_0_0, v_product_0_1) ; + v_sum_0 = vec_sums(v_sum_0, v_zeros_int) ; + + v_sum_1 = vec_add(v_product_1_0, v_product_1_1) ; + v_sum_1 = vec_sums(v_sum_1, v_zeros_int) ; + + v_sum_2 = vec_add(v_product_2_0, v_product_2_1) ; + v_sum_2 = vec_sums(v_sum_2, v_zeros_int) ; + + v_sum_3 = vec_add(v_product_3_0, v_product_3_1) ; + v_sum_3 = vec_sums(v_sum_3, v_zeros_int) ; + + + // Insert the sum results into respective vectors + v_sums_temp_col0 = vec_sro(v_sums_temp_col0, v_sums_shamt) ; + v_sums_temp_col0 = vec_or(v_sum_0, v_sums_temp_col0) ; + + v_sums_temp_col1 = vec_sro(v_sums_temp_col1, v_sums_shamt) ; + v_sums_temp_col1 = vec_or(v_sum_1, v_sums_temp_col1) ; + + v_sums_temp_col2 = vec_sro(v_sums_temp_col2, v_sums_shamt) ; + v_sums_temp_col2 = vec_or(v_sum_2, v_sums_temp_col2) ; + + v_sums_temp_col3 = vec_sro(v_sums_temp_col3, v_sums_shamt) ; + v_sums_temp_col3 = vec_or(v_sum_3, v_sums_temp_col3) ; + + + if(col == 3) + { + v_sums_col0_0 = v_sums_temp_col0 ; + v_sums_col1_0 = v_sums_temp_col1 ; + v_sums_col2_0 = v_sums_temp_col2 ; + v_sums_col3_0 = v_sums_temp_col3 ; + + v_sums_temp_col0 = v_zeros_int ; + v_sums_temp_col1 = v_zeros_int ; + v_sums_temp_col2 = v_zeros_int ; + v_sums_temp_col3 = v_zeros_int ; + } + + + // Shift the pixels by 1 (short pixel) + v_pixel_short_0 = vec_sld(v_pixel_short_1, v_pixel_short_0, 14) ; + v_pixel_short_1 = vec_sld(v_pixel_short_2, v_pixel_short_1, 14) ; + v_pixel_short_2 = vec_sld(v_pixel_short_3, v_pixel_short_2, 14) ; + v_pixel_short_3 = vec_sld(v_pixel_short_4, v_pixel_short_3, 14) ; + const vector unsigned char v_shift_right_two_bytes_shamt = {0x10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} ; + v_pixel_short_4 = vec_sro(v_pixel_short_4, v_shift_right_two_bytes_shamt) ; + } + + // Copy the sums result to the second vector (per colum) + v_sums_col0_1 = v_sums_temp_col0 ; + v_sums_col1_1 = v_sums_temp_col1 ; + v_sums_col2_1 = v_sums_temp_col2 ; + v_sums_col3_1 = v_sums_temp_col3 ; + + + + // Post processing and eventually 2 stores + // Original code: + // int16_t val = (int16_t)((sum + offset) >> shift); + // dst[col] = val; + + + v_sums_col0_0 = vec_sra(vec_add(v_sums_col0_0, v_offset), v_shift) ; + v_sums_col0_1 = vec_sra(vec_add(v_sums_col0_1, v_offset), v_shift) ; + v_sums_col1_0 = vec_sra(vec_add(v_sums_col1_0, v_offset), v_shift) ; + v_sums_col1_1 = vec_sra(vec_add(v_sums_col1_1, v_offset), v_shift) ; + v_sums_col2_0 = vec_sra(vec_add(v_sums_col2_0, v_offset), v_shift) ; + v_sums_col2_1 = vec_sra(vec_add(v_sums_col2_1, v_offset), v_shift) ; + v_sums_col3_0 = vec_sra(vec_add(v_sums_col3_0, v_offset), v_shift) ; + v_sums_col3_1 = vec_sra(vec_add(v_sums_col3_1, v_offset), v_shift) ; + + + vector signed short v_val_col0, v_val_col1, v_val_col2, v_val_col3 ; + v_val_col0 = vec_pack(v_sums_col0_0, v_sums_col0_1) ; + v_val_col1 = vec_pack(v_sums_col1_0, v_sums_col1_1) ; + v_val_col2 = vec_pack(v_sums_col2_0, v_sums_col2_1) ; + v_val_col3 = vec_pack(v_sums_col3_0, v_sums_col3_1) ; + + + + // Store results + vec_xst(v_val_col0, 0, dst) ; + vec_xst(v_val_col1, 16, dst) ; + vec_xst(v_val_col2, 32, dst) ; + vec_xst(v_val_col3, 48, dst) ; + + src += 32 ; + dst += 32 ; + + } // end for col_iter + + src = next_src ; + dst = next_dst ; + } +} // interp_horiz_ps_altivec () + + + +// Works with the following values: +// N = 8 +// width >= 32 (multiple of 32) +// any height +template <int N, int width, int height> +void interp_hv_pp_altivec(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY) +{ + + short immedVals[(64 + 8) * (64 + 8)]; + + interp_horiz_ps_altivec<N, width, height>(src, srcStride, immedVals, width, idxX, 1); + + //!!filterVertical_sp_c<N>(immedVals + 3 * width, width, dst, dstStride, width, height, idxY); + filterVertical_sp_altivec<N,width,height>(immedVals + 3 * width, width, dst, dstStride, idxY); +} + +//ORIGINAL +#if 0 +// Works with the following values: +// N = 8 +// width >= 32 (multiple of 32) +// any height +template <int N, int width, int height> +void interp_horiz_pp_altivec(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx) +{ + + const int16_t* coeff = (N == 4) ? g_chromaFilter[coeffIdx] : g_lumaFilter[coeffIdx]; + int headRoom = IF_FILTER_PREC; + int offset = (1 << (headRoom - 1)); + uint16_t maxVal = (1 << X265_DEPTH) - 1; + int cStride = 1; + + src -= (N / 2 - 1) * cStride; + + + vector signed short v_coeff ; + v_coeff = vec_xl(0, coeff) ; + + + vector unsigned char v_pixel_char_0, v_pixel_char_1, v_pixel_char_2 ; + vector signed short v_pixel_short_0, v_pixel_short_1, v_pixel_short_2, v_pixel_short_3, v_pixel_short_4 ; + const vector signed short v_mask_unisgned_char_to_short = {0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF} ; \ + const vector signed int v_zeros_int = {0, 0, 0, 0} ; + const vector signed short v_zeros_short = {0, 0, 0, 0, 0, 0, 0, 0} ; + + vector signed int v_product_0_0, v_product_0_1 ; + vector signed int v_product_1_0, v_product_1_1 ; + vector signed int v_product_2_0, v_product_2_1 ; + vector signed int v_product_3_0, v_product_3_1 ; + + vector signed int v_sum_0, v_sum_1, v_sum_2, v_sum_3 ; + + vector signed int v_sums_temp_col0, v_sums_temp_col1, v_sums_temp_col2, v_sums_temp_col3 ; + vector signed int v_sums_col0_0, v_sums_col0_1 ; + vector signed int v_sums_col1_0, v_sums_col1_1 ; + vector signed int v_sums_col2_0, v_sums_col2_1 ; + vector signed int v_sums_col3_0, v_sums_col3_1 ; + + + const vector signed int v_offset = {offset, offset, offset, offset}; + const vector unsigned int v_headRoom = {headRoom, headRoom, headRoom, headRoom} ; + + + vector unsigned char v_sums_shamt = {0x20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} ; + + + pixel *next_src ; + pixel *next_dst ; + + int row, col; + for (row = 0; row < height; row++) + { + next_src = (pixel *)src + srcStride ; + next_dst = (pixel *)dst + dstStride ; + + for(int col_iter=0; col_iter<width; col_iter+=32) + { + + // Load a full row of pixels (32 + 7) + v_pixel_char_0 = vec_xl(0, src) ; + v_pixel_char_1 = vec_xl(16, src) ; + v_pixel_char_2 = vec_xl(32, src) ; + + + v_sums_temp_col0 = v_zeros_int ; + v_sums_temp_col1 = v_zeros_int ; + v_sums_temp_col2 = v_zeros_int ; + v_sums_temp_col3 = v_zeros_int ; + + + // Expand the loaded pixels into shorts + v_pixel_short_0 = vec_unpackh((vector signed char)v_pixel_char_0) ; + v_pixel_short_1 = vec_unpackl((vector signed char)v_pixel_char_0) ; + v_pixel_short_2 = vec_unpackh((vector signed char)v_pixel_char_1) ; + v_pixel_short_3 = vec_unpackl((vector signed char)v_pixel_char_1) ; + v_pixel_short_4 = vec_unpackh((vector signed char)v_pixel_char_2) ; + + v_pixel_short_0 = vec_and(v_pixel_short_0, v_mask_unisgned_char_to_short) ; + v_pixel_short_1 = vec_and(v_pixel_short_1, v_mask_unisgned_char_to_short) ; + v_pixel_short_2 = vec_and(v_pixel_short_2, v_mask_unisgned_char_to_short) ; + v_pixel_short_3 = vec_and(v_pixel_short_3, v_mask_unisgned_char_to_short) ; + v_pixel_short_4 = vec_and(v_pixel_short_4, v_mask_unisgned_char_to_short) ; + + + + // Four colum sets are processed below + // One colum per set per iteration + for(col=0; col < 8; col++) + { + + // Multiply the pixels by the coefficients + v_product_0_0 = vec_mule(v_pixel_short_0, v_coeff) ; + v_product_0_1 = vec_mulo(v_pixel_short_0, v_coeff) ; + + v_product_1_0 = vec_mule(v_pixel_short_1, v_coeff) ; + v_product_1_1 = vec_mulo(v_pixel_short_1, v_coeff) ; + + v_product_2_0 = vec_mule(v_pixel_short_2, v_coeff) ; + v_product_2_1 = vec_mulo(v_pixel_short_2, v_coeff) ; + + v_product_3_0 = vec_mule(v_pixel_short_3, v_coeff) ; + v_product_3_1 = vec_mulo(v_pixel_short_3, v_coeff) ; + + + // Sum up the multiplication results + v_sum_0 = vec_add(v_product_0_0, v_product_0_1) ; + v_sum_0 = vec_sums(v_sum_0, v_zeros_int) ; + + v_sum_1 = vec_add(v_product_1_0, v_product_1_1) ; + v_sum_1 = vec_sums(v_sum_1, v_zeros_int) ; + + v_sum_2 = vec_add(v_product_2_0, v_product_2_1) ; + v_sum_2 = vec_sums(v_sum_2, v_zeros_int) ; + + v_sum_3 = vec_add(v_product_3_0, v_product_3_1) ; + v_sum_3 = vec_sums(v_sum_3, v_zeros_int) ; + + + // Insert the sum results into respective vectors + v_sums_temp_col0 = vec_sro(v_sums_temp_col0, v_sums_shamt) ; + v_sums_temp_col0 = vec_or(v_sum_0, v_sums_temp_col0) ; + + v_sums_temp_col1 = vec_sro(v_sums_temp_col1, v_sums_shamt) ; + v_sums_temp_col1 = vec_or(v_sum_1, v_sums_temp_col1) ; + + v_sums_temp_col2 = vec_sro(v_sums_temp_col2, v_sums_shamt) ; + v_sums_temp_col2 = vec_or(v_sum_2, v_sums_temp_col2) ; + + v_sums_temp_col3 = vec_sro(v_sums_temp_col3, v_sums_shamt) ; + v_sums_temp_col3 = vec_or(v_sum_3, v_sums_temp_col3) ; + + + if(col == 3) + { + v_sums_col0_0 = v_sums_temp_col0 ; + v_sums_col1_0 = v_sums_temp_col1 ; + v_sums_col2_0 = v_sums_temp_col2 ; + v_sums_col3_0 = v_sums_temp_col3 ; + + v_sums_temp_col0 = v_zeros_int ; + v_sums_temp_col1 = v_zeros_int ; + v_sums_temp_col2 = v_zeros_int ; + v_sums_temp_col3 = v_zeros_int ; + } + + + // Shift the pixels by 1 (short pixel) + v_pixel_short_0 = vec_sld(v_pixel_short_1, v_pixel_short_0, 14) ; + v_pixel_short_1 = vec_sld(v_pixel_short_2, v_pixel_short_1, 14) ; + v_pixel_short_2 = vec_sld(v_pixel_short_3, v_pixel_short_2, 14) ; + v_pixel_short_3 = vec_sld(v_pixel_short_4, v_pixel_short_3, 14) ; + const vector unsigned char v_shift_right_two_bytes_shamt = {0x10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} ; + v_pixel_short_4 = vec_sro(v_pixel_short_4, v_shift_right_two_bytes_shamt) ; + } + + // Copy the sums result to the second vector (per colum) + v_sums_col0_1 = v_sums_temp_col0 ; + v_sums_col1_1 = v_sums_temp_col1 ; + v_sums_col2_1 = v_sums_temp_col2 ; + v_sums_col3_1 = v_sums_temp_col3 ; + + + + // Post processing and eventually 2 stores + // Original code: + // int16_t val = (int16_t)((sum + offset) >> headRoom); + // if (val < 0) val = 0; + // if (val > maxVal) val = maxVal; + // dst[col] = (pixel)val; + + + v_sums_col0_0 = vec_sra(vec_add(v_sums_col0_0, v_offset), v_headRoom) ; + v_sums_col0_1 = vec_sra(vec_add(v_sums_col0_1, v_offset), v_headRoom) ; + v_sums_col1_0 = vec_sra(vec_add(v_sums_col1_0, v_offset), v_headRoom) ; + v_sums_col1_1 = vec_sra(vec_add(v_sums_col1_1, v_offset), v_headRoom) ; + v_sums_col2_0 = vec_sra(vec_add(v_sums_col2_0, v_offset), v_headRoom) ; + v_sums_col2_1 = vec_sra(vec_add(v_sums_col2_1, v_offset), v_headRoom) ; + v_sums_col3_0 = vec_sra(vec_add(v_sums_col3_0, v_offset), v_headRoom) ; + v_sums_col3_1 = vec_sra(vec_add(v_sums_col3_1, v_offset), v_headRoom) ; + + + vector signed short v_val_col0, v_val_col1, v_val_col2, v_val_col3 ; + v_val_col0 = vec_pack(v_sums_col0_0, v_sums_col0_1) ; + v_val_col1 = vec_pack(v_sums_col1_0, v_sums_col1_1) ; + v_val_col2 = vec_pack(v_sums_col2_0, v_sums_col2_1) ; + v_val_col3 = vec_pack(v_sums_col3_0, v_sums_col3_1) ; + + + // if (val < 0) val = 0; + vector bool short v_comp_zero_col0, v_comp_zero_col1, v_comp_zero_col2, v_comp_zero_col3 ; + // Compute less than 0 + v_comp_zero_col0 = vec_cmplt(v_val_col0, v_zeros_short) ; + v_comp_zero_col1 = vec_cmplt(v_val_col1, v_zeros_short) ; + v_comp_zero_col2 = vec_cmplt(v_val_col2, v_zeros_short) ; + v_comp_zero_col3 = vec_cmplt(v_val_col3, v_zeros_short) ; + // Keep values that are greater or equal to 0 + v_val_col0 = vec_andc(v_val_col0, v_comp_zero_col0) ; + v_val_col1 = vec_andc(v_val_col1, v_comp_zero_col1) ; + v_val_col2 = vec_andc(v_val_col2, v_comp_zero_col2) ; + v_val_col3 = vec_andc(v_val_col3, v_comp_zero_col3) ; + + + // if (val > maxVal) val = maxVal; + vector bool short v_comp_max_col0, v_comp_max_col1, v_comp_max_col2, v_comp_max_col3 ; + const vector signed short v_maxVal = {maxVal, maxVal, maxVal, maxVal, maxVal, maxVal, maxVal, maxVal} ; + // Compute greater than max + v_comp_max_col0 = vec_cmpgt(v_val_col0, v_maxVal) ; + v_comp_max_col1 = vec_cmpgt(v_val_col1, v_maxVal) ; + v_comp_max_col2 = vec_cmpgt(v_val_col2, v_maxVal) ; + v_comp_max_col3 = vec_cmpgt(v_val_col3, v_maxVal) ; + // Replace values greater than maxVal with maxVal + v_val_col0 = vec_sel(v_val_col0, v_maxVal, v_comp_max_col0) ; + v_val_col1 = vec_sel(v_val_col1, v_maxVal, v_comp_max_col1) ; + v_val_col2 = vec_sel(v_val_col2, v_maxVal, v_comp_max_col2) ; + v_val_col3 = vec_sel(v_val_col3, v_maxVal, v_comp_max_col3) ; + + // (pixel)val + vector unsigned char v_final_result_0, v_final_result_1 ; + v_final_result_0 = vec_pack((vector unsigned short)v_val_col0, (vector unsigned short)v_val_col1) ; + v_final_result_1 = vec_pack((vector unsigned short)v_val_col2, (vector unsigned short)v_val_col3) ; + + + + // Store results + vec_xst(v_final_result_0, 0, dst) ; + vec_xst(v_final_result_1, 16, dst) ; + + + src += 32 ; + dst += 32 ; + + } // end for col_iter + + + src = next_src ; + dst = next_dst ; + } +} // interp_horiz_pp_altivec() +#else +template<int N, int width, int height> +void interp_horiz_pp_altivec(const pixel* __restrict__ src, intptr_t srcStride, pixel* __restrict__ dst, intptr_t dstStride, int coeffIdx) +{ + const int16_t* __restrict__ coeff = (N == 4) ? g_chromaFilter[coeffIdx] : g_lumaFilter[coeffIdx]; + int headRoom = IF_FILTER_PREC; + int offset = (1 << (headRoom - 1)); + uint16_t maxVal = (1 << X265_DEPTH) - 1; + int cStride = 1; + + src -= (N / 2 - 1) * cStride; + + vector signed short vcoeff0 = vec_splats(coeff[0]); + vector signed short vcoeff1 = vec_splats(coeff[1]); + vector signed short vcoeff2 = vec_splats(coeff[2]); + vector signed short vcoeff3 = vec_splats(coeff[3]); + vector signed short vcoeff4 = vec_splats(coeff[4]); + vector signed short vcoeff5 = vec_splats(coeff[5]); + vector signed short vcoeff6 = vec_splats(coeff[6]); + vector signed short vcoeff7 = vec_splats(coeff[7]); + vector signed short voffset = vec_splats((short)offset); + vector signed short vheadRoom = vec_splats((short)headRoom); + vector signed short vmaxVal = vec_splats((short)maxVal); + vector signed short vzero_s16 = vec_splats( (signed short)0u);; + vector signed int vzero_s32 = vec_splats( (signed int)0u); + vector unsigned char vzero_u8 = vec_splats( (unsigned char)0u ); + + vector signed short vsrcH, vsrcL, vsumH, vsumL; + vector unsigned char vsrc; + + vector signed short vsrc2H, vsrc2L, vsum2H, vsum2L; + vector unsigned char vsrc2; + + vector unsigned char vchar_to_short_maskH = {24, 0, 25, 0, 26, 0, 27, 0, 28, 0, 29, 0, 30, 0, 31, 0}; + vector unsigned char vchar_to_short_maskL = {16, 0, 17, 0 ,18, 0, 19, 0, 20, 0, 21, 0, 22, 0, 23, 0}; + + const pixel* __restrict__ src2 = src+srcStride; + pixel* __restrict__ dst2 = dst+dstStride; + + int row, col; + for (row = 0; row < height; row+=2) + { + for (col = 0; col < width; col+=16) + { + vsrc = vec_xl(0, (unsigned char*)&src[col + 0*cStride]); + vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH ); + vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL ); + + vsumH = vsrcH * vcoeff0; + vsumL = vsrcL * vcoeff0; + + vsrc = vec_xl(0, (unsigned char*)&src[col + 1*cStride]); + vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH ); + vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL ); + vsumH += vsrcH * vcoeff1; + vsumL += vsrcL * vcoeff1; + + vsrc = vec_xl(0, (unsigned char*)&src[col + 2*cStride]); + vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH ); + vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL ); + vsumH += vsrcH * vcoeff2; + vsumL += vsrcL * vcoeff2; + + vsrc = vec_xl(0, (unsigned char*)&src[col + 3*cStride]); + vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH ); + vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL ); + vsumH += vsrcH * vcoeff3; + vsumL += vsrcL * vcoeff3; + + vsrc = vec_xl(0, (unsigned char*)&src[col + 4*cStride]); + vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH ); + vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL ); + vsumH += vsrcH * vcoeff4; + vsumL += vsrcL * vcoeff4; + + vsrc = vec_xl(0, (unsigned char*)&src[col + 5*cStride]); + vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH ); + vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL ); + vsumH += vsrcH * vcoeff5; + vsumL += vsrcL * vcoeff5; + + vsrc = vec_xl(0, (unsigned char*)&src[col + 6*cStride]); + vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH ); + vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL ); + vsumH += vsrcH * vcoeff6; + vsumL += vsrcL * vcoeff6; + + vsrc = vec_xl(0, (unsigned char*)&src[col + 7*cStride]); + vsrcH = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskH ); + vsrcL = (vector signed short)vec_perm( vzero_u8, vsrc, vchar_to_short_maskL ); + vsumH += vsrcH * vcoeff7; + vsumL += vsrcL * vcoeff7; + + vector short vvalH = (vsumH + voffset) >> vheadRoom; + vvalH = vec_max( vvalH, vzero_s16 ); + vvalH = vec_min( vvalH, vmaxVal ); + + vector short vvalL = (vsumL + voffset) >> vheadRoom; + vvalL = vec_max( vvalL, vzero_s16 ); + vvalL = vec_min( vvalL, vmaxVal ); + + vector signed char vdst = vec_pack( vvalL, vvalH ); + vec_xst( vdst, 0, (signed char*)&dst[col] ); + + + + vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 0*cStride]); + vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH ); + vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL ); + + vsum2H = vsrc2H * vcoeff0; + vsum2L = vsrc2L * vcoeff0; + + vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 1*cStride]); + vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH ); + vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL ); + vsum2H += vsrc2H * vcoeff1; + vsum2L += vsrc2L * vcoeff1; + + vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 2*cStride]); + vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH ); + vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL ); + vsum2H += vsrc2H * vcoeff2; + vsum2L += vsrc2L * vcoeff2; + + vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 3*cStride]); + vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH ); + vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL ); + vsum2H += vsrc2H * vcoeff3; + vsum2L += vsrc2L * vcoeff3; + + vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 4*cStride]); + vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH ); + vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL ); + vsum2H += vsrc2H * vcoeff4; + vsum2L += vsrc2L * vcoeff4; + + vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 5*cStride]); + vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH ); + vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL ); + vsum2H += vsrc2H * vcoeff5; + vsum2L += vsrc2L * vcoeff5; + + vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 6*cStride]); + vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH ); + vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL ); + vsum2H += vsrc2H * vcoeff6; + vsum2L += vsrc2L * vcoeff6; + + vsrc2 = vec_xl(0, (unsigned char*)&src2[col + 7*cStride]); + vsrc2H = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskH ); + vsrc2L = (vector signed short)vec_perm( vzero_u8, vsrc2, vchar_to_short_maskL ); + vsum2H += vsrc2H * vcoeff7; + vsum2L += vsrc2L * vcoeff7; + + vector short vval2H = (vsum2H + voffset) >> vheadRoom; + vval2H = vec_max( vval2H, vzero_s16 ); + vval2H = vec_min( vval2H, vmaxVal ); + + vector short vval2L = (vsum2L + voffset) >> vheadRoom; + vval2L = vec_max( vval2L, vzero_s16 ); + vval2L = vec_min( vval2L, vmaxVal ); + + vector signed char vdst2 = vec_pack( vval2L, vval2H ); + vec_xst( vdst2, 0, (signed char*)&dst2[col] ); + } + + src += 2*srcStride; + dst += 2*dstStride; + + src2 += 2*srcStride; + dst2 += 2*dstStride; + } +} +#endif + + +// Works with the following values: +// N = 8 +// width >= 32 (multiple of 32) +// any height +//template <int N, int width, int height> +//void interp_horiz_pp_altivec(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx) +//{ +// +// const int16_t* coeff = (N == 4) ? g_chromaFilter[coeffIdx] : g_lumaFilter[coeffIdx]; +// int headRoom = IF_FILTER_PREC; +// int offset = (1 << (headRoom - 1)); +// uint16_t maxVal = (1 << X265_DEPTH) - 1; +// int cStride = 1; +// +// src -= (N / 2 - 1) * cStride; +// +// +// vector signed short v_coeff ; +// v_coeff = vec_xl(0, coeff) ; +// +// +// vector unsigned char v_pixel_char_0, v_pixel_char_1, v_pixel_char_2 ; +// vector signed short v_pixel_short_0, v_pixel_short_1, v_pixel_short_2, v_pixel_short_3, v_pixel_short_4 ; +// const vector signed short v_mask_unisgned_char_to_short = {0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF} ; +// const vector signed int v_zeros_int = {0, 0, 0, 0} ; +// const vector signed short v_zeros_short = {0, 0, 0, 0, 0, 0, 0, 0} ; +// +// vector signed int v_product_0_0, v_product_0_1 ; +// vector signed int v_product_1_0, v_product_1_1 ; +// vector signed int v_product_2_0, v_product_2_1 ; +// vector signed int v_product_3_0, v_product_3_1 ; +// +// vector signed int v_sum_0, v_sum_1, v_sum_2, v_sum_3 ; +// +// vector signed int v_sums_temp_col0, v_sums_temp_col1, v_sums_temp_col2, v_sums_temp_col3 ; +// vector signed int v_sums_col0_0, v_sums_col0_1 ; +// vector signed int v_sums_col1_0, v_sums_col1_1 ; +// vector signed int v_sums_col2_0, v_sums_col2_1 ; +// vector signed int v_sums_col3_0, v_sums_col3_1 ; +// +// +// const vector signed int v_offset = {offset, offset, offset, offset}; +// const vector unsigned int v_headRoom = {headRoom, headRoom, headRoom, headRoom} ; +// +// +// vector unsigned char v_sums_shamt = {0x20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} ; +// +// +// pixel *next_src ; +// pixel *next_dst ; +// +// int row, col; +// for (row = 0; row < height; row++) +// { +// next_src = (pixel *)src + srcStride ; +// next_dst = (pixel *)dst + dstStride ; +// +// for(int col_iter=0; col_iter<width; col_iter+=32) +// { +// +// // Load a full row of pixels (32 + 7) +// v_pixel_char_0 = vec_xl(0, src) ; +// v_pixel_char_1 = vec_xl(16, src) ; +// v_pixel_char_2 = vec_xl(32, src) ; +// +// +// v_sums_temp_col0 = v_zeros_int ; +// v_sums_temp_col1 = v_zeros_int ; +// v_sums_temp_col2 = v_zeros_int ; +// v_sums_temp_col3 = v_zeros_int ; +// +// +// // Expand the loaded pixels into shorts +// v_pixel_short_0 = vec_unpackh((vector signed char)v_pixel_char_0) ; +// v_pixel_short_1 = vec_unpackl((vector signed char)v_pixel_char_0) ; +// v_pixel_short_2 = vec_unpackh((vector signed char)v_pixel_char_1) ; +// v_pixel_short_3 = vec_unpackl((vector signed char)v_pixel_char_1) ; +// v_pixel_short_4 = vec_unpackh((vector signed char)v_pixel_char_2) ; +// +// v_pixel_short_0 = vec_and(v_pixel_short_0, v_mask_unisgned_char_to_short) ; +// v_pixel_short_1 = vec_and(v_pixel_short_1, v_mask_unisgned_char_to_short) ; +// v_pixel_short_2 = vec_and(v_pixel_short_2, v_mask_unisgned_char_to_short) ; +// v_pixel_short_3 = vec_and(v_pixel_short_3, v_mask_unisgned_char_to_short) ; +// v_pixel_short_4 = vec_and(v_pixel_short_4, v_mask_unisgned_char_to_short) ; +// +// +// +// // Four colum sets are processed below +// // One colum per set per iteration +// for(col=0; col < 8; col++) +// { +// +// // Multiply the pixels by the coefficients +// v_product_0_0 = vec_mule(v_pixel_short_0, v_coeff) ; +// v_product_0_1 = vec_mulo(v_pixel_short_0, v_coeff) ; +// +// v_product_1_0 = vec_mule(v_pixel_short_1, v_coeff) ; +// v_product_1_1 = vec_mulo(v_pixel_short_1, v_coeff) ; +// +// v_product_2_0 = vec_mule(v_pixel_short_2, v_coeff) ; +// v_product_2_1 = vec_mulo(v_pixel_short_2, v_coeff) ; +// +// v_product_3_0 = vec_mule(v_pixel_short_3, v_coeff) ; +// v_product_3_1 = vec_mulo(v_pixel_short_3, v_coeff) ; +// +// +// // Sum up the multiplication results +// v_sum_0 = vec_add(v_product_0_0, v_product_0_1) ; +// v_sum_0 = vec_sums(v_sum_0, v_zeros_int) ; +// +// v_sum_1 = vec_add(v_product_1_0, v_product_1_1) ; +// v_sum_1 = vec_sums(v_sum_1, v_zeros_int) ; +// +// v_sum_2 = vec_add(v_product_2_0, v_product_2_1) ; +// v_sum_2 = vec_sums(v_sum_2, v_zeros_int) ; +// +// v_sum_3 = vec_add(v_product_3_0, v_product_3_1) ; +// v_sum_3 = vec_sums(v_sum_3, v_zeros_int) ; +// +// +// // Insert the sum results into respective vectors +// v_sums_temp_col0 = vec_sro(v_sums_temp_col0, v_sums_shamt) ; +// v_sums_temp_col0 = vec_or(v_sum_0, v_sums_temp_col0) ; +// +// v_sums_temp_col1 = vec_sro(v_sums_temp_col1, v_sums_shamt) ; +// v_sums_temp_col1 = vec_or(v_sum_1, v_sums_temp_col1) ; +// +// v_sums_temp_col2 = vec_sro(v_sums_temp_col2, v_sums_shamt) ; +// v_sums_temp_col2 = vec_or(v_sum_2, v_sums_temp_col2) ; +// +// v_sums_temp_col3 = vec_sro(v_sums_temp_col3, v_sums_shamt) ; +// v_sums_temp_col3 = vec_or(v_sum_3, v_sums_temp_col3) ; +// +// +// if(col == 3) +// { +// v_sums_col0_0 = v_sums_temp_col0 ; +// v_sums_col1_0 = v_sums_temp_col1 ; +// v_sums_col2_0 = v_sums_temp_col2 ; +// v_sums_col3_0 = v_sums_temp_col3 ; +// +// v_sums_temp_col0 = v_zeros_int ; +// v_sums_temp_col1 = v_zeros_int ; +// v_sums_temp_col2 = v_zeros_int ; +// v_sums_temp_col3 = v_zeros_int ; +// } +// +// +// // Shift the pixels by 1 (short pixel) +// v_pixel_short_0 = vec_sld(v_pixel_short_1, v_pixel_short_0, 14) ; +// v_pixel_short_1 = vec_sld(v_pixel_short_2, v_pixel_short_1, 14) ; +// v_pixel_short_2 = vec_sld(v_pixel_short_3, v_pixel_short_2, 14) ; +// v_pixel_short_3 = vec_sld(v_pixel_short_4, v_pixel_short_3, 14) ; +// const vector unsigned char v_shift_right_two_bytes_shamt = {0x10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} ; +// v_pixel_short_4 = vec_sro(v_pixel_short_4, v_shift_right_two_bytes_shamt) ; +// } +// +// // Copy the sums result to the second vector (per colum) +// v_sums_col0_1 = v_sums_temp_col0 ; +// v_sums_col1_1 = v_sums_temp_col1 ; +// v_sums_col2_1 = v_sums_temp_col2 ; +// v_sums_col3_1 = v_sums_temp_col3 ; +// +// +// +// // Post processing and eventually 2 stores +// // Original code: +// // int16_t val = (int16_t)((sum + offset) >> headRoom); +// // if (val < 0) val = 0; +// // if (val > maxVal) val = maxVal; +// // dst[col] = (pixel)val; +// +// +// v_sums_col0_0 = vec_sra(vec_add(v_sums_col0_0, v_offset), v_headRoom) ; +// v_sums_col0_1 = vec_sra(vec_add(v_sums_col0_1, v_offset), v_headRoom) ; +// v_sums_col1_0 = vec_sra(vec_add(v_sums_col1_0, v_offset), v_headRoom) ; +// v_sums_col1_1 = vec_sra(vec_add(v_sums_col1_1, v_offset), v_headRoom) ; +// v_sums_col2_0 = vec_sra(vec_add(v_sums_col2_0, v_offset), v_headRoom) ; +// v_sums_col2_1 = vec_sra(vec_add(v_sums_col2_1, v_offset), v_headRoom) ; +// v_sums_col3_0 = vec_sra(vec_add(v_sums_col3_0, v_offset), v_headRoom) ; +// v_sums_col3_1 = vec_sra(vec_add(v_sums_col3_1, v_offset), v_headRoom) ; +// +// +// vector signed short v_val_col0, v_val_col1, v_val_col2, v_val_col3 ; +// v_val_col0 = vec_pack(v_sums_col0_0, v_sums_col0_1) ; +// v_val_col1 = vec_pack(v_sums_col1_0, v_sums_col1_1) ; +// v_val_col2 = vec_pack(v_sums_col2_0, v_sums_col2_1) ; +// v_val_col3 = vec_pack(v_sums_col3_0, v_sums_col3_1) ; +// +// +// // if (val < 0) val = 0; +// vector bool short v_comp_zero_col0, v_comp_zero_col1, v_comp_zero_col2, v_comp_zero_col3 ; +// // Compute less than 0 +// v_comp_zero_col0 = vec_cmplt(v_val_col0, v_zeros_short) ; +// v_comp_zero_col1 = vec_cmplt(v_val_col1, v_zeros_short) ; +// v_comp_zero_col2 = vec_cmplt(v_val_col2, v_zeros_short) ; +// v_comp_zero_col3 = vec_cmplt(v_val_col3, v_zeros_short) ; +// // Keep values that are greater or equal to 0 +// v_val_col0 = vec_andc(v_val_col0, v_comp_zero_col0) ; +// v_val_col1 = vec_andc(v_val_col1, v_comp_zero_col1) ; +// v_val_col2 = vec_andc(v_val_col2, v_comp_zero_col2) ; +// v_val_col3 = vec_andc(v_val_col3, v_comp_zero_col3) ; +// +// +// // if (val > maxVal) val = maxVal; +// vector bool short v_comp_max_col0, v_comp_max_col1, v_comp_max_col2, v_comp_max_col3 ; +// const vector signed short v_maxVal = {maxVal, maxVal, maxVal, maxVal, maxVal, maxVal, maxVal, maxVal} ; +// // Compute greater than max +// v_comp_max_col0 = vec_cmpgt(v_val_col0, v_maxVal) ; +// v_comp_max_col1 = vec_cmpgt(v_val_col1, v_maxVal) ; +// v_comp_max_col2 = vec_cmpgt(v_val_col2, v_maxVal) ; +// v_comp_max_col3 = vec_cmpgt(v_val_col3, v_maxVal) ; +// // Replace values greater than maxVal with maxVal +// v_val_col0 = vec_sel(v_val_col0, v_maxVal, v_comp_max_col0) ; +// v_val_col1 = vec_sel(v_val_col1, v_maxVal, v_comp_max_col1) ; +// v_val_col2 = vec_sel(v_val_col2, v_maxVal, v_comp_max_col2) ; +// v_val_col3 = vec_sel(v_val_col3, v_maxVal, v_comp_max_col3) ; +// +// // (pixel)val +// vector unsigned char v_final_result_0, v_final_result_1 ; +// v_final_result_0 = vec_pack((vector unsigned short)v_val_col0, (vector unsigned short)v_val_col1) ; +// v_final_result_1 = vec_pack((vector unsigned short)v_val_col2, (vector unsigned short)v_val_col3) ; +// +// +// +// // Store results +// vec_xst(v_final_result_0, 0, dst) ; +// vec_xst(v_final_result_1, 16, dst) ; +// +// +// src += 32 ; +// dst += 32 ; +// +// } // end for col_iter +// +// +// src = next_src ; +// dst = next_dst ; +// } +//} // interp_horiz_pp_altivec() + + +namespace X265_NS { + +void setupFilterPrimitives_altivec(EncoderPrimitives& p) +{ + // interp_vert_pp_c + p.pu[LUMA_16x16].luma_vpp = interp_vert_pp_altivec<8, 16, 16> ; + p.pu[LUMA_32x8].luma_vpp = interp_vert_pp_altivec<8, 32, 8> ; + p.pu[LUMA_16x12].luma_vpp = interp_vert_pp_altivec<8, 16, 12> ; + p.pu[LUMA_16x4].luma_vpp = interp_vert_pp_altivec<8, 16, 4> ; + p.pu[LUMA_32x32].luma_vpp = interp_vert_pp_altivec<8, 32, 32> ; + p.pu[LUMA_32x16].luma_vpp = interp_vert_pp_altivec<8, 32, 16> ; + p.pu[LUMA_16x32].luma_vpp = interp_vert_pp_altivec<8, 16, 32> ; + p.pu[LUMA_32x24].luma_vpp = interp_vert_pp_altivec<8, 32, 24> ; + p.pu[LUMA_32x8].luma_vpp = interp_vert_pp_altivec<8, 32, 8> ; + p.pu[LUMA_64x64].luma_vpp = interp_vert_pp_altivec<8, 64, 64> ; + p.pu[LUMA_64x32].luma_vpp = interp_vert_pp_altivec<8, 64, 32> ; + p.pu[LUMA_32x64].luma_vpp = interp_vert_pp_altivec<8, 32, 64> ; + p.pu[LUMA_64x48].luma_vpp = interp_vert_pp_altivec<8, 64, 48> ; + p.pu[LUMA_48x64].luma_vpp = interp_vert_pp_altivec<8, 48, 64> ; + p.pu[LUMA_64x16].luma_vpp = interp_vert_pp_altivec<8, 64, 16> ; + p.pu[LUMA_16x64].luma_vpp = interp_vert_pp_altivec<8, 16, 64> ; + + // interp_hv_pp_c + p.pu[LUMA_32x32].luma_hvpp = interp_hv_pp_altivec<8, 32, 32> ; + p.pu[LUMA_32x16].luma_hvpp = interp_hv_pp_altivec<8, 32, 16> ; + p.pu[LUMA_32x24].luma_hvpp = interp_hv_pp_altivec<8, 32, 24> ; + p.pu[LUMA_32x8].luma_hvpp = interp_hv_pp_altivec<8, 32, 8> ; + p.pu[LUMA_64x64].luma_hvpp = interp_hv_pp_altivec<8, 64, 64> ; + p.pu[LUMA_64x32].luma_hvpp = interp_hv_pp_altivec<8, 64, 32> ; + p.pu[LUMA_32x64].luma_hvpp = interp_hv_pp_altivec<8, 32, 64> ; + p.pu[LUMA_64x48].luma_hvpp = interp_hv_pp_altivec<8, 64, 48> ; + p.pu[LUMA_64x16].luma_hvpp = interp_hv_pp_altivec<8, 64, 16> ; + + // interp_horiz_pp_c + p.pu[LUMA_32x32].luma_hpp = interp_horiz_pp_altivec<8, 32, 32> ; + p.pu[LUMA_32x16].luma_hpp = interp_horiz_pp_altivec<8, 32, 16> ; + p.pu[LUMA_32x24].luma_hpp = interp_horiz_pp_altivec<8, 32, 24> ; + p.pu[LUMA_32x8].luma_hpp = interp_horiz_pp_altivec<8, 32, 8> ; + p.pu[LUMA_64x64].luma_hpp = interp_horiz_pp_altivec<8, 64, 64> ; + p.pu[LUMA_64x32].luma_hpp = interp_horiz_pp_altivec<8, 64, 32> ; + p.pu[LUMA_32x64].luma_hpp = interp_horiz_pp_altivec<8, 32, 64> ; + p.pu[LUMA_64x48].luma_hpp = interp_horiz_pp_altivec<8, 64, 48> ; + p.pu[LUMA_64x16].luma_hpp = interp_horiz_pp_altivec<8, 64, 16> ; +} + +} // end namespace X265_NS
View file
x265_2.2.tar.gz/source/common/ppc/pixel_altivec.cpp
Added
@@ -0,0 +1,4321 @@ +/***************************************************************************** + * Copyright (C) 2013 x265 project + * + * Authors: Steve Borho <steve@borho.org> + * Mandar Gurav <mandar@multicorewareinc.com> + * Mahesh Pittala <mahesh@multicorewareinc.com> + * Min Chen <min.chen@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "common.h" +#include "primitives.h" +#include "x265.h" +#include "ppccommon.h" + +#include <cstdlib> // abs() + +//using namespace X265_NS; + +namespace X265_NS { +// place functions in anonymous namespace (file static) + + /* Null vector */ +#define LOAD_ZERO const vec_u8_t zerov = vec_splat_u8( 0 ) + +#define zero_u8v (vec_u8_t) zerov +#define zero_s8v (vec_s8_t) zerov +#define zero_u16v (vec_u16_t) zerov +#define zero_s16v (vec_s16_t) zerov +#define zero_u32v (vec_u32_t) zerov +#define zero_s32v (vec_s32_t) zerov + + /* 8 <-> 16 bits conversions */ +#ifdef WORDS_BIGENDIAN +#define vec_u8_to_u16_h(v) (vec_u16_t) vec_mergeh( zero_u8v, (vec_u8_t) v ) +#define vec_u8_to_u16_l(v) (vec_u16_t) vec_mergel( zero_u8v, (vec_u8_t) v ) +#define vec_u8_to_s16_h(v) (vec_s16_t) vec_mergeh( zero_u8v, (vec_u8_t) v ) +#define vec_u8_to_s16_l(v) (vec_s16_t) vec_mergel( zero_u8v, (vec_u8_t) v ) +#else +#define vec_u8_to_u16_h(v) (vec_u16_t) vec_mergeh( (vec_u8_t) v, zero_u8v ) +#define vec_u8_to_u16_l(v) (vec_u16_t) vec_mergel( (vec_u8_t) v, zero_u8v ) +#define vec_u8_to_s16_h(v) (vec_s16_t) vec_mergeh( (vec_u8_t) v, zero_u8v ) +#define vec_u8_to_s16_l(v) (vec_s16_t) vec_mergel( (vec_u8_t) v, zero_u8v ) +#endif + +#define vec_u8_to_u16(v) vec_u8_to_u16_h(v) +#define vec_u8_to_s16(v) vec_u8_to_s16_h(v) + +#if defined(__GNUC__) +#define ALIGN_VAR_8(T, var) T var __attribute__((aligned(8))) +#define ALIGN_VAR_16(T, var) T var __attribute__((aligned(16))) +#define ALIGN_VAR_32(T, var) T var __attribute__((aligned(32))) +#elif defined(_MSC_VER) +#define ALIGN_VAR_8(T, var) __declspec(align(8)) T var +#define ALIGN_VAR_16(T, var) __declspec(align(16)) T var +#define ALIGN_VAR_32(T, var) __declspec(align(32)) T var +#endif // if defined(__GNUC__) + +typedef uint8_t pixel; +typedef uint32_t sum2_t ; +typedef uint16_t sum_t ; +#define BITS_PER_SUM (8 * sizeof(sum_t)) + +/*********************************************************************** + * SAD routines - altivec implementation + **********************************************************************/ +template<int lx, int ly> +void inline sum_columns_altivec(vec_s32_t sumv, int* sum){} + +template<int lx, int ly> +int inline sad16_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + assert(lx <=16); + LOAD_ZERO; + vec_u8_t pix1v, pix2v; + vec_u8_t absv = zero_u8v; + vec_s32_t sumv = zero_s32v; + ALIGN_VAR_16(int, sum ); + + for( int y = 0; y < ly; y++ ) + { + pix1v = /*vec_vsx_ld*/vec_xl( 0, pix1); + pix2v = /*vec_vsx_ld*/vec_xl( 0, pix2); + //print_vec_u8("pix1v", &pix1v); + //print_vec_u8("pix2v", &pix2v); + + absv = (vector unsigned char)vec_sub(vec_max(pix1v, pix2v), vec_min(pix1v, pix2v)); + //print_vec_u8("abs sub", &absv); + + sumv = (vec_s32_t) vec_sum4s( absv, (vec_u32_t) sumv); + //print_vec_i("vec_sum4s 0", &sumv); + + pix1 += stride_pix1; + pix2 += stride_pix2; + } + + sum_columns_altivec<lx, ly>(sumv, &sum); + //printf("<%d %d>%d\n", lx, ly, sum); + return sum; +} + +template<int lx, int ly> //to be implemented later +int sad16_altivec(const int16_t* pix1, intptr_t stride_pix1, const int16_t* pix2, intptr_t stride_pix2) +{ + int sum = 0; + return sum; +} + +template<int lx, int ly>//to be implemented later +int sad_altivec(const int16_t* pix1, intptr_t stride_pix1, const int16_t* pix2, intptr_t stride_pix2) +{ + int sum = 0; + return sum; +} + +template<> +void inline sum_columns_altivec<16, 4>(vec_s32_t sumv, int* sum) +{ + LOAD_ZERO; + sumv = vec_sums( sumv, zero_s32v ); + //print_vec_i("vec_sums", &sumv); + sumv = vec_splat( sumv, 3 ); + //print_vec_i("vec_splat 3", &sumv); + vec_ste( sumv, 0, sum ); +} + +template<> +void inline sum_columns_altivec<16, 8>(vec_s32_t sumv, int* sum) +{ + LOAD_ZERO; + sumv = vec_sums( sumv, zero_s32v ); + //print_vec_i("vec_sums", &sumv); + sumv = vec_splat( sumv, 3 ); + //print_vec_i("vec_splat 3", &sumv); + vec_ste( sumv, 0, sum ); +} + +template<> +void inline sum_columns_altivec<16, 12>(vec_s32_t sumv, int* sum) +{ + LOAD_ZERO; + sumv = vec_sums( sumv, zero_s32v ); + //print_vec_i("vec_sums", &sumv); + sumv = vec_splat( sumv, 3 ); + //print_vec_i("vec_splat 3", &sumv); + vec_ste( sumv, 0, sum ); +} + +template<> +void inline sum_columns_altivec<16, 16>(vec_s32_t sumv, int* sum) +{ + LOAD_ZERO; + sumv = vec_sums( sumv, zero_s32v ); + //print_vec_i("vec_sums", &sumv); + sumv = vec_splat( sumv, 3 ); + //print_vec_i("vec_splat 3", &sumv); + vec_ste( sumv, 0, sum ); +} + +template<> +void inline sum_columns_altivec<16, 24>(vec_s32_t sumv, int* sum) +{ + LOAD_ZERO; + sumv = vec_sums( sumv, zero_s32v ); + //print_vec_i("vec_sums", &sumv); + sumv = vec_splat( sumv, 3 ); + //print_vec_i("vec_splat 3", &sumv); + vec_ste( sumv, 0, sum ); +} + +template<> +void inline sum_columns_altivec<16, 32>(vec_s32_t sumv, int* sum) +{ + LOAD_ZERO; + sumv = vec_sums( sumv, zero_s32v ); + //print_vec_i("vec_sums", &sumv); + sumv = vec_splat( sumv, 3 ); + //print_vec_i("vec_splat 3", &sumv); + vec_ste( sumv, 0, sum ); +} + +template<> +void inline sum_columns_altivec<16, 48>(vec_s32_t sumv, int* sum) +{ + LOAD_ZERO; + sumv = vec_sums( sumv, zero_s32v ); + //print_vec_i("vec_sums", &sumv); + sumv = vec_splat( sumv, 3 ); + //print_vec_i("vec_splat 3", &sumv); + vec_ste( sumv, 0, sum ); +} + +template<> +void inline sum_columns_altivec<16, 64>(vec_s32_t sumv, int* sum) +{ + LOAD_ZERO; + sumv = vec_sums( sumv, zero_s32v ); + //print_vec_i("vec_sums", &sumv); + sumv = vec_splat( sumv, 3 ); + //print_vec_i("vec_splat 3", &sumv); + vec_ste( sumv, 0, sum ); +} + + +template<> +void inline sum_columns_altivec<8, 4>(vec_s32_t sumv, int* sum) +{ + LOAD_ZERO; + sumv = vec_sum2s( sumv, zero_s32v ); + //print_vec_i("vec_sums", &sumv); + sumv = vec_splat( sumv, 1 ); + //print_vec_i("vec_splat 1", &sumv); + vec_ste( sumv, 0, sum ); +} + +template<> +void inline sum_columns_altivec<8, 8>(vec_s32_t sumv, int* sum) +{ + LOAD_ZERO; + sumv = vec_sum2s( sumv, zero_s32v ); + //print_vec_i("vec_sums", &sumv); + sumv = vec_splat( sumv, 1 ); + //print_vec_i("vec_splat 1", &sumv); + vec_ste( sumv, 0, sum ); +} + +template<> +void inline sum_columns_altivec<8, 16>(vec_s32_t sumv, int* sum) +{ + LOAD_ZERO; + sumv = vec_sum2s( sumv, zero_s32v ); + //print_vec_i("vec_sums", &sumv); + sumv = vec_splat( sumv, 1 ); + //print_vec_i("vec_splat 1", &sumv); + vec_ste( sumv, 0, sum ); +} + +template<> +void inline sum_columns_altivec<8, 32>(vec_s32_t sumv, int* sum) +{ + LOAD_ZERO; + sumv = vec_sum2s( sumv, zero_s32v ); + //print_vec_i("vec_sums", &sumv); + sumv = vec_splat( sumv, 1 ); + //print_vec_i("vec_splat 1", &sumv); + vec_ste( sumv, 0, sum ); +} + +template<> +void inline sum_columns_altivec<4, 4>(vec_s32_t sumv, int* sum) +{ + LOAD_ZERO; + sumv = vec_splat( sumv, 0 ); + //print_vec_i("vec_splat 0", &sumv); + vec_ste( sumv, 0, sum ); +} + +template<> +void inline sum_columns_altivec<4, 8>(vec_s32_t sumv, int* sum) +{ + LOAD_ZERO; + sumv = vec_splat( sumv, 0 ); + //print_vec_i("vec_splat 0", &sumv); + vec_ste( sumv, 0, sum ); +} + +template<> +void inline sum_columns_altivec<4, 16>(vec_s32_t sumv, int* sum) +{ + LOAD_ZERO; + sumv = vec_splat( sumv, 0 ); + //print_vec_i("vec_splat 0", &sumv); + vec_ste( sumv, 0, sum ); +} + +template<> +void inline sum_columns_altivec<12, 16>(vec_s32_t sumv, int* sum) +{ + LOAD_ZERO; + vec_s32_t sum1v= vec_splat( sumv, 3); + sumv = vec_sums( sumv, zero_s32v ); + //print_vec_i("vec_sums", &sumv); + sumv = vec_splat( sumv, 3 ); + //print_vec_i("vec_splat 1", &sumv); + sumv = vec_sub(sumv, sum1v); + vec_ste( sumv, 0, sum ); +} + +template<int lx, int ly> +int inline sad_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2){ return 0; } + +template<> +int inline sad_altivec<24, 32>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + ALIGN_VAR_16(int, sum ); + sum = sad16_altivec<16, 32>(pix1, stride_pix1, pix2, stride_pix2) + + sad16_altivec<8, 32>(pix1+16, stride_pix1, pix2+16, stride_pix2); + //printf("<24 32>%d\n", sum); + return sum; +} + +template<> +int inline sad_altivec<32, 8>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + ALIGN_VAR_16(int, sum ); + sum = sad16_altivec<16, 8>(pix1, stride_pix1, pix2, stride_pix2) + + sad16_altivec<16, 8>(pix1+16, stride_pix1, pix2+16, stride_pix2); + //printf("<32 8>%d\n", sum); + return sum; +} + +template<> +int inline sad_altivec<32, 16>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + ALIGN_VAR_16(int, sum ); + sum = sad16_altivec<16, 16>(pix1, stride_pix1, pix2, stride_pix2) + + sad16_altivec<16, 16>(pix1+16, stride_pix1, pix2+16, stride_pix2); + //printf("<32 16>%d\n", sum); + return sum; +} + +template<> +int inline sad_altivec<32, 24>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + ALIGN_VAR_16(int, sum ); + sum = sad16_altivec<16, 24>(pix1, stride_pix1, pix2, stride_pix2) + + sad16_altivec<16, 24>(pix1+16, stride_pix1, pix2+16, stride_pix2); + //printf("<32 24>%d\n", sum); + return sum; +} + +template<> +int inline sad_altivec<32, 32>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + ALIGN_VAR_16(int, sum ); + sum = sad16_altivec<16, 32>(pix1, stride_pix1, pix2, stride_pix2) + + sad16_altivec<16, 32>(pix1+16, stride_pix1, pix2+16, stride_pix2); + //printf("<32 32>%d\n", sum); + return sum; +} + +template<> +int inline sad_altivec<32, 64>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + ALIGN_VAR_16(int, sum ); + sum = sad16_altivec<16, 64>(pix1, stride_pix1, pix2, stride_pix2) + + sad16_altivec<16, 64>(pix1+16, stride_pix1, pix2+16, stride_pix2); + //printf("<32 64>%d\n", sum); + return sum; +} + +template<> +int inline sad_altivec<48, 64>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + ALIGN_VAR_16(int, sum ); + sum = sad16_altivec<16, 64>(pix1, stride_pix1, pix2, stride_pix2) + + sad16_altivec<16, 64>(pix1+16, stride_pix1, pix2+16, stride_pix2) + + sad16_altivec<16, 64>(pix1+32, stride_pix1, pix2+32, stride_pix2); + //printf("<48 64>%d\n", sum); + return sum; +} + +template<> +int inline sad_altivec<64, 16>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + ALIGN_VAR_16(int, sum ); + sum = sad16_altivec<16, 16>(pix1, stride_pix1, pix2, stride_pix2) + + sad16_altivec<16, 16>(pix1+16, stride_pix1, pix2+16, stride_pix2) + + sad16_altivec<16, 16>(pix1+32, stride_pix1, pix2+32, stride_pix2) + + sad16_altivec<16, 16>(pix1+48, stride_pix1, pix2+48, stride_pix2); + //printf("<64 16>%d\n", sum); + return sum; +} + +template<> +int inline sad_altivec<64, 32>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + ALIGN_VAR_16(int, sum ); + sum = sad16_altivec<16, 32>(pix1, stride_pix1, pix2, stride_pix2) + + sad16_altivec<16, 32>(pix1+16, stride_pix1, pix2+16, stride_pix2) + + sad16_altivec<16, 32>(pix1+32, stride_pix1, pix2+32, stride_pix2) + + sad16_altivec<16, 32>(pix1+48, stride_pix1, pix2+48, stride_pix2); + //printf("<64 32>%d\n", sum); + return sum; +} + +template<> +int inline sad_altivec<64, 48>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + ALIGN_VAR_16(int, sum ); + sum = sad16_altivec<16, 48>(pix1, stride_pix1, pix2, stride_pix2) + + sad16_altivec<16, 48>(pix1+16, stride_pix1, pix2+16, stride_pix2) + + sad16_altivec<16, 48>(pix1+32, stride_pix1, pix2+32, stride_pix2) + + sad16_altivec<16, 48>(pix1+48, stride_pix1, pix2+48, stride_pix2); + //printf("<64 48>%d\n", sum); + return sum; +} + +template<> +int inline sad_altivec<64, 64>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + ALIGN_VAR_16(int, sum ); + sum = sad16_altivec<16, 64>(pix1, stride_pix1, pix2, stride_pix2) + + sad16_altivec<16, 64>(pix1+16, stride_pix1, pix2+16, stride_pix2) + + sad16_altivec<16, 64>(pix1+32, stride_pix1, pix2+32, stride_pix2) + + sad16_altivec<16, 64>(pix1+48, stride_pix1, pix2+48, stride_pix2); + //printf("<64 64>%d\n", sum); + return sum; +} + +/*********************************************************************** + * SAD_X3 routines - altivec implementation + **********************************************************************/ +template<int lx, int ly> +void inline sad16_x3_altivec(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res) +{ + res[0] = 0; + res[1] = 0; + res[2] = 0; + assert(lx <=16); + LOAD_ZERO; + vec_u8_t pix1v, pix2v, pix3v, pix4v; + vec_u8_t absv1_2 = zero_u8v; + vec_u8_t absv1_3 = zero_u8v; + vec_u8_t absv1_4 = zero_u8v; + vec_s32_t sumv0 = zero_s32v; + vec_s32_t sumv1 = zero_s32v; + vec_s32_t sumv2 = zero_s32v; + + for( int y = 0; y < ly; y++ ) + { + pix1v = vec_xl( 0, pix1); //@@RM vec_vsx_ld( 0, pix1); + pix2v = vec_xl( 0, pix2); //@@RM vec_vsx_ld( 0, pix2); + pix3v = vec_xl( 0, pix3); //@@RM vec_vsx_ld( 0, pix3); + pix4v = vec_xl( 0, pix4); //@@RM vec_vsx_ld( 0, pix4); + + //@@RM : using vec_abs has 2 drawbacks here: + //@@RM first, it produces the incorrect result (unpack should be used first) + //@@RM second, it is slower than sub(max,min), as noted in freescale's documentation + //@@RM absv = (vector unsigned char)vec_abs((vector signed char)vec_sub(pix1v, pix2v)); //@@RM vec_abs((vec_s8_t)vec_sub(pix1v, pix2v)); + absv1_2 = (vector unsigned char)vec_sub(vec_max(pix1v, pix2v), vec_min(pix1v, pix2v)); //@@RM vec_abs((vec_s8_t)vec_sub(pix1v, pix2v)); + sumv0 = (vec_s32_t) vec_sum4s( absv1_2, (vec_u32_t) sumv0); + + absv1_3 = (vector unsigned char)vec_sub(vec_max(pix1v, pix3v), vec_min(pix1v, pix3v)); //@@RM vec_abs((vec_s8_t)vec_sub(pix1v, pix3v)); + sumv1 = (vec_s32_t) vec_sum4s( absv1_3, (vec_u32_t) sumv1); + + absv1_4 = (vector unsigned char)vec_sub(vec_max(pix1v, pix4v), vec_min(pix1v, pix4v)); //@@RM vec_abs((vec_s8_t)vec_sub(pix1v, pix3v)); + sumv2 = (vec_s32_t) vec_sum4s( absv1_4, (vec_u32_t) sumv2); + + pix1 += FENC_STRIDE; + pix2 += frefstride; + pix3 += frefstride; + pix4 += frefstride; + } + + sum_columns_altivec<lx, ly>(sumv0, res+0); + sum_columns_altivec<lx, ly>(sumv1, res+1); + sum_columns_altivec<lx, ly>(sumv2, res+2); + //printf("<%d %d>%d %d %d\n", lx, ly, res[0], res[1], res[2]); +} + +template<int lx, int ly> +void inline sad_x3_altivec(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res){} + +template<> +void inline sad_x3_altivec<24, 32>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res) +{ + int32_t sum[3]; + sad16_x3_altivec<16, 32>(pix1, pix2, pix3, pix4, frefstride, sum); + sad16_x3_altivec<8, 32>(pix1+16, pix2+16, pix3+16, pix4+16, frefstride, res); + res[0] += sum[0]; + res[1] += sum[1]; + res[2] += sum[2]; + //printf("<24 32>%d %d %d\n", res[0], res[1], res[2]); +} + +template<> +void inline sad_x3_altivec<32, 8>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res) +{ + int32_t sum[3]; + sad16_x3_altivec<16, 8>(pix1, pix2, pix3, pix4, frefstride, sum); + sad16_x3_altivec<16, 8>(pix1+16, pix2+16, pix3+16, pix4+16, frefstride, res); + res[0] += sum[0]; + res[1] += sum[1]; + res[2] += sum[2]; + //printf("<32 8>%d %d %d\n", res[0], res[1], res[2]); +} + +template<> +void inline sad_x3_altivec<32, 16>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res) +{ + int32_t sum[3]; + sad16_x3_altivec<16, 16>(pix1, pix2, pix3, pix4, frefstride, sum); + sad16_x3_altivec<16, 16>(pix1+16, pix2+16, pix3+16, pix4+16, frefstride, res); + res[0] += sum[0]; + res[1] += sum[1]; + res[2] += sum[2]; + //printf("<32 16>%d %d %d\n", res[0], res[1], res[2]); +} + +template<> +void inline sad_x3_altivec<32, 24>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res) +{ + int32_t sum[3]; + sad16_x3_altivec<16, 24>(pix1, pix2, pix3, pix4, frefstride, sum); + sad16_x3_altivec<16, 24>(pix1+16, pix2+16, pix3+16, pix4+16, frefstride, res); + res[0] += sum[0]; + res[1] += sum[1]; + res[2] += sum[2]; + //printf("<32 24>%d %d %d\n", res[0], res[1], res[2]); +} + +template<> +void sad_x3_altivec<32, 32>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res) +{ + + const int lx = 32 ; + const int ly = 32 ; + + vector unsigned int v_zeros = {0, 0, 0, 0} ; + + vector signed short v_results_0 = {0, 0, 0, 0, 0, 0, 0, 0} ; + vector signed short v_results_1 = {0, 0, 0, 0, 0, 0, 0, 0} ; + vector signed short v_results_2 = {0, 0, 0, 0, 0, 0, 0, 0} ; + + + vector signed int v_results_int_0 ; + vector signed int v_results_int_1 ; + vector signed int v_results_int_2 ; + + vector unsigned char v_pix1 ; + vector unsigned char v_pix2 ; + vector unsigned char v_pix3 ; + vector unsigned char v_pix4 ; + + vector unsigned char v_abs_diff_0 ; + vector unsigned char v_abs_diff_1 ; + vector unsigned char v_abs_diff_2 ; + + vector signed short v_unpack_mask = {0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF} ; + + vector signed short v_short_0_0 , v_short_0_1 ; + vector signed short v_short_1_0 , v_short_1_1 ; + vector signed short v_short_2_0 , v_short_2_1 ; + + vector signed short v_sum_0 ; + vector signed short v_sum_1 ; + vector signed short v_sum_2 ; + + + + res[0] = 0; + res[1] = 0; + res[2] = 0; + for (int y = 0; y < ly; y++) + { + for (int x = 0; x < lx; x+=16) + { + v_pix1 = vec_xl(x, pix1) ; + + // for(int ii=0; ii<16; ii++) { res[0] += abs(pix1[x + ii] - pix2[x + ii]); } + v_pix2 = vec_xl(x, pix2) ; + v_abs_diff_0 = vec_sub(vec_max(v_pix1, v_pix2), vec_min(v_pix1, v_pix2)) ; + v_short_0_0 = vec_unpackh((vector signed char)v_abs_diff_0) ; + v_short_0_0 = vec_and(v_short_0_0, v_unpack_mask) ; + v_short_0_1 = vec_unpackl((vector signed char)v_abs_diff_0) ; + v_short_0_1 = vec_and(v_short_0_1, v_unpack_mask) ; + v_sum_0 = vec_add(v_short_0_0, v_short_0_1) ; + v_results_0 = vec_add(v_results_0, v_sum_0) ; + + // for(int ii=0; ii<16; ii++) { res[1] += abs(pix1[x + ii] - pix3[x + ii]); } + v_pix3 = vec_xl(x, pix3) ; + v_abs_diff_1 = vec_sub(vec_max(v_pix1, v_pix3), vec_min(v_pix1, v_pix3)) ; + v_short_1_0 = vec_unpackh((vector signed char)v_abs_diff_1) ; + v_short_1_0 = vec_and(v_short_1_0, v_unpack_mask) ; + v_short_1_1 = vec_unpackl((vector signed char)v_abs_diff_1) ; + v_short_1_1 = vec_and(v_short_1_1, v_unpack_mask) ; + v_sum_1 = vec_add(v_short_1_0, v_short_1_1) ; + v_results_1 = vec_add(v_results_1, v_sum_1) ; + + + // for(int ii=0; ii<16; ii++) { res[2] += abs(pix1[x + ii] - pix4[x + ii]); } + v_pix4 = vec_xl(x, pix4) ; + v_abs_diff_2 = vec_sub(vec_max(v_pix1, v_pix4), vec_min(v_pix1, v_pix4)) ; + v_short_2_0 = vec_unpackh((vector signed char)v_abs_diff_2) ; + v_short_2_0 = vec_and(v_short_2_0, v_unpack_mask) ; + v_short_2_1 = vec_unpackl((vector signed char)v_abs_diff_2) ; + v_short_2_1 = vec_and(v_short_2_1, v_unpack_mask) ; + v_sum_2 = vec_add(v_short_2_0, v_short_2_1) ; + v_results_2 = vec_add(v_results_2, v_sum_2) ; + + } + + pix1 += FENC_STRIDE; + pix2 += frefstride; + pix3 += frefstride; + pix4 += frefstride; + } + + + v_results_int_0 = vec_sum4s((vector signed short)v_results_0, (vector signed int)v_zeros) ; + v_results_int_0 = vec_sums(v_results_int_0, (vector signed int)v_zeros) ; + res[0] = v_results_int_0[3] ; + + + v_results_int_1 = vec_sum4s((vector signed short)v_results_1, (vector signed int)v_zeros) ; + v_results_int_1 = vec_sums(v_results_int_1, (vector signed int)v_zeros) ; + res[1] = v_results_int_1[3] ; + + + v_results_int_2 = vec_sum4s((vector signed short)v_results_2, (vector signed int)v_zeros) ; + v_results_int_2 = vec_sums(v_results_int_2, (vector signed int)v_zeros) ; + res[2] = v_results_int_2[3] ; + + //printf("<32 32>%d %d %d\n", res[0], res[1], res[2]); + +} // end sad_x3_altivec + +template<> +void inline sad_x3_altivec<32, 64>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res) +{ + int32_t sum[3]; + sad16_x3_altivec<16, 64>(pix1, pix2, pix3, pix4, frefstride, sum); + sad16_x3_altivec<16, 64>(pix1+16, pix2+16, pix3+16, pix4+16, frefstride, res); + res[0] += sum[0]; + res[1] += sum[1]; + res[2] += sum[2]; + //printf("<32 64>%d %d %d\n", res[0], res[1], res[2]); +} + +template<> +void inline sad_x3_altivec<48, 64>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res) +{ + int32_t sum[6]; + sad16_x3_altivec<16, 64>(pix1, pix2, pix3, pix4, frefstride, sum); + sad16_x3_altivec<16, 64>(pix1+16, pix2+16, pix3+16, pix4+16, frefstride, sum+3); + sad16_x3_altivec<16, 64>(pix1+32, pix2+32, pix3+32, pix4+32, frefstride, res); + res[0] = sum[0]+sum[3]+res[0]; + res[1] = sum[1]+sum[4]+res[1]; + res[2] = sum[2]+sum[5]+res[2]; + //printf("<48 64>%d %d %d\n", res[0], res[1], res[2]); +} + +template<> +void inline sad_x3_altivec<64, 16>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res) +{ + int32_t sum[9]; + sad16_x3_altivec<16, 16>(pix1, pix2, pix3, pix4, frefstride, sum); + sad16_x3_altivec<16, 16>(pix1+16, pix2+16, pix3+16, pix4+16, frefstride, sum+3); + sad16_x3_altivec<16, 16>(pix1+32, pix2+32, pix3+32, pix4+32, frefstride, sum+6); + sad16_x3_altivec<16, 16>(pix1+48, pix2+48, pix3+48, pix4+48, frefstride, res); + res[0] = sum[0]+sum[3]+sum[6]+res[0]; + res[1] = sum[1]+sum[4]+sum[7]+res[1]; + res[2] = sum[2]+sum[5]+sum[8]+res[2]; + //printf("<64 16>%d %d %d\n", res[0], res[1], res[2]); +} + +template<> +void inline sad_x3_altivec<64, 32>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res) +{ + int32_t sum[9]; + sad16_x3_altivec<16, 32>(pix1, pix2, pix3, pix4, frefstride, sum); + sad16_x3_altivec<16, 32>(pix1+16, pix2+16, pix3+16, pix4+16, frefstride, sum+3); + sad16_x3_altivec<16, 32>(pix1+32, pix2+32, pix3+32, pix4+32, frefstride, sum+6); + sad16_x3_altivec<16, 32>(pix1+48, pix2+48, pix3+48, pix4+48, frefstride, res); + res[0] = sum[0]+sum[3]+sum[6]+res[0]; + res[1] = sum[1]+sum[4]+sum[7]+res[1]; + res[2] = sum[2]+sum[5]+sum[8]+res[2]; + //printf("<64 32>%d %d %d\n", res[0], res[1], res[2]); +} + +template<> +void inline sad_x3_altivec<64, 48>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res) +{ + int32_t sum[9]; + sad16_x3_altivec<16, 48>(pix1, pix2, pix3, pix4, frefstride, sum); + sad16_x3_altivec<16, 48>(pix1+16, pix2+16, pix3+16, pix4+16, frefstride, sum+3); + sad16_x3_altivec<16, 48>(pix1+32, pix2+32, pix3+32, pix4+32, frefstride, sum+6); + sad16_x3_altivec<16, 48>(pix1+48, pix2+48, pix3+48, pix4+48, frefstride, res); + res[0] = sum[0]+sum[3]+sum[6]+res[0]; + res[1] = sum[1]+sum[4]+sum[7]+res[1]; + res[2] = sum[2]+sum[5]+sum[8]+res[2]; + //printf("<64 48>%d %d %d\n", res[0], res[1], res[2]); +} + +template<> +void inline sad_x3_altivec<64, 64>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res) +{ + int32_t sum[9]; + sad16_x3_altivec<16, 64>(pix1, pix2, pix3, pix4, frefstride, sum); + sad16_x3_altivec<16, 64>(pix1+16, pix2+16, pix3+16, pix4+16, frefstride, sum+3); + sad16_x3_altivec<16, 64>(pix1+32, pix2+32, pix3+32, pix4+32, frefstride, sum+6); + sad16_x3_altivec<16, 64>(pix1+48, pix2+48, pix3+48, pix4+48, frefstride, res); + res[0] = sum[0]+sum[3]+sum[6]+res[0]; + res[1] = sum[1]+sum[4]+sum[7]+res[1]; + res[2] = sum[2]+sum[5]+sum[8]+res[2]; + //printf("<64 64>%d %d %d\n", res[0], res[1], res[2]); +} + +/*********************************************************************** + * SAD_X4 routines - altivec implementation + **********************************************************************/ +template<int lx, int ly> +void inline sad16_x4_altivec(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res) +{ + res[0] = 0; + res[1] = 0; + res[2] = 0; + assert(lx <=16); + LOAD_ZERO; + vec_u8_t pix1v, pix2v, pix3v, pix4v, pix5v; + vec_u8_t absv1_2 = zero_u8v; + vec_u8_t absv1_3 = zero_u8v; + vec_u8_t absv1_4 = zero_u8v; + vec_u8_t absv1_5 = zero_u8v; + vec_s32_t sumv0 = zero_s32v; + vec_s32_t sumv1 = zero_s32v; + vec_s32_t sumv2 = zero_s32v; + vec_s32_t sumv3 = zero_s32v; + + for( int y = 0; y < ly; y++ ) + { + pix1v = vec_xl( 0, pix1); //@@RM vec_vsx_ld( 0, pix1); + pix2v = vec_xl( 0, pix2); //@@RM vec_vsx_ld( 0, pix2); + pix3v = vec_xl( 0, pix3); //@@RM vec_vsx_ld( 0, pix3); + pix4v = vec_xl( 0, pix4); //@@RM vec_vsx_ld( 0, pix4); + pix5v = vec_xl( 0, pix5); //@@RM vec_vsx_ld( 0, pix4); + + //@@RM : using vec_abs has 2 drawbacks here: + //@@RM first, it produces the incorrect result (unpack should be used first) + //@@RM second, it is slower than sub(max,min), as noted in freescale's documentation + //@@RM absv = (vector unsigned char)vec_abs((vector signed char)vec_sub(pix1v, pix2v)); //@@RM vec_abs((vec_s8_t)vec_sub(pix1v, pix2v)); + absv1_2 = (vector unsigned char)vec_sub(vec_max(pix1v, pix2v), vec_min(pix1v, pix2v)); //@@RM vec_abs((vec_s8_t)vec_sub(pix1v, pix2v)); + sumv0 = (vec_s32_t) vec_sum4s( absv1_2, (vec_u32_t) sumv0); + + absv1_3 = (vector unsigned char)vec_sub(vec_max(pix1v, pix3v), vec_min(pix1v, pix3v)); //@@RM vec_abs((vec_s8_t)vec_sub(pix1v, pix3v)); + sumv1 = (vec_s32_t) vec_sum4s( absv1_3, (vec_u32_t) sumv1); + + absv1_4 = (vector unsigned char)vec_sub(vec_max(pix1v, pix4v), vec_min(pix1v, pix4v)); //@@RM vec_abs((vec_s8_t)vec_sub(pix1v, pix3v)); + sumv2 = (vec_s32_t) vec_sum4s( absv1_4, (vec_u32_t) sumv2); + + absv1_5 = (vector unsigned char)vec_sub(vec_max(pix1v, pix5v), vec_min(pix1v, pix5v)); //@@RM vec_abs((vec_s8_t)vec_sub(pix1v, pix3v)); + sumv3 = (vec_s32_t) vec_sum4s( absv1_5, (vec_u32_t) sumv3); + + pix1 += FENC_STRIDE; + pix2 += frefstride; + pix3 += frefstride; + pix4 += frefstride; + pix5 += frefstride; + } + + sum_columns_altivec<lx, ly>(sumv0, res+0); + sum_columns_altivec<lx, ly>(sumv1, res+1); + sum_columns_altivec<lx, ly>(sumv2, res+2); + sum_columns_altivec<lx, ly>(sumv3, res+3); + //printf("<%d %d>%d %d %d %d\n", lx, ly, res[0], res[1], res[2], res[3]); +} + +template<int lx, int ly> +void inline sad_x4_altivec(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res){} + + +template<> +void inline sad_x4_altivec<24, 32>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res) +{ + int32_t sum[4]; + sad16_x4_altivec<16, 32>(pix1, pix2, pix3, pix4, pix5, frefstride, sum); + sad16_x4_altivec<8, 32>(pix1+16, pix2+16, pix3+16, pix4+16, pix5+16, frefstride, res); + res[0] += sum[0]; + res[1] += sum[1]; + res[2] += sum[2]; + res[3] += sum[3]; + //printf("<24 32>%d %d %d %d\n", res[0], res[1], res[2], res[3]); +} + +template<> +void inline sad_x4_altivec<32, 8>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res) +{ + int32_t sum[4]; + sad16_x4_altivec<16, 8>(pix1, pix2, pix3, pix4, pix5, frefstride, sum); + sad16_x4_altivec<16, 8>(pix1+16, pix2+16, pix3+16, pix4+16, pix5+16, frefstride, res); + res[0] += sum[0]; + res[1] += sum[1]; + res[2] += sum[2]; + res[3] += sum[3]; + //printf("<32 8>%d %d %d %d\n", res[0], res[1], res[2], res[3]); +} + +template<> +void sad_x4_altivec<32,16>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res) +{ + + const int lx = 32 ; + const int ly = 16 ; + + vector unsigned int v_zeros = {0, 0, 0, 0} ; + + vector signed short v_results_0 = {0, 0, 0, 0, 0, 0, 0, 0} ; + vector signed short v_results_1 = {0, 0, 0, 0, 0, 0, 0, 0} ; + vector signed short v_results_2 = {0, 0, 0, 0, 0, 0, 0, 0} ; + vector signed short v_results_3 = {0, 0, 0, 0, 0, 0, 0, 0} ; + + + vector signed int v_results_int_0 ; + vector signed int v_results_int_1 ; + vector signed int v_results_int_2 ; + vector signed int v_results_int_3 ; + + vector unsigned char v_pix1 ; + vector unsigned char v_pix2 ; + vector unsigned char v_pix3 ; + vector unsigned char v_pix4 ; + vector unsigned char v_pix5 ; + + vector unsigned char v_abs_diff_0 ; + vector unsigned char v_abs_diff_1 ; + vector unsigned char v_abs_diff_2 ; + vector unsigned char v_abs_diff_3 ; + + vector signed short v_unpack_mask = {0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF} ; + + vector signed short v_short_0_0 , v_short_0_1 ; + vector signed short v_short_1_0 , v_short_1_1 ; + vector signed short v_short_2_0 , v_short_2_1 ; + vector signed short v_short_3_0 , v_short_3_1 ; + + vector signed short v_sum_0 ; + vector signed short v_sum_1 ; + vector signed short v_sum_2 ; + vector signed short v_sum_3 ; + + + res[0] = 0; + res[1] = 0; + res[2] = 0; + res[3] = 0; + for (int y = 0; y < ly; y++) + { + for (int x = 0; x < lx; x+=16) + { + v_pix1 = vec_xl(x, pix1) ; + + // for(int ii=0; ii<16; ii++) { res[0] += abs(pix1[x + ii] - pix2[x + ii]); } + v_pix2 = vec_xl(x, pix2) ; + v_abs_diff_0 = vec_sub(vec_max(v_pix1, v_pix2), vec_min(v_pix1, v_pix2)) ; + v_short_0_0 = vec_unpackh((vector signed char)v_abs_diff_0) ; + v_short_0_0 = vec_and(v_short_0_0, v_unpack_mask) ; + v_short_0_1 = vec_unpackl((vector signed char)v_abs_diff_0) ; + v_short_0_1 = vec_and(v_short_0_1, v_unpack_mask) ; + v_sum_0 = vec_add(v_short_0_0, v_short_0_1) ; + v_results_0 = vec_add(v_results_0, v_sum_0) ; + + // for(int ii=0; ii<16; ii++) { res[1] += abs(pix1[x + ii] - pix3[x + ii]); } + v_pix3 = vec_xl(x, pix3) ; + v_abs_diff_1 = vec_sub(vec_max(v_pix1, v_pix3), vec_min(v_pix1, v_pix3)) ; + v_short_1_0 = vec_unpackh((vector signed char)v_abs_diff_1) ; + v_short_1_0 = vec_and(v_short_1_0, v_unpack_mask) ; + v_short_1_1 = vec_unpackl((vector signed char)v_abs_diff_1) ; + v_short_1_1 = vec_and(v_short_1_1, v_unpack_mask) ; + v_sum_1 = vec_add(v_short_1_0, v_short_1_1) ; + v_results_1 = vec_add(v_results_1, v_sum_1) ; + + + // for(int ii=0; ii<16; ii++) { res[2] += abs(pix1[x + ii] - pix4[x + ii]); } + v_pix4 = vec_xl(x, pix4) ; + v_abs_diff_2 = vec_sub(vec_max(v_pix1, v_pix4), vec_min(v_pix1, v_pix4)) ; + v_short_2_0 = vec_unpackh((vector signed char)v_abs_diff_2) ; + v_short_2_0 = vec_and(v_short_2_0, v_unpack_mask) ; + v_short_2_1 = vec_unpackl((vector signed char)v_abs_diff_2) ; + v_short_2_1 = vec_and(v_short_2_1, v_unpack_mask) ; + v_sum_2 = vec_add(v_short_2_0, v_short_2_1) ; + v_results_2 = vec_add(v_results_2, v_sum_2) ; + + + // for(int ii=0; ii<16; ii++) { res[3] += abs(pix1[x + ii] - pix5[x + ii]); } + v_pix5 = vec_xl(x, pix5) ; + v_abs_diff_3 = vec_sub(vec_max(v_pix1, v_pix5), vec_min(v_pix1, v_pix5)) ; + v_short_3_0 = vec_unpackh((vector signed char)v_abs_diff_3) ; + v_short_3_0 = vec_and(v_short_3_0, v_unpack_mask) ; + v_short_3_1 = vec_unpackl((vector signed char)v_abs_diff_3) ; + v_short_3_1 = vec_and(v_short_3_1, v_unpack_mask) ; + v_sum_3 = vec_add(v_short_3_0, v_short_3_1) ; + v_results_3 = vec_add(v_results_3, v_sum_3) ; + } + + pix1 += FENC_STRIDE; + pix2 += frefstride; + pix3 += frefstride; + pix4 += frefstride; + pix5 += frefstride; + } + + + v_results_int_0 = vec_sum4s((vector signed short)v_results_0, (vector signed int)v_zeros) ; + v_results_int_0 = vec_sums(v_results_int_0, (vector signed int)v_zeros) ; + res[0] = v_results_int_0[3] ; + + + v_results_int_1 = vec_sum4s((vector signed short)v_results_1, (vector signed int)v_zeros) ; + v_results_int_1 = vec_sums(v_results_int_1, (vector signed int)v_zeros) ; + res[1] = v_results_int_1[3] ; + + + v_results_int_2 = vec_sum4s((vector signed short)v_results_2, (vector signed int)v_zeros) ; + v_results_int_2 = vec_sums(v_results_int_2, (vector signed int)v_zeros) ; + res[2] = v_results_int_2[3] ; + + + v_results_int_3 = vec_sum4s((vector signed short)v_results_3, (vector signed int)v_zeros) ; + v_results_int_3 = vec_sums(v_results_int_3, (vector signed int)v_zeros) ; + res[3] = v_results_int_3[3] ; + //printf("<32 16>%d %d %d %d\n", res[0], res[1], res[2], res[3]); +} // end sad_x4_altivec + +template<> +void inline sad_x4_altivec<32, 24>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res) +{ + int32_t sum[4]; + sad16_x4_altivec<16, 24>(pix1, pix2, pix3, pix4, pix5, frefstride, sum); + sad16_x4_altivec<16, 24>(pix1+16, pix2+16, pix3+16, pix4+16, pix5+16, frefstride, res); + res[0] += sum[0]; + res[1] += sum[1]; + res[2] += sum[2]; + res[3] += sum[3]; + //printf("<32 24>%d %d %d %d\n", res[0], res[1], res[2], res[3]); +} + +template<> +void sad_x4_altivec<32,32>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res) +{ + + const int lx = 32 ; + const int ly = 32 ; + + vector unsigned int v_zeros = {0, 0, 0, 0} ; + + vector signed short v_results_0 = {0, 0, 0, 0, 0, 0, 0, 0} ; + vector signed short v_results_1 = {0, 0, 0, 0, 0, 0, 0, 0} ; + vector signed short v_results_2 = {0, 0, 0, 0, 0, 0, 0, 0} ; + vector signed short v_results_3 = {0, 0, 0, 0, 0, 0, 0, 0} ; + + + vector signed int v_results_int_0 ; + vector signed int v_results_int_1 ; + vector signed int v_results_int_2 ; + vector signed int v_results_int_3 ; + + vector unsigned char v_pix1 ; + vector unsigned char v_pix2 ; + vector unsigned char v_pix3 ; + vector unsigned char v_pix4 ; + vector unsigned char v_pix5 ; + + vector unsigned char v_abs_diff_0 ; + vector unsigned char v_abs_diff_1 ; + vector unsigned char v_abs_diff_2 ; + vector unsigned char v_abs_diff_3 ; + + vector signed short v_unpack_mask = {0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF} ; + + vector signed short v_short_0_0 , v_short_0_1 ; + vector signed short v_short_1_0 , v_short_1_1 ; + vector signed short v_short_2_0 , v_short_2_1 ; + vector signed short v_short_3_0 , v_short_3_1 ; + + vector signed short v_sum_0 ; + vector signed short v_sum_1 ; + vector signed short v_sum_2 ; + vector signed short v_sum_3 ; + + + res[0] = 0; + res[1] = 0; + res[2] = 0; + res[3] = 0; + for (int y = 0; y < ly; y++) + { + for (int x = 0; x < lx; x+=16) + { + v_pix1 = vec_xl(x, pix1) ; + + // for(int ii=0; ii<16; ii++) { res[0] += abs(pix1[x + ii] - pix2[x + ii]); } + v_pix2 = vec_xl(x, pix2) ; + v_abs_diff_0 = vec_sub(vec_max(v_pix1, v_pix2), vec_min(v_pix1, v_pix2)) ; + v_short_0_0 = vec_unpackh((vector signed char)v_abs_diff_0) ; + v_short_0_0 = vec_and(v_short_0_0, v_unpack_mask) ; + v_short_0_1 = vec_unpackl((vector signed char)v_abs_diff_0) ; + v_short_0_1 = vec_and(v_short_0_1, v_unpack_mask) ; + v_sum_0 = vec_add(v_short_0_0, v_short_0_1) ; + v_results_0 = vec_add(v_results_0, v_sum_0) ; + + // for(int ii=0; ii<16; ii++) { res[1] += abs(pix1[x + ii] - pix3[x + ii]); } + v_pix3 = vec_xl(x, pix3) ; + v_abs_diff_1 = vec_sub(vec_max(v_pix1, v_pix3), vec_min(v_pix1, v_pix3)) ; + v_short_1_0 = vec_unpackh((vector signed char)v_abs_diff_1) ; + v_short_1_0 = vec_and(v_short_1_0, v_unpack_mask) ; + v_short_1_1 = vec_unpackl((vector signed char)v_abs_diff_1) ; + v_short_1_1 = vec_and(v_short_1_1, v_unpack_mask) ; + v_sum_1 = vec_add(v_short_1_0, v_short_1_1) ; + v_results_1 = vec_add(v_results_1, v_sum_1) ; + + + // for(int ii=0; ii<16; ii++) { res[2] += abs(pix1[x + ii] - pix4[x + ii]); } + v_pix4 = vec_xl(x, pix4) ; + v_abs_diff_2 = vec_sub(vec_max(v_pix1, v_pix4), vec_min(v_pix1, v_pix4)) ; + v_short_2_0 = vec_unpackh((vector signed char)v_abs_diff_2) ; + v_short_2_0 = vec_and(v_short_2_0, v_unpack_mask) ; + v_short_2_1 = vec_unpackl((vector signed char)v_abs_diff_2) ; + v_short_2_1 = vec_and(v_short_2_1, v_unpack_mask) ; + v_sum_2 = vec_add(v_short_2_0, v_short_2_1) ; + v_results_2 = vec_add(v_results_2, v_sum_2) ; + + + // for(int ii=0; ii<16; ii++) { res[3] += abs(pix1[x + ii] - pix5[x + ii]); } + v_pix5 = vec_xl(x, pix5) ; + v_abs_diff_3 = vec_sub(vec_max(v_pix1, v_pix5), vec_min(v_pix1, v_pix5)) ; + v_short_3_0 = vec_unpackh((vector signed char)v_abs_diff_3) ; + v_short_3_0 = vec_and(v_short_3_0, v_unpack_mask) ; + v_short_3_1 = vec_unpackl((vector signed char)v_abs_diff_3) ; + v_short_3_1 = vec_and(v_short_3_1, v_unpack_mask) ; + v_sum_3 = vec_add(v_short_3_0, v_short_3_1) ; + v_results_3 = vec_add(v_results_3, v_sum_3) ; + } + + pix1 += FENC_STRIDE; + pix2 += frefstride; + pix3 += frefstride; + pix4 += frefstride; + pix5 += frefstride; + } + + + v_results_int_0 = vec_sum4s((vector signed short)v_results_0, (vector signed int)v_zeros) ; + v_results_int_0 = vec_sums(v_results_int_0, (vector signed int)v_zeros) ; + res[0] = v_results_int_0[3] ; + + + v_results_int_1 = vec_sum4s((vector signed short)v_results_1, (vector signed int)v_zeros) ; + v_results_int_1 = vec_sums(v_results_int_1, (vector signed int)v_zeros) ; + res[1] = v_results_int_1[3] ; + + + v_results_int_2 = vec_sum4s((vector signed short)v_results_2, (vector signed int)v_zeros) ; + v_results_int_2 = vec_sums(v_results_int_2, (vector signed int)v_zeros) ; + res[2] = v_results_int_2[3] ; + + + v_results_int_3 = vec_sum4s((vector signed short)v_results_3, (vector signed int)v_zeros) ; + v_results_int_3 = vec_sums(v_results_int_3, (vector signed int)v_zeros) ; + res[3] = v_results_int_3[3] ; + + //printf("<32 32>%d %d %d %d\n", res[0], res[1], res[2], res[3]); +} // end sad_x4_altivec + +template<> +void inline sad_x4_altivec<32, 64>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res) +{ + int32_t sum[4]; + sad16_x4_altivec<16, 64>(pix1, pix2, pix3, pix4, pix5, frefstride, sum); + sad16_x4_altivec<16, 64>(pix1+16, pix2+16, pix3+16, pix4+16, pix5+16, frefstride, res); + res[0] += sum[0]; + res[1] += sum[1]; + res[2] += sum[2]; + res[3] += sum[3]; + //printf("<32 64>%d %d %d %d\n", res[0], res[1], res[2], res[3]); +} + +template<> +void inline sad_x4_altivec<48, 64>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res) +{ + int32_t sum[8]; + sad16_x4_altivec<16, 64>(pix1, pix2, pix3, pix4, pix5, frefstride, sum); + sad16_x4_altivec<16, 64>(pix1+16, pix2+16, pix3+16, pix4+16, pix5+16, frefstride, sum+4); + sad16_x4_altivec<16, 64>(pix1+32, pix2+32, pix3+32, pix4+32, pix5+32, frefstride, res); + res[0] = sum[0]+sum[4]+res[0]; + res[1] = sum[1]+sum[5]+res[1]; + res[2] = sum[2]+sum[6]+res[2]; + res[3] = sum[3]+sum[7]+res[3]; + //printf("<48 64>%d %d %d %d\n", res[0], res[1], res[2], res[3]); +} + +template<> +void inline sad_x4_altivec<64, 16>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res) +{ + int32_t sum[12]; + sad16_x4_altivec<16, 16>(pix1, pix2, pix3, pix4, pix5, frefstride, sum); + sad16_x4_altivec<16, 16>(pix1+16, pix2+16, pix3+16, pix4+16, pix5+16, frefstride, sum+4); + sad16_x4_altivec<16, 16>(pix1+32, pix2+32, pix3+32, pix4+32, pix5+32, frefstride, sum+8); + sad16_x4_altivec<16, 16>(pix1+48, pix2+48, pix3+48, pix4+48, pix5+48, frefstride, res); + res[0] = sum[0]+sum[4]+sum[8]+res[0]; + res[1] = sum[1]+sum[5]+sum[9]+res[1]; + res[2] = sum[2]+sum[6]+sum[10]+res[2]; + res[3] = sum[3]+sum[7]+sum[11]+res[3]; + //printf("<64 16>%d %d %d %d\n", res[0], res[1], res[2], res[3]); +} + +template<> +void inline sad_x4_altivec<64, 32>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res) +{ + int32_t sum[12]; + sad16_x4_altivec<16, 32>(pix1, pix2, pix3, pix4, pix5, frefstride, sum); + sad16_x4_altivec<16, 32>(pix1+16, pix2+16, pix3+16, pix4+16, pix5+16, frefstride, sum+4); + sad16_x4_altivec<16, 32>(pix1+32, pix2+32, pix3+32, pix4+32, pix5+32, frefstride, sum+8); + sad16_x4_altivec<16, 32>(pix1+48, pix2+48, pix3+48, pix4+48, pix5+48, frefstride, res); + res[0] = sum[0]+sum[4]+sum[8]+res[0]; + res[1] = sum[1]+sum[5]+sum[9]+res[1]; + res[2] = sum[2]+sum[6]+sum[10]+res[2]; + res[3] = sum[3]+sum[7]+sum[11]+res[3]; + //printf("<64 32>%d %d %d %d\n", res[0], res[1], res[2], res[3]); +} + +template<> +void inline sad_x4_altivec<64, 48>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res) +{ + int32_t sum[12]; + sad16_x4_altivec<16, 48>(pix1, pix2, pix3, pix4, pix5, frefstride, sum); + sad16_x4_altivec<16, 48>(pix1+16, pix2+16, pix3+16, pix4+16, pix5+16, frefstride, sum+4); + sad16_x4_altivec<16, 48>(pix1+32, pix2+32, pix3+32, pix4+32, pix5+32, frefstride, sum+8); + sad16_x4_altivec<16, 48>(pix1+48, pix2+48, pix3+48, pix4+48, pix5+48, frefstride, res); + res[0] = sum[0]+sum[4]+sum[8]+res[0]; + res[1] = sum[1]+sum[5]+sum[9]+res[1]; + res[2] = sum[2]+sum[6]+sum[10]+res[2]; + res[3] = sum[3]+sum[7]+sum[11]+res[3]; + //printf("<64 48>%d %d %d %d\n", res[0], res[1], res[2], res[3]); +} + +template<> +void inline sad_x4_altivec<64, 64>(const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res) +{ + int32_t sum[12]; + sad16_x4_altivec<16, 64>(pix1, pix2, pix3, pix4, pix5, frefstride, sum); + sad16_x4_altivec<16, 64>(pix1+16, pix2+16, pix3+16, pix4+16, pix5+16, frefstride, sum+4); + sad16_x4_altivec<16, 64>(pix1+32, pix2+32, pix3+32, pix4+32, pix5+32, frefstride, sum+8); + sad16_x4_altivec<16, 64>(pix1+48, pix2+48, pix3+48, pix4+48, pix5+48, frefstride, res); + res[0] = sum[0]+sum[4]+sum[8]+res[0]; + res[1] = sum[1]+sum[5]+sum[9]+res[1]; + res[2] = sum[2]+sum[6]+sum[10]+res[2]; + res[3] = sum[3]+sum[7]+sum[11]+res[3]; + //printf("<64 64>%d %d %d %d\n", res[0], res[1], res[2], res[3]); +} + + +/*********************************************************************** + * SATD routines - altivec implementation + **********************************************************************/ +#define HADAMARD4_VEC(s0, s1, s2, s3, d0, d1, d2, d3) \ +{\ + vec_s16_t t0, t1, t2, t3;\ + t0 = vec_add(s0, s1);\ + t1 = vec_sub(s0, s1);\ + t2 = vec_add(s2, s3);\ + t3 = vec_sub(s2, s3);\ + d0 = vec_add(t0, t2);\ + d2 = vec_sub(t0, t2);\ + d1 = vec_add(t1, t3);\ + d3 = vec_sub(t1, t3);\ +} + +#define VEC_TRANSPOSE_4(a0,a1,a2,a3,b0,b1,b2,b3) \ + b0 = vec_mergeh( a0, a0 ); \ + b1 = vec_mergeh( a1, a0 ); \ + b2 = vec_mergeh( a2, a0 ); \ + b3 = vec_mergeh( a3, a0 ); \ + a0 = vec_mergeh( b0, b2 ); \ + a1 = vec_mergel( b0, b2 ); \ + a2 = vec_mergeh( b1, b3 ); \ + a3 = vec_mergel( b1, b3 ); \ + b0 = vec_mergeh( a0, a2 ); \ + b1 = vec_mergel( a0, a2 ); \ + b2 = vec_mergeh( a1, a3 ); \ + b3 = vec_mergel( a1, a3 ) + +#define VEC_TRANSPOSE_8(a0,a1,a2,a3,a4,a5,a6,a7,b0,b1,b2,b3,b4,b5,b6,b7) \ + b0 = vec_mergeh( a0, a4 ); \ + b1 = vec_mergel( a0, a4 ); \ + b2 = vec_mergeh( a1, a5 ); \ + b3 = vec_mergel( a1, a5 ); \ + b4 = vec_mergeh( a2, a6 ); \ + b5 = vec_mergel( a2, a6 ); \ + b6 = vec_mergeh( a3, a7 ); \ + b7 = vec_mergel( a3, a7 ); \ + a0 = vec_mergeh( b0, b4 ); \ + a1 = vec_mergel( b0, b4 ); \ + a2 = vec_mergeh( b1, b5 ); \ + a3 = vec_mergel( b1, b5 ); \ + a4 = vec_mergeh( b2, b6 ); \ + a5 = vec_mergel( b2, b6 ); \ + a6 = vec_mergeh( b3, b7 ); \ + a7 = vec_mergel( b3, b7 ); \ + b0 = vec_mergeh( a0, a4 ); \ + b1 = vec_mergel( a0, a4 ); \ + b2 = vec_mergeh( a1, a5 ); \ + b3 = vec_mergel( a1, a5 ); \ + b4 = vec_mergeh( a2, a6 ); \ + b5 = vec_mergel( a2, a6 ); \ + b6 = vec_mergeh( a3, a7 ); \ + b7 = vec_mergel( a3, a7 ) + +int satd_4x4_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + ALIGN_VAR_16( int, sum ); + + LOAD_ZERO; + vec_s16_t pix1v, pix2v; + vec_s16_t diff0v, diff1v, diff2v, diff3v; + vec_s16_t temp0v, temp1v, temp2v, temp3v; + vec_s32_t satdv, satdv1, satdv2, satdv3; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff0v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff1v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff2v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff3v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + /* Hadamar H */ + HADAMARD4_VEC(diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v); + VEC_TRANSPOSE_4( temp0v, temp1v, temp2v, temp3v, diff0v, diff1v, diff2v, diff3v ); + /* Hadamar V */ + HADAMARD4_VEC(diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v); + +#if 1 + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv = vec_sum4s( temp0v, zero_s32v); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv1 = vec_sum4s( temp1v, zero_s32v ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv2 = vec_sum4s( temp2v, zero_s32v ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv3 = vec_sum4s( temp3v, zero_s32v ); + + satdv += satdv1; + satdv2 += satdv3; + satdv += satdv2; + + satdv = vec_sum2s( satdv, zero_s32v ); + //satdv = vec_splat( satdv, 1 ); + //vec_ste( satdv, 0, &sum ); + sum = vec_extract(satdv, 1); + //print(sum); +#else + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv = vec_sum4s( temp0v, zero_s32v); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv= vec_sum4s( temp1v, satdv ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv= vec_sum4s( temp2v, satdv ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv= vec_sum4s( temp3v, satdv ); + + satdv = vec_sum2s( satdv, zero_s32v ); + //satdv = vec_splat( satdv, 1 ); + //vec_ste( satdv, 0, &sum ); + sum = vec_extract(satdv, 1); + //print(sum); +#endif + return sum >> 1; +} + +#define HADAMARD4_x2vec(v_out0, v_out1, v_in0, v_in1, v_perm_l0_0, v_perm_l0_1) \ +{ \ + \ + vector unsigned int v_l0_input_0, v_l0_input_1 ; \ + v_l0_input_0 = vec_perm((vector unsigned int)v_in0, (vector unsigned int)v_in1, v_perm_l0_0) ; \ + v_l0_input_1 = vec_perm((vector unsigned int)v_in0, (vector unsigned int)v_in1, v_perm_l0_1) ; \ + \ + vector unsigned int v_l0_add_result, v_l0_sub_result ; \ + v_l0_add_result = vec_add(v_l0_input_0, v_l0_input_1) ; \ + v_l0_sub_result = vec_sub(v_l0_input_0, v_l0_input_1) ; \ + \ + vector unsigned char v_perm_l1_0 = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17} ; \ + vector unsigned char v_perm_l1_1 = {0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0xF, 0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F} ; \ + \ + vector unsigned int v_l1_input_0, v_l1_input_1 ; \ + v_l1_input_0 = vec_perm(v_l0_add_result, v_l0_sub_result, v_perm_l1_0) ; \ + v_l1_input_1 = vec_perm(v_l0_add_result, v_l0_sub_result, v_perm_l1_1) ; \ + \ + vector unsigned int v_l1_add_result, v_l1_sub_result ; \ + v_l1_add_result = vec_add(v_l1_input_0, v_l1_input_1) ; \ + v_l1_sub_result = vec_sub(v_l1_input_0, v_l1_input_1) ; \ + \ + \ + v_out0 = v_l1_add_result ; \ + v_out1 = v_l1_sub_result ; \ +\ +\ +} + +int satd_4x8_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + ALIGN_VAR_16( int, sum ); + + LOAD_ZERO; + vec_s16_t pix1v, pix2v; + vec_s16_t diff0v, diff1v, diff2v, diff3v; + vec_s16_t temp0v, temp1v, temp2v, temp3v; + vec_s32_t satdv, satdv1, satdv2, satdv3;; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff0v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff1v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff2v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff3v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + /* Hadamar H */ + HADAMARD4_VEC(diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v); + VEC_TRANSPOSE_4( temp0v, temp1v, temp2v, temp3v, diff0v, diff1v, diff2v, diff3v ); + /* Hadamar V */ + HADAMARD4_VEC(diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v); + +#if 1 + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv = vec_sum4s( temp0v, zero_s32v); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv1= vec_sum4s( temp1v, zero_s32v ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv2= vec_sum4s( temp2v, zero_s32v ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv3= vec_sum4s( temp3v, zero_s32v ); + + satdv += satdv1; + satdv2 += satdv3; + satdv += satdv2; +#else + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv = vec_sum4s( temp0v, zero_s32v); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv= vec_sum4s( temp1v, satdv ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv= vec_sum4s( temp2v, satdv ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv= vec_sum4s( temp3v, satdv ); +#endif + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff0v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff1v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff2v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff3v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + /* Hadamar H */ + HADAMARD4_VEC(diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v); + VEC_TRANSPOSE_4( temp0v, temp1v, temp2v, temp3v, diff0v, diff1v, diff2v, diff3v ); + /* Hadamar V */ + HADAMARD4_VEC(diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v); + +#if 1 + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv += vec_sum4s( temp0v, zero_s32v); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv1 = vec_sum4s( temp1v, zero_s32v ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv2 = vec_sum4s( temp2v, zero_s32v ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv3 = vec_sum4s( temp3v, zero_s32v ); + + satdv += satdv1; + satdv2 += satdv3; + satdv += satdv2; + + satdv = vec_sum2s( satdv, zero_s32v ); + sum = vec_extract(satdv, 1); +#else + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv = vec_sum4s( temp0v, satdv); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv= vec_sum4s( temp1v, satdv ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv= vec_sum4s( temp2v, satdv ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv= vec_sum4s( temp3v, satdv ); + + satdv = vec_sum2s( satdv, zero_s32v ); + satdv = vec_splat( satdv, 1 ); + vec_ste( satdv, 0, &sum ); +#endif + return sum >> 1; +} + +#if 1 +static int satd_8x4_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + const vector signed short v_unsigned_short_mask = {0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF, 0x00FF} ; + vector unsigned char v_pix1_ub, v_pix2_ub ; + vector signed short v_pix1_ss, v_pix2_ss ; + vector signed short v_sub ; + vector signed int v_sub_sw_0, v_sub_sw_1 ; + vector signed int v_packed_sub_0, v_packed_sub_1 ; + vector unsigned int v_hadamard_result_0, v_hadamard_result_1, v_hadamard_result_2, v_hadamard_result_3 ; + + // for (int i = 0; i < 4; i+=2, pix1 += 2*stride_pix1, pix2 += 2*stride_pix2) + // { + //a0 = (pix1[0] - pix2[0]) + ((sum2_t)(pix1[4] - pix2[4]) << BITS_PER_SUM); + //a1 = (pix1[1] - pix2[1]) + ((sum2_t)(pix1[5] - pix2[5]) << BITS_PER_SUM); + //a2 = (pix1[2] - pix2[2]) + ((sum2_t)(pix1[6] - pix2[6]) << BITS_PER_SUM); + //a3 = (pix1[3] - pix2[3]) + ((sum2_t)(pix1[7] - pix2[7]) << BITS_PER_SUM); + + // Load 16 elements from each pix array + v_pix1_ub = vec_xl(0, pix1) ; + v_pix2_ub = vec_xl(0, pix2) ; + + // We only care about the top 8, and in short format + v_pix1_ss = vec_unpackh((vector signed char)v_pix1_ub) ; + v_pix2_ss = vec_unpackh((vector signed char)v_pix2_ub) ; + + // Undo the sign extend of the unpacks + v_pix1_ss = vec_and(v_pix1_ss, v_unsigned_short_mask) ; + v_pix2_ss = vec_and(v_pix2_ss, v_unsigned_short_mask) ; + + // Peform the subtraction + v_sub = vec_sub(v_pix1_ss, v_pix2_ss) ; + + // Unpack the sub results into ints + v_sub_sw_0 = vec_unpackh(v_sub) ; + v_sub_sw_1 = vec_unpackl(v_sub) ; + v_sub_sw_1 = vec_sl(v_sub_sw_1, (vector unsigned int){16,16,16,16}) ; + + // Add the int sub results (compatibility with the original code) + v_packed_sub_0 = vec_add(v_sub_sw_0, v_sub_sw_1) ; + + //a0 = (pix1[0] - pix2[0]) + ((sum2_t)(pix1[4] - pix2[4]) << BITS_PER_SUM); + //a1 = (pix1[1] - pix2[1]) + ((sum2_t)(pix1[5] - pix2[5]) << BITS_PER_SUM); + //a2 = (pix1[2] - pix2[2]) + ((sum2_t)(pix1[6] - pix2[6]) << BITS_PER_SUM); + //a3 = (pix1[3] - pix2[3]) + ((sum2_t)(pix1[7] - pix2[7]) << BITS_PER_SUM); + + // Load 16 elements from each pix array + v_pix1_ub = vec_xl(stride_pix1, pix1) ; + v_pix2_ub = vec_xl(stride_pix2, pix2) ; + + // We only care about the top 8, and in short format + v_pix1_ss = vec_unpackh((vector signed char)v_pix1_ub) ; + v_pix2_ss = vec_unpackh((vector signed char)v_pix2_ub) ; + + // Undo the sign extend of the unpacks + v_pix1_ss = vec_and(v_pix1_ss, v_unsigned_short_mask) ; + v_pix2_ss = vec_and(v_pix2_ss, v_unsigned_short_mask) ; + + // Peform the subtraction + v_sub = vec_sub(v_pix1_ss, v_pix2_ss) ; + + // Unpack the sub results into ints + v_sub_sw_0 = vec_unpackh(v_sub) ; + v_sub_sw_1 = vec_unpackl(v_sub) ; + v_sub_sw_1 = vec_sl(v_sub_sw_1, (vector unsigned int){16,16,16,16}) ; + + // Add the int sub results (compatibility with the original code) + v_packed_sub_1 = vec_add(v_sub_sw_0, v_sub_sw_1) ; + + // original: HADAMARD4(tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0, a1, a2, a3); + // modified while vectorizing: HADAMARD4(tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], v_packed_sub_0[0], v_packed_sub_0[1], v_packed_sub_0[2], v_packed_sub_0[3]); + + // original: HADAMARD4(tmp[i+1][0], tmp[i+1][1], tmp[i+1][2], tmp[i+1][3], a0, a1, a2, a3); + // modified while vectorizing: HADAMARD4(tmp[i+1][0], tmp[i+1][1], tmp[i+1][2], tmp[i+1][3], v_packed_sub_1[0], v_packed_sub_1[1], v_packed_sub_1[2], v_packed_sub_1[3]); + + // Go after two hadamard4(int) at once, fully utilizing the vector width + // Note that the hadamard4(int) provided by x264/x265 is actually two hadamard4(short) simultaneously + const vector unsigned char v_perm_l0_0 = {0x00, 0x01, 0x02, 0x03, 0x10, 0x11, 0x12, 0x13, 0x08, 0x09, 0x0A, 0x0B, 0x18, 0x19, 0x1A, 0x1B} ; + const vector unsigned char v_perm_l0_1 = {0x04, 0x05, 0x06, 0x07, 0x14, 0x15, 0x16, 0x17, 0x0C, 0x0D, 0x0E, 0x0F, 0x1C, 0x1D, 0x1E, 0x1F} ; + HADAMARD4_x2vec(v_hadamard_result_0, v_hadamard_result_1, v_packed_sub_0, v_packed_sub_1, v_perm_l0_0, v_perm_l0_1) ; + + //## + // tmp[0][0] = v_hadamard_result_0[0] ; + // tmp[0][1] = v_hadamard_result_0[2] ; + // tmp[0][2] = v_hadamard_result_1[0] ; + // tmp[0][3] = v_hadamard_result_1[2] ; + + // tmp[1][0] = v_hadamard_result_0[1] ; + // tmp[1][1] = v_hadamard_result_0[3] ; + // tmp[1][2] = v_hadamard_result_1[1] ; + // tmp[1][3] = v_hadamard_result_1[3] ; + //## + + //a0 = (pix1[0] - pix2[0]) + ((sum2_t)(pix1[4] - pix2[4]) << BITS_PER_SUM); + //a1 = (pix1[1] - pix2[1]) + ((sum2_t)(pix1[5] - pix2[5]) << BITS_PER_SUM); + //a2 = (pix1[2] - pix2[2]) + ((sum2_t)(pix1[6] - pix2[6]) << BITS_PER_SUM); + //a3 = (pix1[3] - pix2[3]) + ((sum2_t)(pix1[7] - pix2[7]) << BITS_PER_SUM); + + // Load 16 elements from each pix array + v_pix1_ub = vec_xl(2*stride_pix1, pix1) ; + v_pix2_ub = vec_xl(2*stride_pix1, pix2) ; + + // We only care about the top 8, and in short format + v_pix1_ss = vec_unpackh((vector signed char)v_pix1_ub) ; + v_pix2_ss = vec_unpackh((vector signed char)v_pix2_ub) ; + + // Undo the sign extend of the unpacks + v_pix1_ss = vec_and(v_pix1_ss, v_unsigned_short_mask) ; + v_pix2_ss = vec_and(v_pix2_ss, v_unsigned_short_mask) ; + + // Peform the subtraction + v_sub = vec_sub(v_pix1_ss, v_pix2_ss) ; + + // Unpack the sub results into ints + v_sub_sw_0 = vec_unpackh(v_sub) ; + v_sub_sw_1 = vec_unpackl(v_sub) ; + v_sub_sw_1 = vec_sl(v_sub_sw_1, (vector unsigned int){16,16,16,16}) ; + + // Add the int sub results (compatibility with the original code) + v_packed_sub_0 = vec_add(v_sub_sw_0, v_sub_sw_1) ; + + //a0 = (pix1[0] - pix2[0]) + ((sum2_t)(pix1[4] - pix2[4]) << BITS_PER_SUM); + //a1 = (pix1[1] - pix2[1]) + ((sum2_t)(pix1[5] - pix2[5]) << BITS_PER_SUM); + //a2 = (pix1[2] - pix2[2]) + ((sum2_t)(pix1[6] - pix2[6]) << BITS_PER_SUM); + //a3 = (pix1[3] - pix2[3]) + ((sum2_t)(pix1[7] - pix2[7]) << BITS_PER_SUM); + + // Load 16 elements from each pix array + v_pix1_ub = vec_xl(3*stride_pix1, pix1) ; + v_pix2_ub = vec_xl(3*stride_pix2, pix2) ; + + // We only care about the top 8, and in short format + v_pix1_ss = vec_unpackh((vector signed char)v_pix1_ub) ; + v_pix2_ss = vec_unpackh((vector signed char)v_pix2_ub) ; + + // Undo the sign extend of the unpacks + v_pix1_ss = vec_and(v_pix1_ss, v_unsigned_short_mask) ; + v_pix2_ss = vec_and(v_pix2_ss, v_unsigned_short_mask) ; + + // Peform the subtraction + v_sub = vec_sub(v_pix1_ss, v_pix2_ss) ; + + // Unpack the sub results into ints + v_sub_sw_0 = vec_unpackh(v_sub) ; + v_sub_sw_1 = vec_unpackl(v_sub) ; + v_sub_sw_1 = vec_sl(v_sub_sw_1, (vector unsigned int){16,16,16,16}) ; + + // Add the int sub results (compatibility with the original code) + v_packed_sub_1 = vec_add(v_sub_sw_0, v_sub_sw_1) ; + + + // original: HADAMARD4(tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0, a1, a2, a3); + // modified while vectorizing: HADAMARD4(tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], v_packed_sub_0[0], v_packed_sub_0[1], v_packed_sub_0[2], v_packed_sub_0[3]); + + // original: HADAMARD4(tmp[i+1][0], tmp[i+1][1], tmp[i+1][2], tmp[i+1][3], a0, a1, a2, a3); + // modified while vectorizing: HADAMARD4(tmp[i+1][0], tmp[i+1][1], tmp[i+1][2], tmp[i+1][3], v_packed_sub_1[0], v_packed_sub_1[1], v_packed_sub_1[2], v_packed_sub_1[3]); + + // Go after two hadamard4(int) at once, fully utilizing the vector width + // Note that the hadamard4(int) provided by x264/x265 is actually two hadamard4(short) simultaneously + HADAMARD4_x2vec(v_hadamard_result_2, v_hadamard_result_3, v_packed_sub_0, v_packed_sub_1, v_perm_l0_0, v_perm_l0_1) ; + + //## + //## tmp[2][0] = v_hadamard_result_2[0] ; + //## tmp[2][1] = v_hadamard_result_2[2] ; + //## tmp[2][2] = v_hadamard_result_3[0] ; + //## tmp[2][3] = v_hadamard_result_3[2] ; + //## + //## tmp[3][0] = v_hadamard_result_2[1] ; + //## tmp[3][1] = v_hadamard_result_2[3] ; + //## tmp[3][2] = v_hadamard_result_3[1] ; + //## tmp[3][3] = v_hadamard_result_3[3] ; + //## + // } + // for (int i = 0; i < 4; i++) + // { + // HADAMARD4(a0, a1, a2, a3, tmp[0][0], tmp[1][0], tmp[2][0], tmp[3][0]); + // sum += abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3); + + // HADAMARD4(a0, a1, a2, a3, tmp[0][1], tmp[1][1], tmp[2][1], tmp[3][1]); + // sum += abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3); + const vector unsigned char v_lowerloop_perm_l0_0 = {0x00, 0x01, 0x02, 0x03, 0x08, 0x09, 0x0A, 0x0B, 0x10, 0x11, 0x12, 0x13, 0x18, 0x19, 0x1A, 0x1B} ; + const vector unsigned char v_lowerloop_perm_l0_1 = {0x04, 0x05, 0x06, 0x07, 0x0C, 0x0D, 0x0E, 0x0F, 0x14, 0x15, 0x16, 0x17, 0x1C, 0x1D, 0x1E, 0x1F} ; + HADAMARD4_x2vec(v_hadamard_result_0, v_hadamard_result_2, v_hadamard_result_0, v_hadamard_result_2, v_lowerloop_perm_l0_0, v_lowerloop_perm_l0_1) ; + + const vector unsigned int v_15 = {15, 15, 15, 15} ; + const vector unsigned int v_0x10001 = (vector unsigned int){ 0x10001, 0x10001, 0x10001, 0x10001 }; + const vector unsigned int v_0xffff = (vector unsigned int){ 0xffff, 0xffff, 0xffff, 0xffff }; + + + vector unsigned int v_hadamard_result_s_0 ; + v_hadamard_result_s_0 = vec_sra(v_hadamard_result_0, v_15) ; + v_hadamard_result_s_0 = vec_and(v_hadamard_result_s_0, v_0x10001) ; + asm ("vmuluwm %0,%1,%2" + : "=v" (v_hadamard_result_s_0) + : "v" (v_hadamard_result_s_0) , "v" (v_0xffff) + ) ; + v_hadamard_result_0 = vec_add(v_hadamard_result_0, v_hadamard_result_s_0) ; + v_hadamard_result_0 = vec_xor(v_hadamard_result_0, v_hadamard_result_s_0) ; + + vector unsigned int v_hadamard_result_s_2 ; + v_hadamard_result_s_2 = vec_sra(v_hadamard_result_2, v_15) ; + v_hadamard_result_s_2 = vec_and(v_hadamard_result_s_2, v_0x10001) ; + asm ("vmuluwm %0,%1,%2" + : "=v" (v_hadamard_result_s_2) + : "v" (v_hadamard_result_s_2) , "v" (v_0xffff) + ) ; + v_hadamard_result_2 = vec_add(v_hadamard_result_2, v_hadamard_result_s_2) ; + v_hadamard_result_2 = vec_xor(v_hadamard_result_2, v_hadamard_result_s_2) ; + + // HADAMARD4(a0, a1, a2, a3, tmp[0][2], tmp[1][2], tmp[2][2], tmp[3][2]); + // sum += abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3); + + // HADAMARD4(a0, a1, a2, a3, tmp[0][3], tmp[1][3], tmp[2][3], tmp[3][3]); + // sum += abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3); + + HADAMARD4_x2vec(v_hadamard_result_1, v_hadamard_result_3, v_hadamard_result_1, v_hadamard_result_3, v_lowerloop_perm_l0_0, v_lowerloop_perm_l0_1) ; + + vector unsigned int v_hadamard_result_s_1 ; + v_hadamard_result_s_1 = vec_sra(v_hadamard_result_1, v_15) ; + v_hadamard_result_s_1 = vec_and(v_hadamard_result_s_1, v_0x10001) ; + asm ("vmuluwm %0,%1,%2" + : "=v" (v_hadamard_result_s_1) + : "v" (v_hadamard_result_s_1) , "v" (v_0xffff) + ) ; + v_hadamard_result_1 = vec_add(v_hadamard_result_1, v_hadamard_result_s_1) ; + v_hadamard_result_1 = vec_xor(v_hadamard_result_1, v_hadamard_result_s_1) ; + + vector unsigned int v_hadamard_result_s_3 ; + v_hadamard_result_s_3 = vec_sra(v_hadamard_result_3, v_15) ; + v_hadamard_result_s_3 = vec_and(v_hadamard_result_s_3, v_0x10001) ; + asm ("vmuluwm %0,%1,%2" + : "=v" (v_hadamard_result_s_3) + : "v" (v_hadamard_result_s_3) , "v" (v_0xffff) + ) ; + v_hadamard_result_3 = vec_add(v_hadamard_result_3, v_hadamard_result_s_3) ; + v_hadamard_result_3 = vec_xor(v_hadamard_result_3, v_hadamard_result_s_3) ; + +// } + + + vector unsigned int v_sum_0, v_sum_1 ; + vector signed int v_sum ; + + v_sum_0 = vec_add(v_hadamard_result_0, v_hadamard_result_2) ; + v_sum_1 = vec_add(v_hadamard_result_1, v_hadamard_result_3) ; + + v_sum_0 = vec_add(v_sum_0, v_sum_1) ; + + vector signed int v_zero = {0, 0, 0, 0} ; + v_sum = vec_sums((vector signed int)v_sum_0, v_zero) ; + + // return (((sum_t)sum) + (sum >> BITS_PER_SUM)) >> 1; + return (((sum_t)v_sum[3]) + (v_sum[3] >> BITS_PER_SUM)) >> 1; +} +#else +int satd_8x4_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + ALIGN_VAR_16( int, sum ); + LOAD_ZERO; + vec_s16_t pix1v, pix2v; + vec_s16_t diff0v, diff1v, diff2v, diff3v, diff4v, diff5v, diff6v, diff7v; + vec_s16_t temp0v, temp1v, temp2v, temp3v, temp4v, temp5v, temp6v, temp7v; + vec_s32_t satdv; + + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff0v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff1v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff2v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff3v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff4v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff5v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff6v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff7v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + HADAMARD4_VEC( diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v ); + //HADAMARD4_VEC( diff4v, diff5v, diff6v, diff7v, temp4v, temp5v, temp6v, temp7v ); + + VEC_TRANSPOSE_8( temp0v, temp1v, temp2v, temp3v, + temp4v, temp5v, temp6v, temp7v, + diff0v, diff1v, diff2v, diff3v, + diff4v, diff5v, diff6v, diff7v ); + + HADAMARD4_VEC( diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v ); + HADAMARD4_VEC( diff4v, diff5v, diff6v, diff7v, temp4v, temp5v, temp6v, temp7v ); + + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv = vec_sum4s( temp0v, satdv); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv= vec_sum4s( temp1v, satdv ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv= vec_sum4s( temp2v, satdv ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv= vec_sum4s( temp3v, satdv ); + + temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) ); + satdv = vec_sum4s( temp4v, satdv); + + temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) ); + satdv= vec_sum4s( temp5v, satdv ); + + temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) ); + satdv= vec_sum4s( temp6v, satdv ); + + temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) ); + satdv= vec_sum4s( temp7v, satdv ); + + satdv = vec_sums( satdv, zero_s32v ); + satdv = vec_splat( satdv, 3 ); + vec_ste( satdv, 0, &sum ); + + //print(sum); + return sum>>1; +} +#endif + +int satd_8x8_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + ALIGN_VAR_16( int, sum ); + LOAD_ZERO; + vec_s16_t pix1v, pix2v; + vec_s16_t diff0v, diff1v, diff2v, diff3v, diff4v, diff5v, diff6v, diff7v; + vec_s16_t temp0v, temp1v, temp2v, temp3v, temp4v, temp5v, temp6v, temp7v; + vec_s32_t satdv, satdv1, satdv2, satdv3, satdv4, satdv5, satdv6, satdv7; + //vec_s32_t satdv=(vec_s32_t){0,0,0,0}; + + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff0v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff1v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff2v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff3v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff4v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff5v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff6v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff7v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + HADAMARD4_VEC( diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v ); + HADAMARD4_VEC( diff4v, diff5v, diff6v, diff7v, temp4v, temp5v, temp6v, temp7v ); + + VEC_TRANSPOSE_8( temp0v, temp1v, temp2v, temp3v, + temp4v, temp5v, temp6v, temp7v, + diff0v, diff1v, diff2v, diff3v, + diff4v, diff5v, diff6v, diff7v ); + + HADAMARD4_VEC( diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v ); + HADAMARD4_VEC( diff4v, diff5v, diff6v, diff7v, temp4v, temp5v, temp6v, temp7v ); + +#if 1 + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv = vec_sum4s( temp0v, zero_s32v); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv1= vec_sum4s( temp1v, zero_s32v ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv2= vec_sum4s( temp2v, zero_s32v ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv3= vec_sum4s( temp3v, zero_s32v ); + + temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) ); + satdv4 = vec_sum4s( temp4v, zero_s32v); + + temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) ); + satdv5= vec_sum4s( temp5v, zero_s32v ); + + temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) ); + satdv6= vec_sum4s( temp6v, zero_s32v ); + + temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) ); + satdv7= vec_sum4s( temp7v, zero_s32v ); + + satdv += satdv1; + satdv2 += satdv3; + satdv4 += satdv5; + satdv6 += satdv7; + + satdv += satdv2; + satdv4 += satdv6; + satdv += satdv4; + + satdv = vec_sums( satdv, zero_s32v ); + sum = vec_extract(satdv, 3); +#else + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv = vec_sum4s( temp0v, satdv); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv= vec_sum4s( temp1v, satdv ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv= vec_sum4s( temp2v, satdv ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv= vec_sum4s( temp3v, satdv ); + + temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) ); + satdv = vec_sum4s( temp4v, satdv); + + temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) ); + satdv= vec_sum4s( temp5v, satdv ); + + temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) ); + satdv= vec_sum4s( temp6v, satdv ); + + temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) ); + satdv= vec_sum4s( temp7v, satdv ); + + satdv = vec_sums( satdv, zero_s32v ); + satdv = vec_splat( satdv, 3 ); + vec_ste( satdv, 0, &sum ); +#endif + return sum>>1; +} + +int satd_8x16_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + ALIGN_VAR_16( int, sum ); + + LOAD_ZERO; + vec_s16_t pix1v, pix2v; + vec_s16_t diff0v, diff1v, diff2v, diff3v, diff4v, diff5v, diff6v, diff7v; + vec_s16_t temp0v, temp1v, temp2v, temp3v, temp4v, temp5v, temp6v, temp7v; + //vec_s32_t satdv=(vec_s32_t){0,0,0,0}; + vec_s32_t satdv, satdv1, satdv2, satdv3, satdv4, satdv5, satdv6, satdv7; + + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff0v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff1v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff2v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff3v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff4v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff5v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff6v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff7v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + HADAMARD4_VEC( diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v ); + HADAMARD4_VEC( diff4v, diff5v, diff6v, diff7v, temp4v, temp5v, temp6v, temp7v ); + + VEC_TRANSPOSE_8( temp0v, temp1v, temp2v, temp3v, + temp4v, temp5v, temp6v, temp7v, + diff0v, diff1v, diff2v, diff3v, + diff4v, diff5v, diff6v, diff7v ); + + HADAMARD4_VEC( diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v ); + HADAMARD4_VEC( diff4v, diff5v, diff6v, diff7v, temp4v, temp5v, temp6v, temp7v ); + +#if 1 + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv = vec_sum4s( temp0v, zero_s32v); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv1= vec_sum4s( temp1v, zero_s32v ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv2= vec_sum4s( temp2v, zero_s32v ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv3= vec_sum4s( temp3v, zero_s32v ); + + temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) ); + satdv4 = vec_sum4s( temp4v, zero_s32v); + + temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) ); + satdv5= vec_sum4s( temp5v, zero_s32v ); + + temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) ); + satdv6= vec_sum4s( temp6v, zero_s32v ); + + temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) ); + satdv7= vec_sum4s( temp7v, zero_s32v ); + + satdv += satdv1; + satdv2 += satdv3; + satdv4 += satdv5; + satdv6 += satdv7; + + satdv += satdv2; + satdv4 += satdv6; + satdv += satdv4; +#else + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv = vec_sum4s( temp0v, satdv); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv= vec_sum4s( temp1v, satdv ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv= vec_sum4s( temp2v, satdv ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv= vec_sum4s( temp3v, satdv ); + + temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) ); + satdv = vec_sum4s( temp4v, satdv); + + temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) ); + satdv= vec_sum4s( temp5v, satdv ); + + temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) ); + satdv= vec_sum4s( temp6v, satdv ); + + temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) ); + satdv= vec_sum4s( temp7v, satdv ); +#endif + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff0v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff1v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff2v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff3v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff4v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff5v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff6v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16(vec_xl(0, pix2) ); + diff7v = vec_sub( pix1v, pix2v ); + pix1 += stride_pix1; + pix2 += stride_pix2; + + HADAMARD4_VEC( diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v ); + HADAMARD4_VEC( diff4v, diff5v, diff6v, diff7v, temp4v, temp5v, temp6v, temp7v ); + + VEC_TRANSPOSE_8( temp0v, temp1v, temp2v, temp3v, + temp4v, temp5v, temp6v, temp7v, + diff0v, diff1v, diff2v, diff3v, + diff4v, diff5v, diff6v, diff7v ); + + HADAMARD4_VEC( diff0v, diff1v, diff2v, diff3v, temp0v, temp1v, temp2v, temp3v ); + HADAMARD4_VEC( diff4v, diff5v, diff6v, diff7v, temp4v, temp5v, temp6v, temp7v ); + +#if 1 + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv += vec_sum4s( temp0v, zero_s32v); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv1= vec_sum4s( temp1v, zero_s32v ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv2= vec_sum4s( temp2v, zero_s32v ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv3= vec_sum4s( temp3v, zero_s32v ); + + temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) ); + satdv4 = vec_sum4s( temp4v, zero_s32v); + + temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) ); + satdv5= vec_sum4s( temp5v, zero_s32v ); + + temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) ); + satdv6= vec_sum4s( temp6v, zero_s32v ); + + temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) ); + satdv7= vec_sum4s( temp7v, zero_s32v ); + + satdv += satdv1; + satdv2 += satdv3; + satdv4 += satdv5; + satdv6 += satdv7; + + satdv += satdv2; + satdv4 += satdv6; + satdv += satdv4; + + satdv = vec_sums( satdv, zero_s32v ); + sum = vec_extract(satdv, 3); +#else + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv = vec_sum4s( temp0v, satdv); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv= vec_sum4s( temp1v, satdv ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv= vec_sum4s( temp2v, satdv ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv= vec_sum4s( temp3v, satdv ); + + temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) ); + satdv = vec_sum4s( temp4v, satdv); + + temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) ); + satdv= vec_sum4s( temp5v, satdv ); + + temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) ); + satdv= vec_sum4s( temp6v, satdv ); + + temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) ); + satdv= vec_sum4s( temp7v, satdv ); + + satdv = vec_sums( satdv, zero_s32v ); + satdv = vec_splat( satdv, 3 ); + vec_ste( satdv, 0, &sum ); +#endif + return sum >> 1; +} + +#define VEC_DIFF_S16(p1,i1,p2,i2,dh,dl)\ +{\ + pix1v = (vec_s16_t)vec_xl(0, p1);\ + temp0v = vec_u8_to_s16_h( pix1v );\ + temp1v = vec_u8_to_s16_l( pix1v );\ + pix2v = (vec_s16_t)vec_xl(0, p2);\ + temp2v = vec_u8_to_s16_h( pix2v );\ + temp3v = vec_u8_to_s16_l( pix2v );\ + dh = vec_sub( temp0v, temp2v );\ + dl = vec_sub( temp1v, temp3v );\ + p1 += i1;\ + p2 += i2;\ +} + + +int satd_16x4_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + ALIGN_VAR_16( int, sum ); + LOAD_ZERO; + //vec_s32_t satdv=(vec_s32_t){0,0,0,0}; + vec_s32_t satdv, satdv1, satdv2, satdv3, satdv4, satdv5, satdv6, satdv7; + vec_s16_t pix1v, pix2v; + vec_s16_t diffh0v, diffh1v, diffh2v, diffh3v; + vec_s16_t diffl0v, diffl1v, diffl2v, diffl3v; + vec_s16_t temp0v, temp1v, temp2v, temp3v, + temp4v, temp5v, temp6v, temp7v; + + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh0v,diffl0v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh1v, diffl1v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh2v, diffl2v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh3v, diffl3v); + + + HADAMARD4_VEC( diffh0v, diffh1v, diffh2v, diffh3v, temp0v, temp1v, temp2v, temp3v ); + HADAMARD4_VEC( diffl0v, diffl1v, diffl2v, diffl3v, temp4v, temp5v, temp6v, temp7v ); + + VEC_TRANSPOSE_8( temp0v, temp1v, temp2v, temp3v, + temp4v, temp5v, temp6v, temp7v, + diffh0v, diffh1v, diffh2v, diffh3v, + diffl0v, diffl1v, diffl2v, diffl3v); + + HADAMARD4_VEC( diffh0v, diffh1v, diffh2v, diffh3v, temp0v, temp1v, temp2v, temp3v ); + HADAMARD4_VEC( diffl0v, diffl1v, diffl2v, diffl3v, temp4v, temp5v, temp6v, temp7v ); + +#if 1 + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv = vec_sum4s( temp0v, zero_s32v); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv1= vec_sum4s( temp1v, zero_s32v ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv2= vec_sum4s( temp2v, zero_s32v ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv3= vec_sum4s( temp3v, zero_s32v ); + + temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) ); + satdv4 = vec_sum4s( temp4v, zero_s32v); + + temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) ); + satdv5= vec_sum4s( temp5v, zero_s32v ); + + temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) ); + satdv6= vec_sum4s( temp6v, zero_s32v ); + + temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) ); + satdv7= vec_sum4s( temp7v, zero_s32v ); + + satdv += satdv1; + satdv2 += satdv3; + satdv4 += satdv5; + satdv6 += satdv7; + + satdv += satdv2; + satdv4 += satdv6; + satdv += satdv4; + + satdv = vec_sums( satdv, zero_s32v ); + sum = vec_extract(satdv, 3); +#else + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv = vec_sum4s( temp0v, zero_s32v); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv= vec_sum4s( temp1v, satdv ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv= vec_sum4s( temp2v, satdv ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv= vec_sum4s( temp3v, satdv ); + + temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) ); + satdv = vec_sum4s( temp4v, satdv); + + temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) ); + satdv= vec_sum4s( temp5v, satdv ); + + temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) ); + satdv= vec_sum4s( temp6v, satdv ); + + temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) ); + satdv= vec_sum4s( temp7v, satdv ); + + satdv = vec_sums( satdv, zero_s32v ); + satdv = vec_splat( satdv, 3 ); + vec_ste( satdv, 0, &sum ); +#endif + return sum >> 1; +} + +int satd_16x8_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + ALIGN_VAR_16( int, sum ); + LOAD_ZERO; + //vec_s32_t satdv=(vec_s32_t){0,0,0,0}; + vec_s32_t satdv, satdv1, satdv2, satdv3, satdv4, satdv5, satdv6, satdv7; + vec_s16_t pix1v, pix2v; + vec_s16_t diffh0v, diffh1v, diffh2v, diffh3v, + diffh4v, diffh5v, diffh6v, diffh7v; + vec_s16_t diffl0v, diffl1v, diffl2v, diffl3v, + diffl4v, diffl5v, diffl6v, diffl7v; + vec_s16_t temp0v, temp1v, temp2v, temp3v, + temp4v, temp5v, temp6v, temp7v; + + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh0v,diffl0v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh1v, diffl1v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh2v, diffl2v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh3v, diffl3v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh4v, diffl4v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh5v, diffl5v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh6v, diffl6v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh7v, diffl7v); + + HADAMARD4_VEC( diffh0v, diffh1v, diffh2v, diffh3v, temp0v, temp1v, temp2v, temp3v ); + HADAMARD4_VEC( diffh4v, diffh5v, diffh6v, diffh7v, temp4v, temp5v, temp6v, temp7v ); + + VEC_TRANSPOSE_8( temp0v, temp1v, temp2v, temp3v, + temp4v, temp5v, temp6v, temp7v, + diffh0v, diffh1v, diffh2v, diffh3v, + diffh4v, diffh5v, diffh6v, diffh7v ); + + HADAMARD4_VEC( diffh0v, diffh1v, diffh2v, diffh3v, temp0v, temp1v, temp2v, temp3v ); + HADAMARD4_VEC( diffh4v, diffh5v, diffh6v, diffh7v, temp4v, temp5v, temp6v, temp7v ); + +#if 1 + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv = vec_sum4s( temp0v, zero_s32v); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv1= vec_sum4s( temp1v, zero_s32v ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv2= vec_sum4s( temp2v, zero_s32v ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv3= vec_sum4s( temp3v, zero_s32v ); + + temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) ); + satdv4 = vec_sum4s( temp4v, zero_s32v); + + temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) ); + satdv5= vec_sum4s( temp5v, zero_s32v ); + + temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) ); + satdv6= vec_sum4s( temp6v, zero_s32v ); + + temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) ); + satdv7= vec_sum4s( temp7v, zero_s32v ); + + satdv += satdv1; + satdv2 += satdv3; + satdv4 += satdv5; + satdv6 += satdv7; + + satdv += satdv2; + satdv4 += satdv6; + satdv += satdv4; +#else + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv = vec_sum4s( temp0v, zero_s32v); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv= vec_sum4s( temp1v, satdv ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv= vec_sum4s( temp2v, satdv ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv= vec_sum4s( temp3v, satdv ); + + temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) ); + satdv = vec_sum4s( temp4v, satdv); + + temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) ); + satdv= vec_sum4s( temp5v, satdv ); + + temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) ); + satdv= vec_sum4s( temp6v, satdv ); + + temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) ); + satdv= vec_sum4s( temp7v, satdv ); +#endif + + HADAMARD4_VEC( diffl0v, diffl1v, diffl2v, diffl3v, temp0v, temp1v, temp2v, temp3v ); + HADAMARD4_VEC( diffl4v, diffl5v, diffl6v, diffl7v, temp4v, temp5v, temp6v, temp7v ); + + VEC_TRANSPOSE_8( temp0v, temp1v, temp2v, temp3v, + temp4v, temp5v, temp6v, temp7v, + diffl0v, diffl1v, diffl2v, diffl3v, + diffl4v, diffl5v, diffl6v, diffl7v ); + + HADAMARD4_VEC( diffl0v, diffl1v, diffl2v, diffl3v, temp0v, temp1v, temp2v, temp3v ); + HADAMARD4_VEC( diffl4v, diffl5v, diffl6v, diffl7v, temp4v, temp5v, temp6v, temp7v ); + +#if 1 + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv += vec_sum4s( temp0v, zero_s32v); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv1= vec_sum4s( temp1v, zero_s32v ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv2= vec_sum4s( temp2v, zero_s32v ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv3= vec_sum4s( temp3v, zero_s32v ); + + temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) ); + satdv4 = vec_sum4s( temp4v, zero_s32v); + + temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) ); + satdv5= vec_sum4s( temp5v, zero_s32v ); + + temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) ); + satdv6= vec_sum4s( temp6v, zero_s32v ); + + temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) ); + satdv7= vec_sum4s( temp7v, zero_s32v ); + + satdv += satdv1; + satdv2 += satdv3; + satdv4 += satdv5; + satdv6 += satdv7; + + satdv += satdv2; + satdv4 += satdv6; + satdv += satdv4; + + satdv = vec_sums( satdv, zero_s32v ); + sum = vec_extract(satdv, 3); +#else + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv = vec_sum4s( temp0v, satdv); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv= vec_sum4s( temp1v, satdv ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv= vec_sum4s( temp2v, satdv ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv= vec_sum4s( temp3v, satdv ); + + temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) ); + satdv = vec_sum4s( temp4v, satdv); + + temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) ); + satdv= vec_sum4s( temp5v, satdv ); + + temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) ); + satdv= vec_sum4s( temp6v, satdv ); + + temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) ); + satdv= vec_sum4s( temp7v, satdv ); + + satdv = vec_sums( satdv, zero_s32v ); + satdv = vec_splat( satdv, 3 ); + vec_ste( satdv, 0, &sum ); +#endif + return sum >> 1; +} + +int satd_16x16_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + ALIGN_VAR_16( int, sum ); + LOAD_ZERO; + //vec_s32_t satdv=(vec_s32_t){0,0,0,0}; + vec_s32_t satdv, satdv1, satdv2, satdv3, satdv4, satdv5, satdv6, satdv7; + vec_s16_t pix1v, pix2v; + vec_s16_t diffh0v, diffh1v, diffh2v, diffh3v, + diffh4v, diffh5v, diffh6v, diffh7v; + vec_s16_t diffl0v, diffl1v, diffl2v, diffl3v, + diffl4v, diffl5v, diffl6v, diffl7v; + vec_s16_t temp0v, temp1v, temp2v, temp3v, + temp4v, temp5v, temp6v, temp7v; + + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh0v,diffl0v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh1v, diffl1v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh2v, diffl2v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh3v, diffl3v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh4v, diffl4v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh5v, diffl5v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh6v, diffl6v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh7v, diffl7v); + + HADAMARD4_VEC( diffh0v, diffh1v, diffh2v, diffh3v, temp0v, temp1v, temp2v, temp3v ); + HADAMARD4_VEC( diffh4v, diffh5v, diffh6v, diffh7v, temp4v, temp5v, temp6v, temp7v ); + + VEC_TRANSPOSE_8( temp0v, temp1v, temp2v, temp3v, + temp4v, temp5v, temp6v, temp7v, + diffh0v, diffh1v, diffh2v, diffh3v, + diffh4v, diffh5v, diffh6v, diffh7v ); + + HADAMARD4_VEC( diffh0v, diffh1v, diffh2v, diffh3v, temp0v, temp1v, temp2v, temp3v ); + HADAMARD4_VEC( diffh4v, diffh5v, diffh6v, diffh7v, temp4v, temp5v, temp6v, temp7v ); + +#if 1 + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv = vec_sum4s( temp0v, zero_s32v); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv1= vec_sum4s( temp1v, zero_s32v ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv2= vec_sum4s( temp2v, zero_s32v ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv3= vec_sum4s( temp3v, zero_s32v ); + + temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) ); + satdv4 = vec_sum4s( temp4v, zero_s32v); + + temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) ); + satdv5= vec_sum4s( temp5v, zero_s32v ); + + temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) ); + satdv6= vec_sum4s( temp6v, zero_s32v ); + + temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) ); + satdv7= vec_sum4s( temp7v, zero_s32v ); + + satdv += satdv1; + satdv2 += satdv3; + satdv4 += satdv5; + satdv6 += satdv7; + + satdv += satdv2; + satdv4 += satdv6; + satdv += satdv4; +#else + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv = vec_sum4s( temp0v, zero_s32v); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv= vec_sum4s( temp1v, satdv ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv= vec_sum4s( temp2v, satdv ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv= vec_sum4s( temp3v, satdv ); + + temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) ); + satdv = vec_sum4s( temp4v, satdv); + + temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) ); + satdv= vec_sum4s( temp5v, satdv ); + + temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) ); + satdv= vec_sum4s( temp6v, satdv ); + + temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) ); + satdv= vec_sum4s( temp7v, satdv ); +#endif + + HADAMARD4_VEC( diffl0v, diffl1v, diffl2v, diffl3v, temp0v, temp1v, temp2v, temp3v ); + HADAMARD4_VEC( diffl4v, diffl5v, diffl6v, diffl7v, temp4v, temp5v, temp6v, temp7v ); + + VEC_TRANSPOSE_8( temp0v, temp1v, temp2v, temp3v, + temp4v, temp5v, temp6v, temp7v, + diffl0v, diffl1v, diffl2v, diffl3v, + diffl4v, diffl5v, diffl6v, diffl7v ); + + HADAMARD4_VEC( diffl0v, diffl1v, diffl2v, diffl3v, temp0v, temp1v, temp2v, temp3v ); + HADAMARD4_VEC( diffl4v, diffl5v, diffl6v, diffl7v, temp4v, temp5v, temp6v, temp7v ); + +#if 1 + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv += vec_sum4s( temp0v, zero_s32v); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv1= vec_sum4s( temp1v, zero_s32v ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv2= vec_sum4s( temp2v, zero_s32v ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv3= vec_sum4s( temp3v, zero_s32v ); + + temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) ); + satdv4 = vec_sum4s( temp4v, zero_s32v); + + temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) ); + satdv5= vec_sum4s( temp5v, zero_s32v ); + + temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) ); + satdv6= vec_sum4s( temp6v, zero_s32v ); + + temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) ); + satdv7= vec_sum4s( temp7v, zero_s32v ); + + satdv += satdv1; + satdv2 += satdv3; + satdv4 += satdv5; + satdv6 += satdv7; + + satdv += satdv2; + satdv4 += satdv6; + satdv += satdv4; + +#else + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv = vec_sum4s( temp0v, satdv); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv= vec_sum4s( temp1v, satdv ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv= vec_sum4s( temp2v, satdv ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv= vec_sum4s( temp3v, satdv ); + + temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) ); + satdv = vec_sum4s( temp4v, satdv); + + temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) ); + satdv= vec_sum4s( temp5v, satdv ); + + temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) ); + satdv= vec_sum4s( temp6v, satdv ); + + temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) ); + satdv= vec_sum4s( temp7v, satdv ); +#endif + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh0v,diffl0v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh1v, diffl1v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh2v, diffl2v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh3v, diffl3v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh4v, diffl4v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh5v, diffl5v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh6v, diffl6v); + VEC_DIFF_S16(pix1,stride_pix1,pix2,stride_pix2,diffh7v, diffl7v); + + HADAMARD4_VEC( diffh0v, diffh1v, diffh2v, diffh3v, temp0v, temp1v, temp2v, temp3v ); + HADAMARD4_VEC( diffh4v, diffh5v, diffh6v, diffh7v, temp4v, temp5v, temp6v, temp7v ); + + VEC_TRANSPOSE_8( temp0v, temp1v, temp2v, temp3v, + temp4v, temp5v, temp6v, temp7v, + diffh0v, diffh1v, diffh2v, diffh3v, + diffh4v, diffh5v, diffh6v, diffh7v ); + + HADAMARD4_VEC( diffh0v, diffh1v, diffh2v, diffh3v, temp0v, temp1v, temp2v, temp3v ); + HADAMARD4_VEC( diffh4v, diffh5v, diffh6v, diffh7v, temp4v, temp5v, temp6v, temp7v ); + +#if 1 + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv += vec_sum4s( temp0v, zero_s32v); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv1= vec_sum4s( temp1v, zero_s32v ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv2= vec_sum4s( temp2v, zero_s32v ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv3= vec_sum4s( temp3v, zero_s32v ); + + temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) ); + satdv4 = vec_sum4s( temp4v, zero_s32v); + + temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) ); + satdv5= vec_sum4s( temp5v, zero_s32v ); + + temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) ); + satdv6= vec_sum4s( temp6v, zero_s32v ); + + temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) ); + satdv7= vec_sum4s( temp7v, zero_s32v ); + + satdv += satdv1; + satdv2 += satdv3; + satdv4 += satdv5; + satdv6 += satdv7; + + satdv += satdv2; + satdv4 += satdv6; + satdv += satdv4; +#else + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv = vec_sum4s( temp0v, satdv); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv= vec_sum4s( temp1v, satdv ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv= vec_sum4s( temp2v, satdv ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv= vec_sum4s( temp3v, satdv ); + + temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) ); + satdv = vec_sum4s( temp4v, satdv); + + temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) ); + satdv= vec_sum4s( temp5v, satdv ); + + temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) ); + satdv= vec_sum4s( temp6v, satdv ); + + temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) ); + satdv= vec_sum4s( temp7v, satdv ); +#endif + HADAMARD4_VEC( diffl0v, diffl1v, diffl2v, diffl3v, temp0v, temp1v, temp2v, temp3v ); + HADAMARD4_VEC( diffl4v, diffl5v, diffl6v, diffl7v, temp4v, temp5v, temp6v, temp7v ); + + VEC_TRANSPOSE_8( temp0v, temp1v, temp2v, temp3v, + temp4v, temp5v, temp6v, temp7v, + diffl0v, diffl1v, diffl2v, diffl3v, + diffl4v, diffl5v, diffl6v, diffl7v ); + + HADAMARD4_VEC( diffl0v, diffl1v, diffl2v, diffl3v, temp0v, temp1v, temp2v, temp3v ); + HADAMARD4_VEC( diffl4v, diffl5v, diffl6v, diffl7v, temp4v, temp5v, temp6v, temp7v ); + +#if 1 + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv += vec_sum4s( temp0v, zero_s32v); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv1= vec_sum4s( temp1v, zero_s32v ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv2= vec_sum4s( temp2v, zero_s32v ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv3= vec_sum4s( temp3v, zero_s32v ); + + temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) ); + satdv4 = vec_sum4s( temp4v, zero_s32v); + + temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) ); + satdv5= vec_sum4s( temp5v, zero_s32v ); + + temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) ); + satdv6= vec_sum4s( temp6v, zero_s32v ); + + temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) ); + satdv7= vec_sum4s( temp7v, zero_s32v ); + + satdv += satdv1; + satdv2 += satdv3; + satdv4 += satdv5; + satdv6 += satdv7; + + satdv += satdv2; + satdv4 += satdv6; + satdv += satdv4; + + satdv = vec_sums( satdv, zero_s32v ); + sum = vec_extract(satdv, 3); +#else + temp0v = vec_max( temp0v, vec_sub( zero_s16v, temp0v ) ); + satdv = vec_sum4s( temp0v, satdv); + + temp1v = vec_max( temp1v, vec_sub( zero_s16v, temp1v ) ); + satdv= vec_sum4s( temp1v, satdv ); + + temp2v = vec_max( temp2v, vec_sub( zero_s16v, temp2v ) ); + satdv= vec_sum4s( temp2v, satdv ); + + temp3v = vec_max( temp3v, vec_sub( zero_s16v, temp3v ) ); + satdv= vec_sum4s( temp3v, satdv ); + + temp4v = vec_max( temp4v, vec_sub( zero_s16v, temp4v ) ); + satdv = vec_sum4s( temp4v, satdv); + + temp5v = vec_max( temp5v, vec_sub( zero_s16v, temp5v ) ); + satdv= vec_sum4s( temp5v, satdv ); + + temp6v = vec_max( temp6v, vec_sub( zero_s16v, temp6v ) ); + satdv= vec_sum4s( temp6v, satdv ); + + temp7v = vec_max( temp7v, vec_sub( zero_s16v, temp7v ) ); + satdv= vec_sum4s( temp7v, satdv ); + + satdv = vec_sums( satdv, zero_s32v ); + satdv = vec_splat( satdv, 3 ); + vec_ste( satdv, 0, &sum ); +#endif + return sum >> 1; +} + + +template<int w, int h> +int satd_altivec(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2); + +template<> +int satd_altivec<4, 4>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + return satd_4x4_altivec(pix1, stride_pix1, pix2, stride_pix2); +} + +template<> +int satd_altivec<4, 8>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + return satd_4x8_altivec(pix1, stride_pix1, pix2, stride_pix2); +} + +template<> +int satd_altivec<4, 12>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_4x4_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_4x8_altivec(pix1+4*stride_pix1, stride_pix1, pix2+4*stride_pix2, stride_pix2); + + return satd; +} + +template<> +int satd_altivec<4, 16>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_4x8_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_4x8_altivec(pix1+8*stride_pix1, stride_pix1, pix2+8*stride_pix2, stride_pix2); + + return satd; +} + +template<> +int satd_altivec<4, 24>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_4x8_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_4x8_altivec(pix1+8*stride_pix1, stride_pix1, pix2+8*stride_pix2, stride_pix2) + + satd_4x8_altivec(pix1+16*stride_pix1, stride_pix1, pix2+16*stride_pix2, stride_pix2); + + return satd; +} + +template<> +int satd_altivec<4, 32>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_4x8_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_4x8_altivec(pix1+8*stride_pix1, stride_pix1, pix2+8*stride_pix2, stride_pix2) + + satd_4x8_altivec(pix1+16*stride_pix1, stride_pix1, pix2+16*stride_pix2, stride_pix2) + + satd_4x8_altivec(pix1+24*stride_pix1, stride_pix1, pix2+24*stride_pix2, stride_pix2); + + return satd; +} + +template<> +int satd_altivec<4, 64>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_altivec<4, 32>(pix1, stride_pix1, pix2, stride_pix2) + + satd_altivec<4, 32>(pix1+32*stride_pix1, stride_pix1, pix2+32*stride_pix2, stride_pix2); + + return satd; +} + +template<> +int satd_altivec<8, 4>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + return satd_8x4_altivec(pix1, stride_pix1, pix2, stride_pix2); +} + +template<> +int satd_altivec<8, 8>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + return satd_8x8_altivec(pix1, stride_pix1, pix2, stride_pix2); +} + +template<> +int satd_altivec<8, 12>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_8x8_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_8x4_altivec(pix1+8*stride_pix1, stride_pix1, pix2+8*stride_pix2, stride_pix2); + return satd; +} + +template<> +int satd_altivec<8,16>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + return satd_8x16_altivec(pix1, stride_pix1, pix2, stride_pix2); +} + +template<> +int satd_altivec<8,24>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_8x8_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_8x16_altivec(pix1+8*stride_pix1, stride_pix1, pix2+8*stride_pix2, stride_pix2); + return satd; +} + +template<> +int satd_altivec<8,32>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_8x16_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_8x16_altivec(pix1+16*stride_pix1, stride_pix1, pix2+16*stride_pix2, stride_pix2); + return satd; +} + +template<> +int satd_altivec<8,64>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_8x16_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_8x16_altivec(pix1+16*stride_pix1, stride_pix1, pix2+16*stride_pix2, stride_pix2) + + satd_8x16_altivec(pix1+32*stride_pix1, stride_pix1, pix2+32*stride_pix2, stride_pix2) + + satd_8x16_altivec(pix1+48*stride_pix1, stride_pix1, pix2+48*stride_pix2, stride_pix2); + return satd; +} + +template<> +int satd_altivec<12, 4>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_8x4_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_4x4_altivec(pix1+8, stride_pix1, pix2+8, stride_pix2); + return satd; +} + +template<> +int satd_altivec<12, 8>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_8x8_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_4x8_altivec(pix1+8, stride_pix1, pix2+8, stride_pix2); + return satd; +} + +template<> +int satd_altivec<12, 12>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + const pixel *pix3 = pix1 + 8*stride_pix1; + const pixel *pix4 = pix2 + 8*stride_pix2; + satd = satd_8x8_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_4x8_altivec(pix1+8, stride_pix1, pix2+8, stride_pix2); + + satd_8x4_altivec(pix3, stride_pix1, pix4, stride_pix2) + + satd_4x4_altivec(pix3+8, stride_pix1, pix4+8, stride_pix2); + return satd; +} + +template<> +int satd_altivec<12, 16>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + const pixel *pix3 = pix1 + 8*stride_pix1; + const pixel *pix4 = pix2 + 8*stride_pix2; + satd = satd_8x8_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_4x8_altivec(pix1+8, stride_pix1, pix2+8, stride_pix2) + + satd_8x8_altivec(pix3, stride_pix1, pix4, stride_pix2) + + satd_4x8_altivec(pix3+8, stride_pix1, pix4+8, stride_pix2); + return satd; +} + +template<> +int satd_altivec<12, 24>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + const pixel *pix3 = pix1 + 8*stride_pix1; + const pixel *pix4 = pix2 + 8*stride_pix2; + satd = satd_8x8_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_4x8_altivec(pix1+8, stride_pix1, pix2+8, stride_pix2) + + satd_8x16_altivec(pix3, stride_pix1, pix4, stride_pix2) + + satd_altivec<4, 16>(pix3+8, stride_pix1, pix4+8, stride_pix2); + return satd; +} + +template<> +int satd_altivec<12, 32>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + const pixel *pix3 = pix1 + 16*stride_pix1; + const pixel *pix4 = pix2 + 16*stride_pix2; + satd = satd_8x16_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_altivec<4, 16>(pix1+8, stride_pix1, pix2+8, stride_pix2) + + satd_8x16_altivec(pix3, stride_pix1, pix4, stride_pix2) + + satd_altivec<4, 16>(pix3+8, stride_pix1, pix4+8, stride_pix2); + return satd; +} + +template<> +int satd_altivec<12, 64>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + const pixel *pix3 = pix1 + 16*stride_pix1; + const pixel *pix4 = pix2 + 16*stride_pix2; + const pixel *pix5 = pix1 + 32*stride_pix1; + const pixel *pix6 = pix2 + 32*stride_pix2; + const pixel *pix7 = pix1 + 48*stride_pix1; + const pixel *pix8 = pix2 + 48*stride_pix2; + satd = satd_8x16_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_altivec<4, 16>(pix1+8, stride_pix1, pix2+8, stride_pix2) + + satd_8x16_altivec(pix3, stride_pix1, pix4, stride_pix2) + + satd_altivec<4, 16>(pix3+8, stride_pix1, pix4+8, stride_pix2) + + satd_8x16_altivec(pix5, stride_pix1, pix6, stride_pix2) + + satd_altivec<4, 16>(pix5+8, stride_pix1, pix6+8, stride_pix2) + + satd_8x16_altivec(pix7, stride_pix1, pix8, stride_pix2) + + satd_altivec<4, 16>(pix7+8, stride_pix1, pix8+8, stride_pix2); + return satd; +} + +template<> +int satd_altivec<16, 4>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + return satd_16x4_altivec(pix1, stride_pix1, pix2, stride_pix2); +} + +template<> +int satd_altivec<16, 8>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + return satd_16x8_altivec(pix1, stride_pix1, pix2, stride_pix2); +} + +template<> +int satd_altivec<16, 12>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_16x4_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_16x8_altivec(pix1+4*stride_pix1, stride_pix1, pix2+4*stride_pix2, stride_pix2); + return satd; +} + +template<> +int satd_altivec<16, 16>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + return satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2); +} + +template<> +int satd_altivec<16, 24>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_16x8_altivec(pix1+16*stride_pix1, stride_pix1, pix2+16*stride_pix2, stride_pix2); + return satd; +} + +template<> +int satd_altivec<16, 32>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_16x16_altivec(pix1+16*stride_pix1, stride_pix1, pix2+16*stride_pix2, stride_pix2); + return satd; +} + +template<> +int satd_altivec<16, 64>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_16x16_altivec(pix1+16*stride_pix1, stride_pix1, pix2+16*stride_pix2, stride_pix2) + + satd_16x16_altivec(pix1+32*stride_pix1, stride_pix1, pix2+32*stride_pix2, stride_pix2) + + satd_16x16_altivec(pix1+48*stride_pix1, stride_pix1, pix2+48*stride_pix2, stride_pix2); + return satd; +} + +template<> +int satd_altivec<24, 4>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_16x4_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_8x4_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2); + return satd; +} + +template<> +int satd_altivec<24, 8>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_16x8_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_8x8_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2); + return satd; +} + +template<> +int satd_altivec<24, 12>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + const pixel *pix3 = pix1 + 8*stride_pix1; + const pixel *pix4 = pix2 + 8*stride_pix2; + satd = satd_16x8_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_8x8_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2) + + satd_16x4_altivec(pix3, stride_pix1, pix4, stride_pix2) + + satd_8x4_altivec(pix3+16, stride_pix1, pix4+16, stride_pix2); + return satd; +} + +template<> +int satd_altivec<24, 16>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_8x16_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2); + return satd; +} + +template<> +int satd_altivec<24, 24>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_altivec<24, 16>(pix1, stride_pix1, pix2, stride_pix2) + + satd_altivec<24, 8>(pix1+16*stride_pix1, stride_pix1, pix2+16*stride_pix2, stride_pix2); + return satd; +} + +template<> +int satd_altivec<24, 32>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + const pixel *pix3 = pix1 + 16*stride_pix1; + const pixel *pix4 = pix2 + 16*stride_pix2; + satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_8x16_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2) + + satd_16x16_altivec(pix3, stride_pix1, pix4, stride_pix2) + + satd_8x16_altivec(pix3+16, stride_pix1, pix4+16, stride_pix2); + return satd; +} + +template<> +int satd_altivec<24, 64>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + const pixel *pix3 = pix1 + 16*stride_pix1; + const pixel *pix4 = pix2 + 16*stride_pix2; + const pixel *pix5 = pix1 + 32*stride_pix1; + const pixel *pix6 = pix2 + 32*stride_pix2; + const pixel *pix7 = pix1 + 48*stride_pix1; + const pixel *pix8 = pix2 + 48*stride_pix2; + satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_8x16_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2) + + satd_16x16_altivec(pix3, stride_pix1, pix4, stride_pix2) + + satd_8x16_altivec(pix3+16, stride_pix1, pix4+16, stride_pix2) + + satd_16x16_altivec(pix5, stride_pix1, pix6, stride_pix2) + + satd_8x16_altivec(pix5+16, stride_pix1, pix6+16, stride_pix2) + + satd_16x16_altivec(pix7, stride_pix1, pix8, stride_pix2) + + satd_8x16_altivec(pix7+16, stride_pix1, pix8+16, stride_pix2); + return satd; +} + +template<> +int satd_altivec<32, 4>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_16x4_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_16x4_altivec(pix1 + 16, stride_pix1, pix2 + 16, stride_pix2); + return satd; +} + +template<> +int satd_altivec<32, 8>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_16x8_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_16x8_altivec(pix1 + 16, stride_pix1, pix2 + 16, stride_pix2); + return satd; +} + +template<> +int satd_altivec<32, 12>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + const pixel *pix3 = pix1 + 8*stride_pix1; + const pixel *pix4 = pix2 + 8*stride_pix2; + satd = satd_16x8_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_16x8_altivec(pix1 + 16, stride_pix1, pix2 + 16, stride_pix2) + + satd_16x4_altivec(pix3, stride_pix1, pix4, stride_pix2) + + satd_16x4_altivec(pix3 + 16, stride_pix1, pix4 + 16, stride_pix2); + return satd; +} + +template<> +int satd_altivec<32, 16>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_16x16_altivec(pix1 + 16, stride_pix1, pix2 + 16, stride_pix2); + return satd; +} + +template<> +int satd_altivec<32, 24>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + const pixel *pix3 = pix1 + 16*stride_pix1; + const pixel *pix4 = pix2 + 16*stride_pix2; + satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_16x16_altivec(pix1 + 16, stride_pix1, pix2 + 16, stride_pix2) + + satd_16x8_altivec(pix3, stride_pix1, pix4, stride_pix2) + + satd_16x8_altivec(pix3 + 16, stride_pix1, pix4 + 16, stride_pix2); + return satd; +} + +template<> +int satd_altivec<32, 32>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + const pixel *pix3 = pix1 + 16*stride_pix1; + const pixel *pix4 = pix2 + 16*stride_pix2; + satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_16x16_altivec(pix1 + 16, stride_pix1, pix2 + 16, stride_pix2) + + satd_16x16_altivec(pix3, stride_pix1, pix4, stride_pix2) + + satd_16x16_altivec(pix3 + 16, stride_pix1, pix4 + 16, stride_pix2); + return satd; +} + +template<> +int satd_altivec<32, 48>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + const pixel *pix3 = pix1 + 16*stride_pix1; + const pixel *pix4 = pix2 + 16*stride_pix2; + const pixel *pix5 = pix1 + 32*stride_pix1; + const pixel *pix6 = pix2 + 32*stride_pix2; + satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_16x16_altivec(pix1 + 16, stride_pix1, pix2 + 16, stride_pix2) + + satd_16x16_altivec(pix3, stride_pix1, pix4, stride_pix2) + + satd_16x16_altivec(pix3 + 16, stride_pix1, pix4 + 16, stride_pix2) + + satd_16x16_altivec(pix5, stride_pix1, pix6, stride_pix2) + + satd_16x16_altivec(pix5+16, stride_pix1, pix6+16, stride_pix2); + return satd; +} + +template<> +int satd_altivec<32, 64>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + const pixel *pix3 = pix1 + 16*stride_pix1; + const pixel *pix4 = pix2 + 16*stride_pix2; + const pixel *pix5 = pix1 + 32*stride_pix1; + const pixel *pix6 = pix2 + 32*stride_pix2; + const pixel *pix7 = pix1 + 48*stride_pix1; + const pixel *pix8 = pix2 + 48*stride_pix2; + satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_16x16_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2) + + satd_16x16_altivec(pix3, stride_pix1, pix4, stride_pix2) + + satd_16x16_altivec(pix3+16, stride_pix1, pix4+16, stride_pix2) + + satd_16x16_altivec(pix5, stride_pix1, pix6, stride_pix2) + + satd_16x16_altivec(pix5+16, stride_pix1, pix6+16, stride_pix2) + + satd_16x16_altivec(pix7, stride_pix1, pix8, stride_pix2) + + satd_16x16_altivec(pix7+16, stride_pix1, pix8+16, stride_pix2); + return satd; +} + +template<> +int satd_altivec<48, 4>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_16x4_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_16x4_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2) + + satd_16x4_altivec(pix1+32, stride_pix1, pix2+32, stride_pix2); + return satd; +} + +template<> +int satd_altivec<48, 8>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_16x8_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_16x8_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2) + + satd_16x8_altivec(pix1+32, stride_pix1, pix2+32, stride_pix2); + return satd; +} + +template<> +int satd_altivec<48, 12>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + const pixel *pix3 = pix1 + 8*stride_pix1; + const pixel *pix4 = pix2 + 8*stride_pix2; + satd = satd_16x8_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_16x8_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2) + + satd_16x8_altivec(pix1+32, stride_pix1, pix2+32, stride_pix2) + +satd_16x4_altivec(pix3, stride_pix1, pix4, stride_pix2) + + satd_16x4_altivec(pix3+16, stride_pix1, pix4+16, stride_pix2) + + satd_16x4_altivec(pix3+32, stride_pix1, pix4+32, stride_pix2); + return satd; +} + +template<> +int satd_altivec<48, 16>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_16x16_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2) + + satd_16x16_altivec(pix1+32, stride_pix1, pix2+32, stride_pix2); + return satd; +} + +template<> +int satd_altivec<48, 24>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + const pixel *pix3 = pix1 + 8*stride_pix1; + const pixel *pix4 = pix2 + 8*stride_pix2; + satd = satd_16x8_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_16x8_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2) + + satd_16x8_altivec(pix1+32, stride_pix1, pix2+32, stride_pix2) + +satd_16x16_altivec(pix3, stride_pix1, pix4, stride_pix2) + + satd_16x16_altivec(pix3+16, stride_pix1, pix4+16, stride_pix2) + + satd_16x16_altivec(pix3+32, stride_pix1, pix4+32, stride_pix2); + return satd; +} + +template<> +int satd_altivec<48, 32>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + const pixel *pix3 = pix1 + 16*stride_pix1; + const pixel *pix4 = pix2 + 16*stride_pix2; + satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_16x16_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2) + + satd_16x16_altivec(pix1+32, stride_pix1, pix2+32, stride_pix2) + +satd_16x16_altivec(pix3, stride_pix1, pix4, stride_pix2) + + satd_16x16_altivec(pix3+16, stride_pix1, pix4+16, stride_pix2) + + satd_16x16_altivec(pix3+32, stride_pix1, pix4+32, stride_pix2); + return satd; +} + +template<> +int satd_altivec<48, 64>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + const pixel *pix3 = pix1 + 16*stride_pix1; + const pixel *pix4 = pix2 + 16*stride_pix2; + const pixel *pix5 = pix1 + 32*stride_pix1; + const pixel *pix6 = pix2 + 32*stride_pix2; + const pixel *pix7 = pix1 + 48*stride_pix1; + const pixel *pix8 = pix2 + 48*stride_pix2; + satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_16x16_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2) + + satd_16x16_altivec(pix1+32, stride_pix1, pix2+32, stride_pix2) + +satd_16x16_altivec(pix3, stride_pix1, pix4, stride_pix2) + + satd_16x16_altivec(pix3+16, stride_pix1, pix4+16, stride_pix2) + + satd_16x16_altivec(pix3+32, stride_pix1, pix4+32, stride_pix2) + +satd_16x16_altivec(pix5, stride_pix1, pix6, stride_pix2) + + satd_16x16_altivec(pix5+16, stride_pix1, pix6+16, stride_pix2) + + satd_16x16_altivec(pix5+32, stride_pix1, pix6+32, stride_pix2) + +satd_16x16_altivec(pix7, stride_pix1, pix8, stride_pix2) + + satd_16x16_altivec(pix7+16, stride_pix1,pix8+16, stride_pix2) + + satd_16x16_altivec(pix7+32, stride_pix1, pix8+32, stride_pix2); + return satd; +} + +template<> +int satd_altivec<64, 4>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_altivec<32, 4>(pix1, stride_pix1, pix2, stride_pix2) + + satd_altivec<32, 4>(pix1+32, stride_pix1, pix2+32, stride_pix2); + return satd; +} + +template<> +int satd_altivec<64, 8>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_altivec<32, 8>(pix1, stride_pix1, pix2, stride_pix2) + + satd_altivec<32, 8>(pix1+32, stride_pix1, pix2+32, stride_pix2); + return satd; +} + +template<> +int satd_altivec<64, 12>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_altivec<32, 12>(pix1, stride_pix1, pix2, stride_pix2) + + satd_altivec<32, 12>(pix1+32, stride_pix1, pix2+32, stride_pix2); + return satd; +} + +template<> +int satd_altivec<64, 16>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_16x16_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2) + + satd_16x16_altivec(pix1+32, stride_pix1, pix2+32, stride_pix2) + + satd_16x16_altivec(pix1+48, stride_pix1, pix2+48, stride_pix2); + return satd; +} + +template<> +int satd_altivec<64, 24>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + satd = satd_altivec<32, 24>(pix1, stride_pix1, pix2, stride_pix2) + + satd_altivec<32, 24>(pix1+32, stride_pix1, pix2+32, stride_pix2); + return satd; +} + +template<> +int satd_altivec<64, 32>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + const pixel *pix3 = pix1 + 16*stride_pix1; + const pixel *pix4 = pix2 + 16*stride_pix2; + satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_16x16_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2) + + satd_16x16_altivec(pix1+32, stride_pix1, pix2+32, stride_pix2) + + satd_16x16_altivec(pix1+48, stride_pix1, pix2+48, stride_pix2) + + satd_16x16_altivec(pix3, stride_pix1, pix4, stride_pix2) + + satd_16x16_altivec(pix3+16, stride_pix1, pix4+16, stride_pix2) + + satd_16x16_altivec(pix3+32, stride_pix1, pix4+32, stride_pix2) + + satd_16x16_altivec(pix3+48, stride_pix1, pix4+48, stride_pix2); + return satd; +} + +template<> +int satd_altivec<64, 48>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + const pixel *pix3 = pix1 + 16*stride_pix1; + const pixel *pix4 = pix2 + 16*stride_pix2; + const pixel *pix5 = pix1 + 32*stride_pix1; + const pixel *pix6 = pix2 + 32*stride_pix2; + satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_16x16_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2) + + satd_16x16_altivec(pix1+32, stride_pix1, pix2+32, stride_pix2) + + satd_16x16_altivec(pix1+48, stride_pix1, pix2+48, stride_pix2) + + satd_16x16_altivec(pix3, stride_pix1, pix4, stride_pix2) + + satd_16x16_altivec(pix3+16, stride_pix1, pix4+16, stride_pix2) + + satd_16x16_altivec(pix3+32, stride_pix1, pix4+32, stride_pix2) + + satd_16x16_altivec(pix3+48, stride_pix1, pix4+48, stride_pix2) + + satd_16x16_altivec(pix5, stride_pix1, pix6, stride_pix2) + + satd_16x16_altivec(pix5+16, stride_pix1, pix6+16, stride_pix2) + + satd_16x16_altivec(pix5+32, stride_pix1, pix6+32, stride_pix2) + + satd_16x16_altivec(pix5+48, stride_pix1, pix6+48, stride_pix2); + return satd; +} + +template<> +int satd_altivec<64, 64>(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +{ + int satd = 0; + const pixel *pix3 = pix1 + 16*stride_pix1; + const pixel *pix4 = pix2 + 16*stride_pix2; + const pixel *pix5 = pix1 + 32*stride_pix1; + const pixel *pix6 = pix2 + 32*stride_pix2; + const pixel *pix7 = pix1 + 48*stride_pix1; + const pixel *pix8 = pix2 + 48*stride_pix2; + satd = satd_16x16_altivec(pix1, stride_pix1, pix2, stride_pix2) + + satd_16x16_altivec(pix1+16, stride_pix1, pix2+16, stride_pix2) + + satd_16x16_altivec(pix1+32, stride_pix1, pix2+32, stride_pix2) + + satd_16x16_altivec(pix1+48, stride_pix1, pix2+48, stride_pix2) + + satd_16x16_altivec(pix3, stride_pix1, pix4, stride_pix2) + + satd_16x16_altivec(pix3+16, stride_pix1, pix4+16, stride_pix2) + + satd_16x16_altivec(pix3+32, stride_pix1, pix4+32, stride_pix2) + + satd_16x16_altivec(pix3+48, stride_pix1, pix4+48, stride_pix2) + + satd_16x16_altivec(pix5, stride_pix1, pix6, stride_pix2) + + satd_16x16_altivec(pix5+16, stride_pix1, pix6+16, stride_pix2) + + satd_16x16_altivec(pix5+32, stride_pix1, pix6+32, stride_pix2) + + satd_16x16_altivec(pix5+48, stride_pix1, pix6+48, stride_pix2) + + satd_16x16_altivec(pix7, stride_pix1, pix8, stride_pix2) + + satd_16x16_altivec(pix7+16, stride_pix1, pix8+16, stride_pix2) + + satd_16x16_altivec(pix7+32, stride_pix1, pix8+32, stride_pix2) + + satd_16x16_altivec(pix7+48, stride_pix1, pix8+48, stride_pix2); + return satd; +} + + +/*********************************************************************** + * SA8D routines - altivec implementation + **********************************************************************/ +#define SA8D_1D_ALTIVEC( sa8d0v, sa8d1v, sa8d2v, sa8d3v, \ + sa8d4v, sa8d5v, sa8d6v, sa8d7v ) \ +{ \ + /* int a0 = SRC(0) + SRC(4) */ \ + vec_s16_t a0v = vec_add(sa8d0v, sa8d4v); \ + /* int a4 = SRC(0) - SRC(4) */ \ + vec_s16_t a4v = vec_sub(sa8d0v, sa8d4v); \ + /* int a1 = SRC(1) + SRC(5) */ \ + vec_s16_t a1v = vec_add(sa8d1v, sa8d5v); \ + /* int a5 = SRC(1) - SRC(5) */ \ + vec_s16_t a5v = vec_sub(sa8d1v, sa8d5v); \ + /* int a2 = SRC(2) + SRC(6) */ \ + vec_s16_t a2v = vec_add(sa8d2v, sa8d6v); \ + /* int a6 = SRC(2) - SRC(6) */ \ + vec_s16_t a6v = vec_sub(sa8d2v, sa8d6v); \ + /* int a3 = SRC(3) + SRC(7) */ \ + vec_s16_t a3v = vec_add(sa8d3v, sa8d7v); \ + /* int a7 = SRC(3) - SRC(7) */ \ + vec_s16_t a7v = vec_sub(sa8d3v, sa8d7v); \ + \ + /* int b0 = a0 + a2 */ \ + vec_s16_t b0v = vec_add(a0v, a2v); \ + /* int b2 = a0 - a2; */ \ + vec_s16_t b2v = vec_sub(a0v, a2v); \ + /* int b1 = a1 + a3; */ \ + vec_s16_t b1v = vec_add(a1v, a3v); \ + /* int b3 = a1 - a3; */ \ + vec_s16_t b3v = vec_sub(a1v, a3v); \ + /* int b4 = a4 + a6; */ \ + vec_s16_t b4v = vec_add(a4v, a6v); \ + /* int b6 = a4 - a6; */ \ + vec_s16_t b6v = vec_sub(a4v, a6v); \ + /* int b5 = a5 + a7; */ \ + vec_s16_t b5v = vec_add(a5v, a7v); \ + /* int b7 = a5 - a7; */ \ + vec_s16_t b7v = vec_sub(a5v, a7v); \ + \ + /* DST(0, b0 + b1) */ \ + sa8d0v = vec_add(b0v, b1v); \ + /* DST(1, b0 - b1) */ \ + sa8d1v = vec_sub(b0v, b1v); \ + /* DST(2, b2 + b3) */ \ + sa8d2v = vec_add(b2v, b3v); \ + /* DST(3, b2 - b3) */ \ + sa8d3v = vec_sub(b2v, b3v); \ + /* DST(4, b4 + b5) */ \ + sa8d4v = vec_add(b4v, b5v); \ + /* DST(5, b4 - b5) */ \ + sa8d5v = vec_sub(b4v, b5v); \ + /* DST(6, b6 + b7) */ \ + sa8d6v = vec_add(b6v, b7v); \ + /* DST(7, b6 - b7) */ \ + sa8d7v = vec_sub(b6v, b7v); \ +} + +inline int sa8d_8x8_altivec(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2) +{ + ALIGN_VAR_16(int, sum); + + LOAD_ZERO; + vec_s16_t pix1v, pix2v; + vec_s16_t diff0v, diff1v, diff2v, diff3v, diff4v, diff5v, diff6v, diff7v; + vec_s16_t sa8d0v, sa8d1v, sa8d2v, sa8d3v, sa8d4v, sa8d5v, sa8d6v, sa8d7v; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff0v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff1v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff2v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff3v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff4v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff5v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff6v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff7v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + + SA8D_1D_ALTIVEC(diff0v, diff1v, diff2v, diff3v, + diff4v, diff5v, diff6v, diff7v); + VEC_TRANSPOSE_8(diff0v, diff1v, diff2v, diff3v, + diff4v, diff5v, diff6v, diff7v, + sa8d0v, sa8d1v, sa8d2v, sa8d3v, + sa8d4v, sa8d5v, sa8d6v, sa8d7v ); + SA8D_1D_ALTIVEC(sa8d0v, sa8d1v, sa8d2v, sa8d3v, + sa8d4v, sa8d5v, sa8d6v, sa8d7v ); + + /* accumulation of the absolute value of all elements of the resulting bloc */ + vec_s16_t abs0v = vec_max( sa8d0v, vec_sub( zero_s16v, sa8d0v ) ); + vec_s16_t abs1v = vec_max( sa8d1v, vec_sub( zero_s16v, sa8d1v ) ); + vec_s16_t sum01v = vec_add(abs0v, abs1v); + + vec_s16_t abs2v = vec_max( sa8d2v, vec_sub( zero_s16v, sa8d2v ) ); + vec_s16_t abs3v = vec_max( sa8d3v, vec_sub( zero_s16v, sa8d3v ) ); + vec_s16_t sum23v = vec_add(abs2v, abs3v); + + vec_s16_t abs4v = vec_max( sa8d4v, vec_sub( zero_s16v, sa8d4v ) ); + vec_s16_t abs5v = vec_max( sa8d5v, vec_sub( zero_s16v, sa8d5v ) ); + vec_s16_t sum45v = vec_add(abs4v, abs5v); + + vec_s16_t abs6v = vec_max( sa8d6v, vec_sub( zero_s16v, sa8d6v ) ); + vec_s16_t abs7v = vec_max( sa8d7v, vec_sub( zero_s16v, sa8d7v ) ); + vec_s16_t sum67v = vec_add(abs6v, abs7v); + + vec_s16_t sum0123v = vec_add(sum01v, sum23v); + vec_s16_t sum4567v = vec_add(sum45v, sum67v); + + vec_s32_t sumblocv; + + sumblocv = vec_sum4s(sum0123v, (vec_s32_t)zerov ); + //print_vec_s("sum0123v", &sum0123v); + //print_vec_i("sumblocv = vec_sum4s(sum0123v, 0 )", &sumblocv); + sumblocv = vec_sum4s(sum4567v, sumblocv ); + //print_vec_s("sum4567v", &sum4567v); + //print_vec_i("sumblocv = vec_sum4s(sum4567v, sumblocv )", &sumblocv); + sumblocv = vec_sums(sumblocv, (vec_s32_t)zerov ); + //print_vec_i("sumblocv=vec_sums(sumblocv,0 )", &sumblocv); + sumblocv = vec_splat(sumblocv, 3); + //print_vec_i("sumblocv = vec_splat(sumblocv, 3)", &sumblocv); + vec_ste(sumblocv, 0, &sum); + + return (sum + 2) >> 2; +} + + +int sa8d_8x8_altivec(const int16_t* pix1, intptr_t i_pix1) +{ + int sum = 0; + return ((sum+2)>>2); +} + +inline int sa8d_8x16_altivec(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2) +{ + ALIGN_VAR_16(int, sum); + ALIGN_VAR_16(int, sum1); + + LOAD_ZERO; + vec_s16_t pix1v, pix2v; + vec_s16_t diff0v, diff1v, diff2v, diff3v, diff4v, diff5v, diff6v, diff7v; + vec_s16_t sa8d0v, sa8d1v, sa8d2v, sa8d3v, sa8d4v, sa8d5v, sa8d6v, sa8d7v; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff0v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff1v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff2v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff3v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff4v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff5v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff6v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff7v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + + SA8D_1D_ALTIVEC(diff0v, diff1v, diff2v, diff3v, + diff4v, diff5v, diff6v, diff7v); + VEC_TRANSPOSE_8(diff0v, diff1v, diff2v, diff3v, + diff4v, diff5v, diff6v, diff7v, + sa8d0v, sa8d1v, sa8d2v, sa8d3v, + sa8d4v, sa8d5v, sa8d6v, sa8d7v ); + SA8D_1D_ALTIVEC(sa8d0v, sa8d1v, sa8d2v, sa8d3v, + sa8d4v, sa8d5v, sa8d6v, sa8d7v ); + + /* accumulation of the absolute value of all elements of the resulting bloc */ + vec_s16_t abs0v = vec_max( sa8d0v, vec_sub( zero_s16v, sa8d0v ) ); + vec_s16_t abs1v = vec_max( sa8d1v, vec_sub( zero_s16v, sa8d1v ) ); + vec_s16_t sum01v = vec_add(abs0v, abs1v); + + vec_s16_t abs2v = vec_max( sa8d2v, vec_sub( zero_s16v, sa8d2v ) ); + vec_s16_t abs3v = vec_max( sa8d3v, vec_sub( zero_s16v, sa8d3v ) ); + vec_s16_t sum23v = vec_add(abs2v, abs3v); + + vec_s16_t abs4v = vec_max( sa8d4v, vec_sub( zero_s16v, sa8d4v ) ); + vec_s16_t abs5v = vec_max( sa8d5v, vec_sub( zero_s16v, sa8d5v ) ); + vec_s16_t sum45v = vec_add(abs4v, abs5v); + + vec_s16_t abs6v = vec_max( sa8d6v, vec_sub( zero_s16v, sa8d6v ) ); + vec_s16_t abs7v = vec_max( sa8d7v, vec_sub( zero_s16v, sa8d7v ) ); + vec_s16_t sum67v = vec_add(abs6v, abs7v); + + vec_s16_t sum0123v = vec_add(sum01v, sum23v); + vec_s16_t sum4567v = vec_add(sum45v, sum67v); + + vec_s32_t sumblocv, sumblocv1; + + sumblocv = vec_sum4s(sum0123v, (vec_s32_t)zerov ); + sumblocv = vec_sum4s(sum4567v, sumblocv ); + sumblocv = vec_sums(sumblocv, (vec_s32_t)zerov ); + sumblocv = vec_splat(sumblocv, 3); + vec_ste(sumblocv, 0, &sum); + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff0v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff1v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff2v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff3v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff4v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff5v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff6v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + pix1v = vec_u8_to_s16(vec_xl(0, pix1)); + pix2v = vec_u8_to_s16( vec_xl(0, pix2) ); + diff7v = vec_sub( pix1v, pix2v ); + pix1 += i_pix1; + pix2 += i_pix2; + + + SA8D_1D_ALTIVEC(diff0v, diff1v, diff2v, diff3v, + diff4v, diff5v, diff6v, diff7v); + VEC_TRANSPOSE_8(diff0v, diff1v, diff2v, diff3v, + diff4v, diff5v, diff6v, diff7v, + sa8d0v, sa8d1v, sa8d2v, sa8d3v, + sa8d4v, sa8d5v, sa8d6v, sa8d7v ); + SA8D_1D_ALTIVEC(sa8d0v, sa8d1v, sa8d2v, sa8d3v, + sa8d4v, sa8d5v, sa8d6v, sa8d7v ); + + /* accumulation of the absolute value of all elements of the resulting bloc */ + abs0v = vec_max( sa8d0v, vec_sub( zero_s16v, sa8d0v ) ); + abs1v = vec_max( sa8d1v, vec_sub( zero_s16v, sa8d1v ) ); + sum01v = vec_add(abs0v, abs1v); + + abs2v = vec_max( sa8d2v, vec_sub( zero_s16v, sa8d2v ) ); + abs3v = vec_max( sa8d3v, vec_sub( zero_s16v, sa8d3v ) ); + sum23v = vec_add(abs2v, abs3v); + + abs4v = vec_max( sa8d4v, vec_sub( zero_s16v, sa8d4v ) ); + abs5v = vec_max( sa8d5v, vec_sub( zero_s16v, sa8d5v ) ); + sum45v = vec_add(abs4v, abs5v); + + abs6v = vec_max( sa8d6v, vec_sub( zero_s16v, sa8d6v ) ); + abs7v = vec_max( sa8d7v, vec_sub( zero_s16v, sa8d7v ) ); + sum67v = vec_add(abs6v, abs7v); + + sum0123v = vec_add(sum01v, sum23v); + sum4567v = vec_add(sum45v, sum67v); + + sumblocv1 = vec_sum4s(sum0123v, (vec_s32_t)zerov ); + sumblocv1 = vec_sum4s(sum4567v, sumblocv1 ); + sumblocv1 = vec_sums(sumblocv1, (vec_s32_t)zerov ); + sumblocv1 = vec_splat(sumblocv1, 3); + vec_ste(sumblocv1, 0, &sum1); + + sum = (sum + 2) >> 2; + sum1 = (sum1 + 2) >> 2; + sum += sum1; + return (sum); +} + +inline int sa8d_16x8_altivec(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2) +{ + ALIGN_VAR_16(int, sumh); + ALIGN_VAR_16(int, suml); + + LOAD_ZERO; + vec_s16_t pix1v, pix2v; + vec_s16_t diffh0v, diffh1v, diffh2v, diffh3v, + diffh4v, diffh5v, diffh6v, diffh7v; + vec_s16_t diffl0v, diffl1v, diffl2v, diffl3v, + diffl4v, diffl5v, diffl6v, diffl7v; + vec_s16_t sa8dh0v, sa8dh1v, sa8dh2v, sa8dh3v, sa8dh4v, sa8dh5v, sa8dh6v, sa8dh7v; + vec_s16_t sa8dl0v, sa8dl1v, sa8dl2v, sa8dl3v, sa8dl4v, sa8dl5v, sa8dl6v, sa8dl7v; + vec_s16_t temp0v, temp1v, temp2v, temp3v; + + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh0v,diffl0v); + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh1v, diffl1v); + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh2v, diffl2v); + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh3v, diffl3v); + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh4v, diffl4v); + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh5v, diffl5v); + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh6v, diffl6v); + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh7v, diffl7v); + + SA8D_1D_ALTIVEC(diffh0v, diffh1v, diffh2v, diffh3v, diffh4v, diffh5v, diffh6v, diffh7v); + VEC_TRANSPOSE_8(diffh0v, diffh1v, diffh2v, diffh3v, diffh4v, diffh5v, diffh6v, diffh7v, + sa8dh0v, sa8dh1v, sa8dh2v, sa8dh3v, sa8dh4v, sa8dh5v, sa8dh6v, sa8dh7v ); + SA8D_1D_ALTIVEC(sa8dh0v, sa8dh1v, sa8dh2v, sa8dh3v, sa8dh4v, sa8dh5v, sa8dh6v, sa8dh7v); + + SA8D_1D_ALTIVEC(diffl0v, diffl1v, diffl2v, diffl3v, diffl4v, diffl5v, diffl6v, diffl7v); + VEC_TRANSPOSE_8(diffl0v, diffl1v, diffl2v, diffl3v, diffl4v, diffl5v, diffl6v, diffl7v, + sa8dl0v, sa8dl1v, sa8dl2v, sa8dl3v, sa8dl4v, sa8dl5v, sa8dl6v, sa8dl7v ); + SA8D_1D_ALTIVEC(sa8dl0v, sa8dl1v, sa8dl2v, sa8dl3v, sa8dl4v, sa8dl5v, sa8dl6v, sa8dl7v); + + /* accumulation of the absolute value of all elements of the resulting bloc */ + sa8dh0v = vec_max( sa8dh0v, vec_sub( zero_s16v, sa8dh0v ) ); + sa8dh1v = vec_max( sa8dh1v, vec_sub( zero_s16v, sa8dh1v ) ); + vec_s16_t sumh01v = vec_add(sa8dh0v, sa8dh1v); + + sa8dh2v = vec_max( sa8dh2v, vec_sub( zero_s16v, sa8dh2v ) ); + sa8dh3v = vec_max( sa8dh3v, vec_sub( zero_s16v, sa8dh3v ) ); + vec_s16_t sumh23v = vec_add(sa8dh2v, sa8dh3v); + + sa8dh4v = vec_max( sa8dh4v, vec_sub( zero_s16v, sa8dh4v ) ); + sa8dh5v = vec_max( sa8dh5v, vec_sub( zero_s16v, sa8dh5v ) ); + vec_s16_t sumh45v = vec_add(sa8dh4v, sa8dh5v); + + sa8dh6v = vec_max( sa8dh6v, vec_sub( zero_s16v, sa8dh6v ) ); + sa8dh7v = vec_max( sa8dh7v, vec_sub( zero_s16v, sa8dh7v ) ); + vec_s16_t sumh67v = vec_add(sa8dh6v, sa8dh7v); + + vec_s16_t sumh0123v = vec_add(sumh01v, sumh23v); + vec_s16_t sumh4567v = vec_add(sumh45v, sumh67v); + + vec_s32_t sumblocv_h; + + sumblocv_h = vec_sum4s(sumh0123v, (vec_s32_t)zerov ); + //print_vec_s("sum0123v", &sum0123v); + //print_vec_i("sumblocv = vec_sum4s(sum0123v, 0 )", &sumblocv); + sumblocv_h = vec_sum4s(sumh4567v, sumblocv_h ); + //print_vec_s("sum4567v", &sum4567v); + //print_vec_i("sumblocv = vec_sum4s(sum4567v, sumblocv )", &sumblocv); + sumblocv_h = vec_sums(sumblocv_h, (vec_s32_t)zerov ); + //print_vec_i("sumblocv=vec_sums(sumblocv,0 )", &sumblocv); + sumblocv_h = vec_splat(sumblocv_h, 3); + //print_vec_i("sumblocv = vec_splat(sumblocv, 3)", &sumblocv); + vec_ste(sumblocv_h, 0, &sumh); + + sa8dl0v = vec_max( sa8dl0v, vec_sub( zero_s16v, sa8dl0v ) ); + sa8dl1v = vec_max( sa8dl1v, vec_sub( zero_s16v, sa8dl1v ) ); + vec_s16_t suml01v = vec_add(sa8dl0v, sa8dl1v); + + sa8dl2v = vec_max( sa8dl2v, vec_sub( zero_s16v, sa8dl2v ) ); + sa8dl3v = vec_max( sa8dl3v, vec_sub( zero_s16v, sa8dl3v ) ); + vec_s16_t suml23v = vec_add(sa8dl2v, sa8dl3v); + + sa8dl4v = vec_max( sa8dl4v, vec_sub( zero_s16v, sa8dl4v ) ); + sa8dl5v = vec_max( sa8dl5v, vec_sub( zero_s16v, sa8dl5v ) ); + vec_s16_t suml45v = vec_add(sa8dl4v, sa8dl5v); + + sa8dl6v = vec_max( sa8dl6v, vec_sub( zero_s16v, sa8dl6v ) ); + sa8dl7v = vec_max( sa8dl7v, vec_sub( zero_s16v, sa8dl7v ) ); + vec_s16_t suml67v = vec_add(sa8dl6v, sa8dl7v); + + vec_s16_t suml0123v = vec_add(suml01v, suml23v); + vec_s16_t suml4567v = vec_add(suml45v, suml67v); + + vec_s32_t sumblocv_l; + + sumblocv_l = vec_sum4s(suml0123v, (vec_s32_t)zerov ); + //print_vec_s("sum0123v", &sum0123v); + //print_vec_i("sumblocv = vec_sum4s(sum0123v, 0 )", &sumblocv); + sumblocv_l = vec_sum4s(suml4567v, sumblocv_l ); + //print_vec_s("sum4567v", &sum4567v); + //print_vec_i("sumblocv = vec_sum4s(sum4567v, sumblocv )", &sumblocv); + sumblocv_l = vec_sums(sumblocv_l, (vec_s32_t)zerov ); + //print_vec_i("sumblocv=vec_sums(sumblocv,0 )", &sumblocv); + sumblocv_l = vec_splat(sumblocv_l, 3); + //print_vec_i("sumblocv = vec_splat(sumblocv, 3)", &sumblocv); + vec_ste(sumblocv_l, 0, &suml); + + sumh = (sumh + 2) >> 2; + suml= (suml + 2) >> 2; + return (sumh+suml); +} + +inline int sa8d_16x16_altivec(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2) +{ + ALIGN_VAR_16(int, sumh0); + ALIGN_VAR_16(int, suml0); + + ALIGN_VAR_16(int, sumh1); + ALIGN_VAR_16(int, suml1); + + ALIGN_VAR_16(int, sum); + + LOAD_ZERO; + vec_s16_t pix1v, pix2v; + vec_s16_t diffh0v, diffh1v, diffh2v, diffh3v, + diffh4v, diffh5v, diffh6v, diffh7v; + vec_s16_t diffl0v, diffl1v, diffl2v, diffl3v, + diffl4v, diffl5v, diffl6v, diffl7v; + vec_s16_t sa8dh0v, sa8dh1v, sa8dh2v, sa8dh3v, sa8dh4v, sa8dh5v, sa8dh6v, sa8dh7v; + vec_s16_t sa8dl0v, sa8dl1v, sa8dl2v, sa8dl3v, sa8dl4v, sa8dl5v, sa8dl6v, sa8dl7v; + vec_s16_t temp0v, temp1v, temp2v, temp3v; + + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh0v,diffl0v); + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh1v, diffl1v); + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh2v, diffl2v); + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh3v, diffl3v); + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh4v, diffl4v); + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh5v, diffl5v); + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh6v, diffl6v); + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh7v, diffl7v); + + SA8D_1D_ALTIVEC(diffh0v, diffh1v, diffh2v, diffh3v, diffh4v, diffh5v, diffh6v, diffh7v); + VEC_TRANSPOSE_8(diffh0v, diffh1v, diffh2v, diffh3v, diffh4v, diffh5v, diffh6v, diffh7v, + sa8dh0v, sa8dh1v, sa8dh2v, sa8dh3v, sa8dh4v, sa8dh5v, sa8dh6v, sa8dh7v ); + SA8D_1D_ALTIVEC(sa8dh0v, sa8dh1v, sa8dh2v, sa8dh3v, sa8dh4v, sa8dh5v, sa8dh6v, sa8dh7v); + + SA8D_1D_ALTIVEC(diffl0v, diffl1v, diffl2v, diffl3v, diffl4v, diffl5v, diffl6v, diffl7v); + VEC_TRANSPOSE_8(diffl0v, diffl1v, diffl2v, diffl3v, diffl4v, diffl5v, diffl6v, diffl7v, + sa8dl0v, sa8dl1v, sa8dl2v, sa8dl3v, sa8dl4v, sa8dl5v, sa8dl6v, sa8dl7v ); + SA8D_1D_ALTIVEC(sa8dl0v, sa8dl1v, sa8dl2v, sa8dl3v, sa8dl4v, sa8dl5v, sa8dl6v, sa8dl7v); + + /* accumulation of the absolute value of all elements of the resulting bloc */ + sa8dh0v = vec_max( sa8dh0v, vec_sub( zero_s16v, sa8dh0v ) ); + sa8dh1v = vec_max( sa8dh1v, vec_sub( zero_s16v, sa8dh1v ) ); + vec_s16_t sumh01v = vec_add(sa8dh0v, sa8dh1v); + + sa8dh2v = vec_max( sa8dh2v, vec_sub( zero_s16v, sa8dh2v ) ); + sa8dh3v = vec_max( sa8dh3v, vec_sub( zero_s16v, sa8dh3v ) ); + vec_s16_t sumh23v = vec_add(sa8dh2v, sa8dh3v); + + sa8dh4v = vec_max( sa8dh4v, vec_sub( zero_s16v, sa8dh4v ) ); + sa8dh5v = vec_max( sa8dh5v, vec_sub( zero_s16v, sa8dh5v ) ); + vec_s16_t sumh45v = vec_add(sa8dh4v, sa8dh5v); + + sa8dh6v = vec_max( sa8dh6v, vec_sub( zero_s16v, sa8dh6v ) ); + sa8dh7v = vec_max( sa8dh7v, vec_sub( zero_s16v, sa8dh7v ) ); + vec_s16_t sumh67v = vec_add(sa8dh6v, sa8dh7v); + + vec_s16_t sumh0123v = vec_add(sumh01v, sumh23v); + vec_s16_t sumh4567v = vec_add(sumh45v, sumh67v); + + vec_s32_t sumblocv_h0; + + sumblocv_h0 = vec_sum4s(sumh0123v, (vec_s32_t)zerov ); + sumblocv_h0 = vec_sum4s(sumh4567v, sumblocv_h0 ); + sumblocv_h0 = vec_sums(sumblocv_h0, (vec_s32_t)zerov ); + sumblocv_h0 = vec_splat(sumblocv_h0, 3); + vec_ste(sumblocv_h0, 0, &sumh0); + + sa8dl0v = vec_max( sa8dl0v, vec_sub( zero_s16v, sa8dl0v ) ); + sa8dl1v = vec_max( sa8dl1v, vec_sub( zero_s16v, sa8dl1v ) ); + vec_s16_t suml01v = vec_add(sa8dl0v, sa8dl1v); + + sa8dl2v = vec_max( sa8dl2v, vec_sub( zero_s16v, sa8dl2v ) ); + sa8dl3v = vec_max( sa8dl3v, vec_sub( zero_s16v, sa8dl3v ) ); + vec_s16_t suml23v = vec_add(sa8dl2v, sa8dl3v); + + sa8dl4v = vec_max( sa8dl4v, vec_sub( zero_s16v, sa8dl4v ) ); + sa8dl5v = vec_max( sa8dl5v, vec_sub( zero_s16v, sa8dl5v ) ); + vec_s16_t suml45v = vec_add(sa8dl4v, sa8dl5v); + + sa8dl6v = vec_max( sa8dl6v, vec_sub( zero_s16v, sa8dl6v ) ); + sa8dl7v = vec_max( sa8dl7v, vec_sub( zero_s16v, sa8dl7v ) ); + vec_s16_t suml67v = vec_add(sa8dl6v, sa8dl7v); + + vec_s16_t suml0123v = vec_add(suml01v, suml23v); + vec_s16_t suml4567v = vec_add(suml45v, suml67v); + + vec_s32_t sumblocv_l0; + + sumblocv_l0 = vec_sum4s(suml0123v, (vec_s32_t)zerov ); + sumblocv_l0 = vec_sum4s(suml4567v, sumblocv_l0 ); + sumblocv_l0 = vec_sums(sumblocv_l0, (vec_s32_t)zerov ); + sumblocv_l0 = vec_splat(sumblocv_l0, 3); + vec_ste(sumblocv_l0, 0, &suml0); + + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh0v,diffl0v); + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh1v, diffl1v); + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh2v, diffl2v); + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh3v, diffl3v); + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh4v, diffl4v); + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh5v, diffl5v); + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh6v, diffl6v); + VEC_DIFF_S16(pix1,i_pix1,pix2,i_pix2,diffh7v, diffl7v); + + SA8D_1D_ALTIVEC(diffh0v, diffh1v, diffh2v, diffh3v, diffh4v, diffh5v, diffh6v, diffh7v); + VEC_TRANSPOSE_8(diffh0v, diffh1v, diffh2v, diffh3v, diffh4v, diffh5v, diffh6v, diffh7v, + sa8dh0v, sa8dh1v, sa8dh2v, sa8dh3v, sa8dh4v, sa8dh5v, sa8dh6v, sa8dh7v ); + SA8D_1D_ALTIVEC(sa8dh0v, sa8dh1v, sa8dh2v, sa8dh3v, sa8dh4v, sa8dh5v, sa8dh6v, sa8dh7v); + + SA8D_1D_ALTIVEC(diffl0v, diffl1v, diffl2v, diffl3v, diffl4v, diffl5v, diffl6v, diffl7v); + VEC_TRANSPOSE_8(diffl0v, diffl1v, diffl2v, diffl3v, diffl4v, diffl5v, diffl6v, diffl7v, + sa8dl0v, sa8dl1v, sa8dl2v, sa8dl3v, sa8dl4v, sa8dl5v, sa8dl6v, sa8dl7v ); + SA8D_1D_ALTIVEC(sa8dl0v, sa8dl1v, sa8dl2v, sa8dl3v, sa8dl4v, sa8dl5v, sa8dl6v, sa8dl7v); + + /* accumulation of the absolute value of all elements of the resulting bloc */ + sa8dh0v = vec_max( sa8dh0v, vec_sub( zero_s16v, sa8dh0v ) ); + sa8dh1v = vec_max( sa8dh1v, vec_sub( zero_s16v, sa8dh1v ) ); + sumh01v = vec_add(sa8dh0v, sa8dh1v); + + sa8dh2v = vec_max( sa8dh2v, vec_sub( zero_s16v, sa8dh2v ) ); + sa8dh3v = vec_max( sa8dh3v, vec_sub( zero_s16v, sa8dh3v ) ); + sumh23v = vec_add(sa8dh2v, sa8dh3v); + + sa8dh4v = vec_max( sa8dh4v, vec_sub( zero_s16v, sa8dh4v ) ); + sa8dh5v = vec_max( sa8dh5v, vec_sub( zero_s16v, sa8dh5v ) ); + sumh45v = vec_add(sa8dh4v, sa8dh5v); + + sa8dh6v = vec_max( sa8dh6v, vec_sub( zero_s16v, sa8dh6v ) ); + sa8dh7v = vec_max( sa8dh7v, vec_sub( zero_s16v, sa8dh7v ) ); + sumh67v = vec_add(sa8dh6v, sa8dh7v); + + sumh0123v = vec_add(sumh01v, sumh23v); + sumh4567v = vec_add(sumh45v, sumh67v); + + vec_s32_t sumblocv_h1; + + sumblocv_h1 = vec_sum4s(sumh0123v, (vec_s32_t)zerov ); + sumblocv_h1 = vec_sum4s(sumh4567v, sumblocv_h1 ); + sumblocv_h1 = vec_sums(sumblocv_h1, (vec_s32_t)zerov ); + sumblocv_h1 = vec_splat(sumblocv_h1, 3); + vec_ste(sumblocv_h1, 0, &sumh1); + + sa8dl0v = vec_max( sa8dl0v, vec_sub( zero_s16v, sa8dl0v ) ); + sa8dl1v = vec_max( sa8dl1v, vec_sub( zero_s16v, sa8dl1v ) ); + suml01v = vec_add(sa8dl0v, sa8dl1v); + + sa8dl2v = vec_max( sa8dl2v, vec_sub( zero_s16v, sa8dl2v ) ); + sa8dl3v = vec_max( sa8dl3v, vec_sub( zero_s16v, sa8dl3v ) ); + suml23v = vec_add(sa8dl2v, sa8dl3v); + + sa8dl4v = vec_max( sa8dl4v, vec_sub( zero_s16v, sa8dl4v ) ); + sa8dl5v = vec_max( sa8dl5v, vec_sub( zero_s16v, sa8dl5v ) ); + suml45v = vec_add(sa8dl4v, sa8dl5v); + + sa8dl6v = vec_max( sa8dl6v, vec_sub( zero_s16v, sa8dl6v ) ); + sa8dl7v = vec_max( sa8dl7v, vec_sub( zero_s16v, sa8dl7v ) ); + suml67v = vec_add(sa8dl6v, sa8dl7v); + + suml0123v = vec_add(suml01v, suml23v); + suml4567v = vec_add(suml45v, suml67v); + + vec_s32_t sumblocv_l1; + + sumblocv_l1 = vec_sum4s(suml0123v, (vec_s32_t)zerov ); + sumblocv_l1 = vec_sum4s(suml4567v, sumblocv_l1 ); + sumblocv_l1 = vec_sums(sumblocv_l1, (vec_s32_t)zerov ); + sumblocv_l1 = vec_splat(sumblocv_l1, 3); + vec_ste(sumblocv_l1, 0, &suml1); + + sum = (sumh0+suml0+sumh1+suml1 + 2) >>2; + return (sum ); +} + +int sa8d_16x32_altivec(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2) +{ + ALIGN_VAR_16(int, sum); + sum = sa8d_16x16_altivec(pix1, i_pix1, pix2, i_pix2) + + sa8d_16x16_altivec(pix1+16*i_pix1, i_pix1, pix2+16*i_pix2, i_pix2); + return sum; +} + +int sa8d_32x32_altivec(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2) +{ + ALIGN_VAR_16(int, sum); + int offset1, offset2; + offset1 = 16*i_pix1; + offset2 = 16*i_pix2; + sum = sa8d_16x16_altivec(pix1, i_pix1, pix2, i_pix2) + + sa8d_16x16_altivec(pix1+16, i_pix1, pix2+16, i_pix2) + + sa8d_16x16_altivec(pix1+offset1, i_pix1, pix2+offset2, i_pix2) + + sa8d_16x16_altivec(pix1+16+offset1, i_pix1, pix2+16+offset2, i_pix2); + return sum; +} + +int sa8d_32x64_altivec(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2) +{ + ALIGN_VAR_16(int, sum); + int offset1, offset2; + offset1 = 16*i_pix1; + offset2 = 16*i_pix2; + sum = sa8d_16x16_altivec(pix1, i_pix1, pix2, i_pix2) + + sa8d_16x16_altivec(pix1+16, i_pix1, pix2+16, i_pix2) + + sa8d_16x16_altivec(pix1+offset1, i_pix1, pix2+offset2, i_pix2) + + sa8d_16x16_altivec(pix1+16+offset1, i_pix1, pix2+16+offset2, i_pix2) + + sa8d_16x16_altivec(pix1+32*i_pix1, i_pix1, pix2+32*i_pix2, i_pix2) + + sa8d_16x16_altivec(pix1+16+32*i_pix1, i_pix1, pix2+16+32*i_pix2, i_pix2) + + sa8d_16x16_altivec(pix1+48*i_pix1, i_pix1, pix2+48*i_pix2, i_pix2) + + sa8d_16x16_altivec(pix1+16+48*i_pix1, i_pix1, pix2+16+48*i_pix2, i_pix2); + return sum; +} + +int sa8d_64x64_altivec(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2) +{ + ALIGN_VAR_16(int, sum); + int offset1, offset2; + offset1 = 16*i_pix1; + offset2 = 16*i_pix2; + sum = sa8d_16x16_altivec(pix1, i_pix1, pix2, i_pix2) + + sa8d_16x16_altivec(pix1+16, i_pix1, pix2+16, i_pix2) + + sa8d_16x16_altivec(pix1+32, i_pix1, pix2+32, i_pix2) + + sa8d_16x16_altivec(pix1+48, i_pix1, pix2+48, i_pix2) + + sa8d_16x16_altivec(pix1+offset1, i_pix1, pix2+offset2, i_pix2) + + sa8d_16x16_altivec(pix1+16+offset1, i_pix1, pix2+16+offset2, i_pix2) + + sa8d_16x16_altivec(pix1+32+offset1, i_pix1, pix2+32+offset2, i_pix2) + + sa8d_16x16_altivec(pix1+48+offset1, i_pix1, pix2+48+offset2, i_pix2) + + sa8d_16x16_altivec(pix1+32*i_pix1, i_pix1, pix2+32*i_pix2, i_pix2) + + sa8d_16x16_altivec(pix1+16+32*i_pix1, i_pix1, pix2+16+32*i_pix2, i_pix2) + + sa8d_16x16_altivec(pix1+32+32*i_pix1, i_pix1, pix2+32+32*i_pix2, i_pix2) + + sa8d_16x16_altivec(pix1+48+32*i_pix1, i_pix1, pix2+48+32*i_pix2, i_pix2) + + sa8d_16x16_altivec(pix1+48*i_pix1, i_pix1, pix2+48*i_pix2, i_pix2) + + sa8d_16x16_altivec(pix1+16+48*i_pix1, i_pix1, pix2+16+48*i_pix2, i_pix2) + + sa8d_16x16_altivec(pix1+32+48*i_pix1, i_pix1, pix2+32+48*i_pix2, i_pix2) + + sa8d_16x16_altivec(pix1+48+48*i_pix1, i_pix1, pix2+48+48*i_pix2, i_pix2); + return sum; +} + +/* Initialize entries for pixel functions defined in this file */ +void setupPixelPrimitives_altivec(EncoderPrimitives &p) +{ +#define LUMA_PU(W, H) \ + if (W<=16) { \ + p.pu[LUMA_ ## W ## x ## H].sad = sad16_altivec<W, H>; \ + p.pu[LUMA_ ## W ## x ## H].sad_x3 = sad16_x3_altivec<W, H>; \ + p.pu[LUMA_ ## W ## x ## H].sad_x4 = sad16_x4_altivec<W, H>; \ + } \ + else { \ + p.pu[LUMA_ ## W ## x ## H].sad = sad_altivec<W, H>; \ + p.pu[LUMA_ ## W ## x ## H].sad_x3 = sad_x3_altivec<W, H>; \ + p.pu[LUMA_ ## W ## x ## H].sad_x4 = sad_x4_altivec<W, H>; \ + } + + LUMA_PU(4, 4); + LUMA_PU(8, 8); + LUMA_PU(16, 16); + LUMA_PU(32, 32); + LUMA_PU(64, 64); + LUMA_PU(4, 8); + LUMA_PU(8, 4); + LUMA_PU(16, 8); + LUMA_PU(8, 16); + LUMA_PU(16, 12); + LUMA_PU(12, 16); + LUMA_PU(16, 4); + LUMA_PU(4, 16); + LUMA_PU(32, 16); + LUMA_PU(16, 32); + LUMA_PU(32, 24); + LUMA_PU(24, 32); + LUMA_PU(32, 8); + LUMA_PU(8, 32); + LUMA_PU(64, 32); + LUMA_PU(32, 64); + LUMA_PU(64, 48); + LUMA_PU(48, 64); + LUMA_PU(64, 16); + LUMA_PU(16, 64); + + p.pu[LUMA_4x4].satd = satd_4x4_altivec;//satd_4x4; + p.pu[LUMA_8x8].satd = satd_8x8_altivec;//satd8<8, 8>; + p.pu[LUMA_8x4].satd = satd_8x4_altivec;//satd_8x4; + p.pu[LUMA_4x8].satd = satd_4x8_altivec;//satd4<4, 8>; + p.pu[LUMA_16x16].satd = satd_16x16_altivec;//satd8<16, 16>; + p.pu[LUMA_16x8].satd = satd_16x8_altivec;//satd8<16, 8>; + p.pu[LUMA_8x16].satd = satd_8x16_altivec;//satd8<8, 16>; + p.pu[LUMA_16x12].satd = satd_altivec<16, 12>;//satd8<16, 12>; + p.pu[LUMA_12x16].satd = satd_altivec<12, 16>;//satd4<12, 16>; + p.pu[LUMA_16x4].satd = satd_altivec<16, 4>;//satd8<16, 4>; + p.pu[LUMA_4x16].satd = satd_altivec<4, 16>;//satd4<4, 16>; + p.pu[LUMA_32x32].satd = satd_altivec<32, 32>;//satd8<32, 32>; + p.pu[LUMA_32x16].satd = satd_altivec<32, 16>;//satd8<32, 16>; + p.pu[LUMA_16x32].satd = satd_altivec<16, 32>;//satd8<16, 32>; + p.pu[LUMA_32x24].satd = satd_altivec<32, 24>;//satd8<32, 24>; + p.pu[LUMA_24x32].satd = satd_altivec<24, 32>;//satd8<24, 32>; + p.pu[LUMA_32x8].satd = satd_altivec<32, 8>;//satd8<32, 8>; + p.pu[LUMA_8x32].satd = satd_altivec<8,32>;//satd8<8, 32>; + p.pu[LUMA_64x64].satd = satd_altivec<64, 64>;//satd8<64, 64>; + p.pu[LUMA_64x32].satd = satd_altivec<64, 32>;//satd8<64, 32>; + p.pu[LUMA_32x64].satd = satd_altivec<32, 64>;//satd8<32, 64>; + p.pu[LUMA_64x48].satd = satd_altivec<64, 48>;//satd8<64, 48>; + p.pu[LUMA_48x64].satd = satd_altivec<48, 64>;//satd8<48, 64>; + p.pu[LUMA_64x16].satd = satd_altivec<64, 16>;//satd8<64, 16>; + p.pu[LUMA_16x64].satd = satd_altivec<16, 64>;//satd8<16, 64>; + + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].satd = satd_4x4_altivec;//satd_4x4; + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].satd = satd_8x8_altivec;//satd8<8, 8>; + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].satd = satd_16x16_altivec;//satd8<16, 16>; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].satd = satd_altivec<32, 32>;//satd8<32, 32>; + + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].satd = satd_8x4_altivec;//satd_8x4; + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].satd = satd_4x8_altivec;//satd4<4, 8>; + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].satd = satd_16x8_altivec;//satd8<16, 8>; + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].satd = satd_8x16_altivec;//satd8<8, 16>; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].satd = satd_altivec<32, 16>;//satd8<32, 16>; + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].satd = satd_altivec<16, 32>;//satd8<16, 32>; + + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].satd = satd_altivec<16, 12>;//satd4<16, 12>; + p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].satd = satd_altivec<12, 16>;//satd4<12, 16>; + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].satd = satd_altivec<16, 4>;//satd4<16, 4>; + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].satd = satd_altivec<4, 16>;//satd4<4, 16>; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].satd = satd_altivec<32, 24>;//satd8<32, 24>; + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].satd = satd_altivec<24, 32>;//satd8<24, 32>; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].satd = satd_altivec<32, 8>;//satd8<32, 8>; + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].satd = satd_altivec<8,32>;//satd8<8, 32>; + + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].satd = satd_4x8_altivec;//satd4<4, 8>; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].satd = satd_8x16_altivec;//satd8<8, 16>; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].satd = satd_altivec<16, 32>;//satd8<16, 32>; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].satd = satd_altivec<32, 64>;//satd8<32, 64>; + + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].satd = satd_4x4_altivec;//satd_4x4; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].satd = satd_8x8_altivec;//satd8<8, 8>; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].satd = satd_altivec<4, 16>;//satd4<4, 16>; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].satd = satd_16x16_altivec;//satd8<16, 16>; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].satd = satd_altivec<8,32>;//satd8<8, 32>; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].satd = satd_altivec<32, 32>;//satd8<32, 32>; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].satd = satd_altivec<16, 64>;//satd8<16, 64>; + + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].satd = satd_altivec<8, 12>;//satd4<8, 12>; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].satd = satd_8x4_altivec;//satd4<8, 4>; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].satd = satd_altivec<16, 24>;//satd8<16, 24>; + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].satd = satd_altivec<12, 32>;//satd4<12, 32>; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].satd = satd_16x8_altivec;//satd8<16, 8>; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].satd = satd_altivec<4, 32>;//satd4<4, 32>; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].satd = satd_altivec<32, 48>;//satd8<32, 48>; + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].satd = satd_altivec<24, 64>;//satd8<24, 64>; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].satd = satd_altivec<32, 16>;//satd8<32, 16>; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].satd = satd_altivec<8,64>;//satd8<8, 64>; + + p.cu[BLOCK_4x4].sa8d = satd_4x4_altivec;//satd_4x4; + p.cu[BLOCK_8x8].sa8d = sa8d_8x8_altivec;//sa8d_8x8; + p.cu[BLOCK_16x16].sa8d = sa8d_16x16_altivec;//sa8d_16x16; + p.cu[BLOCK_32x32].sa8d = sa8d_32x32_altivec;//sa8d16<32, 32>; + p.cu[BLOCK_64x64].sa8d = sa8d_64x64_altivec;//sa8d16<64, 64>; + + p.chroma[X265_CSP_I420].cu[BLOCK_16x16].sa8d = sa8d_8x8_altivec;//sa8d8<8, 8>; + p.chroma[X265_CSP_I420].cu[BLOCK_32x32].sa8d = sa8d_16x16_altivec;//sa8d16<16, 16>; + p.chroma[X265_CSP_I420].cu[BLOCK_64x64].sa8d = sa8d_32x32_altivec;//sa8d16<32, 32>; + + p.chroma[X265_CSP_I422].cu[BLOCK_16x16].sa8d = sa8d_8x16_altivec;//sa8d8<8, 16>; + p.chroma[X265_CSP_I422].cu[BLOCK_32x32].sa8d = sa8d_16x32_altivec;//sa8d16<16, 32>; + p.chroma[X265_CSP_I422].cu[BLOCK_64x64].sa8d = sa8d_32x64_altivec;//sa8d16<32, 64>; + +} +}
View file
x265_2.2.tar.gz/source/common/ppc/ppccommon.h
Added
@@ -0,0 +1,91 @@ +/***************************************************************************** + * Copyright (C) 2013 x265 project + * + * Authors: Min Chen <min.chen@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#ifndef X265_PPCCOMMON_H +#define X265_PPCCOMMON_H + + +#if HAVE_ALTIVEC +#include <altivec.h> + +#define vec_u8_t vector unsigned char +#define vec_s8_t vector signed char +#define vec_u16_t vector unsigned short +#define vec_s16_t vector signed short +#define vec_u32_t vector unsigned int +#define vec_s32_t vector signed int + +//copy from x264 +#define LOAD_ZERO const vec_u8_t zerov = vec_splat_u8( 0 ) + +#define zero_u8v (vec_u8_t) zerov +#define zero_s8v (vec_s8_t) zerov +#define zero_u16v (vec_u16_t) zerov +#define zero_s16v (vec_s16_t) zerov +#define zero_u32v (vec_u32_t) zerov +#define zero_s32v (vec_s32_t) zerov + +/*********************************************************************** + * 8 <-> 16 bits conversions + **********************************************************************/ +#ifdef WORDS_BIGENDIAN +#define vec_u8_to_u16_h(v) (vec_u16_t) vec_mergeh( zero_u8v, (vec_u8_t) v ) +#define vec_u8_to_u16_l(v) (vec_u16_t) vec_mergel( zero_u8v, (vec_u8_t) v ) +#define vec_u8_to_s16_h(v) (vec_s16_t) vec_mergeh( zero_u8v, (vec_u8_t) v ) +#define vec_u8_to_s16_l(v) (vec_s16_t) vec_mergel( zero_u8v, (vec_u8_t) v ) +#else +#define vec_u8_to_u16_h(v) (vec_u16_t) vec_mergeh( (vec_u8_t) v, zero_u8v ) +#define vec_u8_to_u16_l(v) (vec_u16_t) vec_mergel( (vec_u8_t) v, zero_u8v ) +#define vec_u8_to_s16_h(v) (vec_s16_t) vec_mergeh( (vec_u8_t) v, zero_u8v ) +#define vec_u8_to_s16_l(v) (vec_s16_t) vec_mergel( (vec_u8_t) v, zero_u8v ) +#endif + +#define vec_u8_to_u16(v) vec_u8_to_u16_h(v) +#define vec_u8_to_s16(v) vec_u8_to_s16_h(v) + +#ifdef WORDS_BIGENDIAN +#define vec_u16_to_u32_h(v) (vec_u32_t) vec_mergeh( zero_u16v, (vec_u16_t) v ) +#define vec_u16_to_u32_l(v) (vec_u32_t) vec_mergel( zero_u16v, (vec_u16_t) v ) +#define vec_u16_to_s32_h(v) (vec_s32_t) vec_mergeh( zero_u16v, (vec_u16_t) v ) +#define vec_u16_to_s32_l(v) (vec_s32_t) vec_mergel( zero_u16v, (vec_u16_t) v ) +#else +#define vec_u16_to_u32_h(v) (vec_u32_t) vec_mergeh( (vec_u16_t) v, zero_u16v ) +#define vec_u16_to_u32_l(v) (vec_u32_t) vec_mergel( (vec_u16_t) v, zero_u16v ) +#define vec_u16_to_s32_h(v) (vec_s32_t) vec_mergeh( (vec_u16_t) v, zero_u16v ) +#define vec_u16_to_s32_l(v) (vec_s32_t) vec_mergel( (vec_u16_t) v, zero_u16v ) +#endif + +#define vec_u16_to_u32(v) vec_u16_to_u32_h(v) +#define vec_u16_to_s32(v) vec_u16_to_s32_h(v) + +#define vec_u32_to_u16(v) vec_pack( v, zero_u32v ) +#define vec_s32_to_u16(v) vec_packsu( v, zero_s32v ) + +#define BITS_PER_SUM (8 * sizeof(sum_t)) + +#endif /* HAVE_ALTIVEC */ + +#endif /* X265_PPCCOMMON_H */ + + +
View file
x265_2.1.tar.gz/source/common/primitives.cpp -> x265_2.2.tar.gz/source/common/primitives.cpp
Changed
@@ -243,6 +243,15 @@ #endif setupAssemblyPrimitives(primitives, param->cpuid); #endif +#if HAVE_ALTIVEC + if (param->cpuid & X265_CPU_ALTIVEC) + { + setupPixelPrimitives_altivec(primitives); // pixel_altivec.cpp, overwrite the initialization for altivec optimizated functions + setupDCTPrimitives_altivec(primitives); // dct_altivec.cpp, overwrite the initialization for altivec optimizated functions + setupFilterPrimitives_altivec(primitives); // ipfilter.cpp, overwrite the initialization for altivec optimizated functions + setupIntraPrimitives_altivec(primitives); // intrapred_altivec.cpp, overwrite the initialization for altivec optimizated functions + } +#endif setupAliasPrimitives(primitives); }
View file
x265_2.1.tar.gz/source/common/primitives.h -> x265_2.2.tar.gz/source/common/primitives.h
Changed
@@ -115,6 +115,7 @@ typedef sse_t (*pixel_sse_t)(const pixel* fenc, intptr_t fencstride, const pixel* fref, intptr_t frefstride); // fenc is aligned typedef sse_t (*pixel_sse_ss_t)(const int16_t* fenc, intptr_t fencstride, const int16_t* fref, intptr_t frefstride); typedef sse_t (*pixel_ssd_s_t)(const int16_t* fenc, intptr_t fencstride); +typedef int(*pixelcmp_ads_t)(int encDC[], uint32_t *sums, int delta, uint16_t *costMvX, int16_t *mvs, int width, int thresh); typedef void (*pixelcmp_x4_t)(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); typedef void (*pixelcmp_x3_t)(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); typedef void (*blockfill_s_t)(int16_t* dst, intptr_t dstride, int16_t val); @@ -217,6 +218,7 @@ pixelcmp_t sad; // Sum of Absolute Differences pixelcmp_x3_t sad_x3; // Sum of Absolute Differences, 3 mv offsets at once pixelcmp_x4_t sad_x4; // Sum of Absolute Differences, 4 mv offsets at once + pixelcmp_ads_t ads; // Absolute Differences sum pixelcmp_t satd; // Sum of Absolute Transformed Differences (4x4 Hadamard) filter_pp_t luma_hpp; // 8-tap luma motion compensation interpolation filters @@ -402,6 +404,22 @@ return part; } +/* Computes the size of the LumaPU for a given LumaPU enum */ +inline void sizesFromPartition(int part, int *width, int *height) +{ + X265_CHECK(part >= 0 && part <= 24, "Invalid part %d \n", part); + extern const uint8_t lumaPartitionMapTable[]; + int index = 0; + for (int i = 0; i < 256;i++) + if (part == lumaPartitionMapTable[i]) + { + index = i; + break; + } + *width = 4 * ((index >> 4) + 1); + *height = 4 * ((index % 16) + 1); +} + inline int partitionFromLog2Size(int log2Size) { X265_CHECK(2 <= log2Size && log2Size <= 6, "Invalid block size\n"); @@ -412,6 +430,12 @@ void setupInstrinsicPrimitives(EncoderPrimitives &p, int cpuMask); void setupAssemblyPrimitives(EncoderPrimitives &p, int cpuMask); void setupAliasPrimitives(EncoderPrimitives &p); +#if HAVE_ALTIVEC +void setupPixelPrimitives_altivec(EncoderPrimitives &p); +void setupDCTPrimitives_altivec(EncoderPrimitives &p); +void setupFilterPrimitives_altivec(EncoderPrimitives &p); +void setupIntraPrimitives_altivec(EncoderPrimitives &p); +#endif } #if !EXPORT_C_API
View file
x265_2.1.tar.gz/source/common/scalinglist.cpp -> x265_2.2.tar.gz/source/common/scalinglist.cpp
Changed
@@ -29,64 +29,6 @@ // file-anonymous namespace /* Strings for scaling list file parsing */ -const char MatrixType[4][6][20] = -{ - { - "INTRA4X4_LUMA", - "INTRA4X4_CHROMAU", - "INTRA4X4_CHROMAV", - "INTER4X4_LUMA", - "INTER4X4_CHROMAU", - "INTER4X4_CHROMAV" - }, - { - "INTRA8X8_LUMA", - "INTRA8X8_CHROMAU", - "INTRA8X8_CHROMAV", - "INTER8X8_LUMA", - "INTER8X8_CHROMAU", - "INTER8X8_CHROMAV" - }, - { - "INTRA16X16_LUMA", - "INTRA16X16_CHROMAU", - "INTRA16X16_CHROMAV", - "INTER16X16_LUMA", - "INTER16X16_CHROMAU", - "INTER16X16_CHROMAV" - }, - { - "INTRA32X32_LUMA", - "", - "", - "INTER32X32_LUMA", - "", - "", - }, -}; -const char MatrixType_DC[4][12][22] = -{ - { - }, - { - }, - { - "INTRA16X16_LUMA_DC", - "INTRA16X16_CHROMAU_DC", - "INTRA16X16_CHROMAV_DC", - "INTER16X16_LUMA_DC", - "INTER16X16_CHROMAU_DC", - "INTER16X16_CHROMAV_DC" - }, - { - "INTRA32X32_LUMA_DC", - "", - "", - "INTER32X32_LUMA_DC", - "", - "", - }, -}; static int quantTSDefault4x4[16] = { @@ -124,6 +66,64 @@ namespace X265_NS { // private namespace + const char ScalingList::MatrixType[4][6][20] = + { + { + "INTRA4X4_LUMA", + "INTRA4X4_CHROMAU", + "INTRA4X4_CHROMAV", + "INTER4X4_LUMA", + "INTER4X4_CHROMAU", + "INTER4X4_CHROMAV" + }, + { + "INTRA8X8_LUMA", + "INTRA8X8_CHROMAU", + "INTRA8X8_CHROMAV", + "INTER8X8_LUMA", + "INTER8X8_CHROMAU", + "INTER8X8_CHROMAV" + }, + { + "INTRA16X16_LUMA", + "INTRA16X16_CHROMAU", + "INTRA16X16_CHROMAV", + "INTER16X16_LUMA", + "INTER16X16_CHROMAU", + "INTER16X16_CHROMAV" + }, + { + "INTRA32X32_LUMA", + "", + "", + "INTER32X32_LUMA", + "", + "", + }, + }; + const char ScalingList::MatrixType_DC[4][12][22] = + { + { + }, + { + }, + { + "INTRA16X16_LUMA_DC", + "INTRA16X16_CHROMAU_DC", + "INTRA16X16_CHROMAV_DC", + "INTER16X16_LUMA_DC", + "INTER16X16_CHROMAU_DC", + "INTER16X16_CHROMAV_DC" + }, + { + "INTRA32X32_LUMA_DC", + "", + "", + "INTER32X32_LUMA_DC", + "", + "", + }, + }; const int ScalingList::s_numCoefPerSize[NUM_SIZES] = { 16, 64, 256, 1024 }; const int32_t ScalingList::s_quantScales[NUM_REM] = { 26214, 23302, 20560, 18396, 16384, 14564 }; @@ -312,6 +312,22 @@ m_scalingListDC[sizeIdc][listIdc] = data; } } + if (sizeIdc == 3) + { + for (int listIdc = 1; listIdc < NUM_LISTS; listIdc++) + { + if (listIdc % 3 != 0) + { + src = m_scalingListCoef[sizeIdc][listIdc]; + const int *srcNextSmallerSize = m_scalingListCoef[sizeIdc - 1][listIdc]; + for (int i = 0; i < size; i++) + { + src[i] = srcNextSmallerSize[i]; + } + m_scalingListDC[sizeIdc][listIdc] = m_scalingListDC[sizeIdc - 1][listIdc]; + } + } + } } fclose(fp);
View file
x265_2.1.tar.gz/source/common/scalinglist.h -> x265_2.2.tar.gz/source/common/scalinglist.h
Changed
@@ -42,6 +42,8 @@ static const int s_numCoefPerSize[NUM_SIZES]; static const int32_t s_invQuantScales[NUM_REM]; static const int32_t s_quantScales[NUM_REM]; + static const char MatrixType[4][6][20]; + static const char MatrixType_DC[4][12][22]; int32_t m_scalingListDC[NUM_SIZES][NUM_LISTS]; // the DC value of the matrix coefficient for 16x16 int32_t* m_scalingListCoef[NUM_SIZES][NUM_LISTS]; // quantization matrix
View file
x265_2.1.tar.gz/source/common/slice.h -> x265_2.2.tar.gz/source/common/slice.h
Changed
@@ -239,11 +239,16 @@ uint32_t maxLatencyIncrease; int numReorderPics; + RPS spsrps[MAX_NUM_SHORT_TERM_RPS]; + int spsrpsNum; + int numGOPBegin; + bool bUseSAO; // use param bool bUseAMP; // use param bool bUseStrongIntraSmoothing; // use param bool bTemporalMVPEnabled; - bool bDiscardOptionalVUI; + bool bEmitVUITimingInfo; + bool bEmitVUIHRDInfo; Window conformanceWindow; VUI vuiParameters; @@ -282,6 +287,8 @@ bool bDeblockingFilterControlPresent; bool bPicDisableDeblockingFilter; + + int numRefIdxDefault[2]; }; struct WeightParam @@ -334,6 +341,7 @@ int m_sliceQp; int m_poc; int m_lastIDR; + int m_rpsIdx; uint32_t m_colRefIdx; // never modified @@ -347,6 +355,10 @@ bool m_sLFaseFlag; // loop filter boundary flag bool m_colFromL0Flag; // collocated picture from List0 or List1 flag + int m_iPPSQpMinus26; + int numRefIdxDefault[2]; + int m_iNumRPSInSPS; + Slice() { m_lastIDR = 0; @@ -356,6 +368,10 @@ memset(m_refReconPicList, 0, sizeof(m_refReconPicList)); memset(m_refPOCList, 0, sizeof(m_refPOCList)); disableWeights(); + m_iPPSQpMinus26 = 0; + numRefIdxDefault[0] = 1; + numRefIdxDefault[1] = 1; + m_rpsIdx = -1; } void disableWeights();
View file
x265_2.1.tar.gz/source/common/version.cpp -> x265_2.2.tar.gz/source/common/version.cpp
Changed
@@ -77,7 +77,7 @@ #define BITS "[32 bit]" #endif -#if defined(ENABLE_ASSEMBLY) +#if defined(ENABLE_ASSEMBLY) || HAVE_ALTIVEC #define ASM "" #else #define ASM "[noasm]"
View file
x265_2.1.tar.gz/source/common/yuv.cpp -> x265_2.2.tar.gz/source/common/yuv.cpp
Changed
@@ -47,6 +47,11 @@ m_size = size; m_part = partitionFromSizes(size, size); + for (int i = 0; i < 2; i++) + for (int j = 0; j < MAX_NUM_REF; j++) + for (int k = 0; k < INTEGRAL_PLANE_NUM; k++) + m_integral[i][j][k] = NULL; + if (csp == X265_CSP_I400) { CHECKED_MALLOC(m_buf[0], pixel, size * size + 8);
View file
x265_2.1.tar.gz/source/common/yuv.h -> x265_2.2.tar.gz/source/common/yuv.h
Changed
@@ -48,6 +48,7 @@ int m_csp; int m_hChromaShift; int m_vChromaShift; + uint32_t *m_integral[2][MAX_NUM_REF][INTEGRAL_PLANE_NUM]; Yuv();
View file
x265_2.1.tar.gz/source/encoder/analysis.cpp -> x265_2.2.tar.gz/source/encoder/analysis.cpp
Changed
@@ -203,6 +203,57 @@ return *m_modeDepth[0].bestMode; } +int32_t Analysis::loadTUDepth(CUGeom cuGeom, CUData parentCTU) +{ + float predDepth = 0; + CUData* neighbourCU; + uint8_t count = 0; + int32_t maxTUDepth = -1; + neighbourCU = m_slice->m_refFrameList[0][0]->m_encData->m_picCTU; + predDepth += neighbourCU->m_refTuDepth[cuGeom.geomRecurId]; + count++; + if (m_slice->isInterB()) + { + neighbourCU = m_slice->m_refFrameList[1][0]->m_encData->m_picCTU; + predDepth += neighbourCU->m_refTuDepth[cuGeom.geomRecurId]; + count++; + } + if (parentCTU.m_cuAbove) + { + predDepth += parentCTU.m_cuAbove->m_refTuDepth[cuGeom.geomRecurId]; + count++; + if (parentCTU.m_cuAboveLeft) + { + predDepth += parentCTU.m_cuAboveLeft->m_refTuDepth[cuGeom.geomRecurId]; + count++; + } + if (parentCTU.m_cuAboveRight) + { + predDepth += parentCTU.m_cuAboveRight->m_refTuDepth[cuGeom.geomRecurId]; + count++; + } + } + if (parentCTU.m_cuLeft) + { + predDepth += parentCTU.m_cuLeft->m_refTuDepth[cuGeom.geomRecurId]; + count++; + } + predDepth /= count; + + if (predDepth == 0) + maxTUDepth = 0; + else if (predDepth < 1) + maxTUDepth = 1; + else if (predDepth >= 1 && predDepth <= 1.5) + maxTUDepth = 2; + else if (predDepth > 1.5 && predDepth <= 2.5) + maxTUDepth = 3; + else + maxTUDepth = -1; + + return maxTUDepth; +} + void Analysis::tryLossless(const CUGeom& cuGeom) { ModeDepth& md = m_modeDepth[cuGeom.depth]; @@ -394,6 +445,16 @@ cacheCost[cuIdx] = md.bestMode->rdCost; } + /* Save Intra CUs TU depth only when analysis mode is OFF */ + if ((m_limitTU & X265_TU_LIMIT_NEIGH) && cuGeom.log2CUSize >= 4 && !m_param->analysisMode) + { + CUData* ctu = md.bestMode->cu.m_encData->getPicCTU(parentCTU.m_cuAddr); + int8_t maxTUDepth = -1; + for (uint32_t i = 0; i < cuGeom.numPartitions; i++) + maxTUDepth = X265_MAX(maxTUDepth, md.pred[PRED_INTRA].cu.m_tuDepth[i]); + ctu->m_refTuDepth[cuGeom.geomRecurId] = maxTUDepth; + } + /* Copy best data to encData CTU and recon */ md.bestMode->cu.copyToPic(depth); if (md.bestMode != &md.pred[PRED_SPLIT]) @@ -883,6 +944,16 @@ ModeDepth& md = m_modeDepth[depth]; md.bestMode = NULL; + if (m_param->searchMethod == X265_SEA) + { + int numPredDir = m_slice->isInterP() ? 1 : 2; + int offset = (int)(m_frame->m_reconPic->m_cuOffsetY[parentCTU.m_cuAddr] + m_frame->m_reconPic->m_buOffsetY[cuGeom.absPartIdx]); + for (int list = 0; list < numPredDir; list++) + for (int i = 0; i < m_frame->m_encData->m_slice->m_numRefIdx[list]; i++) + for (int planes = 0; planes < INTEGRAL_PLANE_NUM; planes++) + m_modeDepth[depth].fencYuv.m_integral[list][i][planes] = m_frame->m_encData->m_slice->m_refFrameList[list][i]->m_encData->m_meIntegral[planes] + offset; + } + PicYuv& reconPic = *m_frame->m_reconPic; bool mightSplit = !(cuGeom.flags & CUGeom::LEAF); @@ -894,6 +965,9 @@ bool skipRectAmp = false; bool chooseMerge = false; + if ((m_limitTU & X265_TU_LIMIT_NEIGH) && cuGeom.log2CUSize >= 4) + m_maxTUDepth = loadTUDepth(cuGeom, parentCTU); + SplitData splitData[4]; splitData[0].initSplitCUData(); splitData[1].initSplitCUData(); @@ -1400,6 +1474,18 @@ if (m_param->rdLevel) md.bestMode->reconYuv.copyToPicYuv(reconPic, cuAddr, cuGeom.absPartIdx); + if ((m_limitTU & X265_TU_LIMIT_NEIGH) && cuGeom.log2CUSize >= 4) + { + if (mightNotSplit) + { + CUData* ctu = md.bestMode->cu.m_encData->getPicCTU(parentCTU.m_cuAddr); + int8_t maxTUDepth = -1; + for (uint32_t i = 0; i < cuGeom.numPartitions; i++) + maxTUDepth = X265_MAX(maxTUDepth, md.bestMode->cu.m_tuDepth[i]); + ctu->m_refTuDepth[cuGeom.geomRecurId] = maxTUDepth; + } + } + return splitCUData; } @@ -1409,6 +1495,16 @@ ModeDepth& md = m_modeDepth[depth]; md.bestMode = NULL; + if (m_param->searchMethod == X265_SEA) + { + int numPredDir = m_slice->isInterP() ? 1 : 2; + int offset = (int)(m_frame->m_reconPic->m_cuOffsetY[parentCTU.m_cuAddr] + m_frame->m_reconPic->m_buOffsetY[cuGeom.absPartIdx]); + for (int list = 0; list < numPredDir; list++) + for (int i = 0; i < m_frame->m_encData->m_slice->m_numRefIdx[list]; i++) + for (int planes = 0; planes < INTEGRAL_PLANE_NUM; planes++) + m_modeDepth[depth].fencYuv.m_integral[list][i][planes] = m_frame->m_encData->m_slice->m_refFrameList[list][i]->m_encData->m_meIntegral[planes] + offset; + } + bool mightSplit = !(cuGeom.flags & CUGeom::LEAF); bool mightNotSplit = !(cuGeom.flags & CUGeom::SPLIT_MANDATORY); bool skipRecursion = false; @@ -1424,6 +1520,9 @@ md.pred[PRED_2Nx2N].rdCost = 0; } + if ((m_limitTU & X265_TU_LIMIT_NEIGH) && cuGeom.log2CUSize >= 4) + m_maxTUDepth = loadTUDepth(cuGeom, parentCTU); + SplitData splitData[4]; splitData[0].initSplitCUData(); splitData[1].initSplitCUData(); @@ -1751,6 +1850,18 @@ addSplitFlagCost(*md.bestMode, cuGeom.depth); } + if ((m_limitTU & X265_TU_LIMIT_NEIGH) && cuGeom.log2CUSize >= 4) + { + if (mightNotSplit) + { + CUData* ctu = md.bestMode->cu.m_encData->getPicCTU(parentCTU.m_cuAddr); + int8_t maxTUDepth = -1; + for (uint32_t i = 0; i < cuGeom.numPartitions; i++) + maxTUDepth = X265_MAX(maxTUDepth, md.bestMode->cu.m_tuDepth[i]); + ctu->m_refTuDepth[cuGeom.geomRecurId] = maxTUDepth; + } + } + /* compare split RD cost against best cost */ if (mightSplit && !skipRecursion) checkBestMode(md.pred[PRED_SPLIT], depth); @@ -1942,12 +2053,12 @@ if (m_param->maxSlices > 1) { // NOTE: First row in slice can't negative - if ((candMvField[i][0].mv.y < m_sliceMinY) | (candMvField[i][1].mv.y < m_sliceMinY)) + if (X265_MIN(candMvField[i][0].mv.y, candMvField[i][1].mv.y) < m_sliceMinY) continue; // Last row in slice can't reference beyond bound since it is another slice area // TODO: we may beyond bound in future since these area have a chance to finish because we use parallel slices. Necessary prepare research on load balance - if ((candMvField[i][0].mv.y > m_sliceMaxY) | (candMvField[i][1].mv.y > m_sliceMaxY)) + if (X265_MAX(candMvField[i][0].mv.y, candMvField[i][1].mv.y) > m_sliceMaxY) continue; } @@ -2072,12 +2183,12 @@ if (m_param->maxSlices > 1) { // NOTE: First row in slice can't negative - if ((candMvField[i][0].mv.y < m_sliceMinY) | (candMvField[i][1].mv.y < m_sliceMinY)) + if (X265_MIN(candMvField[i][0].mv.y, candMvField[i][1].mv.y) < m_sliceMinY) continue; // Last row in slice can't reference beyond bound since it is another slice area // TODO: we may beyond bound in future since these area have a chance to finish because we use parallel slices. Necessary prepare research on load balance - if ((candMvField[i][0].mv.y > m_sliceMaxY) | (candMvField[i][1].mv.y > m_sliceMaxY)) + if (X265_MAX(candMvField[i][0].mv.y, candMvField[i][1].mv.y) > m_sliceMaxY) continue; }
View file
x265_2.1.tar.gz/source/encoder/analysis.h -> x265_2.2.tar.gz/source/encoder/analysis.h
Changed
@@ -116,6 +116,7 @@ void destroy(); Mode& compressCTU(CUData& ctu, Frame& frame, const CUGeom& cuGeom, const Entropy& initialContext); + int32_t loadTUDepth(CUGeom cuGeom, CUData parentCTU); protected: /* Analysis data for save/load mode, writes/reads data based on absPartIdx */
View file
x265_2.1.tar.gz/source/encoder/api.cpp -> x265_2.2.tar.gz/source/encoder/api.cpp
Changed
@@ -141,6 +141,11 @@ Encoder *encoder = static_cast<Encoder*>(enc); Entropy sbacCoder; Bitstream bs; + if (encoder->m_param->rc.bStatRead && encoder->m_param->bMultiPassOptRPS) + { + if (!encoder->computeSPSRPSIndex()) + return -1; + } encoder->getStreamHeaders(encoder->m_nalList, sbacCoder, bs); *pp_nal = &encoder->m_nalList.m_nal[0]; if (pi_nal) *pi_nal = encoder->m_nalList.m_numNal;
View file
x265_2.1.tar.gz/source/encoder/bitcost.cpp -> x265_2.2.tar.gz/source/encoder/bitcost.cpp
Changed
@@ -54,16 +54,40 @@ s_costs[qp][i] = s_costs[qp][-i] = (uint16_t)X265_MIN(s_bitsizes[i] * lambda + 0.5f, (1 << 15) - 1); } } - + for (int j = 0; j < 4; j++) + { + if (!s_fpelMvCosts[qp][j]) + { + ScopedLock s(s_costCalcLock); + if (!s_fpelMvCosts[qp][j]) + { + s_fpelMvCosts[qp][j] = X265_MALLOC(uint16_t, BC_MAX_MV + 1) + (BC_MAX_MV >> 1); + if (!s_fpelMvCosts[qp][j]) + { + x265_log(NULL, X265_LOG_ERROR, "BitCost s_fpelMvCosts buffer allocation failure\n"); + return; + } + for (int i = -(BC_MAX_MV >> 1); i < (BC_MAX_MV >> 1); i++) + { + s_fpelMvCosts[qp][j][i] = s_costs[qp][i * 4 + j]; + } + } + } + } m_cost = s_costs[qp]; + for (int j = 0; j < 4; j++) + { + m_fpelMvCosts[j] = s_fpelMvCosts[qp][j]; + } } - /*** * Class static data and methods */ uint16_t *BitCost::s_costs[BC_MAX_QP]; +uint16_t* BitCost::s_fpelMvCosts[BC_MAX_QP][4]; + float *BitCost::s_bitsizes; Lock BitCost::s_costCalcLock; @@ -96,6 +120,17 @@ s_costs[i] = NULL; } } + for (int i = 0; i < BC_MAX_QP; i++) + { + for (int j = 0; j < 4; j++) + { + if (s_fpelMvCosts[i][j]) + { + X265_FREE(s_fpelMvCosts[i][j] - (BC_MAX_MV >> 1)); + s_fpelMvCosts[i][j] = NULL; + } + } + } if (s_bitsizes) {
View file
x265_2.1.tar.gz/source/encoder/bitcost.h -> x265_2.2.tar.gz/source/encoder/bitcost.h
Changed
@@ -67,6 +67,8 @@ uint16_t *m_cost; + uint16_t *m_fpelMvCosts[4]; + MV m_mvp; BitCost& operator =(const BitCost&); @@ -84,6 +86,8 @@ static uint16_t *s_costs[BC_MAX_QP]; + static uint16_t *s_fpelMvCosts[BC_MAX_QP][4]; + static Lock s_costCalcLock; static void CalculateLogs();
View file
x265_2.1.tar.gz/source/encoder/dpb.cpp -> x265_2.2.tar.gz/source/encoder/dpb.cpp
Changed
@@ -92,6 +92,19 @@ m_freeList.pushBack(*curFrame); curFrame->m_encData->m_freeListNext = m_frameDataFreeList; m_frameDataFreeList = curFrame->m_encData; + + if (curFrame->m_encData->m_meBuffer) + { + for (int i = 0; i < INTEGRAL_PLANE_NUM; i++) + { + if (curFrame->m_encData->m_meBuffer[i] != NULL) + { + X265_FREE(curFrame->m_encData->m_meBuffer[i]); + curFrame->m_encData->m_meBuffer[i] = NULL; + } + } + } + curFrame->m_encData = NULL; curFrame->m_reconPic = NULL; }
View file
x265_2.1.tar.gz/source/encoder/encoder.cpp -> x265_2.2.tar.gz/source/encoder/encoder.cpp
Changed
@@ -74,6 +74,10 @@ m_threadPool = NULL; m_analysisFile = NULL; m_offsetEmergency = NULL; + m_iFrameNum = 0; + m_iPPSQpMinus26 = 0; + m_iLastSliceQp = 0; + m_rpsInSpsCount = 0; for (int i = 0; i < X265_MAX_FRAME_THREADS; i++) m_frameEncoder[i] = NULL; @@ -145,12 +149,6 @@ p->bEnableWavefront = p->bDistributeModeAnalysis = p->bDistributeMotionEstimation = p->lookaheadSlices = 0; } - if (!p->bEnableWavefront && p->rc.vbvBufferSize) - { - x265_log(p, X265_LOG_ERROR, "VBV requires wavefront parallelism\n"); - m_aborted = true; - } - x265_log(p, X265_LOG_INFO, "Slices : %d\n", p->maxSlices); char buf[128]; @@ -318,6 +316,8 @@ if (!m_lookahead->create()) m_aborted = true; + initRefIdx(); + if (m_param->analysisMode) { const char* name = m_param->analysisFileName; @@ -869,6 +869,58 @@ slice->m_endCUAddr = slice->realEndAddress(m_sps.numCUsInFrame * NUM_4x4_PARTITIONS); } + if (m_param->searchMethod == X265_SEA && frameEnc->m_lowres.sliceType != X265_TYPE_B) + { + int padX = g_maxCUSize + 32; + int padY = g_maxCUSize + 16; + uint32_t numCuInHeight = (frameEnc->m_encData->m_reconPic->m_picHeight + g_maxCUSize - 1) / g_maxCUSize; + int maxHeight = numCuInHeight * g_maxCUSize; + for (int i = 0; i < INTEGRAL_PLANE_NUM; i++) + { + frameEnc->m_encData->m_meBuffer[i] = X265_MALLOC(uint32_t, frameEnc->m_reconPic->m_stride * (maxHeight + (2 * padY))); + if (frameEnc->m_encData->m_meBuffer[i]) + { + memset(frameEnc->m_encData->m_meBuffer[i], 0, sizeof(uint32_t)* frameEnc->m_reconPic->m_stride * (maxHeight + (2 * padY))); + frameEnc->m_encData->m_meIntegral[i] = frameEnc->m_encData->m_meBuffer[i] + frameEnc->m_encData->m_reconPic->m_stride * padY + padX; + } + else + x265_log(m_param, X265_LOG_ERROR, "SEA motion search: POC %d Integral buffer[%d] unallocated\n", frameEnc->m_poc, i); + } + } + + if (m_param->bOptQpPPS && frameEnc->m_lowres.bKeyframe && m_param->bRepeatHeaders) + { + ScopedLock qpLock(m_sliceQpLock); + if (m_iFrameNum > 0) + { + //Search the least cost + int64_t iLeastCost = m_iBitsCostSum[0]; + int iLeastId = 0; + for (int i = 1; i < QP_MAX_MAX + 1; i++) + { + if (iLeastCost > m_iBitsCostSum[i]) + { + iLeastId = i; + iLeastCost = m_iBitsCostSum[i]; + } + } + + /* If last slice Qp is close to (26 + m_iPPSQpMinus26) or outputs is all I-frame video, + we don't need to change m_iPPSQpMinus26. */ + if ((abs(m_iLastSliceQp - (26 + m_iPPSQpMinus26)) > 1) && (m_iFrameNum > 1)) + m_iPPSQpMinus26 = (iLeastId + 1) - 26; + m_iFrameNum = 0; + } + + for (int i = 0; i < QP_MAX_MAX + 1; i++) + m_iBitsCostSum[i] = 0; + } + + frameEnc->m_encData->m_slice->m_iPPSQpMinus26 = m_iPPSQpMinus26; + frameEnc->m_encData->m_slice->numRefIdxDefault[0] = m_pps.numRefIdxDefault[0]; + frameEnc->m_encData->m_slice->numRefIdxDefault[1] = m_pps.numRefIdxDefault[1]; + frameEnc->m_encData->m_slice->m_iNumRPSInSPS = m_sps.spsrpsNum; + curEncoder->m_rce.encodeOrder = frameEnc->m_encodeOrder = m_encodedFrameNum++; if (m_bframeDelay) { @@ -1031,6 +1083,13 @@ x265_log(m_param, X265_LOG_INFO, "lossless compression ratio %.2f::1\n", uncompressed / m_analyzeAll.m_accBits); } + if (m_param->bMultiPassOptRPS && m_param->rc.bStatRead) + { + x265_log(m_param, X265_LOG_INFO, "RPS in SPS: %d frames (%.2f%%), RPS not in SPS: %d frames (%.2f%%)\n", + m_rpsInSpsCount, (float)100.0 * m_rpsInSpsCount / m_rateControl->m_numEntries, + m_rateControl->m_numEntries - m_rpsInSpsCount, + (float)100.0 * (m_rateControl->m_numEntries - m_rpsInSpsCount) / m_rateControl->m_numEntries); + } if (m_analyzeAll.m_numPics) { @@ -1353,6 +1412,7 @@ frameStats->qp = curEncData.m_avgQpAq; frameStats->bits = bits; frameStats->bScenecut = curFrame->m_lowres.bScenecut; + frameStats->bufferFill = m_rateControl->m_bufferFillActual; frameStats->frameLatency = inPoc - poc; if (m_param->rc.rateControlMode == X265_RC_CRF) frameStats->rateFactor = curEncData.m_rateFactor; @@ -1413,6 +1473,66 @@ #pragma warning(disable: 4127) // conditional expression is constant #endif +void Encoder::initRefIdx() +{ + int j = 0; + + for (j = 0; j < MAX_NUM_REF_IDX; j++) + { + m_refIdxLastGOP.numRefIdxl0[j] = 0; + m_refIdxLastGOP.numRefIdxl1[j] = 0; + } + + return; +} + +void Encoder::analyseRefIdx(int *numRefIdx) +{ + int i_l0 = 0; + int i_l1 = 0; + + i_l0 = numRefIdx[0]; + i_l1 = numRefIdx[1]; + + if ((0 < i_l0) && (MAX_NUM_REF_IDX > i_l0)) + m_refIdxLastGOP.numRefIdxl0[i_l0]++; + if ((0 < i_l1) && (MAX_NUM_REF_IDX > i_l1)) + m_refIdxLastGOP.numRefIdxl1[i_l1]++; + + return; +} + +void Encoder::updateRefIdx() +{ + int i_max_l0 = 0; + int i_max_l1 = 0; + int j = 0; + + i_max_l0 = 0; + i_max_l1 = 0; + m_refIdxLastGOP.numRefIdxDefault[0] = 1; + m_refIdxLastGOP.numRefIdxDefault[1] = 1; + for (j = 0; j < MAX_NUM_REF_IDX; j++) + { + if (i_max_l0 < m_refIdxLastGOP.numRefIdxl0[j]) + { + i_max_l0 = m_refIdxLastGOP.numRefIdxl0[j]; + m_refIdxLastGOP.numRefIdxDefault[0] = j; + } + if (i_max_l1 < m_refIdxLastGOP.numRefIdxl1[j]) + { + i_max_l1 = m_refIdxLastGOP.numRefIdxl1[j]; + m_refIdxLastGOP.numRefIdxDefault[1] = j; + } + } + + m_pps.numRefIdxDefault[0] = m_refIdxLastGOP.numRefIdxDefault[0]; + m_pps.numRefIdxDefault[1] = m_refIdxLastGOP.numRefIdxDefault[1]; + initRefIdx(); + + return; +} + void Encoder::getStreamHeaders(NALList& list, Entropy& sbacCoder, Bitstream& bs) { sbacCoder.setBitstream(&bs); @@ -1429,7 +1549,7 @@ list.serialize(NAL_UNIT_SPS, bs); bs.resetBits(); - sbacCoder.codePPS(m_pps, (m_param->maxSlices <= 1)); + sbacCoder.codePPS( m_pps, (m_param->maxSlices <= 1), m_iPPSQpMinus26); bs.writeByteAlignment(); list.serialize(NAL_UNIT_PPS, bs); @@ -1458,9 +1578,9 @@ list.serialize(NAL_UNIT_PREFIX_SEI, bs); } - if (!m_param->bDiscardSEI && m_param->bEmitInfoSEI) + if (m_param->bEmitInfoSEI) { - char *opts = x265_param2string(m_param); + char *opts = x265_param2string(m_param, m_sps.conformanceWindow.rightOffset, m_sps.conformanceWindow.bottomOffset); if (opts) { char *buffer = X265_MALLOC(char, strlen(opts) + strlen(PFX(version_str)) + @@ -1468,7 +1588,7 @@ if (buffer) { sprintf(buffer, "x265 (build %d) - %s:%s - H.265/HEVC codec - " - "Copyright 2013-2015 (c) Multicoreware Inc - " + "Copyright 2013-2016 (c) Multicoreware Inc - " "http://x265.org - options: %s", X265_BUILD, PFX(version_str), PFX(build_info_str), opts); @@ -1488,7 +1608,7 @@ } } - if (!m_param->bDiscardSEI && (m_param->bEmitHRDSEI || !!m_param->interlaceMode)) + if ((m_param->bEmitHRDSEI || !!m_param->interlaceMode)) { /* Picture Timing and Buffering Period SEI require the SPS to be "activated" */ SEIActiveParameterSets sei; @@ -1543,7 +1663,8 @@ sps->bUseStrongIntraSmoothing = m_param->bEnableStrongIntraSmoothing; sps->bTemporalMVPEnabled = m_param->bEnableTemporalMvp; - sps->bDiscardOptionalVUI = m_param->bDiscardOptionalVUI; + sps->bEmitVUITimingInfo = m_param->bEmitVUITimingInfo; + sps->bEmitVUIHRDInfo = m_param->bEmitVUIHRDInfo; sps->log2MaxPocLsb = m_param->log2MaxPocLsb; int maxDeltaPOC = (m_param->bframes + 2) * (!!m_param->bBPyramid + 1) * 2; while ((1 << sps->log2MaxPocLsb) <= maxDeltaPOC * 2) @@ -1621,6 +1742,9 @@ pps->deblockingFilterTcOffsetDiv2 = m_param->deblockingFilterTCOffset; pps->bEntropyCodingSyncEnabled = m_param->bEnableWavefront; + + pps->numRefIdxDefault[0] = 1; + pps->numRefIdxDefault[1] = 1; } void Encoder::configure(x265_param *p) @@ -1819,6 +1943,7 @@ m_bframeDelay = p->bframes ? (p->bBPyramid ? 2 : 1) : 0; p->bFrameBias = X265_MIN(X265_MAX(-90, p->bFrameBias), 100); + p->scenecutBias = (double)(p->scenecutBias / 100); if (p->logLevel < X265_LOG_INFO) { @@ -1849,6 +1974,12 @@ if (s) x265_log(p, X265_LOG_WARNING, "--tune %s should be used if attempting to benchmark %s!\n", s, s); } + if (p->searchMethod == X265_SEA && (p->bDistributeMotionEstimation || p->bDistributeModeAnalysis)) + { + x265_log(p, X265_LOG_WARNING, "Disabling pme and pmode: --pme and --pmode cannot be used with SEA motion search!\n"); + p->bDistributeMotionEstimation = 0; + p->bDistributeModeAnalysis = 0; + } /* some options make no sense if others are disabled */ p->bSaoNonDeblocked &= p->bEnableSAO; @@ -1878,6 +2009,11 @@ x265_log(p, X265_LOG_WARNING, "--rd-refine disabled, requires RD level > 4 and adaptive quant\n"); } + if (p->limitTU && p->tuQTMaxInterDepth < 2) + { + p->limitTU = 0; + x265_log(p, X265_LOG_WARNING, "limit-tu disabled, requires tu-inter-depth > 1\n"); + } bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0; if (!m_param->bLossless && (m_param->rc.aqMode || bIsVbv)) { @@ -2013,6 +2149,19 @@ p->log2MaxPocLsb = 4; } + if (p->maxSlices < 1) + { + x265_log(p, X265_LOG_WARNING, "maxSlices can not be less than 1, force set to 1\n"); + p->maxSlices = 1; + } + + const uint32_t numRows = (p->sourceHeight + p->maxCUSize - 1) / p->maxCUSize; + const uint32_t slicesLimit = X265_MIN(numRows, NALList::MAX_NAL_UNITS - 1); + if (p->maxSlices > numRows) + { + x265_log(p, X265_LOG_WARNING, "maxSlices can not be more than min(rows, MAX_NAL_UNITS-1), force set to %d\n", slicesLimit); + p->maxSlices = slicesLimit; + } } void Encoder::allocAnalysis(x265_analysis_data* analysis) @@ -2309,10 +2458,10 @@ x265_param* oldParam = m_param; x265_param* newParam = m_latestParam; - x265_log(newParam, X265_LOG_INFO, "Reconfigured param options, input Frame: %d\n", m_pocLast + 1); + x265_log(newParam, X265_LOG_DEBUG, "Reconfigured param options, input Frame: %d\n", m_pocLast + 1); char tmp[40]; -#define TOOLCMP(COND1, COND2, STR) if (COND1 != COND2) { sprintf(tmp, STR, COND1, COND2); x265_log(newParam, X265_LOG_INFO, tmp); } +#define TOOLCMP(COND1, COND2, STR) if (COND1 != COND2) { sprintf(tmp, STR, COND1, COND2); x265_log(newParam, X265_LOG_DEBUG, tmp); } TOOLCMP(oldParam->maxNumReferences, newParam->maxNumReferences, "ref=%d to %d\n"); TOOLCMP(oldParam->bEnableFastIntra, newParam->bEnableFastIntra, "fast-intra=%d to %d\n"); TOOLCMP(oldParam->bEnableEarlySkip, newParam->bEnableEarlySkip, "early-skip=%d to %d\n"); @@ -2326,3 +2475,208 @@ TOOLCMP(oldParam->maxNumMergeCand, newParam->maxNumMergeCand, "max-merge=%d to %d\n"); TOOLCMP(oldParam->bIntraInBFrames, newParam->bIntraInBFrames, "b-intra=%d to %d\n"); } + +bool Encoder::computeSPSRPSIndex() +{ + RPS* rpsInSPS = m_sps.spsrps; + int* rpsNumInPSP = &m_sps.spsrpsNum; + int beginNum = m_sps.numGOPBegin; + int endNum; + RPS* rpsInRec; + RPS* rpsInIdxList; + RPS* thisRpsInSPS; + RPS* thisRpsInList; + RPSListNode* headRpsIdxList = NULL; + RPSListNode* tailRpsIdxList = NULL; + RPSListNode* rpsIdxListIter = NULL; + RateControlEntry *rce2Pass = m_rateControl->m_rce2Pass; + int numEntries = m_rateControl->m_numEntries; + RateControlEntry *rce; + int idx = 0; + int pos = 0; + int resultIdx[64]; + memset(rpsInSPS, 0, sizeof(RPS) * MAX_NUM_SHORT_TERM_RPS); + + // find out all RPS date in current GOP + beginNum++; + endNum = beginNum; + if (!m_param->bRepeatHeaders) + { + endNum = numEntries; + } + else + { + while (endNum < numEntries) + { + rce = &rce2Pass[endNum]; + if (rce->sliceType == I_SLICE) + { + if (m_param->keyframeMin && (endNum - beginNum + 1 < m_param->keyframeMin)) + { + endNum++; + continue; + } + break; + } + endNum++; + } + } + m_sps.numGOPBegin = endNum; + + // find out all kinds of RPS + for (int i = beginNum; i < endNum; i++) + { + rce = &rce2Pass[i]; + rpsInRec = &rce->rpsData; + rpsIdxListIter = headRpsIdxList; + // i frame don't recode RPS info + if (rce->sliceType != I_SLICE) + { + while (rpsIdxListIter) + { + rpsInIdxList = rpsIdxListIter->rps; + if (rpsInRec->numberOfPictures == rpsInIdxList->numberOfPictures + && rpsInRec->numberOfNegativePictures == rpsInIdxList->numberOfNegativePictures + && rpsInRec->numberOfPositivePictures == rpsInIdxList->numberOfPositivePictures) + { + for (pos = 0; pos < rpsInRec->numberOfPictures; pos++) + { + if (rpsInRec->deltaPOC[pos] != rpsInIdxList->deltaPOC[pos] + || rpsInRec->bUsed[pos] != rpsInIdxList->bUsed[pos]) + break; + } + if (pos == rpsInRec->numberOfPictures) // if this type of RPS has exist + { + rce->rpsIdx = rpsIdxListIter->idx; + rpsIdxListIter->count++; + // sort RPS type link after reset RPS type count. + RPSListNode* next = rpsIdxListIter->next; + RPSListNode* prior = rpsIdxListIter->prior; + RPSListNode* iter = prior; + if (iter) + { + while (iter) + { + if (iter->count > rpsIdxListIter->count) + break; + iter = iter->prior; + } + if (iter) + { + prior->next = next; + if (next) + next->prior = prior; + else + tailRpsIdxList = prior; + rpsIdxListIter->next = iter->next; + rpsIdxListIter->prior = iter; + iter->next->prior = rpsIdxListIter; + iter->next = rpsIdxListIter; + } + else + { + prior->next = next; + if (next) + next->prior = prior; + else + tailRpsIdxList = prior; + headRpsIdxList->prior = rpsIdxListIter; + rpsIdxListIter->next = headRpsIdxList; + rpsIdxListIter->prior = NULL; + headRpsIdxList = rpsIdxListIter; + } + } + break; + } + } + rpsIdxListIter = rpsIdxListIter->next; + } + if (!rpsIdxListIter) // add new type of RPS + { + RPSListNode* newIdxNode = new RPSListNode(); + if (newIdxNode == NULL) + goto fail; + newIdxNode->rps = rpsInRec; + newIdxNode->idx = idx++; + newIdxNode->count = 1; + newIdxNode->next = NULL; + newIdxNode->prior = NULL; + if (!tailRpsIdxList) + tailRpsIdxList = headRpsIdxList = newIdxNode; + else + { + tailRpsIdxList->next = newIdxNode; + newIdxNode->prior = tailRpsIdxList; + tailRpsIdxList = newIdxNode; + } + rce->rpsIdx = newIdxNode->idx; + } + } + else + { + rce->rpsIdx = -1; + } + } + + // get commonly RPS set + memset(resultIdx, 0, sizeof(resultIdx)); + if (idx > MAX_NUM_SHORT_TERM_RPS) + idx = MAX_NUM_SHORT_TERM_RPS; + + *rpsNumInPSP = idx; + rpsIdxListIter = headRpsIdxList; + for (int i = 0; i < idx; i++) + { + resultIdx[i] = rpsIdxListIter->idx; + m_rpsInSpsCount += rpsIdxListIter->count; + thisRpsInSPS = rpsInSPS + i; + thisRpsInList = rpsIdxListIter->rps; + thisRpsInSPS->numberOfPictures = thisRpsInList->numberOfPictures; + thisRpsInSPS->numberOfNegativePictures = thisRpsInList->numberOfNegativePictures; + thisRpsInSPS->numberOfPositivePictures = thisRpsInList->numberOfPositivePictures; + for (pos = 0; pos < thisRpsInList->numberOfPictures; pos++) + { + thisRpsInSPS->deltaPOC[pos] = thisRpsInList->deltaPOC[pos]; + thisRpsInSPS->bUsed[pos] = thisRpsInList->bUsed[pos]; + } + rpsIdxListIter = rpsIdxListIter->next; + } + + //reset every frame's RPS index + for (int i = beginNum; i < endNum; i++) + { + int j; + rce = &rce2Pass[i]; + for (j = 0; j < idx; j++) + { + if (rce->rpsIdx == resultIdx[j]) + { + rce->rpsIdx = j; + break; + } + } + + if (j == idx) + rce->rpsIdx = -1; + } + + rpsIdxListIter = headRpsIdxList; + while (rpsIdxListIter) + { + RPSListNode* freeIndex = rpsIdxListIter; + rpsIdxListIter = rpsIdxListIter->next; + delete freeIndex; + } + return true; + +fail: + rpsIdxListIter = headRpsIdxList; + while (rpsIdxListIter) + { + RPSListNode* freeIndex = rpsIdxListIter; + rpsIdxListIter = rpsIdxListIter->next; + delete freeIndex; + } + return false; +} +
View file
x265_2.1.tar.gz/source/encoder/encoder.h -> x265_2.2.tar.gz/source/encoder/encoder.h
Changed
@@ -26,6 +26,7 @@ #include "common.h" #include "slice.h" +#include "threading.h" #include "scalinglist.h" #include "x265.h" #include "nal.h" @@ -69,6 +70,24 @@ void addSsim(double ssim); }; +#define MAX_NUM_REF_IDX 64 + +struct RefIdxLastGOP +{ + int numRefIdxDefault[2]; + int numRefIdxl0[MAX_NUM_REF_IDX]; + int numRefIdxl1[MAX_NUM_REF_IDX]; +}; + +struct RPSListNode +{ + int idx; + int count; + RPS* rps; + RPSListNode* next; + RPSListNode* prior; +}; + class FrameEncoder; class DPB; class Lookahead; @@ -136,6 +155,19 @@ * one is done. Requires bIntraRefresh to be set.*/ int m_bQueuedIntraRefresh; + /* For optimising slice QP */ + Lock m_sliceQpLock; + int m_iFrameNum; + int m_iPPSQpMinus26; + int m_iLastSliceQp; + int64_t m_iBitsCostSum[QP_MAX_MAX + 1]; + + Lock m_sliceRefIdxLock; + RefIdxLastGOP m_refIdxLastGOP; + + Lock m_rpsInSpsLock; + int m_rpsInSpsCount; + Encoder(); ~Encoder() {} @@ -173,6 +205,11 @@ void calcRefreshInterval(Frame* frameEnc); + void initRefIdx(); + void analyseRefIdx(int *numRefIdx); + void updateRefIdx(); + bool computeSPSRPSIndex(); + protected: void initVPS(VPS *vps);
View file
x265_2.1.tar.gz/source/encoder/entropy.cpp -> x265_2.2.tar.gz/source/encoder/entropy.cpp
Changed
@@ -312,19 +312,21 @@ WRITE_FLAG(sps.bUseSAO, "sample_adaptive_offset_enabled_flag"); WRITE_FLAG(0, "pcm_enabled_flag"); - WRITE_UVLC(0, "num_short_term_ref_pic_sets"); + WRITE_UVLC(sps.spsrpsNum, "num_short_term_ref_pic_sets"); + for (int i = 0; i < sps.spsrpsNum; i++) + codeShortTermRefPicSet(sps.spsrps[i], i); WRITE_FLAG(0, "long_term_ref_pics_present_flag"); WRITE_FLAG(sps.bTemporalMVPEnabled, "sps_temporal_mvp_enable_flag"); WRITE_FLAG(sps.bUseStrongIntraSmoothing, "sps_strong_intra_smoothing_enable_flag"); WRITE_FLAG(1, "vui_parameters_present_flag"); - codeVUI(sps.vuiParameters, sps.maxTempSubLayers, sps.bDiscardOptionalVUI); + codeVUI(sps.vuiParameters, sps.maxTempSubLayers, sps.bEmitVUITimingInfo, sps.bEmitVUIHRDInfo); WRITE_FLAG(0, "sps_extension_flag"); } -void Entropy::codePPS(const PPS& pps, bool filerAcross) +void Entropy::codePPS( const PPS& pps, bool filerAcross, int iPPSInitQpMinus26 ) { WRITE_UVLC(0, "pps_pic_parameter_set_id"); WRITE_UVLC(0, "pps_seq_parameter_set_id"); @@ -333,10 +335,10 @@ WRITE_CODE(0, 3, "num_extra_slice_header_bits"); WRITE_FLAG(pps.bSignHideEnabled, "sign_data_hiding_flag"); WRITE_FLAG(0, "cabac_init_present_flag"); - WRITE_UVLC(0, "num_ref_idx_l0_default_active_minus1"); - WRITE_UVLC(0, "num_ref_idx_l1_default_active_minus1"); + WRITE_UVLC(pps.numRefIdxDefault[0] - 1, "num_ref_idx_l0_default_active_minus1"); + WRITE_UVLC(pps.numRefIdxDefault[1] - 1, "num_ref_idx_l1_default_active_minus1"); - WRITE_SVLC(0, "init_qp_minus26"); + WRITE_SVLC(iPPSInitQpMinus26, "init_qp_minus26"); WRITE_FLAG(pps.bConstrainedIntraPred, "constrained_intra_pred_flag"); WRITE_FLAG(pps.bTransformSkipEnabled, "transform_skip_enabled_flag"); @@ -422,7 +424,7 @@ } } -void Entropy::codeVUI(const VUI& vui, int maxSubTLayers, bool bDiscardOptionalVUI) +void Entropy::codeVUI(const VUI& vui, int maxSubTLayers, bool bEmitVUITimingInfo, bool bEmitVUIHRDInfo) { WRITE_FLAG(vui.aspectRatioInfoPresentFlag, "aspect_ratio_info_present_flag"); if (vui.aspectRatioInfoPresentFlag) @@ -473,7 +475,7 @@ WRITE_UVLC(vui.defaultDisplayWindow.bottomOffset, "def_disp_win_bottom_offset"); } - if (bDiscardOptionalVUI) + if (!bEmitVUITimingInfo) WRITE_FLAG(0, "vui_timing_info_present_flag"); else { @@ -483,7 +485,7 @@ WRITE_FLAG(0, "vui_poc_proportional_to_timing_flag"); } - if (bDiscardOptionalVUI) + if (!bEmitVUIHRDInfo) WRITE_FLAG(0, "vui_hrd_parameters_present_flag"); else { @@ -614,8 +616,21 @@ } #endif - WRITE_FLAG(0, "short_term_ref_pic_set_sps_flag"); - codeShortTermRefPicSet(slice.m_rps); + if (slice.m_rpsIdx < 0) + { + WRITE_FLAG(0, "short_term_ref_pic_set_sps_flag"); + codeShortTermRefPicSet(slice.m_rps, slice.m_sps->spsrpsNum); + } + else + { + WRITE_FLAG(1, "short_term_ref_pic_set_sps_flag"); + int numBits = 0; + while ((1 << numBits) < slice.m_iNumRPSInSPS) + numBits++; + + if (numBits > 0) + WRITE_CODE(slice.m_rpsIdx, numBits, "short_term_ref_pic_set_idx"); + } if (slice.m_sps->bTemporalMVPEnabled) WRITE_FLAG(1, "slice_temporal_mvp_enable_flag"); @@ -633,7 +648,7 @@ if (!slice.isIntra()) { - bool overrideFlag = (slice.m_numRefIdx[0] != 1 || (slice.isInterB() && slice.m_numRefIdx[1] != 1)); + bool overrideFlag = (slice.m_numRefIdx[0] != slice.numRefIdxDefault[0] || (slice.isInterB() && slice.m_numRefIdx[1] != slice.numRefIdxDefault[1])); WRITE_FLAG(overrideFlag, "num_ref_idx_active_override_flag"); if (overrideFlag) { @@ -673,7 +688,7 @@ if (!slice.isIntra()) WRITE_UVLC(MRG_MAX_NUM_CANDS - slice.m_maxNumMergeCand, "five_minus_max_num_merge_cand"); - int code = sliceQp - 26; + int code = sliceQp - (slice.m_iPPSQpMinus26 + 26); WRITE_SVLC(code, "slice_qp_delta"); // TODO: Enable when pps_loop_filter_across_slices_enabled_flag==1 @@ -707,8 +722,11 @@ WRITE_CODE(substreamSizes[i] - 1, offsetLen, "entry_point_offset_minus1"); } -void Entropy::codeShortTermRefPicSet(const RPS& rps) +void Entropy::codeShortTermRefPicSet(const RPS& rps, int idx) { + if (idx > 0) + WRITE_FLAG(0, "inter_ref_pic_set_prediction_flag"); + WRITE_UVLC(rps.numberOfNegativePictures, "num_negative_pics"); WRITE_UVLC(rps.numberOfPositivePictures, "num_positive_pics"); int prev = 0;
View file
x265_2.1.tar.gz/source/encoder/entropy.h -> x265_2.2.tar.gz/source/encoder/entropy.h
Changed
@@ -142,14 +142,14 @@ void codeVPS(const VPS& vps); void codeSPS(const SPS& sps, const ScalingList& scalingList, const ProfileTierLevel& ptl); - void codePPS(const PPS& pps, bool filerAcross); - void codeVUI(const VUI& vui, int maxSubTLayers, bool discardOptionalVUI); + void codePPS( const PPS& pps, bool filerAcross, int iPPSInitQpMinus26 ); + void codeVUI(const VUI& vui, int maxSubTLayers, bool bEmitVUITimingInfo, bool bEmitVUIHRDInfo); void codeAUD(const Slice& slice); void codeHrdParameters(const HRDInfo& hrd, int maxSubTLayers); void codeSliceHeader(const Slice& slice, FrameData& encData, uint32_t slice_addr, uint32_t slice_addr_bits, int sliceQp); void codeSliceHeaderWPPEntryPoints(const uint32_t *substreamSizes, uint32_t numSubStreams, uint32_t maxOffset); - void codeShortTermRefPicSet(const RPS& rps); + void codeShortTermRefPicSet(const RPS& rps, int idx); void finishSlice() { encodeBinTrm(1); finish(); dynamic_cast<Bitstream*>(m_bitIf)->writeByteAlignment(); } void encodeCTU(const CUData& cu, const CUGeom& cuGeom);
View file
x265_2.1.tar.gz/source/encoder/frameencoder.cpp -> x265_2.2.tar.gz/source/encoder/frameencoder.cpp
Changed
@@ -50,6 +50,7 @@ m_bAllRowsStop = false; m_vbvResetTriggerRow = -1; m_outStreams = NULL; + m_backupStreams = NULL; m_substreamSizes = NULL; m_nr = NULL; m_tld = NULL; @@ -85,6 +86,7 @@ delete[] m_rows; delete[] m_outStreams; + delete[] m_backupStreams; X265_FREE(m_sliceBaseRow); X265_FREE(m_cuGeoms); X265_FREE(m_ctuGeomMap); @@ -121,7 +123,7 @@ int range = m_param->searchRange; /* fpel search */ range += !!(m_param->searchMethod < 2); /* diamond/hex range check lag */ range += NTAPS_LUMA / 2; /* subpel filter half-length */ - range += 2 + MotionEstimate::hpelIterationCount(m_param->subpelRefine) / 2; /* subpel refine steps */ + range += 2 + (MotionEstimate::hpelIterationCount(m_param->subpelRefine) + 1) / 2; /* subpel refine steps */ m_refLagRows = /*(m_param->maxSlices > 1 ? 1 : 0) +*/ 1 + ((range + g_maxCUSize - 1) / g_maxCUSize); // NOTE: 2 times of numRows because both Encoder and Filter in same queue @@ -152,7 +154,7 @@ // 7.4.7.1 - Ceil( Log2( PicSizeInCtbsY ) ) bits { unsigned long tmp; - CLZ(tmp, (numRows * numCols)); + CLZ(tmp, (numRows * numCols - 1)); m_sliceAddrBits = (uint16_t)(tmp + 1); } @@ -305,6 +307,19 @@ weightAnalyse(*frame->m_encData->m_slice, *frame, *master.m_param); } + +uint32_t getBsLength( int32_t code ) +{ + uint32_t ucode = (code <= 0) ? -code << 1 : (code << 1) - 1; + + ++ucode; + unsigned long idx; + CLZ( idx, ucode ); + uint32_t length = (uint32_t)idx * 2 + 1; + + return length; +} + void FrameEncoder::compressFrame() { ProfileScopeEvent(frameThread); @@ -340,7 +355,28 @@ m_nalList.serialize(NAL_UNIT_ACCESS_UNIT_DELIMITER, m_bs); } if (m_frame->m_lowres.bKeyframe && m_param->bRepeatHeaders) - m_top->getStreamHeaders(m_nalList, m_entropyCoder, m_bs); + { + if (m_param->bOptRefListLengthPPS) + { + ScopedLock refIdxLock(m_top->m_sliceRefIdxLock); + m_top->updateRefIdx(); + } + if (m_top->m_param->rc.bStatRead && m_top->m_param->bMultiPassOptRPS) + { + ScopedLock refIdxLock(m_top->m_rpsInSpsLock); + if (!m_top->computeSPSRPSIndex()) + { + x265_log(m_param, X265_LOG_ERROR, "compute commonly RPS failed!\n"); + m_top->m_aborted = true; + } + m_top->getStreamHeaders(m_nalList, m_entropyCoder, m_bs); + } + else + m_top->getStreamHeaders(m_nalList, m_entropyCoder, m_bs); + } + + if (m_top->m_param->rc.bStatRead && m_top->m_param->bMultiPassOptRPS) + m_frame->m_encData->m_slice->m_rpsIdx = (m_top->m_rateControl->m_rce2Pass + m_frame->m_encodeOrder)->rpsIdx; // Weighted Prediction parameters estimation. bool bUseWeightP = slice->m_sliceType == P_SLICE && slice->m_pps->bUseWeightPred; @@ -448,6 +484,19 @@ /* Clip slice QP to 0-51 spec range before encoding */ slice->m_sliceQp = x265_clip3(-QP_BD_OFFSET, QP_MAX_SPEC, qp); + if (m_param->bOptQpPPS && m_param->bRepeatHeaders) + { + ScopedLock qpLock(m_top->m_sliceQpLock); + for (int i = 0; i < (QP_MAX_MAX + 1); i++) + { + int delta = slice->m_sliceQp - (i + 1); + int codeLength = getBsLength( delta ); + m_top->m_iBitsCostSum[i] += codeLength; + } + m_top->m_iFrameNum++; + m_top->m_iLastSliceQp = slice->m_sliceQp; + } + m_initSliceContext.resetEntropy(*slice); m_frameFilter.start(m_frame, m_initSliceContext); @@ -485,6 +534,8 @@ if (!m_outStreams) { m_outStreams = new Bitstream[numSubstreams]; + if (!m_param->bEnableWavefront) + m_backupStreams = new Bitstream[numSubstreams]; m_substreamSizes = X265_MALLOC(uint32_t, numSubstreams); if (!m_param->bEnableSAO) for (uint32_t i = 0; i < numSubstreams; i++) @@ -498,7 +549,7 @@ if (m_frame->m_lowres.bKeyframe) { - if (!m_param->bDiscardSEI && m_param->bEmitHRDSEI) + if (m_param->bEmitHRDSEI) { SEIBufferingPeriod* bpSei = &m_top->m_rateControl->m_bufPeriodSEI; @@ -520,7 +571,7 @@ } } - if (!m_param->bDiscardSEI && (m_param->bEmitHRDSEI || !!m_param->interlaceMode)) + if ((m_param->bEmitHRDSEI || !!m_param->interlaceMode)) { SEIPictureTiming *sei = m_rce.picTimingSEI; const VUI *vui = &slice->m_sps->vuiParameters; @@ -556,22 +607,19 @@ } /* Write user SEI */ - if (!m_param->bDiscardSEI) + for (int i = 0; i < m_frame->m_userSEI.numPayloads; i++) { - for (int i = 0; i < m_frame->m_userSEI.numPayloads; i++) - { - x265_sei_payload *payload = &m_frame->m_userSEI.payloads[i]; - SEIuserDataUnregistered sei; + x265_sei_payload *payload = &m_frame->m_userSEI.payloads[i]; + SEIuserDataUnregistered sei; - sei.m_payloadType = payload->payloadType; - sei.m_userDataLength = payload->payloadSize; - sei.m_userData = payload->payload; + sei.m_payloadType = payload->payloadType; + sei.m_userDataLength = payload->payloadSize; + sei.m_userData = payload->payload; - m_bs.resetBits(); - sei.write(m_bs, *slice->m_sps); - m_bs.writeByteAlignment(); - m_nalList.serialize(NAL_UNIT_PREFIX_SEI, m_bs); - } + m_bs.resetBits(); + sei.write(m_bs, *slice->m_sps); + m_bs.writeByteAlignment(); + m_nalList.serialize(NAL_UNIT_PREFIX_SEI, m_bs); } /* CQP and CRF (without capped VBV) doesn't use mid-frame statistics to @@ -606,8 +654,7 @@ const uint32_t sliceEndRow = m_sliceBaseRow[sliceId + 1] - 1; const uint32_t row = sliceStartRow + rowInSlice; - if (row >= m_numRows) - break; + X265_CHECK(row < m_numRows, "slices row fault was detected"); if (row > sliceEndRow) continue; @@ -626,7 +673,7 @@ refpic->m_reconRowFlag[rowIdx].waitForChange(0); if ((bUseWeightP || bUseWeightB) && m_mref[l][ref].isWeighted) - m_mref[l][ref].applyWeight(row + m_refLagRows, m_numRows, sliceEndRow + 1, sliceId); + m_mref[l][ref].applyWeight(rowIdx, m_numRows, sliceEndRow, sliceId); } } @@ -666,7 +713,7 @@ refpic->m_reconRowFlag[rowIdx].waitForChange(0); if ((bUseWeightP || bUseWeightB) && m_mref[l][ref].isWeighted) - m_mref[list][ref].applyWeight(i + m_refLagRows, m_numRows, m_numRows, 0); + m_mref[list][ref].applyWeight(rowIdx, m_numRows, m_numRows, 0); } } @@ -830,6 +877,11 @@ const uint32_t sliceAddr = nextSliceRow * m_numCols; //CUData* ctu = m_frame->m_encData->getPicCTU(sliceAddr); //const int sliceQp = ctu->m_qp[0]; + if (m_param->bOptRefListLengthPPS) + { + ScopedLock refIdxLock(m_top->m_sliceRefIdxLock); + m_top->analyseRefIdx(slice->m_numRefIdx); + } m_entropyCoder.codeSliceHeader(*slice, *m_frame->m_encData, sliceAddr, m_sliceAddrBits, slice->m_sliceQp); // Find rows of current slice @@ -853,6 +905,11 @@ } else { + if (m_param->bOptRefListLengthPPS) + { + ScopedLock refIdxLock(m_top->m_sliceRefIdxLock); + m_top->analyseRefIdx(slice->m_numRefIdx); + } m_entropyCoder.codeSliceHeader(*slice, *m_frame->m_encData, 0, 0, slice->m_sliceQp); // serialize each row, record final lengths in slice header @@ -868,7 +925,7 @@ } - if (!m_param->bDiscardSEI && m_param->decodedPictureHashSEI) + if (m_param->decodedPictureHashSEI) { int planes = (m_frame->m_param->internalCsp != X265_CSP_I400) ? 3 : 1; if (m_param->decodedPictureHashSEI == 1) @@ -1129,8 +1186,8 @@ // TODO: specially case handle on first and last row // Initialize restrict on MV range in slices - tld.analysis.m_sliceMinY = -(int16_t)(rowInSlice * g_maxCUSize * 4) + 2 * 4; - tld.analysis.m_sliceMaxY = (int16_t)((endRowInSlicePlus1 - 1 - row) * (g_maxCUSize * 4) - 3 * 4); + tld.analysis.m_sliceMinY = -(int16_t)(rowInSlice * g_maxCUSize * 4) + 3 * 4; + tld.analysis.m_sliceMaxY = (int16_t)((endRowInSlicePlus1 - 1 - row) * (g_maxCUSize * 4) - 4 * 4); // Handle single row slice if (tld.analysis.m_sliceMaxY < tld.analysis.m_sliceMinY) @@ -1149,17 +1206,25 @@ if (bIsVbv) { - if (!row) + if (col == 0 && !m_param->bEnableWavefront) { - curEncData.m_rowStat[row].diagQp = curEncData.m_avgQpRc; - curEncData.m_rowStat[row].diagQpScale = x265_qp2qScale(curEncData.m_avgQpRc); + m_backupStreams[0].copyBits(&m_outStreams[0]); + curRow.bufferedEntropy.copyState(rowCoder); + curRow.bufferedEntropy.loadContexts(rowCoder); + } + if (!row && m_vbvResetTriggerRow != intRow) + { + curEncData.m_rowStat[row].rowQp = curEncData.m_avgQpRc; + curEncData.m_rowStat[row].rowQpScale = x265_qp2qScale(curEncData.m_avgQpRc); } FrameData::RCStatCU& cuStat = curEncData.m_cuStat[cuAddr]; - if (row >= col && row && m_vbvResetTriggerRow != intRow) + if (m_param->bEnableWavefront && row >= col && row && m_vbvResetTriggerRow != intRow) cuStat.baseQp = curEncData.m_cuStat[cuAddr - numCols + 1].baseQp; + else if (!m_param->bEnableWavefront && row && m_vbvResetTriggerRow != intRow) + cuStat.baseQp = curEncData.m_rowStat[row - 1].rowQp; else - cuStat.baseQp = curEncData.m_rowStat[row].diagQp; + cuStat.baseQp = curEncData.m_rowStat[row].rowQp; /* TODO: use defines from slicetype.h for lowres block size */ uint32_t block_y = (ctu->m_cuPelY >> g_maxLog2CUSize) * noOfBlocks; @@ -1310,21 +1375,52 @@ if (bIsVbv) { // Update encoded bits, satdCost, baseQP for each CU - curEncData.m_rowStat[row].diagSatd += curEncData.m_cuStat[cuAddr].vbvCost; - curEncData.m_rowStat[row].diagIntraSatd += curEncData.m_cuStat[cuAddr].intraVbvCost; + curEncData.m_rowStat[row].rowSatd += curEncData.m_cuStat[cuAddr].vbvCost; + curEncData.m_rowStat[row].rowIntraSatd += curEncData.m_cuStat[cuAddr].intraVbvCost; curEncData.m_rowStat[row].encodedBits += curEncData.m_cuStat[cuAddr].totalBits; curEncData.m_rowStat[row].sumQpRc += curEncData.m_cuStat[cuAddr].baseQp; curEncData.m_rowStat[row].numEncodedCUs = cuAddr; + // If current block is at row end checkpoint, call vbv ratecontrol. + + if (!m_param->bEnableWavefront && col == numCols - 1) + { + double qpBase = curEncData.m_cuStat[cuAddr].baseQp; + int reEncode = m_top->m_rateControl->rowVbvRateControl(m_frame, row, &m_rce, qpBase); + qpBase = x265_clip3((double)m_param->rc.qpMin, (double)m_param->rc.qpMax, qpBase); + curEncData.m_rowStat[row].rowQp = qpBase; + curEncData.m_rowStat[row].rowQpScale = x265_qp2qScale(qpBase); + if (reEncode < 0) + { + x265_log(m_param, X265_LOG_DEBUG, "POC %d row %d - encode restart required for VBV, to %.2f from %.2f\n", + m_frame->m_poc, row, qpBase, curEncData.m_cuStat[cuAddr].baseQp); + + m_vbvResetTriggerRow = row; + m_outStreams[0].copyBits(&m_backupStreams[0]); + + rowCoder.copyState(curRow.bufferedEntropy); + rowCoder.loadContexts(curRow.bufferedEntropy); + + curRow.completed = 0; + memset(&curRow.rowStats, 0, sizeof(curRow.rowStats)); + curEncData.m_rowStat[row].numEncodedCUs = 0; + curEncData.m_rowStat[row].encodedBits = 0; + curEncData.m_rowStat[row].rowSatd = 0; + curEncData.m_rowStat[row].rowIntraSatd = 0; + curEncData.m_rowStat[row].sumQpRc = 0; + curEncData.m_rowStat[row].sumQpAq = 0; + } + } + // If current block is at row diagonal checkpoint, call vbv ratecontrol. - if (row == col && row) + else if (m_param->bEnableWavefront && row == col && row) { double qpBase = curEncData.m_cuStat[cuAddr].baseQp; - int reEncode = m_top->m_rateControl->rowDiagonalVbvRateControl(m_frame, row, &m_rce, qpBase); + int reEncode = m_top->m_rateControl->rowVbvRateControl(m_frame, row, &m_rce, qpBase); qpBase = x265_clip3((double)m_param->rc.qpMin, (double)m_param->rc.qpMax, qpBase); - curEncData.m_rowStat[row].diagQp = qpBase; - curEncData.m_rowStat[row].diagQpScale = x265_qp2qScale(qpBase); + curEncData.m_rowStat[row].rowQp = qpBase; + curEncData.m_rowStat[row].rowQpScale = x265_qp2qScale(qpBase); if (reEncode < 0) { @@ -1377,8 +1473,8 @@ memset(&stopRow.rowStats, 0, sizeof(stopRow.rowStats)); curEncData.m_rowStat[r].numEncodedCUs = 0; curEncData.m_rowStat[r].encodedBits = 0; - curEncData.m_rowStat[r].diagSatd = 0; - curEncData.m_rowStat[r].diagIntraSatd = 0; + curEncData.m_rowStat[r].rowSatd = 0; + curEncData.m_rowStat[r].rowIntraSatd = 0; curEncData.m_rowStat[r].sumQpRc = 0; curEncData.m_rowStat[r].sumQpAq = 0; } @@ -1405,7 +1501,7 @@ ScopedLock self(curRow.lock); if ((m_bAllRowsStop && intRow > m_vbvResetTriggerRow) || - (!bFirstRowInSlice && ((curRow.completed < numCols - 1) || (m_rows[row - 1].completed < numCols)) && m_rows[row - 1].completed < m_rows[row].completed + 2)) + (!bFirstRowInSlice && ((curRow.completed < numCols - 1) || (m_rows[row - 1].completed < numCols)) && m_rows[row - 1].completed < curRow.completed + 2)) { curRow.active = false; curRow.busy = false;
View file
x265_2.1.tar.gz/source/encoder/frameencoder.h -> x265_2.2.tar.gz/source/encoder/frameencoder.h
Changed
@@ -184,6 +184,7 @@ NoiseReduction* m_nr; ThreadLocalData* m_tld; /* for --no-wpp */ Bitstream* m_outStreams; + Bitstream* m_backupStreams; uint32_t* m_substreamSizes; CUGeom* m_cuGeoms;
View file
x265_2.1.tar.gz/source/encoder/framefilter.cpp -> x265_2.2.tar.gz/source/encoder/framefilter.cpp
Changed
@@ -35,6 +35,109 @@ static uint64_t computeSSD(pixel *fenc, pixel *rec, intptr_t stride, uint32_t width, uint32_t height); static float calculateSSIM(pixel *pix1, intptr_t stride1, pixel *pix2, intptr_t stride2, uint32_t width, uint32_t height, void *buf, uint32_t& cnt); +static void integral_init4h(uint32_t *sum, pixel *pix, intptr_t stride) +{ + int32_t v = pix[0] + pix[1] + pix[2] + pix[3]; + for (int16_t x = 0; x < stride - 4; x++) + { + sum[x] = v + sum[x - stride]; + v += pix[x + 4] - pix[x]; + } +} + +static void integral_init8h(uint32_t *sum, pixel *pix, intptr_t stride) +{ + int32_t v = pix[0] + pix[1] + pix[2] + pix[3] + pix[4] + pix[5] + pix[6] + pix[7]; + for (int16_t x = 0; x < stride - 8; x++) + { + sum[x] = v + sum[x - stride]; + v += pix[x + 8] - pix[x]; + } +} + +static void integral_init12h(uint32_t *sum, pixel *pix, intptr_t stride) +{ + int32_t v = pix[0] + pix[1] + pix[2] + pix[3] + pix[4] + pix[5] + pix[6] + pix[7] + + pix[8] + pix[9] + pix[10] + pix[11]; + for (int16_t x = 0; x < stride - 12; x++) + { + sum[x] = v + sum[x - stride]; + v += pix[x + 12] - pix[x]; + } +} + +static void integral_init16h(uint32_t *sum, pixel *pix, intptr_t stride) +{ + int32_t v = pix[0] + pix[1] + pix[2] + pix[3] + pix[4] + pix[5] + pix[6] + pix[7] + + pix[8] + pix[9] + pix[10] + pix[11] + pix[12] + pix[13] + pix[14] + pix[15]; + for (int16_t x = 0; x < stride - 16; x++) + { + sum[x] = v + sum[x - stride]; + v += pix[x + 16] - pix[x]; + } +} + +static void integral_init24h(uint32_t *sum, pixel *pix, intptr_t stride) +{ + int32_t v = pix[0] + pix[1] + pix[2] + pix[3] + pix[4] + pix[5] + pix[6] + pix[7] + + pix[8] + pix[9] + pix[10] + pix[11] + pix[12] + pix[13] + pix[14] + pix[15] + + pix[16] + pix[17] + pix[18] + pix[19] + pix[20] + pix[21] + pix[22] + pix[23]; + for (int16_t x = 0; x < stride - 24; x++) + { + sum[x] = v + sum[x - stride]; + v += pix[x + 24] - pix[x]; + } +} + +static void integral_init32h(uint32_t *sum, pixel *pix, intptr_t stride) +{ + int32_t v = pix[0] + pix[1] + pix[2] + pix[3] + pix[4] + pix[5] + pix[6] + pix[7] + + pix[8] + pix[9] + pix[10] + pix[11] + pix[12] + pix[13] + pix[14] + pix[15] + + pix[16] + pix[17] + pix[18] + pix[19] + pix[20] + pix[21] + pix[22] + pix[23] + + pix[24] + pix[25] + pix[26] + pix[27] + pix[28] + pix[29] + pix[30] + pix[31]; + for (int16_t x = 0; x < stride - 32; x++) + { + sum[x] = v + sum[x - stride]; + v += pix[x + 32] - pix[x]; + } +} + +static void integral_init4v(uint32_t *sum4, intptr_t stride) +{ + for (int x = 0; x < stride; x++) + sum4[x] = sum4[x + 4 * stride] - sum4[x]; +} + +static void integral_init8v(uint32_t *sum8, intptr_t stride) +{ + for (int x = 0; x < stride; x++) + sum8[x] = sum8[x + 8 * stride] - sum8[x]; +} + +static void integral_init12v(uint32_t *sum12, intptr_t stride) +{ + for (int x = 0; x < stride; x++) + sum12[x] = sum12[x + 12 * stride] - sum12[x]; +} + +static void integral_init16v(uint32_t *sum16, intptr_t stride) +{ + for (int x = 0; x < stride; x++) + sum16[x] = sum16[x + 16 * stride] - sum16[x]; +} + +static void integral_init24v(uint32_t *sum24, intptr_t stride) +{ + for (int x = 0; x < stride; x++) + sum24[x] = sum24[x + 24 * stride] - sum24[x]; +} + +static void integral_init32v(uint32_t *sum32, intptr_t stride) +{ + for (int x = 0; x < stride; x++) + sum32[x] = sum32[x + 32 * stride] - sum32[x]; +} + void FrameFilter::destroy() { X265_FREE(m_ssimBuf); @@ -65,6 +168,7 @@ m_saoRowDelay = m_param->bEnableLoopFilter ? 1 : 0; m_lastHeight = (m_param->sourceHeight % g_maxCUSize) ? (m_param->sourceHeight % g_maxCUSize) : g_maxCUSize; m_lastWidth = (m_param->sourceWidth % g_maxCUSize) ? (m_param->sourceWidth % g_maxCUSize) : g_maxCUSize; + integralCompleted.set(0); if (m_param->bEnableSsim) m_ssimBuf = X265_MALLOC(int, 8 * (m_param->sourceWidth / 4 + 3)); @@ -499,14 +603,19 @@ if (!ctu->m_bFirstRowInSlice) processPostRow(row - 1); - if (ctu->m_bLastRowInSlice) - processPostRow(row); - // NOTE: slices parallelism will be execute out-of-order - int numRowFinished; - for(numRowFinished = 0; numRowFinished < m_numRows; numRowFinished++) - if (!m_frame->m_reconRowFlag[numRowFinished].get()) - break; + int numRowFinished = 0; + if (m_frame->m_reconRowFlag) + { + for (numRowFinished = 0; numRowFinished < m_numRows; numRowFinished++) + { + if (!m_frame->m_reconRowFlag[numRowFinished].get()) + break; + + if (numRowFinished == row) + continue; + } + } if (numRowFinished == m_numRows) { @@ -522,6 +631,9 @@ m_parallelFilter[0].m_sao.rdoSaoUnitRowEnd(saoParam, encData.m_slice->m_sps->numCUsInFrame); } } + + if (ctu->m_bLastRowInSlice) + processPostRow(row); } void FrameFilter::processPostRow(int row) @@ -656,6 +768,107 @@ } } // end of (m_param->maxSlices == 1) + int lastRow = row == (int)m_frame->m_encData->m_slice->m_sps->numCuInHeight - 1; + + /* generate integral planes for SEA motion search */ + if (m_param->searchMethod == X265_SEA && m_frame->m_encData->m_meIntegral && m_frame->m_lowres.sliceType != X265_TYPE_B) + { + /* If WPP, other than first row, integral calculation for current row needs to wait till the + * integral for the previous row is computed */ + if (m_param->bEnableWavefront && row) + { + while (m_parallelFilter[row - 1].m_frameFilter->integralCompleted.get() == 0) + { + m_parallelFilter[row - 1].m_frameFilter->integralCompleted.waitForChange(0); + } + } + + int stride = (int)m_frame->m_reconPic->m_stride; + int padX = g_maxCUSize + 32; + int padY = g_maxCUSize + 16; + int numCuInHeight = m_frame->m_encData->m_slice->m_sps->numCuInHeight; + int maxHeight = numCuInHeight * g_maxCUSize; + int startRow = 0; + + if (m_param->interlaceMode) + startRow = (row * g_maxCUSize >> 1); + else + startRow = row * g_maxCUSize; + + int height = lastRow ? (maxHeight + g_maxCUSize * m_param->interlaceMode) : (((row + m_param->interlaceMode) * g_maxCUSize) + g_maxCUSize); + + if (!row) + { + for (int i = 0; i < INTEGRAL_PLANE_NUM; i++) + memset(m_frame->m_encData->m_meIntegral[i] - padY * stride - padX, 0, stride * sizeof(uint32_t)); + startRow = -padY; + } + + if (lastRow) + height += padY - 1; + + for (int y = startRow; y < height; y++) + { + pixel *pix = m_frame->m_reconPic->m_picOrg[0] + y * stride - padX; + uint32_t *sum32x32 = m_frame->m_encData->m_meIntegral[0] + (y + 1) * stride - padX; + uint32_t *sum32x24 = m_frame->m_encData->m_meIntegral[1] + (y + 1) * stride - padX; + uint32_t *sum32x8 = m_frame->m_encData->m_meIntegral[2] + (y + 1) * stride - padX; + uint32_t *sum24x32 = m_frame->m_encData->m_meIntegral[3] + (y + 1) * stride - padX; + uint32_t *sum16x16 = m_frame->m_encData->m_meIntegral[4] + (y + 1) * stride - padX; + uint32_t *sum16x12 = m_frame->m_encData->m_meIntegral[5] + (y + 1) * stride - padX; + uint32_t *sum16x4 = m_frame->m_encData->m_meIntegral[6] + (y + 1) * stride - padX; + uint32_t *sum12x16 = m_frame->m_encData->m_meIntegral[7] + (y + 1) * stride - padX; + uint32_t *sum8x32 = m_frame->m_encData->m_meIntegral[8] + (y + 1) * stride - padX; + uint32_t *sum8x8 = m_frame->m_encData->m_meIntegral[9] + (y + 1) * stride - padX; + uint32_t *sum4x16 = m_frame->m_encData->m_meIntegral[10] + (y + 1) * stride - padX; + uint32_t *sum4x4 = m_frame->m_encData->m_meIntegral[11] + (y + 1) * stride - padX; + + /*For width = 32 */ + integral_init32h(sum32x32, pix, stride); + if (y >= 32 - padY) + integral_init32v(sum32x32 - 32 * stride, stride); + integral_init32h(sum32x24, pix, stride); + if (y >= 24 - padY) + integral_init24v(sum32x24 - 24 * stride, stride); + integral_init32h(sum32x8, pix, stride); + if (y >= 8 - padY) + integral_init8v(sum32x8 - 8 * stride, stride); + /*For width = 24 */ + integral_init24h(sum24x32, pix, stride); + if (y >= 32 - padY) + integral_init32v(sum24x32 - 32 * stride, stride); + /*For width = 16 */ + integral_init16h(sum16x16, pix, stride); + if (y >= 16 - padY) + integral_init16v(sum16x16 - 16 * stride, stride); + integral_init16h(sum16x12, pix, stride); + if (y >= 12 - padY) + integral_init12v(sum16x12 - 12 * stride, stride); + integral_init16h(sum16x4, pix, stride); + if (y >= 4 - padY) + integral_init4v(sum16x4 - 4 * stride, stride); + /*For width = 12 */ + integral_init12h(sum12x16, pix, stride); + if (y >= 16 - padY) + integral_init16v(sum12x16 - 16 * stride, stride); + /*For width = 8 */ + integral_init8h(sum8x32, pix, stride); + if (y >= 32 - padY) + integral_init32v(sum8x32 - 32 * stride, stride); + integral_init8h(sum8x8, pix, stride); + if (y >= 8 - padY) + integral_init8v(sum8x8 - 8 * stride, stride); + /*For width = 4 */ + integral_init4h(sum4x16, pix, stride); + if (y >= 16 - padY) + integral_init16v(sum4x16 - 16 * stride, stride); + integral_init4h(sum4x4, pix, stride); + if (y >= 4 - padY) + integral_init4v(sum4x4 - 4 * stride, stride); + } + m_parallelFilter[row].m_frameFilter->integralCompleted.set(1); + } + if (ATOMIC_INC(&m_frameEncoder->m_completionCount) == 2 * (int)m_frameEncoder->m_numRows) { m_frameEncoder->m_completionEvent.trigger();
View file
x265_2.1.tar.gz/source/encoder/framefilter.h -> x265_2.2.tar.gz/source/encoder/framefilter.h
Changed
@@ -57,6 +57,8 @@ int m_lastHeight; int m_lastWidth; + ThreadSafeInteger integralCompleted; /* check if integral calculation is completed in this row */ + void* m_ssimBuf; /* Temp storage for ssim computation */ #define MAX_PFILTER_CUS (4) /* maximum CUs for every thread */
View file
x265_2.1.tar.gz/source/encoder/motion.cpp -> x265_2.2.tar.gz/source/encoder/motion.cpp
Changed
@@ -109,6 +109,8 @@ blockOffset = 0; bChromaSATD = false; chromaSatd = NULL; + for (int i = 0; i < INTEGRAL_PLANE_NUM; i++) + integral[i] = NULL; } void MotionEstimate::init(int csp) @@ -165,10 +167,12 @@ partEnum = partitionFromSizes(pwidth, pheight); X265_CHECK(LUMA_4x4 != partEnum, "4x4 inter partition detected!\n"); sad = primitives.pu[partEnum].sad; + ads = primitives.pu[partEnum].ads; satd = primitives.pu[partEnum].satd; sad_x3 = primitives.pu[partEnum].sad_x3; sad_x4 = primitives.pu[partEnum].sad_x4; + blockwidth = pwidth; blockOffset = offset; absPartIdx = ctuAddr = -1; @@ -188,6 +192,7 @@ partEnum = partitionFromSizes(pwidth, pheight); X265_CHECK(LUMA_4x4 != partEnum, "4x4 inter partition detected!\n"); sad = primitives.pu[partEnum].sad; + ads = primitives.pu[partEnum].ads; satd = primitives.pu[partEnum].satd; sad_x3 = primitives.pu[partEnum].sad_x3; sad_x4 = primitives.pu[partEnum].sad_x4; @@ -278,12 +283,31 @@ costs[1] += mvcost((omv + MV(m1x, m1y)) << 2); \ costs[2] += mvcost((omv + MV(m2x, m2y)) << 2); \ costs[3] += mvcost((omv + MV(m3x, m3y)) << 2); \ - COPY2_IF_LT(bcost, costs[0], bmv, omv + MV(m0x, m0y)); \ - COPY2_IF_LT(bcost, costs[1], bmv, omv + MV(m1x, m1y)); \ - COPY2_IF_LT(bcost, costs[2], bmv, omv + MV(m2x, m2y)); \ - COPY2_IF_LT(bcost, costs[3], bmv, omv + MV(m3x, m3y)); \ + if ((omv.y + m0y >= mvmin.y) & (omv.y + m0y <= mvmax.y)) \ + COPY2_IF_LT(bcost, costs[0], bmv, omv + MV(m0x, m0y)); \ + if ((omv.y + m1y >= mvmin.y) & (omv.y + m1y <= mvmax.y)) \ + COPY2_IF_LT(bcost, costs[1], bmv, omv + MV(m1x, m1y)); \ + if ((omv.y + m2y >= mvmin.y) & (omv.y + m2y <= mvmax.y)) \ + COPY2_IF_LT(bcost, costs[2], bmv, omv + MV(m2x, m2y)); \ + if ((omv.y + m3y >= mvmin.y) & (omv.y + m3y <= mvmax.y)) \ + COPY2_IF_LT(bcost, costs[3], bmv, omv + MV(m3x, m3y)); \ } +#define COST_MV_X3_ABS( m0x, m0y, m1x, m1y, m2x, m2y )\ +{\ + sad_x3(fenc, \ + fref + (m0x) + (m0y) * stride, \ + fref + (m1x) + (m1y) * stride, \ + fref + (m2x) + (m2y) * stride, \ + stride, costs); \ + costs[0] += p_cost_mvx[(m0x) << 2]; /* no cost_mvy */\ + costs[1] += p_cost_mvx[(m1x) << 2]; \ + costs[2] += p_cost_mvx[(m2x) << 2]; \ + COPY3_IF_LT(bcost, costs[0], bmv.x, m0x, bmv.y, m0y); \ + COPY3_IF_LT(bcost, costs[1], bmv.x, m1x, bmv.y, m1y); \ + COPY3_IF_LT(bcost, costs[2], bmv.x, m2x, bmv.y, m2y); \ +} + #define COST_MV_X4_DIR(m0x, m0y, m1x, m1y, m2x, m2y, m3x, m3y, costs) \ { \ pixel *pix_base = fref + bmv.x + bmv.y * stride; \ @@ -627,6 +651,7 @@ { bcost = cost; bmv = 0; + bmv.y = X265_MAX(X265_MIN(0, mvmax.y), mvmin.y); } } @@ -659,8 +684,10 @@ do { COST_MV_X4_DIR(0, -1, 0, 1, -1, 0, 1, 0, costs); - COPY1_IF_LT(bcost, (costs[0] << 4) + 1); - COPY1_IF_LT(bcost, (costs[1] << 4) + 3); + if ((bmv.y - 1 >= mvmin.y) & (bmv.y - 1 <= mvmax.y)) + COPY1_IF_LT(bcost, (costs[0] << 4) + 1); + if ((bmv.y + 1 >= mvmin.y) & (bmv.y + 1 <= mvmax.y)) + COPY1_IF_LT(bcost, (costs[1] << 4) + 3); COPY1_IF_LT(bcost, (costs[2] << 4) + 4); COPY1_IF_LT(bcost, (costs[3] << 4) + 12); if (!(bcost & 15)) @@ -698,36 +725,57 @@ /* equivalent to the above, but eliminates duplicate candidates */ COST_MV_X3_DIR(-2, 0, -1, 2, 1, 2, costs); bcost <<= 3; - COPY1_IF_LT(bcost, (costs[0] << 3) + 2); - COPY1_IF_LT(bcost, (costs[1] << 3) + 3); - COPY1_IF_LT(bcost, (costs[2] << 3) + 4); + if ((bmv.y >= mvmin.y) & (bmv.y <= mvmax.y)) + COPY1_IF_LT(bcost, (costs[0] << 3) + 2); + if ((bmv.y + 2 >= mvmin.y) & (bmv.y + 2 <= mvmax.y)) + { + COPY1_IF_LT(bcost, (costs[1] << 3) + 3); + COPY1_IF_LT(bcost, (costs[2] << 3) + 4); + } + COST_MV_X3_DIR(2, 0, 1, -2, -1, -2, costs); - COPY1_IF_LT(bcost, (costs[0] << 3) + 5); - COPY1_IF_LT(bcost, (costs[1] << 3) + 6); - COPY1_IF_LT(bcost, (costs[2] << 3) + 7); + if ((bmv.y >= mvmin.y) & (bmv.y <= mvmax.y)) + COPY1_IF_LT(bcost, (costs[0] << 3) + 5); + if ((bmv.y - 2 >= mvmin.y) & (bmv.y - 2 <= mvmax.y)) + { + COPY1_IF_LT(bcost, (costs[1] << 3) + 6); + COPY1_IF_LT(bcost, (costs[2] << 3) + 7); + } if (bcost & 7) { int dir = (bcost & 7) - 2; - bmv += hex2[dir + 1]; - /* half hexagon, not overlapping the previous iteration */ - for (int i = (merange >> 1) - 1; i > 0 && bmv.checkRange(mvmin, mvmax); i--) + if ((bmv.y + hex2[dir + 1].y >= mvmin.y) & (bmv.y + hex2[dir + 1].y <= mvmax.y)) { - COST_MV_X3_DIR(hex2[dir + 0].x, hex2[dir + 0].y, - hex2[dir + 1].x, hex2[dir + 1].y, - hex2[dir + 2].x, hex2[dir + 2].y, - costs); - bcost &= ~7; - COPY1_IF_LT(bcost, (costs[0] << 3) + 1); - COPY1_IF_LT(bcost, (costs[1] << 3) + 2); - COPY1_IF_LT(bcost, (costs[2] << 3) + 3); - if (!(bcost & 7)) - break; - dir += (bcost & 7) - 2; - dir = mod6m1[dir + 1]; bmv += hex2[dir + 1]; - } + + /* half hexagon, not overlapping the previous iteration */ + for (int i = (merange >> 1) - 1; i > 0 && bmv.checkRange(mvmin, mvmax); i--) + { + COST_MV_X3_DIR(hex2[dir + 0].x, hex2[dir + 0].y, + hex2[dir + 1].x, hex2[dir + 1].y, + hex2[dir + 2].x, hex2[dir + 2].y, + costs); + bcost &= ~7; + + if ((bmv.y + hex2[dir + 0].y >= mvmin.y) & (bmv.y + hex2[dir + 0].y <= mvmax.y)) + COPY1_IF_LT(bcost, (costs[0] << 3) + 1); + + if ((bmv.y + hex2[dir + 1].y >= mvmin.y) & (bmv.y + hex2[dir + 1].y <= mvmax.y)) + COPY1_IF_LT(bcost, (costs[1] << 3) + 2); + + if ((bmv.y + hex2[dir + 2].y >= mvmin.y) & (bmv.y + hex2[dir + 2].y <= mvmax.y)) + COPY1_IF_LT(bcost, (costs[2] << 3) + 3); + + if (!(bcost & 7)) + break; + + dir += (bcost & 7) - 2; + dir = mod6m1[dir + 1]; + bmv += hex2[dir + 1]; + } + } // if ((bmv.y + hex2[dir + 1].y >= mvmin.y) & (bmv.y + hex2[dir + 1].y <= mvmax.y)) } bcost >>= 3; #endif // if 0 @@ -735,15 +783,21 @@ /* square refine */ int dir = 0; COST_MV_X4_DIR(0, -1, 0, 1, -1, 0, 1, 0, costs); - COPY2_IF_LT(bcost, costs[0], dir, 1); - COPY2_IF_LT(bcost, costs[1], dir, 2); + if ((bmv.y - 1 >= mvmin.y) & (bmv.y - 1 <= mvmax.y)) + COPY2_IF_LT(bcost, costs[0], dir, 1); + if ((bmv.y + 1 >= mvmin.y) & (bmv.y + 1 <= mvmax.y)) + COPY2_IF_LT(bcost, costs[1], dir, 2); COPY2_IF_LT(bcost, costs[2], dir, 3); COPY2_IF_LT(bcost, costs[3], dir, 4); COST_MV_X4_DIR(-1, -1, -1, 1, 1, -1, 1, 1, costs); - COPY2_IF_LT(bcost, costs[0], dir, 5); - COPY2_IF_LT(bcost, costs[1], dir, 6); - COPY2_IF_LT(bcost, costs[2], dir, 7); - COPY2_IF_LT(bcost, costs[3], dir, 8); + if ((bmv.y - 1 >= mvmin.y) & (bmv.y - 1 <= mvmax.y)) + COPY2_IF_LT(bcost, costs[0], dir, 5); + if ((bmv.y + 1 >= mvmin.y) & (bmv.y + 1 <= mvmax.y)) + COPY2_IF_LT(bcost, costs[1], dir, 6); + if ((bmv.y - 1 >= mvmin.y) & (bmv.y - 1 <= mvmax.y)) + COPY2_IF_LT(bcost, costs[2], dir, 7); + if ((bmv.y + 1 >= mvmin.y) & (bmv.y + 1 <= mvmax.y)) + COPY2_IF_LT(bcost, costs[3], dir, 8); bmv += square1[dir]; break; } @@ -756,6 +810,7 @@ /* refine predictors */ omv = bmv; ucost1 = bcost; + X265_CHECK(((pmv.y >= mvmin.y) & (pmv.y <= mvmax.y)), "pmv outside of search range!"); DIA1_ITER(pmv.x, pmv.y); if (pmv.notZero()) DIA1_ITER(0, 0); @@ -879,7 +934,7 @@ stride, costs + 4 * k); \ fref_base += 2 * dy; #define ADD_MVCOST(k, x, y) costs[k] += p_cost_omvx[x * 4 * i] + p_cost_omvy[y * 4 * i] -#define MIN_MV(k, x, y) COPY2_IF_LT(bcost, costs[k], dir, x * 16 + (y & 15)) +#define MIN_MV(k, dx, dy) if ((omv.y + (dy) >= mvmin.y) & (omv.y + (dy) <= mvmax.y)) { COPY2_IF_LT(bcost, costs[k], dir, dx * 16 + (dy & 15)) } SADS(0, +0, -4, +0, +4, -2, -3, +2, -3); SADS(1, -4, -2, +4, -2, -4, -1, +4, -1); @@ -1043,6 +1098,161 @@ break; } + case X265_SEA: + { + // Successive Elimination Algorithm + const int16_t minX = X265_MAX(omv.x - (int16_t)merange, mvmin.x); + const int16_t minY = X265_MAX(omv.y - (int16_t)merange, mvmin.y); + const int16_t maxX = X265_MIN(omv.x + (int16_t)merange, mvmax.x); + const int16_t maxY = X265_MIN(omv.y + (int16_t)merange, mvmax.y); + const uint16_t *p_cost_mvx = m_cost_mvx - qmvp.x; + const uint16_t *p_cost_mvy = m_cost_mvy - qmvp.y; + int16_t* meScratchBuffer = NULL; + int scratchSize = merange * 2 + 4; + if (scratchSize) + { + meScratchBuffer = X265_MALLOC(int16_t, scratchSize); + memset(meScratchBuffer, 0, sizeof(int16_t)* scratchSize); + } + + /* SEA is fastest in multiples of 4 */ + int meRangeWidth = (maxX - minX + 3) & ~3; + int w = 0, h = 0; // Width and height of the PU + ALIGN_VAR_32(pixel, zero[64 * FENC_STRIDE]) = { 0 }; + ALIGN_VAR_32(int, encDC[4]); + uint16_t *fpelCostMvX = m_fpelMvCosts[-qmvp.x & 3] + (-qmvp.x >> 2); + sizesFromPartition(partEnum, &w, &h); + int deltaX = (w <= 8) ? (w) : (w >> 1); + int deltaY = (h <= 8) ? (h) : (h >> 1); + + /* Check if very small rectangular blocks which cannot be sub-divided anymore */ + bool smallRectPartition = partEnum == LUMA_4x4 || partEnum == LUMA_16x12 || + partEnum == LUMA_12x16 || partEnum == LUMA_16x4 || partEnum == LUMA_4x16; + /* Check if vertical partition */ + bool verticalRect = partEnum == LUMA_32x64 || partEnum == LUMA_16x32 || partEnum == LUMA_8x16 || + partEnum == LUMA_4x8; + /* Check if horizontal partition */ + bool horizontalRect = partEnum == LUMA_64x32 || partEnum == LUMA_32x16 || partEnum == LUMA_16x8 || + partEnum == LUMA_8x4; + /* Check if assymetric vertical partition */ + bool assymetricVertical = partEnum == LUMA_12x16 || partEnum == LUMA_4x16 || partEnum == LUMA_24x32 || + partEnum == LUMA_8x32 || partEnum == LUMA_48x64 || partEnum == LUMA_16x64; + /* Check if assymetric horizontal partition */ + bool assymetricHorizontal = partEnum == LUMA_16x12 || partEnum == LUMA_16x4 || partEnum == LUMA_32x24 || + partEnum == LUMA_32x8 || partEnum == LUMA_64x48 || partEnum == LUMA_64x16; + + int tempPartEnum = 0; + + /* If a vertical rectangular partition, it is horizontally split into two, for ads_x2() */ + if (verticalRect) + tempPartEnum = partitionFromSizes(w, h >> 1); + /* If a horizontal rectangular partition, it is vertically split into two, for ads_x2() */ + else if (horizontalRect) + tempPartEnum = partitionFromSizes(w >> 1, h); + /* We have integral planes introduced to account for assymetric partitions. + * Hence all assymetric partitions except those which cannot be split into legal sizes, + * are split into four for ads_x4() */ + else if (assymetricVertical || assymetricHorizontal) + tempPartEnum = smallRectPartition ? partEnum : partitionFromSizes(w >> 1, h >> 1); + /* General case: Square partitions. All partitions with width > 8 are split into four + * for ads_x4(), for 4x4 and 8x8 we do ads_x1() */ + else + tempPartEnum = (w <= 8) ? partEnum : partitionFromSizes(w >> 1, h >> 1); + + /* Successive elimination by comparing DC before a full SAD, + * because sum(abs(diff)) >= abs(diff(sum)). */ + primitives.pu[tempPartEnum].sad_x4(zero, + fenc, + fenc + deltaX, + fenc + deltaY * FENC_STRIDE, + fenc + deltaX + deltaY * FENC_STRIDE, + FENC_STRIDE, + encDC); + + /* Assigning appropriate integral plane */ + uint32_t *sumsBase = NULL; + switch (deltaX) + { + case 32: if (deltaY % 24 == 0) + sumsBase = integral[1]; + else if (deltaY == 8) + sumsBase = integral[2]; + else + sumsBase = integral[0]; + break; + case 24: sumsBase = integral[3]; + break; + case 16: if (deltaY % 12 == 0) + sumsBase = integral[5]; + else if (deltaY == 4) + sumsBase = integral[6]; + else + sumsBase = integral[4]; + break; + case 12: sumsBase = integral[7]; + break; + case 8: if (deltaY == 32) + sumsBase = integral[8]; + else + sumsBase = integral[9]; + break; + case 4: if (deltaY == 16) + sumsBase = integral[10]; + else + sumsBase = integral[11]; + break; + default: sumsBase = integral[11]; + break; + } + + if (partEnum == LUMA_64x64 || partEnum == LUMA_32x32 || partEnum == LUMA_16x16 || + partEnum == LUMA_32x64 || partEnum == LUMA_16x32 || partEnum == LUMA_8x16 || + partEnum == LUMA_4x8 || partEnum == LUMA_12x16 || partEnum == LUMA_4x16 || + partEnum == LUMA_24x32 || partEnum == LUMA_8x32 || partEnum == LUMA_48x64 || + partEnum == LUMA_16x64) + deltaY *= (int)stride; + + if (verticalRect) + encDC[1] = encDC[2]; + + if (horizontalRect) + deltaY = deltaX; + + /* ADS and SAD */ + MV tmv; + for (tmv.y = minY; tmv.y <= maxY; tmv.y++) + { + int i, xn; + int ycost = p_cost_mvy[tmv.y] << 2; + if (bcost <= ycost) + continue; + bcost -= ycost; + + /* ADS_4 for 16x16, 32x32, 64x64, 24x32, 32x24, 48x64, 64x48, 32x8, 8x32, 64x16, 16x64 partitions + * ADS_1 for 4x4, 8x8, 16x4, 4x16, 16x12, 12x16 partitions + * ADS_2 for all other rectangular partitions */ + xn = ads(encDC, + sumsBase + minX + tmv.y * stride, + deltaY, + fpelCostMvX + minX, + meScratchBuffer, + meRangeWidth, + bcost); + + for (i = 0; i < xn - 2; i += 3) + COST_MV_X3_ABS(minX + meScratchBuffer[i], tmv.y, + minX + meScratchBuffer[i + 1], tmv.y, + minX + meScratchBuffer[i + 2], tmv.y); + + bcost += ycost; + for (; i < xn; i++) + COST_MV(minX + meScratchBuffer[i], tmv.y); + } + if (meScratchBuffer) + x265_free(meScratchBuffer); + break; + } + case X265_FULL_SEARCH: { // dead slow exhaustive search, but at least it uses sad_x4() @@ -1099,6 +1309,7 @@ if ((g_maxSlices > 1) & ((bmv.y < qmvmin.y) | (bmv.y > qmvmax.y))) { bmv.y = x265_min(x265_max(bmv.y, qmvmin.y), qmvmax.y); + bcost = subpelCompare(ref, bmv, satd) + mvcost(bmv); } if (!bcost) @@ -1113,6 +1324,11 @@ for (int i = 1; i <= wl.hpel_dirs; i++) { MV qmv = bmv + square1[i] * 2; + + /* skip invalid range */ + if ((qmv.y < qmvmin.y) | (qmv.y > qmvmax.y)) + continue; + int cost = ref->lowresQPelCost(fenc, blockOffset, qmv, sad) + mvcost(qmv); COPY2_IF_LT(bcost, cost, bdir, i); } @@ -1124,6 +1340,11 @@ for (int i = 1; i <= wl.qpel_dirs; i++) { MV qmv = bmv + square1[i]; + + /* skip invalid range */ + if ((qmv.y < qmvmin.y) | (qmv.y > qmvmax.y)) + continue; + int cost = ref->lowresQPelCost(fenc, blockOffset, qmv, satd) + mvcost(qmv); COPY2_IF_LT(bcost, cost, bdir, i); } @@ -1150,7 +1371,7 @@ MV qmv = bmv + square1[i] * 2; // check mv range for slice bound - if ((g_maxSlices > 1) & ((qmv.y < qmvmin.y) | (qmv.y > qmvmax.y))) + if ((qmv.y < qmvmin.y) | (qmv.y > qmvmax.y)) continue; int cost = subpelCompare(ref, qmv, hpelcomp) + mvcost(qmv); @@ -1175,7 +1396,7 @@ MV qmv = bmv + square1[i]; // check mv range for slice bound - if ((g_maxSlices > 1) & ((qmv.y < qmvmin.y) | (qmv.y > qmvmax.y))) + if ((qmv.y < qmvmin.y) | (qmv.y > qmvmax.y)) continue; int cost = subpelCompare(ref, qmv, satd) + mvcost(qmv); @@ -1189,6 +1410,9 @@ } } + // check mv range for slice bound + X265_CHECK(((bmv.y >= qmvmin.y) & (bmv.y <= qmvmax.y)), "mv beyond range!"); + x265_emms(); outQMv = bmv; return bcost;
View file
x265_2.1.tar.gz/source/encoder/motion.h -> x265_2.2.tar.gz/source/encoder/motion.h
Changed
@@ -52,6 +52,7 @@ pixelcmp_t sad; pixelcmp_x3_t sad_x3; pixelcmp_x4_t sad_x4; + pixelcmp_ads_t ads; pixelcmp_t satd; pixelcmp_t chromaSatd; @@ -61,6 +62,7 @@ static const int COST_MAX = 1 << 28; + uint32_t* integral[INTEGRAL_PLANE_NUM]; Yuv fencPUYuv; int partEnum; bool bChromaSATD;
View file
x265_2.1.tar.gz/source/encoder/nal.h -> x265_2.2.tar.gz/source/encoder/nal.h
Changed
@@ -34,6 +34,7 @@ class NALList { +public: static const int MAX_NAL_UNITS = 16; public:
View file
x265_2.1.tar.gz/source/encoder/ratecontrol.cpp -> x265_2.2.tar.gz/source/encoder/ratecontrol.cpp
Changed
@@ -341,6 +341,8 @@ m_param->rc.vbvBufferInit = x265_clip3(0.0, 1.0, m_param->rc.vbvBufferInit / m_param->rc.vbvBufferSize); m_param->rc.vbvBufferInit = x265_clip3(0.0, 1.0, X265_MAX(m_param->rc.vbvBufferInit, m_bufferRate / m_bufferSize)); m_bufferFillFinal = m_bufferSize * m_param->rc.vbvBufferInit; + m_bufferFillActual = m_bufferFillFinal; + m_bufferExcess = 0; } m_totalBits = 0; @@ -431,7 +433,7 @@ } *statsIn = '\0'; statsIn++; - if (sscanf(opts, "#options: %dx%d", &i, &j) != 2) + if ((p = strstr(opts, " input-res=")) == 0 || sscanf(p, " input-res=%dx%d", &i, &j) != 2) { x265_log(m_param, X265_LOG_ERROR, "Resolution specified in stats file not valid\n"); return false; @@ -457,9 +459,15 @@ CMP_OPT_FIRST_PASS("bframes", m_param->bframes); CMP_OPT_FIRST_PASS("b-pyramid", m_param->bBPyramid); CMP_OPT_FIRST_PASS("open-gop", m_param->bOpenGOP); - CMP_OPT_FIRST_PASS("keyint", m_param->keyframeMax); + CMP_OPT_FIRST_PASS(" keyint", m_param->keyframeMax); CMP_OPT_FIRST_PASS("scenecut", m_param->scenecutThreshold); CMP_OPT_FIRST_PASS("intra-refresh", m_param->bIntraRefresh); + if (m_param->bMultiPassOptRPS) + { + CMP_OPT_FIRST_PASS("multi-pass-opt-rps", m_param->bMultiPassOptRPS); + CMP_OPT_FIRST_PASS("repeat-headers", m_param->bRepeatHeaders); + CMP_OPT_FIRST_PASS("min-keyint", m_param->keyframeMin); + } if ((p = strstr(opts, "b-adapt=")) != 0 && sscanf(p, "b-adapt=%d", &i) && i >= X265_B_ADAPT_NONE && i <= X265_B_ADAPT_TRELLIS) { @@ -542,10 +550,27 @@ } rce = &m_rce2Pass[encodeOrder]; m_encOrder[frameNumber] = encodeOrder; - e += sscanf(p, " in:%*d out:%*d type:%c q:%lf q-aq:%lf q-noVbv:%lf q-Rceq:%lf tex:%d mv:%d misc:%d icu:%lf pcu:%lf scu:%lf", - &picType, &qpRc, &qpAq, &qNoVbv, &qRceq, &rce->coeffBits, - &rce->mvBits, &rce->miscBits, &rce->iCuCount, &rce->pCuCount, - &rce->skipCuCount); + if (!m_param->bMultiPassOptRPS) + { + e += sscanf(p, " in:%*d out:%*d type:%c q:%lf q-aq:%lf q-noVbv:%lf q-Rceq:%lf tex:%d mv:%d misc:%d icu:%lf pcu:%lf scu:%lf", + &picType, &qpRc, &qpAq, &qNoVbv, &qRceq, &rce->coeffBits, + &rce->mvBits, &rce->miscBits, &rce->iCuCount, &rce->pCuCount, + &rce->skipCuCount); + } + else + { + char deltaPOC[128]; + char bUsed[40]; + memset(deltaPOC, 0, sizeof(deltaPOC)); + memset(bUsed, 0, sizeof(bUsed)); + e += sscanf(p, " in:%*d out:%*d type:%c q:%lf q-aq:%lf q-noVbv:%lf q-Rceq:%lf tex:%d mv:%d misc:%d icu:%lf pcu:%lf scu:%lf nump:%d numnegp:%d numposp:%d deltapoc:%s bused:%s", + &picType, &qpRc, &qpAq, &qNoVbv, &qRceq, &rce->coeffBits, + &rce->mvBits, &rce->miscBits, &rce->iCuCount, &rce->pCuCount, + &rce->skipCuCount, &rce->rpsData.numberOfPictures, &rce->rpsData.numberOfNegativePictures, &rce->rpsData.numberOfPositivePictures, deltaPOC, bUsed); + splitdeltaPOC(deltaPOC, rce); + splitbUsed(bUsed, rce); + rce->rpsIdx = -1; + } rce->keptAsRef = true; rce->isIdr = false; if (picType == 'b' || picType == 'p') @@ -598,7 +623,7 @@ x265_log_file(m_param, X265_LOG_ERROR, "can't open stats file %s.temp\n", fileName); return false; } - p = x265_param2string(m_param); + p = x265_param2string(m_param, sps.conformanceWindow.rightOffset, sps.conformanceWindow.bottomOffset); if (p) fprintf(m_statFileOut, "#options: %s\n", p); X265_FREE(p); @@ -1649,15 +1674,18 @@ if (m_pred[m_predType].count == 1) qScale = x265_clip3(lmin, lmax, qScale); m_lastQScaleFor[m_sliceType] = qScale; - rce->frameSizePlanned = predictSize(&m_pred[m_predType], qScale, (double)m_currentSatd); } - else - rce->frameSizePlanned = qScale2bits(rce, qScale); + } - /* Limit planned size by MinCR */ + if (m_2pass) + rce->frameSizePlanned = qScale2bits(rce, qScale); + else + rce->frameSizePlanned = predictSize(&m_pred[m_predType], qScale, (double)m_currentSatd); + + /* Limit planned size by MinCR */ + if (m_isVbv) rce->frameSizePlanned = X265_MIN(rce->frameSizePlanned, rce->frameSizeMaximum); - rce->frameSizeEstimated = rce->frameSizePlanned; - } + rce->frameSizeEstimated = rce->frameSizePlanned; rce->newQScale = qScale; if(rce->bLastMiniGopBFrame) @@ -1875,7 +1903,7 @@ if ((m_curSlice->m_poc == 0 || m_lastQScaleFor[P_SLICE] < q) && !(m_2pass && !m_isVbv)) m_lastQScaleFor[P_SLICE] = q * fabs(m_param->rc.ipFactor); - if (m_2pass && m_isVbv) + if (m_2pass) rce->frameSizePlanned = qScale2bits(rce, q); else rce->frameSizePlanned = predictSize(&m_pred[m_predType], q, (double)m_currentSatd); @@ -2161,7 +2189,7 @@ for (uint32_t row = 0; row < maxRows; row++) { encodedBitsSoFar += curEncData.m_rowStat[row].encodedBits; - rowSatdCostSoFar = curEncData.m_rowStat[row].diagSatd; + rowSatdCostSoFar = curEncData.m_rowStat[row].rowSatd; uint32_t satdCostForPendingCus = curEncData.m_rowStat[row].satdForVbv - rowSatdCostSoFar; satdCostForPendingCus >>= X265_DEPTH - 8; if (satdCostForPendingCus > 0) @@ -2190,7 +2218,7 @@ } refRowSatdCost >>= X265_DEPTH - 8; - refQScale = refEncData.m_rowStat[row].diagQpScale; + refQScale = refEncData.m_rowStat[row].rowQpScale; } if (picType == I_SLICE || qScale >= refQScale) @@ -2212,7 +2240,7 @@ } else if (picType == P_SLICE) { - intraCostForPendingCus = curEncData.m_rowStat[row].intraSatdForVbv - curEncData.m_rowStat[row].diagIntraSatd; + intraCostForPendingCus = curEncData.m_rowStat[row].intraSatdForVbv - curEncData.m_rowStat[row].rowIntraSatd; intraCostForPendingCus >>= X265_DEPTH - 8; /* Our QP is lower than the reference! */ double pred_intra = predictSize(rce->rowPred[1], qScale, intraCostForPendingCus); @@ -2227,16 +2255,16 @@ return totalSatdBits + encodedBitsSoFar; } -int RateControl::rowDiagonalVbvRateControl(Frame* curFrame, uint32_t row, RateControlEntry* rce, double& qpVbv) +int RateControl::rowVbvRateControl(Frame* curFrame, uint32_t row, RateControlEntry* rce, double& qpVbv) { FrameData& curEncData = *curFrame->m_encData; double qScaleVbv = x265_qp2qScale(qpVbv); - uint64_t rowSatdCost = curEncData.m_rowStat[row].diagSatd; + uint64_t rowSatdCost = curEncData.m_rowStat[row].rowSatd; double encodedBits = curEncData.m_rowStat[row].encodedBits; - if (row == 1) + if (m_param->bEnableWavefront && row == 1) { - rowSatdCost += curEncData.m_rowStat[0].diagSatd; + rowSatdCost += curEncData.m_rowStat[0].rowSatd; encodedBits += curEncData.m_rowStat[0].encodedBits; } rowSatdCost >>= X265_DEPTH - 8; @@ -2244,11 +2272,11 @@ if (curEncData.m_slice->m_sliceType != I_SLICE) { Frame* refFrame = curEncData.m_slice->m_refFrameList[0][0]; - if (qpVbv < refFrame->m_encData->m_rowStat[row].diagQp) + if (qpVbv < refFrame->m_encData->m_rowStat[row].rowQp) { - uint64_t intraRowSatdCost = curEncData.m_rowStat[row].diagIntraSatd; - if (row == 1) - intraRowSatdCost += curEncData.m_rowStat[0].diagIntraSatd; + uint64_t intraRowSatdCost = curEncData.m_rowStat[row].rowIntraSatd; + if (m_param->bEnableWavefront && row == 1) + intraRowSatdCost += curEncData.m_rowStat[0].rowIntraSatd; intraRowSatdCost >>= X265_DEPTH - 8; updatePredictor(rce->rowPred[1], qScaleVbv, (double)intraRowSatdCost, encodedBits); } @@ -2309,7 +2337,7 @@ } while (qpVbv > qpMin - && (qpVbv > curEncData.m_rowStat[0].diagQp || m_singleFrameVbv) + && (qpVbv > curEncData.m_rowStat[0].rowQp || m_singleFrameVbv) && (((accFrameBits < rce->frameSizePlanned * 0.8f && qpVbv <= prevRowQp) || accFrameBits < (rce->bufferFill - m_bufferSize + m_bufferRate) * 1.1) && (!m_param->rc.bStrictCbr ? 1 : abrOvershoot < 0))) @@ -2329,7 +2357,7 @@ accFrameBits = predictRowsSizeSum(curFrame, rce, qpVbv, encodedBitsSoFar); abrOvershoot = (accFrameBits + m_totalBits - m_wantedBitsWindow) / totalBitsNeeded; } - if (qpVbv > curEncData.m_rowStat[0].diagQp && + if (qpVbv > curEncData.m_rowStat[0].rowQp && abrOvershoot < -0.1 && timeDone > 0.5 && accFrameBits < rce->frameSizePlanned - rcTol) { qpVbv -= stepSize; @@ -2446,6 +2474,10 @@ m_bufferFillFinal = X265_MAX(m_bufferFillFinal, 0); m_bufferFillFinal += m_bufferRate; m_bufferFillFinal = X265_MIN(m_bufferFillFinal, m_bufferSize); + double bufferBits = X265_MIN(bits + m_bufferExcess, m_bufferRate); + m_bufferExcess = X265_MAX(m_bufferExcess - bufferBits + bits, 0); + m_bufferFillActual += bufferBits - bits; + m_bufferFillActual = X265_MIN(m_bufferFillActual, m_bufferSize); } /* After encoding one frame, update rate control state */ @@ -2626,18 +2658,55 @@ char cType = rce->sliceType == I_SLICE ? (curFrame->m_lowres.sliceType == X265_TYPE_IDR ? 'I' : 'i') : rce->sliceType == P_SLICE ? 'P' : IS_REFERENCED(curFrame) ? 'B' : 'b'; - if (fprintf(m_statFileOut, - "in:%d out:%d type:%c q:%.2f q-aq:%.2f q-noVbv:%.2f q-Rceq:%.2f tex:%d mv:%d misc:%d icu:%.2f pcu:%.2f scu:%.2f ;\n", - rce->poc, rce->encodeOrder, - cType, curEncData.m_avgQpRc, curEncData.m_avgQpAq, - rce->qpNoVbv, rce->qRceq, - curFrame->m_encData->m_frameStats.coeffBits, - curFrame->m_encData->m_frameStats.mvBits, - curFrame->m_encData->m_frameStats.miscBits, - curFrame->m_encData->m_frameStats.percent8x8Intra * m_ncu, - curFrame->m_encData->m_frameStats.percent8x8Inter * m_ncu, - curFrame->m_encData->m_frameStats.percent8x8Skip * m_ncu) < 0) - goto writeFailure; + + if (!curEncData.m_param->bMultiPassOptRPS) + { + if (fprintf(m_statFileOut, + "in:%d out:%d type:%c q:%.2f q-aq:%.2f q-noVbv:%.2f q-Rceq:%.2f tex:%d mv:%d misc:%d icu:%.2f pcu:%.2f scu:%.2f ;\n", + rce->poc, rce->encodeOrder, + cType, curEncData.m_avgQpRc, curEncData.m_avgQpAq, + rce->qpNoVbv, rce->qRceq, + curFrame->m_encData->m_frameStats.coeffBits, + curFrame->m_encData->m_frameStats.mvBits, + curFrame->m_encData->m_frameStats.miscBits, + curFrame->m_encData->m_frameStats.percent8x8Intra * m_ncu, + curFrame->m_encData->m_frameStats.percent8x8Inter * m_ncu, + curFrame->m_encData->m_frameStats.percent8x8Skip * m_ncu) < 0) + goto writeFailure; + } + else{ + RPS* rpsWriter = &curFrame->m_encData->m_slice->m_rps; + int i, num = rpsWriter->numberOfPictures; + char deltaPOC[128]; + char bUsed[40]; + memset(deltaPOC, 0, sizeof(deltaPOC)); + memset(bUsed, 0, sizeof(bUsed)); + sprintf(deltaPOC, "deltapoc:~"); + sprintf(bUsed, "bused:~"); + + for (i = 0; i < num; i++) + { + sprintf(deltaPOC, "%s%d~", deltaPOC, rpsWriter->deltaPOC[i]); + sprintf(bUsed, "%s%d~", bUsed, rpsWriter->bUsed[i]); + } + + if (fprintf(m_statFileOut, + "in:%d out:%d type:%c q:%.2f q-aq:%.2f q-noVbv:%.2f q-Rceq:%.2f tex:%d mv:%d misc:%d icu:%.2f pcu:%.2f scu:%.2f nump:%d numnegp:%d numposp:%d %s %s ;\n", + rce->poc, rce->encodeOrder, + cType, curEncData.m_avgQpRc, curEncData.m_avgQpAq, + rce->qpNoVbv, rce->qRceq, + curFrame->m_encData->m_frameStats.coeffBits, + curFrame->m_encData->m_frameStats.mvBits, + curFrame->m_encData->m_frameStats.miscBits, + curFrame->m_encData->m_frameStats.percent8x8Intra * m_ncu, + curFrame->m_encData->m_frameStats.percent8x8Inter * m_ncu, + curFrame->m_encData->m_frameStats.percent8x8Skip * m_ncu, + rpsWriter->numberOfPictures, + rpsWriter->numberOfNegativePictures, + rpsWriter->numberOfPositivePictures, + deltaPOC, bUsed) < 0) + goto writeFailure; + } /* Don't re-write the data in multi-pass mode. */ if (m_param->rc.cuTree && IS_REFERENCED(curFrame) && !m_param->rc.bStatRead) { @@ -2730,3 +2799,48 @@ X265_FREE(m_param->rc.zones); } +void RateControl::splitdeltaPOC(char deltapoc[], RateControlEntry *rce) +{ + int idx = 0, length = 0; + char tmpStr[128]; + char* src = deltapoc; + char* buf = strstr(src, "~"); + while (buf) + { + memset(tmpStr, 0, sizeof(tmpStr)); + length = (int)(buf - src); + if (length != 0) + { + strncpy(tmpStr, src, length); + rce->rpsData.deltaPOC[idx] = atoi(tmpStr); + idx++; + if (idx == rce->rpsData.numberOfPictures) + break; + } + src += (length + 1); + buf = strstr(src, "~"); + } +} + +void RateControl::splitbUsed(char bused[], RateControlEntry *rce) +{ + int idx = 0, length = 0; + char tmpStr[128]; + char* src = bused; + char* buf = strstr(src, "~"); + while (buf) + { + memset(tmpStr, 0, sizeof(tmpStr)); + length = (int)(buf - src); + if (length != 0) + { + strncpy(tmpStr, src, length); + rce->rpsData.bUsed[idx] = atoi(tmpStr) > 0; + idx++; + if (idx == rce->rpsData.numberOfPictures) + break; + } + src += (length + 1); + buf = strstr(src, "~"); + } +}
View file
x265_2.1.tar.gz/source/encoder/ratecontrol.h -> x265_2.2.tar.gz/source/encoder/ratecontrol.h
Changed
@@ -111,6 +111,8 @@ bool isIdr; SEIPictureTiming *picTimingSEI; HRDTiming *hrdTiming; + int rpsIdx; + RPS rpsData; }; class RateControl @@ -144,6 +146,8 @@ double m_rateFactorMaxIncrement; /* Don't allow RF above (CRF + this value). */ double m_rateFactorMaxDecrement; /* don't allow RF below (this value). */ double m_avgPFrameQp; + double m_bufferFillActual; + double m_bufferExcess; bool m_isFirstMiniGop; Predictor m_pred[4]; /* Slice predictors to preidct bits for each Slice type - I,P,Bref and B */ int64_t m_leadingNoBSatd; @@ -239,7 +243,7 @@ int rateControlStart(Frame* curFrame, RateControlEntry* rce, Encoder* enc); void rateControlUpdateStats(RateControlEntry* rce); int rateControlEnd(Frame* curFrame, int64_t bits, RateControlEntry* rce); - int rowDiagonalVbvRateControl(Frame* curFrame, uint32_t row, RateControlEntry* rce, double& qpVbv); + int rowVbvRateControl(Frame* curFrame, uint32_t row, RateControlEntry* rce, double& qpVbv); int rateControlSliceType(int frameNum); bool cuTreeReadFor2Pass(Frame* curFrame); void hrdFullness(SEIBufferingPeriod* sei); @@ -280,6 +284,8 @@ bool findUnderflow(double *fills, int *t0, int *t1, int over, int framesCount); bool fixUnderflow(int t0, int t1, double adjustment, double qscaleMin, double qscaleMax); double tuneQScaleForGrain(double rcOverflow); + void splitdeltaPOC(char deltapoc[], RateControlEntry *rce); + void splitbUsed(char deltapoc[], RateControlEntry *rce); }; } #endif // ifndef X265_RATECONTROL_H
View file
x265_2.1.tar.gz/source/encoder/reference.cpp -> x265_2.2.tar.gz/source/encoder/reference.cpp
Changed
@@ -128,11 +128,12 @@ intptr_t stride = reconPic->m_stride; int width = reconPic->m_picWidth; int height = (finishedRows - numWeightedRows) * g_maxCUSize; - if ((finishedRows == maxNumRows) && (reconPic->m_picHeight % g_maxCUSize)) + /* the last row may be partial height */ + if (finishedRows == maxNumRows - 1) { - /* the last row may be partial height */ - height -= g_maxCUSize; - height += reconPic->m_picHeight % g_maxCUSize; + const int leftRows = (reconPic->m_picHeight & (g_maxCUSize - 1)); + + height += leftRows ? leftRows : g_maxCUSize; } int cuHeight = g_maxCUSize; @@ -172,7 +173,7 @@ } // Extending Bottom - if (finishedRows == maxNumRows) + if (finishedRows == maxNumRows - 1) { int picHeight = reconPic->m_picHeight; if (c) picHeight >>= reconPic->m_vChromaShift;
View file
x265_2.1.tar.gz/source/encoder/sao.cpp -> x265_2.2.tar.gz/source/encoder/sao.cpp
Changed
@@ -1208,10 +1208,15 @@ if (!saoParam->bSaoFlag[0]) m_depthSaoRate[0 * SAO_DEPTHRATE_SIZE + m_refDepth] = 1.0; else + { + X265_CHECK(m_numNoSao[0] <= numctus, "m_numNoSao check failure!"); m_depthSaoRate[0 * SAO_DEPTHRATE_SIZE + m_refDepth] = m_numNoSao[0] / ((double)numctus); + } if (!saoParam->bSaoFlag[1]) + { m_depthSaoRate[1 * SAO_DEPTHRATE_SIZE + m_refDepth] = 1.0; + } else m_depthSaoRate[1 * SAO_DEPTHRATE_SIZE + m_refDepth] = m_numNoSao[1] / ((double)numctus); }
View file
x265_2.1.tar.gz/source/encoder/search.cpp -> x265_2.2.tar.gz/source/encoder/search.cpp
Changed
@@ -67,6 +67,7 @@ m_param = NULL; m_slice = NULL; m_frame = NULL; + m_maxTUDepth = -1; } bool Search::initSearch(const x265_param& param, ScalingList& scalingList) @@ -93,6 +94,19 @@ uint32_t sizeC = sizeL >> (m_hChromaShift + m_vChromaShift); uint32_t numPartitions = 1 << (maxLog2CUSize - LOG2_UNIT_SIZE) * 2; + m_limitTU = 0; + if (m_param->limitTU) + { + if (m_param->limitTU == 1) + m_limitTU = X265_TU_LIMIT_BFS; + else if (m_param->limitTU == 2) + m_limitTU = X265_TU_LIMIT_DFS; + else if (m_param->limitTU == 3) + m_limitTU = X265_TU_LIMIT_NEIGH; + else if (m_param->limitTU == 4) + m_limitTU = X265_TU_LIMIT_DFS + X265_TU_LIMIT_NEIGH; + } + /* these are indexed by qtLayer (log2size - 2) so nominally 0=4x4, 1=8x8, 2=16x16, 3=32x32 * the coeffRQT and reconQtYuv are allocated to the max CU size at every depth. The parts * which are reconstructed at each depth are valid. At the end, the transform depth table @@ -2131,6 +2145,13 @@ int mvpIdx = selectMVP(cu, pu, amvp, list, ref); MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx]; + if (m_param->searchMethod == X265_SEA) + { + int puX = puIdx & 1; + int puY = puIdx >> 1; + for (int planes = 0; planes < INTEGRAL_PLANE_NUM; planes++) + m_me.integral[planes] = interMode.fencYuv->m_integral[list][ref][planes] + puX * pu.width + puY * pu.height * m_slice->m_refFrameList[list][ref]->m_reconPic->m_stride; + } setSearchRange(cu, mvp, m_param->searchRange, mvmin, mvmax); int satdCost = m_me.motionEstimate(&slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, m_param->searchRange, outmv, m_param->bSourceReferenceEstimation ? m_slice->m_refFrameList[list][ref]->m_fencPic->getLumaAddr(0) : 0); @@ -2229,7 +2250,13 @@ if (lmv.notZero()) mvc[numMvc++] = lmv; } - + if (m_param->searchMethod == X265_SEA) + { + int puX = puIdx & 1; + int puY = puIdx >> 1; + for (int planes = 0; planes < INTEGRAL_PLANE_NUM; planes++) + m_me.integral[planes] = interMode.fencYuv->m_integral[list][ref][planes] + puX * pu.width + puY * pu.height * m_slice->m_refFrameList[list][ref]->m_reconPic->m_stride; + } setSearchRange(cu, mvp, m_param->searchRange, mvmin, mvmax); int satdCost = m_me.motionEstimate(&slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, m_param->searchRange, outmv, m_param->bSourceReferenceEstimation ? m_slice->m_refFrameList[list][ref]->m_fencPic->getLumaAddr(0) : 0); @@ -2544,6 +2571,9 @@ /* conditional clipping for frame parallelism */ mvmin.y = X265_MIN(mvmin.y, (int16_t)m_refLagPixels); mvmax.y = X265_MIN(mvmax.y, (int16_t)m_refLagPixels); + + /* conditional clipping for negative mv range */ + mvmax.y = X265_MAX(mvmax.y, mvmin.y); } /* Note: this function overwrites the RD cost variables of interMode, but leaves the sa8d cost unharmed */ @@ -2617,8 +2647,29 @@ m_entropyCoder.load(m_rqt[depth].cur); + if ((m_limitTU & X265_TU_LIMIT_DFS) && !(m_limitTU & X265_TU_LIMIT_NEIGH)) + m_maxTUDepth = -1; + else if (m_limitTU & X265_TU_LIMIT_BFS) + memset(&m_cacheTU, 0, sizeof(TUInfoCache)); + Cost costs; - estimateResidualQT(interMode, cuGeom, 0, 0, *resiYuv, costs, tuDepthRange); + if (m_limitTU & X265_TU_LIMIT_NEIGH) + { + /* Save and reload maxTUDepth to avoid changing of maxTUDepth between modes */ + int32_t tempDepth = m_maxTUDepth; + if (m_maxTUDepth != -1) + { + uint32_t splitFlag = interMode.cu.m_partSize[0] != SIZE_2Nx2N; + uint32_t minSize = tuDepthRange[0]; + uint32_t maxSize = tuDepthRange[1]; + maxSize = X265_MIN(maxSize, cuGeom.log2CUSize - splitFlag); + m_maxTUDepth = x265_clip3(cuGeom.log2CUSize - maxSize, cuGeom.log2CUSize - minSize, (uint32_t)m_maxTUDepth); + } + estimateResidualQT(interMode, cuGeom, 0, 0, *resiYuv, costs, tuDepthRange); + m_maxTUDepth = tempDepth; + } + else + estimateResidualQT(interMode, cuGeom, 0, 0, *resiYuv, costs, tuDepthRange); uint32_t tqBypass = cu.m_tqBypass[0]; if (!tqBypass) @@ -2867,7 +2918,57 @@ return m_rdCost.calcRdCost(dist, nullBits); } -void Search::estimateResidualQT(Mode& mode, const CUGeom& cuGeom, uint32_t absPartIdx, uint32_t tuDepth, ShortYuv& resiYuv, Cost& outCosts, const uint32_t depthRange[2]) +bool Search::splitTU(Mode& mode, const CUGeom& cuGeom, uint32_t absPartIdx, uint32_t tuDepth, ShortYuv& resiYuv, Cost& splitCost, const uint32_t depthRange[2], int32_t splitMore) +{ + CUData& cu = mode.cu; + uint32_t depth = cuGeom.depth + tuDepth; + uint32_t log2TrSize = cuGeom.log2CUSize - tuDepth; + + uint32_t qNumParts = 1 << (log2TrSize - 1 - LOG2_UNIT_SIZE) * 2; + uint32_t ycbf = 0, ucbf = 0, vcbf = 0; + for (uint32_t qIdx = 0, qPartIdx = absPartIdx; qIdx < 4; ++qIdx, qPartIdx += qNumParts) + { + if ((m_limitTU & X265_TU_LIMIT_DFS) && tuDepth == 0 && qIdx == 1) + { + m_maxTUDepth = cu.m_tuDepth[0]; + // Fetch maximum TU depth of first sub partition to limit recursion of others + for (uint32_t i = 1; i < cuGeom.numPartitions / 4; i++) + m_maxTUDepth = X265_MAX(m_maxTUDepth, cu.m_tuDepth[i]); + } + estimateResidualQT(mode, cuGeom, qPartIdx, tuDepth + 1, resiYuv, splitCost, depthRange, splitMore); + ycbf |= cu.getCbf(qPartIdx, TEXT_LUMA, tuDepth + 1); + if (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400) + { + ucbf |= cu.getCbf(qPartIdx, TEXT_CHROMA_U, tuDepth + 1); + vcbf |= cu.getCbf(qPartIdx, TEXT_CHROMA_V, tuDepth + 1); + } + } + cu.m_cbf[0][absPartIdx] |= ycbf << tuDepth; + if (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400) + { + cu.m_cbf[1][absPartIdx] |= ucbf << tuDepth; + cu.m_cbf[2][absPartIdx] |= vcbf << tuDepth; + } + + // Here we were encoding cbfs and coefficients for splitted blocks. Since I have collected coefficient bits + // for each individual blocks, only encoding cbf values. As I mentioned encoding chroma cbfs is different then luma. + // But have one doubt that if coefficients are encoded in context at depth 2 (for example) and cbfs are encoded in context + // at depth 0 (for example). + m_entropyCoder.load(m_rqt[depth].rqtRoot); + m_entropyCoder.resetBits(); + codeInterSubdivCbfQT(cu, absPartIdx, tuDepth, depthRange); + uint32_t splitCbfBits = m_entropyCoder.getNumberOfWrittenBits(); + splitCost.bits += splitCbfBits; + + if (m_rdCost.m_psyRd) + splitCost.rdcost = m_rdCost.calcPsyRdCost(splitCost.distortion, splitCost.bits, splitCost.energy); + else + splitCost.rdcost = m_rdCost.calcRdCost(splitCost.distortion, splitCost.bits); + + return ycbf || ucbf || vcbf; +} + +void Search::estimateResidualQT(Mode& mode, const CUGeom& cuGeom, uint32_t absPartIdx, uint32_t tuDepth, ShortYuv& resiYuv, Cost& outCosts, const uint32_t depthRange[2], int32_t splitMore) { CUData& cu = mode.cu; uint32_t depth = cuGeom.depth + tuDepth; @@ -2876,6 +2977,37 @@ bool bCheckSplit = log2TrSize > depthRange[0]; bool bCheckFull = log2TrSize <= depthRange[1]; + bool bSaveTUData = false, bLoadTUData = false; + uint32_t idx = 0; + + if ((m_limitTU & X265_TU_LIMIT_BFS) && splitMore >= 0) + { + if (bCheckSplit && bCheckFull && tuDepth) + { + uint32_t qNumParts = 1 << (log2TrSize - LOG2_UNIT_SIZE) * 2; + uint32_t qIdx = (absPartIdx / qNumParts) % 4; + idx = (depth - 1) * 4 + qIdx; + if (splitMore) + { + bLoadTUData = true; + bCheckFull = false; + } + else + { + bSaveTUData = true; + bCheckSplit = false; + } + } + } + else if (m_limitTU & X265_TU_LIMIT_DFS || m_limitTU & X265_TU_LIMIT_NEIGH) + { + if (bCheckSplit && m_maxTUDepth >= 0) + { + uint32_t log2MaxTrSize = cuGeom.log2CUSize - m_maxTUDepth; + bCheckSplit = log2TrSize > log2MaxTrSize; + } + } + bool bSplitPresentFlag = bCheckSplit && bCheckFull; if (cu.m_partSize[0] != SIZE_2Nx2N && !tuDepth && bCheckSplit) @@ -3194,6 +3326,8 @@ singlePsyEnergy[TEXT_LUMA][0] = nonZeroPsyEnergyY; cbfFlag[TEXT_LUMA][0] = !!numSigTSkipY; bestTransformMode[TEXT_LUMA][0] = 1; + if (m_param->limitTU) + numSig[TEXT_LUMA][0] = numSigTSkipY; uint32_t numCoeffY = 1 << (log2TrSize << 1); memcpy(coeffCurY, m_tsCoeff, sizeof(coeff_t) * numCoeffY); primitives.cu[partSize].copy_ss(curResiY, strideResiY, m_tsResidual, trSize); @@ -3331,6 +3465,50 @@ fullCost.rdcost = m_rdCost.calcPsyRdCost(fullCost.distortion, fullCost.bits, fullCost.energy); else fullCost.rdcost = m_rdCost.calcRdCost(fullCost.distortion, fullCost.bits); + + if (m_param->limitTU && bCheckSplit) + { + // Stop recursion if the TU's energy level is minimal + uint32_t numCoeff = trSize * trSize; + if (cbfFlag[TEXT_LUMA][0] == 0) + bCheckSplit = false; + else if (numSig[TEXT_LUMA][0] < (numCoeff / 64)) + { + uint32_t energy = 0; + for (uint32_t i = 0; i < numCoeff; i++) + energy += abs(coeffCurY[i]); + if (energy == numSig[TEXT_LUMA][0]) + bCheckSplit = false; + } + } + + if (bSaveTUData) + { + for (int plane = 0; plane < MAX_NUM_COMPONENT; plane++) + { + for(int part = 0; part < (m_csp == X265_CSP_I422) + 1; part++) + { + m_cacheTU.bestTransformMode[idx][plane][part] = bestTransformMode[plane][part]; + m_cacheTU.cbfFlag[idx][plane][part] = cbfFlag[plane][part]; + } + } + m_cacheTU.cost[idx] = fullCost; + m_entropyCoder.store(m_cacheTU.rqtStore[idx]); + } + } + if (bLoadTUData) + { + for (int plane = 0; plane < MAX_NUM_COMPONENT; plane++) + { + for(int part = 0; part < (m_csp == X265_CSP_I422) + 1; part++) + { + bestTransformMode[plane][part] = m_cacheTU.bestTransformMode[idx][plane][part]; + cbfFlag[plane][part] = m_cacheTU.cbfFlag[idx][plane][part]; + } + } + fullCost = m_cacheTU.cost[idx]; + m_entropyCoder.load(m_cacheTU.rqtStore[idx]); + bCheckFull = true; } // code sub-blocks @@ -3351,45 +3529,29 @@ splitCost.bits = m_entropyCoder.getNumberOfWrittenBits(); } - uint32_t qNumParts = 1 << (log2TrSize - 1 - LOG2_UNIT_SIZE) * 2; - uint32_t ycbf = 0, ucbf = 0, vcbf = 0; - for (uint32_t qIdx = 0, qPartIdx = absPartIdx; qIdx < 4; ++qIdx, qPartIdx += qNumParts) - { - estimateResidualQT(mode, cuGeom, qPartIdx, tuDepth + 1, resiYuv, splitCost, depthRange); - ycbf |= cu.getCbf(qPartIdx, TEXT_LUMA, tuDepth + 1); - if (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400) - { - ucbf |= cu.getCbf(qPartIdx, TEXT_CHROMA_U, tuDepth + 1); - vcbf |= cu.getCbf(qPartIdx, TEXT_CHROMA_V, tuDepth + 1); - } - } - cu.m_cbf[0][absPartIdx] |= ycbf << tuDepth; - if (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400) - { - cu.m_cbf[1][absPartIdx] |= ucbf << tuDepth; - cu.m_cbf[2][absPartIdx] |= vcbf << tuDepth; - } - - // Here we were encoding cbfs and coefficients for splitted blocks. Since I have collected coefficient bits - // for each individual blocks, only encoding cbf values. As I mentioned encoding chroma cbfs is different then luma. - // But have one doubt that if coefficients are encoded in context at depth 2 (for example) and cbfs are encoded in context - // at depth 0 (for example). - m_entropyCoder.load(m_rqt[depth].rqtRoot); - m_entropyCoder.resetBits(); - - codeInterSubdivCbfQT(cu, absPartIdx, tuDepth, depthRange); - uint32_t splitCbfBits = m_entropyCoder.getNumberOfWrittenBits(); - splitCost.bits += splitCbfBits; - - if (m_rdCost.m_psyRd) - splitCost.rdcost = m_rdCost.calcPsyRdCost(splitCost.distortion, splitCost.bits, splitCost.energy); - else - splitCost.rdcost = m_rdCost.calcRdCost(splitCost.distortion, splitCost.bits); - - if (ycbf || ucbf || vcbf || !bCheckFull) + bool yCbCrCbf = splitTU(mode, cuGeom, absPartIdx, tuDepth, resiYuv, splitCost, depthRange, 0); + if (yCbCrCbf || !bCheckFull) { if (splitCost.rdcost < fullCost.rdcost) { + if (m_limitTU & X265_TU_LIMIT_BFS) + { + uint32_t nextlog2TrSize = cuGeom.log2CUSize - (tuDepth + 1); + bool nextSplit = nextlog2TrSize > depthRange[0]; + if (nextSplit) + { + m_entropyCoder.load(m_rqt[depth].rqtRoot); + splitCost.bits = splitCost.distortion = splitCost.rdcost = splitCost.energy = 0; + if (bSplitPresentFlag && (log2TrSize <= depthRange[1] && log2TrSize > depthRange[0])) + { + // Subdiv flag can be encoded at the start of analysis of split blocks. + m_entropyCoder.resetBits(); + m_entropyCoder.codeTransformSubdivFlag(1, 5 - log2TrSize); + splitCost.bits = m_entropyCoder.getNumberOfWrittenBits(); + } + splitTU(mode, cuGeom, absPartIdx, tuDepth, resiYuv, splitCost, depthRange, 1); + } + } outCosts.distortion += splitCost.distortion; outCosts.rdcost += splitCost.rdcost; outCosts.bits += splitCost.bits;
View file
x265_2.1.tar.gz/source/encoder/search.h -> x265_2.2.tar.gz/source/encoder/search.h
Changed
@@ -49,6 +49,8 @@ #define ProfileCounter(cu, count) #endif +#define NUM_SUBPART MAX_TS_SIZE * 4 // 4 sub partitions * 4 depth + namespace X265_NS { // private namespace @@ -275,6 +277,9 @@ uint32_t m_numLayers; uint32_t m_refLagPixels; + int32_t m_maxTUDepth; + uint16_t m_limitTU; + int16_t m_sliceMaxY; int16_t m_sliceMinY; @@ -377,8 +382,17 @@ Cost() { rdcost = 0; bits = 0; distortion = 0; energy = 0; } }; + struct TUInfoCache + { + Cost cost[NUM_SUBPART]; + uint32_t bestTransformMode[NUM_SUBPART][MAX_NUM_COMPONENT][2]; + uint8_t cbfFlag[NUM_SUBPART][MAX_NUM_COMPONENT][2]; + Entropy rqtStore[NUM_SUBPART]; + } m_cacheTU; + uint64_t estimateNullCbfCost(sse_t dist, uint32_t psyEnergy, uint32_t tuDepth, TextType compId); - void estimateResidualQT(Mode& mode, const CUGeom& cuGeom, uint32_t absPartIdx, uint32_t depth, ShortYuv& resiYuv, Cost& costs, const uint32_t depthRange[2]); + bool splitTU(Mode& mode, const CUGeom& cuGeom, uint32_t absPartIdx, uint32_t tuDepth, ShortYuv& resiYuv, Cost& splitCost, const uint32_t depthRange[2], int32_t splitMore); + void estimateResidualQT(Mode& mode, const CUGeom& cuGeom, uint32_t absPartIdx, uint32_t depth, ShortYuv& resiYuv, Cost& costs, const uint32_t depthRange[2], int32_t splitMore = -1); // generate prediction, generate residual and recon. if bAllowSplit, find optimal RQT splits void codeIntraLumaQT(Mode& mode, const CUGeom& cuGeom, uint32_t tuDepth, uint32_t absPartIdx, bool bAllowSplit, Cost& costs, const uint32_t depthRange[2]);
View file
x265_2.1.tar.gz/source/encoder/slicetype.cpp -> x265_2.2.tar.gz/source/encoder/slicetype.cpp
Changed
@@ -1617,7 +1617,7 @@ /* magic numbers pulled out of thin air */ float threshMin = (float)(threshMax * 0.25); - double bias = 0.05; + double bias = m_param->scenecutBias; if (bRealScenecut) { if (m_param->keyframeMin == m_param->keyframeMax)
View file
x265_2.1.tar.gz/source/input/y4m.cpp -> x265_2.2.tar.gz/source/input/y4m.cpp
Changed
@@ -280,7 +280,7 @@ { c = ifs->get(); - if (c <= '9' && c >= '0') + if (c <= 'o' && c >= '0') csp = csp * 10 + (c - '0'); else if (c == 'p') { @@ -300,9 +300,23 @@ break; } - if (d >= 8 && d <= 16) - depth = d; - colorSpace = (csp == 444) ? X265_CSP_I444 : (csp == 422) ? X265_CSP_I422 : X265_CSP_I420; + switch (csp) + { + case ('m'-'0')*100000 + ('o'-'0')*10000 + ('n'-'0')*1000 + ('o'-'0')*100 + 16: + colorSpace = X265_CSP_I400; + depth = 16; + break; + + case ('m'-'0')*1000 + ('o'-'0')*100 + ('n'-'0')*10 + ('o'-'0'): + colorSpace = X265_CSP_I400; + depth = 8; + break; + + default: + if (d >= 8 && d <= 16) + depth = d; + colorSpace = (csp == 444) ? X265_CSP_I444 : (csp == 422) ? X265_CSP_I422 : X265_CSP_I420; + } break; default: @@ -324,7 +338,7 @@ if (width < MIN_FRAME_WIDTH || width > MAX_FRAME_WIDTH || height < MIN_FRAME_HEIGHT || height > MAX_FRAME_HEIGHT || (rateNum / rateDenom) < 1 || (rateNum / rateDenom) > MAX_FRAME_RATE || - colorSpace <= X265_CSP_I400 || colorSpace >= X265_CSP_COUNT) + colorSpace < X265_CSP_I400 || colorSpace >= X265_CSP_COUNT) return false; return true;
View file
x265_2.1.tar.gz/source/test/rate-control-tests.txt -> x265_2.2.tar.gz/source/test/rate-control-tests.txt
Changed
@@ -21,6 +21,9 @@ big_buck_bunny_360p24.y4m,--preset medium --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 600 --aud --hrd --tune fast-decode sita_1920x1080_30.yuv,--preset superfast --crf 25 --vbv-bufsize 3000 --vbv-maxrate 4000 --vbv-bufsize 5000 --hrd --crf-max 30 sita_1920x1080_30.yuv,--preset superfast --bitrate 3000 --vbv-bufsize 3000 --vbv-maxrate 3000 --aud --strict-cbr +BasketballDrive_1920x1080_50.y4m,--preset ultrafast --bitrate 3000 --vbv-bufsize 3000 --vbv-maxrate 3000 --no-wpp +big_buck_bunny_360p24.y4m,--preset medium --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 600 --no-wpp --aud --hrd --tune fast-decode +sita_1920x1080_30.yuv,--preset superfast --bitrate 3000 --vbv-bufsize 3000 --vbv-maxrate 3000 --aud --strict-cbr --no-wpp @@ -38,4 +41,5 @@ RaceHorses_416x240_30_10bit.yuv,--preset medium --crf 40 --pass 1, --preset faster --bitrate 200 --pass 2 -F4 CrowdRun_1920x1080_50_10bit_422.yuv,--preset superfast --bitrate 2500 --pass 1 -F4 --slow-firstpass,--preset superfast --bitrate 2500 --pass 2 -F4 RaceHorses_416x240_30_10bit.yuv,--preset medium --crf 26 --vbv-maxrate 1000 --vbv-bufsize 1000 --pass 1,--preset fast --bitrate 1000 --vbv-maxrate 1000 --vbv-bufsize 700 --pass 3 -F4,--preset slow --bitrate 500 --vbv-maxrate 500 --vbv-bufsize 700 --pass 2 -F4 - +sita_1920x1080_30.yuv, --preset ultrafast --crf 20 --no-cutree --keyint 50 --min-keyint 50 --no-open-gop --pass 1 --vbv-bufsize 7000 --vbv-maxrate 5000, --preset ultrafast --crf 20 --no-cutree --keyint 50 --min-keyint 50 --no-open-gop --pass 2 --vbv-bufsize 7000 --vbv-maxrate 5000 --repeat-headers +sita_1920x1080_30.yuv, --preset medium --crf 20 --no-cutree --keyint 50 --min-keyint 50 --no-open-gop --pass 1 --vbv-bufsize 7000 --vbv-maxrate 5000 --repeat-headers --multi-pass-opt-rps, --preset medium --crf 20 --no-cutree --keyint 50 --min-keyint 50 --no-open-gop --pass 2 --vbv-bufsize 7000 --vbv-maxrate 5000 --repeat-headers --multi-pass-opt-rps
View file
x265_2.1.tar.gz/source/test/regression-tests.txt -> x265_2.2.tar.gz/source/test/regression-tests.txt
Changed
@@ -14,20 +14,21 @@ BasketballDrive_1920x1080_50.y4m,--preset ultrafast --signhide --colormatrix bt709 BasketballDrive_1920x1080_50.y4m,--preset superfast --psy-rd 1 --ctu 16 --no-wpp --limit-modes BasketballDrive_1920x1080_50.y4m,--preset veryfast --tune zerolatency --no-temporal-mvp -BasketballDrive_1920x1080_50.y4m,--preset faster --aq-strength 2 --merange 190 -BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7 --qg-size 16 --cu-lossless +BasketballDrive_1920x1080_50.y4m,--preset faster --aq-strength 2 --merange 190 --slices 3 +BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7 --qg-size 16 --cu-lossless --tu-inter-depth 3 --limit-tu 1 BasketballDrive_1920x1080_50.y4m,--preset medium --keyint -1 --nr-inter 100 -F4 --no-sao BasketballDrive_1920x1080_50.y4m,--preset medium --no-cutree --analysis-mode=save --bitrate 7000 --limit-modes,--preset medium --no-cutree --analysis-mode=load --bitrate 7000 --limit-modes BasketballDrive_1920x1080_50.y4m,--preset slow --nr-intra 100 -F4 --aq-strength 3 --qg-size 16 --limit-refs 1 -BasketballDrive_1920x1080_50.y4m,--preset slower --lossless --chromaloc 3 --subme 0 +BasketballDrive_1920x1080_50.y4m,--preset slower --lossless --chromaloc 3 --subme 0 --limit-tu 4 BasketballDrive_1920x1080_50.y4m,--preset slower --no-cutree --analysis-mode=save --bitrate 7000,--preset slower --no-cutree --analysis-mode=load --bitrate 7000 -BasketballDrive_1920x1080_50.y4m,--preset veryslow --crf 4 --cu-lossless --pmode --limit-refs 1 --aq-mode 3 -BasketballDrive_1920x1080_50.y4m,--preset veryslow --no-cutree --analysis-mode=save --bitrate 7000 --tskip-fast,--preset veryslow --no-cutree --analysis-mode=load --bitrate 7000 --tskip-fast +BasketballDrive_1920x1080_50.y4m,--preset veryslow --crf 4 --cu-lossless --pmode --limit-refs 1 --aq-mode 3 --limit-tu 3 +BasketballDrive_1920x1080_50.y4m,--preset veryslow --no-cutree --analysis-mode=save --bitrate 7000 --tskip-fast --limit-tu 4,--preset veryslow --no-cutree --analysis-mode=load --bitrate 7000 --tskip-fast --limit-tu 4 BasketballDrive_1920x1080_50.y4m,--preset veryslow --recon-y4m-exec "ffplay -i pipe:0 -autoexit" Coastguard-4k.y4m,--preset ultrafast --recon-y4m-exec "ffplay -i pipe:0 -autoexit" Coastguard-4k.y4m,--preset superfast --tune grain --overscan=crop +Coastguard-4k.y4m,--preset superfast --tune grain --pme --aq-strength 2 --merange 190 Coastguard-4k.y4m,--preset veryfast --no-cutree --analysis-mode=save --bitrate 15000,--preset veryfast --no-cutree --analysis-mode=load --bitrate 15000 -Coastguard-4k.y4m,--preset medium --rdoq-level 1 --tune ssim --no-signhide --me umh +Coastguard-4k.y4m,--preset medium --rdoq-level 1 --tune ssim --no-signhide --me umh --slices 2 Coastguard-4k.y4m,--preset slow --tune psnr --cbqpoffs -1 --crqpoffs 1 --limit-refs 1 CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency --qg-size 16 CrowdRun_1920x1080_50_10bit_422.yuv,--preset superfast --weightp --no-wpp --sao @@ -41,13 +42,14 @@ CrowdRun_1920x1080_50_10bit_444.yuv,--preset superfast --weightp --dither --no-psy-rd CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryfast --temporal-layers --repeat-headers --limit-refs 2 CrowdRun_1920x1080_50_10bit_444.yuv,--preset medium --dither --keyint -1 --rdoq-level 1 --limit-modes -CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryslow --tskip --tskip-fast --no-scenecut +CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryslow --tskip --tskip-fast --no-scenecut --limit-tu 1 DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset superfast --weightp --qg-size 16 DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset medium --tune psnr --bframes 16 --limit-modes DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers --no-psy-rd --qg-size 32 --limit-refs 0 --cu-lossless DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset veryfast --weightp --nr-intra 1000 -F4 DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset medium --nr-inter 500 -F4 --no-psy-rdoq -DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset slower --no-weightp --rdoq-level 0 --limit-refs 3 +DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset slower --no-weightp --rdoq-level 0 --limit-refs 3 --tu-inter-depth 4 --limit-tu 3 +DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset fast --no-cutree --analysis-mode=save --bitrate 3000 --early-skip --tu-inter-depth 3 --limit-tu 1,--preset fast --no-cutree --analysis-mode=load --bitrate 3000 --early-skip --tu-inter-depth 3 --limit-tu 1 FourPeople_1280x720_60.y4m,--preset superfast --no-wpp --lookahead-slices 2 FourPeople_1280x720_60.y4m,--preset veryfast --aq-mode 2 --aq-strength 1.5 --qg-size 8 FourPeople_1280x720_60.y4m,--preset medium --qp 38 --no-psy-rd @@ -61,24 +63,27 @@ KristenAndSara_1280x720_60.y4m,--preset ultrafast --strong-intra-smoothing KristenAndSara_1280x720_60.y4m,--preset superfast --min-cu-size 16 --qg-size 16 --limit-refs 1 KristenAndSara_1280x720_60.y4m,--preset medium --no-cutree --max-tu-size 16 -KristenAndSara_1280x720_60.y4m,--preset slower --pmode --max-tu-size 8 --limit-refs 0 --limit-modes +KristenAndSara_1280x720_60.y4m,--preset slower --pmode --max-tu-size 8 --limit-refs 0 --limit-modes --limit-tu 1 NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset superfast --tune psnr NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --tune grain --limit-refs 2 NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset slow --no-cutree --analysis-mode=save --bitrate 9000,--preset slow --no-cutree --analysis-mode=load --bitrate 9000 News-4k.y4m,--preset ultrafast --no-cutree --analysis-mode=save --bitrate 15000,--preset ultrafast --no-cutree --analysis-mode=load --bitrate 15000 News-4k.y4m,--preset superfast --lookahead-slices 6 --aq-mode 0 +News-4k.y4m,--preset superfast --slices 4 --aq-mode 0 News-4k.y4m,--preset medium --tune ssim --no-sao --qg-size 16 News-4k.y4m,--preset veryslow --no-rskip +News-4k.y4m,--preset veryslow --pme --crf 40 OldTownCross_1920x1080_50_10bit_422.yuv,--preset superfast --weightp OldTownCross_1920x1080_50_10bit_422.yuv,--preset medium --no-weightp OldTownCross_1920x1080_50_10bit_422.yuv,--preset slower --tune fastdecode ParkScene_1920x1080_24_10bit_444.yuv,--preset superfast --weightp --lookahead-slices 4 ParkScene_1920x1080_24.y4m,--preset medium --qp 40 --rdpenalty 2 --tu-intra-depth 3 +ParkScene_1920x1080_24.y4m,--preset medium --pme --tskip-fast --tskip --min-keyint 48 --weightb --limit-refs 3 ParkScene_1920x1080_24.y4m,--preset slower --no-weightp RaceHorses_416x240_30.y4m,--preset superfast --no-cutree RaceHorses_416x240_30.y4m,--preset medium --tskip-fast --tskip -RaceHorses_416x240_30.y4m,--preset slower --keyint -1 --rdoq-level 0 -RaceHorses_416x240_30.y4m,--preset veryslow --tskip-fast --tskip --limit-refs 3 +RaceHorses_416x240_30.y4m,--preset slower --keyint -1 --rdoq-level 0 --limit-tu 2 +RaceHorses_416x240_30.y4m,--preset veryslow --tskip-fast --tskip --limit-refs 3 --limit-tu 3 RaceHorses_416x240_30_10bit.yuv,--preset ultrafast --tune psnr --limit-refs 1 RaceHorses_416x240_30_10bit.yuv,--preset veryfast --weightb RaceHorses_416x240_30_10bit.yuv,--preset faster --rdoq-level 0 --dither @@ -108,7 +113,7 @@ ducks_take_off_420_720p50.y4m,--preset veryslow --constrained-intra --bframes 2 mobile_calendar_422_ntsc.y4m,--preset superfast --weightp mobile_calendar_422_ntsc.y4m,--preset medium --bitrate 500 -F4 -mobile_calendar_422_ntsc.y4m,--preset slower --tskip --tskip-fast +mobile_calendar_422_ntsc.y4m,--preset slower --tskip --tskip-fast --limit-tu 4 mobile_calendar_422_ntsc.y4m,--preset veryslow --tskip --limit-refs 2 old_town_cross_444_720p50.y4m,--preset ultrafast --weightp --min-cu 32 old_town_cross_444_720p50.y4m,--preset superfast --weightp --min-cu 16 --limit-modes @@ -118,6 +123,7 @@ old_town_cross_444_720p50.y4m,--preset medium --keyint -1 --no-weightp --ref 6 old_town_cross_444_720p50.y4m,--preset slow --rdoq-level 1 --early-skip --ref 7 --no-b-pyramid old_town_cross_444_720p50.y4m,--preset slower --crf 4 --cu-lossless +old_town_cross_444_720p50.y4m,--preset veryslow --max-tu-size 4 --min-cu-size 32 --limit-tu 4 parkrun_ter_720p50.y4m,--preset medium --no-open-gop --sao-non-deblock --crf 4 --cu-lossless parkrun_ter_720p50.y4m,--preset slower --fast-intra --no-rect --tune grain silent_cif_420.y4m,--preset superfast --weightp --rect @@ -133,6 +139,11 @@ vtc1nw_422_ntsc.y4m,--preset slower --nr-inter 1000 -F4 --tune fast-decode --qg-size 16 washdc_422_ntsc.y4m,--preset slower --psy-rdoq 2.0 --rdoq-level 2 --qg-size 32 --limit-refs 1 washdc_422_ntsc.y4m,--preset veryslow --crf 4 --cu-lossless --limit-refs 3 --limit-modes +washdc_422_ntsc.y4m,--preset veryslow --crf 4 --cu-lossless --limit-refs 3 --limit-modes --slices 2 +Kimono1_1920x1080_24_400.yuv,--preset ultrafast --slices 1 --weightp --tu-intra-depth 4 +Kimono1_1920x1080_24_400.yuv,--preset medium --rdoq-level 0 --limit-refs 3 --slices 2 +Kimono1_1920x1080_24_400.yuv,--preset veryslow --crf 4 --cu-lossless --slices 2 --limit-refs 3 --limit-modes +Kimono1_1920x1080_24_400.yuv,--preset placebo --ctu 32 --max-tu-size 8 --limit-tu 2 # Main12 intraCost overflow bug test 720p50_parkrun_ter.y4m,--preset medium @@ -141,4 +152,7 @@ CrowdRun_1920x1080_50_10bit_422.yuv,--preset faster --interlace tff CrowdRun_1920x1080_50_10bit_422.yuv,--preset fast --interlace bff +#SEA Implementation Test +silent_cif_420.y4m,--preset veryslow --me 4 +big_buck_bunny_360p24.y4m,--preset superfast --me 4 # vim: tw=200
View file
x265_2.1.tar.gz/source/test/smoke-tests.txt -> x265_2.2.tar.gz/source/test/smoke-tests.txt
Changed
@@ -3,10 +3,9 @@ # consider VBV tests a failure if new bitrate is more than 5% different # from the old bitrate # vbv-tolerance = 0.05 - big_buck_bunny_360p24.y4m,--preset=superfast --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 400 --hrd --aud --repeat-headers big_buck_bunny_360p24.y4m,--preset=medium --bitrate 1000 -F4 --cu-lossless --scaling-list default -big_buck_bunny_360p24.y4m,--preset=slower --no-weightp --pme --qg-size 16 +big_buck_bunny_360p24.y4m,--preset=slower --no-weightp --qg-size 16 washdc_422_ntsc.y4m,--preset=faster --no-strong-intra-smoothing --keyint 1 --qg-size 16 washdc_422_ntsc.y4m,--preset=medium --qp 40 --nr-inter 400 -F4 washdc_422_ntsc.y4m,--preset=veryslow --pmode --tskip --rdoq-level 0 @@ -16,9 +15,10 @@ RaceHorses_416x240_30_10bit.yuv,--preset=veryfast --max-tu-size 8 RaceHorses_416x240_30_10bit.yuv,--preset=slower --bitrate 500 -F4 --rdoq-level 1 CrowdRun_1920x1080_50_10bit_444.yuv,--preset=ultrafast --constrained-intra --min-keyint 5 --keyint 10 -CrowdRun_1920x1080_50_10bit_444.yuv,--preset=medium --max-tu-size 16 +CrowdRun_1920x1080_50_10bit_444.yuv,--preset=medium --max-tu-size 16 --tu-inter-depth 2 --limit-tu 3 DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset=veryfast --min-cu 16 DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset=fast --weightb --interlace bff +DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset=veryslow --limit-ref 1 --limit-mode --tskip --limit-tu 1 # Main12 intraCost overflow bug test 720p50_parkrun_ter.y4m,--preset medium
View file
x265_2.1.tar.gz/source/x265-extras.cpp -> x265_2.2.tar.gz/source/x265-extras.cpp
Changed
@@ -64,6 +64,8 @@ fprintf(csvfp, "Encode Order, Type, POC, QP, Bits, Scenecut, "); if (param.rc.rateControlMode == X265_RC_CRF) fprintf(csvfp, "RateFactor, "); + if (param.rc.vbvBufferSize) + fprintf(csvfp, "BufferFill, "); if (param.bEnablePsnr) fprintf(csvfp, "Y PSNR, U PSNR, V PSNR, YUV PSNR, "); if (param.bEnableSsim) @@ -132,6 +134,8 @@ fprintf(csvfp, "%d, %c-SLICE, %4d, %2.2lf, %10d, %d,", frameStats->encoderOrder, frameStats->sliceType, frameStats->poc, frameStats->qp, (int)frameStats->bits, frameStats->bScenecut); if (param.rc.rateControlMode == X265_RC_CRF) fprintf(csvfp, "%.3lf,", frameStats->rateFactor); + if (param.rc.vbvBufferSize) + fprintf(csvfp, "%.3lf,", frameStats->bufferFill); if (param.bEnablePsnr) fprintf(csvfp, "%.3lf, %.3lf, %.3lf, %.3lf,", frameStats->psnrY, frameStats->psnrU, frameStats->psnrV, frameStats->psnr); if (param.bEnableSsim) @@ -187,7 +191,7 @@ fflush(stderr); } -void x265_csvlog_encode(FILE* csvfp, const x265_api& api, const x265_param& param, const x265_stats& stats, int level, int argc, char** argv) +void x265_csvlog_encode(FILE* csvfp, const char* version, const x265_param& param, const x265_stats& stats, int level, int argc, char** argv) { if (!csvfp) return; @@ -277,7 +281,7 @@ else fprintf(csvfp, " -, -, -, -, -, -, -,"); - fprintf(csvfp, " %-6u, %-6u, %s\n", stats.maxCLL, stats.maxFALL, api.version_str); + fprintf(csvfp, " %-6u, %-6u, %s\n", stats.maxCLL, stats.maxFALL, version); } /* The dithering algorithm is based on Sierra-2-4A error diffusion.
View file
x265_2.1.tar.gz/source/x265-extras.h -> x265_2.2.tar.gz/source/x265-extras.h
Changed
@@ -53,7 +53,7 @@ /* Log final encode statistics to the CSV file handle. 'argc' and 'argv' are * intended to be command line arguments passed to the encoder. Encode * statistics should be queried from the encoder just prior to closing it. */ -LIBAPI void x265_csvlog_encode(FILE* csvfp, const x265_api& api, const x265_param& param, const x265_stats& stats, int level, int argc, char** argv); +LIBAPI void x265_csvlog_encode(FILE* csvfp, const char* version, const x265_param& param, const x265_stats& stats, int level, int argc, char** argv); /* In-place downshift from a bit-depth greater than 8 to a bit-depth of 8, using * the residual bits to dither each row. */
View file
x265_2.1.tar.gz/source/x265.cpp -> x265_2.2.tar.gz/source/x265.cpp
Changed
@@ -746,7 +746,7 @@ api->encoder_get_stats(encoder, &stats, sizeof(stats)); if (cliopt.csvfpt && !b_ctrl_c) - x265_csvlog_encode(cliopt.csvfpt, *api, *param, stats, cliopt.csvLogLevel, argc, argv); + x265_csvlog_encode(cliopt.csvfpt, api->version_str, *param, stats, cliopt.csvLogLevel, argc, argv); api->encoder_close(encoder); int64_t second_largest_pts = 0;
View file
x265_2.1.tar.gz/source/x265.h -> x265_2.2.tar.gz/source/x265.h
Changed
@@ -137,6 +137,7 @@ double avgPsyEnergy; double avgResEnergy; double avgLumaLevel; + double bufferFill; uint64_t bits; int encoderOrder; int poc; @@ -289,6 +290,7 @@ X265_HEX_SEARCH, X265_UMH_SEARCH, X265_STAR_SEARCH, + X265_SEA, X265_FULL_SEARCH } X265_ME_METHODS; @@ -334,6 +336,9 @@ #define X265_CPU_NEON 0x0000002 /* ARM NEON */ #define X265_CPU_FAST_NEON_MRC 0x0000004 /* Transfer from NEON to ARM register is fast (Cortex-A9) */ +/* IBM Power8 */ +#define X265_CPU_ALTIVEC 0x0000001 + #define X265_MAX_SUBPEL_LEVEL 7 /* Log level */ @@ -351,6 +356,10 @@ #define X265_REF_LIMIT_DEPTH 1 #define X265_REF_LIMIT_CU 2 +#define X265_TU_LIMIT_BFS 1 +#define X265_TU_LIMIT_DFS 2 +#define X265_TU_LIMIT_NEIGH 4 + #define X265_BFRAME_MAX 16 #define X265_MAX_FRAME_THREADS 16 @@ -456,7 +465,7 @@ } x265_stats; /* String values accepted by x265_param_parse() (and CLI) for various parameters */ -static const char * const x265_motion_est_names[] = { "dia", "hex", "umh", "star", "full", 0 }; +static const char * const x265_motion_est_names[] = { "dia", "hex", "umh", "star", "sea", "full", 0 }; static const char * const x265_source_csp_names[] = { "i400", "i420", "i422", "i444", "nv12", "nv16", 0 }; static const char * const x265_video_format_names[] = { "component", "pal", "ntsc", "secam", "mac", "undef", 0 }; static const char * const x265_fullrange_names[] = { "limited", "full", 0 }; @@ -823,6 +832,10 @@ * compressed by the DCT transforms, at the expense of much more compute */ uint32_t tuQTMaxIntraDepth; + /* Enable early exit decisions for inter coded blocks to avoid recursing to + * higher TU depths. Default: 0 */ + uint32_t limitTU; + /* Set the amount of rate-distortion analysis to use within quant. 0 implies * no rate-distortion optimization. At level 1 rate-distortion cost is used to * find optimal rounding values for each level (and allows psy-rdoq to be @@ -898,9 +911,9 @@ /* Limit modes analyzed for each CU using cost metrics from the 4 sub-CUs */ uint32_t limitModes; - /* ME search method (DIA, HEX, UMH, STAR, FULL). The search patterns + /* ME search method (DIA, HEX, UMH, STAR, SEA, FULL). The search patterns * (methods) are sorted in increasing complexity, with diamond being the - * simplest and fastest and full being the slowest. DIA, HEX, and UMH were + * simplest and fastest and full being the slowest. DIA, HEX, UMH and SEA were * adapted from x264 directly. STAR is an adaption of the HEVC reference * encoder's three step search, while full is a naive exhaustive search. The * default is the star search, it has a good balance of performance and @@ -1300,15 +1313,28 @@ /* Maximum of the picture order count */ int log2MaxPocLsb; - /* Dicard SEI messages when printing */ - int bDiscardSEI; - - /* Control removing optional vui information (timing, HRD info) to get low bitrate */ - int bDiscardOptionalVUI; + /* Emit VUI Timing info, an optional VUI field */ + int bEmitVUITimingInfo; + + /* Emit HRD Timing info */ + int bEmitVUIHRDInfo; /* Maximum count of Slices of picture, the value range is [1, maximum rows] */ unsigned int maxSlices; + /* Optimize QP in PPS based on statistics from prevvious GOP*/ + int bOptQpPPS; + + /* Opitmize ref list length in PPS based on stats from previous GOP*/ + int bOptRefListLengthPPS; + + /* Enable storing commonly RPS in SPS in multi pass mode */ + int bMultiPassOptRPS; + + /* This value represents the percentage difference between the inter cost and + * intra cost of a frame used in scenecut detection. Default 5. */ + double scenecutBias; + } x265_param; /* x265_param_alloc:
View file
x265_2.1.tar.gz/source/x265cli.h -> x265_2.2.tar.gz/source/x265cli.h
Changed
@@ -85,6 +85,7 @@ { "max-tu-size", required_argument, NULL, 0 }, { "tu-intra-depth", required_argument, NULL, 0 }, { "tu-inter-depth", required_argument, NULL, 0 }, + { "limit-tu", required_argument, NULL, 0 }, { "me", required_argument, NULL, 0 }, { "subme", required_argument, NULL, 'm' }, { "merange", required_argument, NULL, 0 }, @@ -120,6 +121,7 @@ { "min-keyint", required_argument, NULL, 'i' }, { "scenecut", required_argument, NULL, 0 }, { "no-scenecut", no_argument, NULL, 0 }, + { "scenecut-bias", required_argument, NULL, 0 }, { "intra-refresh", no_argument, NULL, 0 }, { "rc-lookahead", required_argument, NULL, 0 }, { "lookahead-slices", required_argument, NULL, 0 }, @@ -208,8 +210,14 @@ { "min-luma", required_argument, NULL, 0 }, { "max-luma", required_argument, NULL, 0 }, { "log2-max-poc-lsb", required_argument, NULL, 8 }, - { "discard-sei", no_argument, NULL, 0 }, - { "discard-vui", no_argument, NULL, 0 }, + { "vui-timing-info", no_argument, NULL, 0 }, + { "no-vui-timing-info", no_argument, NULL, 0 }, + { "vui-hrd-info", no_argument, NULL, 0 }, + { "no-vui-hrd-info", no_argument, NULL, 0 }, + { "opt-qp-pps", no_argument, NULL, 0 }, + { "no-opt-qp-pps", no_argument, NULL, 0 }, + { "opt-ref-list-length-pps", no_argument, NULL, 0 }, + { "no-opt-ref-list-length-pps", no_argument, NULL, 0 }, { "no-dither", no_argument, NULL, 0 }, { "dither", no_argument, NULL, 0 }, { "no-repeat-headers", no_argument, NULL, 0 }, @@ -229,6 +237,8 @@ { "pass", required_argument, NULL, 0 }, { "slow-firstpass", no_argument, NULL, 0 }, { "no-slow-firstpass", no_argument, NULL, 0 }, + { "multi-pass-opt-rps", no_argument, NULL, 0 }, + { "no-multi-pass-opt-rps", no_argument, NULL, 0 }, { "analysis-mode", required_argument, NULL, 0 }, { "analysis-file", required_argument, NULL, 0 }, { "strict-cbr", no_argument, NULL, 0 }, @@ -317,6 +327,7 @@ H0(" --max-tu-size <32|16|8|4> Maximum TU size (WxH). Default %d\n", param->maxTUSize); H0(" --tu-intra-depth <integer> Max TU recursive depth for intra CUs. Default %d\n", param->tuQTMaxIntraDepth); H0(" --tu-inter-depth <integer> Max TU recursive depth for inter CUs. Default %d\n", param->tuQTMaxInterDepth); + H0(" --limit-tu <0..4> Enable early exit from TU recursion for inter coded blocks. Default %d\n", param->limitTU); H0("\nAnalysis:\n"); H0(" --rd <1..6> Level of RDO in mode decision 1:least....6:full RDO. Default %d\n", param->rdLevel); H0(" --[no-]psy-rd <0..5.0> Strength of psycho-visual rate distortion optimization, 0 to disable. Default %.1f\n", param->psyRd); @@ -357,6 +368,7 @@ H0("-i/--min-keyint <integer> Scenecuts closer together than this are coded as I, not IDR. Default: auto\n"); H0(" --no-scenecut Disable adaptive I-frame decision\n"); H0(" --scenecut <integer> How aggressively to insert extra I-frames. Default %d\n", param->scenecutThreshold); + H1(" --scenecut-bias <0..100.0> Bias for scenecut detection. Default %.2f\n", param->scenecutBias); H0(" --intra-refresh Use Periodic Intra Refresh instead of IDR frames\n"); H0(" --rc-lookahead <integer> Number of frames for frame-type lookahead (determines encoder latency) Default %d\n", param->lookaheadDepth); H1(" --lookahead-slices <0..16> Number of slices to use per lookahead cost estimate. Default %d\n", param->lookaheadSlices); @@ -448,8 +460,11 @@ H0(" --[no-]aud Emit access unit delimiters at the start of each access unit. Default %s\n", OPT(param->bEnableAccessUnitDelimiters)); H1(" --hash <integer> Decoded Picture Hash SEI 0: disabled, 1: MD5, 2: CRC, 3: Checksum. Default %d\n", param->decodedPictureHashSEI); H0(" --log2-max-poc-lsb <integer> Maximum of the picture order count\n"); - H0(" --discard-sei Discard SEI packets in bitstream. Default %s\n", OPT(param->bDiscardSEI)); - H0(" --discard-vui Discard optional VUI information from the bistream. Default %s\n", OPT(param->bDiscardOptionalVUI)); + H0(" --[no-]vui-timing-info Emit VUI timing information in the bistream. Default %s\n", OPT(param->bEmitVUITimingInfo)); + H0(" --[no-]vui-hrd-info Emit VUI HRD information in the bistream. Default %s\n", OPT(param->bEmitVUIHRDInfo)); + H0(" --[no-]opt-qp-pps Dynamically optimize QP in PPS (instead of default 26) based on QPs in previous GOP. Default %s\n", OPT(param->bOptQpPPS)); + H0(" --[no-]opt-ref-list-length-pps Dynamically set L0 and L1 ref list length in PPS (instead of default 0) based on values in last GOP. Default %s\n", OPT(param->bOptRefListLengthPPS)); + H0(" --[no-]multi-pass-opt-rps Enable storing commonly used RPS in SPS in multi pass mode. Default %s\n", OPT(param->bMultiPassOptRPS)); H1("\nReconstructed video options (debugging):\n"); H1("-r/--recon <filename> Reconstructed raw image YUV or Y4M output file name\n"); H1(" --recon-depth <integer> Bit-depth of reconstructed raw image file. Defaults to input bit depth, or 8 if Y4M\n");
Locations
Projects
Search
Status Monitor
Help
Open Build Service
OBS Manuals
API Documentation
OBS Portal
Reporting a Bug
Contact
Mailing List
Forums
Chat (IRC)
Twitter
Open Build Service (OBS)
is an
openSUSE project
.