- 
                Notifications
    
You must be signed in to change notification settings  - Fork 51
 
Include Offsets & Fringe Case Fix for outerSize > size && lda = {1, 1, ...} #33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Open
      
      
            njh80
  wants to merge
  3
  commits into
  springer13:master
  
    
      
        
          
  
    
      Choose a base branch
      
     
    
      
        
      
      
        
          
          
        
        
          
            
              
              
              
  
           
        
        
          
            
              
              
           
        
       
     
  
        
          
            
          
            
          
        
       
    
      
from
njh80:master
  
      
      
   
  
    
  
  
  
 
  
      
    base: master
Could not load branches
            
              
  
    Branch not found: {{ refName }}
  
            
                
      Loading
              
            Could not load tags
            
            
              Nothing to show
            
              
  
            
                
      Loading
              
            Are you sure you want to change the base?
            Some commits from the old base branch may be removed from the timeline,
            and old review comments may become outdated.
          
          Conversation
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
    …luding offset inputs for tensors. This feature includes backwards compatibility for calls to hptt::create_plan without offsets and is therefore not a breaking change.
**Detail:**
*Makefiles* (Makefile, benchmark/Makefile, testframework/Makefile)
FIX: In the case of libomp not being discovered in LD_LIBRARY_PATH (MacOS M2 issue), user can specify a path for build.
*benchmark/benchmark.cpp*
FEAT: `transpose_ref` is for internal use and therefore changes do not include backwards compatibility for the function and hence the function call is amended to pass new nullptr arguments.
*benchmark/maxFromFiles.py*
FIX: Print statement of Error is given parentheses.
*benchmark/reference.cpp*
FEAT: Firstly, function receives new outerSize (A/B) and offset (A/B) arrays which are initialised to mimic the size array in the supplication of nullptrs. Next, the stepping through B is amended to ensure that the outerSize is traversed where the row of size is exceeded. Further offsets are inserted into the traversal. Behaviour can be verified via DEBUG.
Pseudo-Code is:
for each dimension not the innermost loop of B:
    divide the current position by the size of the next innermost loop of B that we want to traverse
    move across the offset distance as many times as we have exceeded it plus the initial offset
    further move over any space that remains after the end of the block required by size as many times as we exceed it
*benchmark/reference.h*
FEAT: Amended template to reflect new inputs of transpose_ref(), namely offsetA, offsetB, outerSizeA and outerSizeB.
*include/compute_node.h*
FEAT: Included three new members of a ComputeNode without exceeding the cache size of 64 bytes (unaligned memory in caches exceeding this programmers be warned!).
First the offset difference (A - B) which reduces the number of calculations required in adjusting for the offset in the execution of hptt. The plan is created with start and end positions inclusive of the offset of B and the difference is added to access the start and end values of A.
FIX: Secondly, the booleans of indexA and indexB indicate true when the leading dimension of A/B is 1 and the index is 0. The original code faultered when A or B's innermost dimensions were 1 causing the transpose_int functions to identify incorrect innermost indexes - especially problematic with non-zero outerSizes.
*include/hptt.h*
FEAT: New template functions provided for provision of offsets in various floatType contexts.
*include/transpose.h*
FEAT: Amended skipIndices and verifyParameter to include offset inputs as these functions are effected by the inclusion of these. Also included offsets as properties of the transpose class.
*include/utils.h*
FEAT: Amended the template of accountForRowMajor as this needs to change the orders of the offsets similarly to the other parameters.
*src/hptt.cpp*
FEAT: Implemented new offset templates and amended original templates to point to plan() with nullptrs or offsets where appropriate.
*src/transpose.cpp*
FEAT: Amended plan assignment section to include assignments for the new computeNode members.
FEAT: Included offsets in fuseIndices, skipIndices and verifyParameters functions where amendments effect offsets too and verification proves offset + size <= outerSize for all dimensions.
FEAT: axpy functions require offset differences as well and so these are calculated and the integer/array passed to the respective functions for proper calculation. Similarly, the axpy functions themselves are amended.
FEAT: in transpose_ functions offDiffAB is always added to i to get the correct start/end. Also where lda/ldb == 1 is checked, plan->indexA/B is also asserted to ensure correct blocking is passed. As result of the increased robustness, the blockingA/B can always be confidently passed and loops can be included for cases where scalar is reached and lda/ldb is not 1.
FEAT: Included a plethora of DEBUG statements (coding this was very fun).
*src/utils.cpp*
FEAT: Implemented accountForRowMajor changes for offsets mirroring the behaviour for outerSizes.
*testframework/testframework.cpp*
FEAT: Improved testing to include triggerable outerSize != size and offsets with strings printed for DEBUG cases.
FEAT: Error messages modified for clarity.
    Sub-Tensors often omit their inner-most dimension meaning that they access their source data without an inner stride of one. This commit adds a basic level of support for this in a similar way to the support for offsets. Inner Strides are optional arguments and are supplied as integers. *benchmark/benchmark.cpp* Amends reference to `transpose_ref` to include nullptrs to inner strides. *benchmark/reference.cpp* `transpose_ref` can now receive non-integer inner strides - used for evaluating tests. *benchmark/reference.h* Amends template function. *include/hptt.h* Creates overloads including innerStrides (size_t) for create_plan calls *include/transpose.h* Amends functions to receive innerStrides as inputs. *src/hptt.cpp* Includes the new overloads and amends existing to pass nullptr objects in the cases where inner strides are not supplied. *src/transpose.cpp* Amends behaviour of execution to include innerStrides. As `transpose_int` functions are not part of the `Transpose` class, the innerStrides must be passed as new arguments, unchanged throughout, to the `macro_kernel_scalar` and `micro_kernel`. These then use the new strides. An attempt to write support for Arch ARM and Arch AVX has been written but the execution of these is unchecked as the author is not working with access to these operating systems. Further, no support has been included for the B buffer case in the Macro-Kernel which in theory could be included. Further, as a comment to the offset version as well, there has not been any changes made to the plan generation stages - be the effectiveness of these will likely be altered by these commits. *testframework/testframework.cpp* Tests have been added for innerStrides of 1 or 2 to test behaviour (in theory larger strides are fine but exceed the memory capabilities of my device).
…en one and the number of dimenions (a small random number) and number of dimension any value between 1 and MAX_DIM again.
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
      
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Headline
FEAT: Introduced increased flexibility for handling subtensors by including offset inputs for tensors. This feature includes backwards compatibility for calls to hptt::create_plan without offsets and is therefore not a breaking change.
Performance
Passes all testFramework.cpp tests.
Benchmark Output: hptt_benchmark.txt
Detail:
Makefiles (Makefile, benchmark/Makefile, testframework/Makefile) FIX: In the case of libomp not being discovered in LD_LIBRARY_PATH (MacOS M2 issue), user can specify a path for build.
benchmark/benchmark.cpp
FEAT:
transpose_refis for internal use and therefore changes do not include backwards compatibility for the function and hence the function call is amended to pass new nullptr arguments.benchmark/maxFromFiles.py
FIX: Print statement of Error is given parentheses.
benchmark/reference.cpp
FEAT: Firstly, function receives new outerSize (A/B) and offset (A/B) arrays which are initialised to mimic the size array in the supplication of nullptrs. Next, the stepping through B is amended to ensure that the outerSize is traversed where the row of size is exceeded. Further offsets are inserted into the traversal. Behaviour can be verified via DEBUG.
Pseudo-Code is:
for each dimension not the innermost loop of B:
divide the current position by the size of the next innermost loop of B that we want to traverse
move across the offset distance as many times as we have exceeded it plus the initial offset
further move over any space that remains after the end of the block required by size as many times as we exceed it
benchmark/reference.h
FEAT: Amended template to reflect new inputs of transpose_ref(), namely offsetA, offsetB, outerSizeA and outerSizeB.
include/compute_node.h
FEAT: Included three new members of a ComputeNode without exceeding the cache size of 64 bytes (unaligned memory in caches exceeding this programmers be warned!).
First the offset difference (A - B) which reduces the number of calculations required in adjusting for the offset in the execution of hptt. The plan is created with start and end positions inclusive of the offset of B and the difference is added to access the start and end values of A.
FIX: Secondly, the booleans of indexA and indexB indicate true when the leading dimension of A/B is 1 and the index is 0. The original code faultered when A or B's innermost dimensions were 1 causing the transpose_int functions to identify incorrect innermost indexes - especially problematic with non-zero outerSizes.
include/hptt.h
FEAT: New template functions provided for provision of offsets in various floatType contexts.
include/transpose.h
FEAT: Amended skipIndices and verifyParameter to include offset inputs as these functions are effected by the inclusion of these. Also included offsets as properties of the transpose class.
include/utils.h
FEAT: Amended the template of accountForRowMajor as this needs to change the orders of the offsets similarly to the other parameters.
src/hptt.cpp
FEAT: Implemented new offset templates and amended original templates to point to plan() with nullptrs or offsets where appropriate.
src/transpose.cpp
FEAT: Amended plan assignment section to include assignments for the new computeNode members. FEAT: Included offsets in fuseIndices, skipIndices and verifyParameters functions where amendments effect offsets too and verification proves offset + size <= outerSize for all dimensions. FEAT: axpy functions require offset differences as well and so these are calculated and the integer/array passed to the respective functions for proper calculation. Similarly, the axpy functions themselves are amended. FEAT: in transpose_ functions offDiffAB is always added to i to get the correct start/end. Also where lda/ldb == 1 is checked, plan->indexA/B is also asserted to ensure correct blocking is passed. As result of the increased robustness, the blockingA/B can always be confidently passed and loops can be included for cases where scalar is reached and lda/ldb is not 1. FEAT: Included a plethora of DEBUG statements (coding this was very fun).
src/utils.cpp
FEAT: Implemented accountForRowMajor changes for offsets mirroring the behaviour for outerSizes.
testframework/testframework.cpp
FEAT: Improved testing to include triggerable outerSize != size and offsets with strings printed for DEBUG cases. FEAT: Error messages modified for clarity.