Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Would it be good to create a library object and use LD_PRELOAD to support self-profiling? #1

Open
yskelg opened this issue Aug 7, 2024 · 2 comments

Comments

@yskelg
Copy link

yskelg commented Aug 7, 2024

Wow, This is really great Idea. Thank you for the inspiration @ThinkOpenly.

Using LD_PRELOAD to execute at the start with constructor and terminate at exit would be very convenient for profiling other program!
If we consider the interface of that library, we could also measure specific functions.

@ThinkOpenly
Copy link
Owner

Wow, This is really great Idea. Thank you for the inspiration @ThinkOpenly.

I'm pleased that you like it!

Using LD_PRELOAD to execute at the start with constructor and terminate at exit would be very convenient for profiling other program!

Are you suggesting creating a library to be pre-loaded with a constructor that implements PROFILE_BEGIN and a destructor that implements PROFILE_END? That could work, but the constructor should set a flag that the other API methods would need to check every time, just in case LD_PRELOAD was not specified. This adds a bit of additional overhead.

Do you see significant advantage to using LD_PRELOAD in place of PROFILE_BEGIN/PROFILE_END?

If we consider the interface of that library, we could also measure specific functions.

Tell me more about what you are suggesting here.

The current implementation requires that the code to be profiled be instrumented with PROFILE_START/PROFILE_STOP. Are you suggesting there is a way to avoid having to instrument/compile/link by using LD_PRELOAD?

yskelg pushed a commit to yskelg/self-profiling that referenced this issue Aug 8, 2024
This code is a simple implementation of my idea, with a focus on making
the self-profile portable.
It seems useful, even if there's a call to the main function, because
This method seems to reduce overhead compared to using "perf record" directly.
We can directly insert the code according to its original purpose.

Here are the test results from my Raspberry Pi 5, gcc version 12.2.0 (Debian 12.2.0-14)

$ uname -a
Linux paran 6.10.1-v8-16k+ ThinkOpenly#1 SMP PREEMPT Sat Jul 27 17:52:03 KST 2024 aarch64 GNU/Linux

$ make run
export PERF_COUNT_HW_CPU_CYCLES=1; ./test_profile
Sorting...
00: { "H", 107, 0.900000 }
01: { "I", 111, 0.900000 }
02: { "G", 117, 0.900000 }
03: { "E", 127, 0.900000 }
04: { "F", 147, 0.900000 }
05: { "A", 157, 0.900000 }
06: { "K", 157, 0.900000 }
07: { "L", 157, 0.900000 }
08: { "M", 157, 0.900000 }
09: { "N", 157, 0.900000 }
10: { "O", 157, 0.900000 }
11: { "P", 157, 0.900000 }
12: { "Z", 157, 0.900000 }
13: { "C", 175, 0.900000 }
14: { "J", 227, 0.900000 }
15: { "B", 517, 0.900000 }
16: { "D", 571, 0.900000 }
PERF_COUNT_HW_CPU_CYCLES(0): 7970
export PERF_COUNT_HW_CPU_CYCLES=1; LD_PRELOAD=self-profile.so ./preload_test_profile
Sorting...
PERF_COUNT_HW_CPU_CYCLES(0): 7444
export PERF_COUNT_HW_CPU_CYCLES=1; LD_PRELOAD=self-profile.so ./bsearch
Sorting...
PERF_COUNT_HW_CPU_CYCLES(0): 6779

Signed-off-by: Yunseong Kim <[email protected]>
yskelg pushed a commit to yskelg/self-profiling that referenced this issue Aug 8, 2024
This code is a simple implementation of my idea, with a focus on making
the self-profile portable.
It seems useful, even if there's a call to the main function, because
This method seems to reduce overhead compared to using "perf record" directly.
We can directly insert the code according to its original purpose.

Here are the test results from my Raspberry Pi 5, gcc version 12.2.0 (Debian 12.2.0-14)

$ uname -a
Linux paran 6.10.1-v8-16k+ ThinkOpenly#1 SMP PREEMPT Sat Jul 27 17:52:03 KST 2024 aarch64 GNU/Linux

$ make run
export PERF_COUNT_HW_CPU_CYCLES=1; ./test_profile
Sorting...
00: { "H", 107, 0.900000 }
01: { "I", 111, 0.900000 }
02: { "G", 117, 0.900000 }
03: { "E", 127, 0.900000 }
04: { "F", 147, 0.900000 }
05: { "A", 157, 0.900000 }
06: { "K", 157, 0.900000 }
07: { "L", 157, 0.900000 }
08: { "M", 157, 0.900000 }
09: { "N", 157, 0.900000 }
10: { "O", 157, 0.900000 }
11: { "P", 157, 0.900000 }
12: { "Z", 157, 0.900000 }
13: { "C", 175, 0.900000 }
14: { "J", 227, 0.900000 }
15: { "B", 517, 0.900000 }
16: { "D", 571, 0.900000 }
PERF_COUNT_HW_CPU_CYCLES(0): 7970
export PERF_COUNT_HW_CPU_CYCLES=1; LD_PRELOAD=self-profile.so ./preload_test_profile
Sorting...
PERF_COUNT_HW_CPU_CYCLES(0): 7444
export PERF_COUNT_HW_CPU_CYCLES=1; LD_PRELOAD=self-profile.so ./bsearch
Sorting...
PERF_COUNT_HW_CPU_CYCLES(0): 6779

Signed-off-by: Yunseong Kim <[email protected]>
yskelg pushed a commit to yskelg/self-profiling that referenced this issue Aug 8, 2024
This code is a simple implementation of my idea, with a focus on making
the self-profile portable.
It seems useful, even if there's a call to the main function, because
This method seems to reduce overhead compared to using "perf record" directly.
We can directly insert the code according to its original purpose.

Here are the test results from my Raspberry Pi 5, gcc version 12.2.0 (Debian 12.2.0-14)

$ uname -a
Linux paran 6.10.1-v8-16k+ ThinkOpenly#1 SMP PREEMPT Sat Jul 27 17:52:03 KST 2024 aarch64 GNU/Linux

$ make run
export PERF_COUNT_HW_CPU_CYCLES=1; ./test_profile
Sorting...
00: { "H", 107, 0.900000 }
01: { "I", 111, 0.900000 }
02: { "G", 117, 0.900000 }
03: { "E", 127, 0.900000 }
04: { "F", 147, 0.900000 }
05: { "A", 157, 0.900000 }
06: { "K", 157, 0.900000 }
07: { "L", 157, 0.900000 }
08: { "M", 157, 0.900000 }
09: { "N", 157, 0.900000 }
10: { "O", 157, 0.900000 }
11: { "P", 157, 0.900000 }
12: { "Z", 157, 0.900000 }
13: { "C", 175, 0.900000 }
14: { "J", 227, 0.900000 }
15: { "B", 517, 0.900000 }
16: { "D", 571, 0.900000 }
PERF_COUNT_HW_CPU_CYCLES(0): 7970
export PERF_COUNT_HW_CPU_CYCLES=1; LD_PRELOAD=self-profile.so ./preload_test_profile
Sorting...
PERF_COUNT_HW_CPU_CYCLES(0): 7444
export PERF_COUNT_HW_CPU_CYCLES=1; LD_PRELOAD=self-profile.so ./bsearch
Sorting...
PERF_COUNT_HW_CPU_CYCLES(0): 6779

Signed-off-by: Yunseong Kim <[email protected]>
yskelg added a commit to yskelg/self-profiling that referenced this issue Aug 8, 2024
This code is a simple implementation of my idea, with a focus on making
the self-profile portable.
It seems useful, even if there's a call to the main function, because
This method seems to reduce overhead compared to using "perf record" directly.
We can directly insert the code according to its original purpose.

Here are the test results from my Raspberry Pi 5, gcc version 12.2.0 (Debian 12.2.0-14)

$ uname -a
Linux paran 6.10.1-v8-16k+ ThinkOpenly#1 SMP PREEMPT Sat Jul 27 17:52:03 KST 2024 aarch64 GNU/Linux

$ make run
export PERF_COUNT_HW_CPU_CYCLES=1; ./test_profile
Sorting...
00: { "H", 107, 0.900000 }
01: { "I", 111, 0.900000 }
02: { "G", 117, 0.900000 }
03: { "E", 127, 0.900000 }
04: { "F", 147, 0.900000 }
05: { "A", 157, 0.900000 }
06: { "K", 157, 0.900000 }
07: { "L", 157, 0.900000 }
08: { "M", 157, 0.900000 }
09: { "N", 157, 0.900000 }
10: { "O", 157, 0.900000 }
11: { "P", 157, 0.900000 }
12: { "Z", 157, 0.900000 }
13: { "C", 175, 0.900000 }
14: { "J", 227, 0.900000 }
15: { "B", 517, 0.900000 }
16: { "D", 571, 0.900000 }
PERF_COUNT_HW_CPU_CYCLES(0): 7970
export PERF_COUNT_HW_CPU_CYCLES=1; LD_PRELOAD=self-profile.so ./preload_test_profile
Sorting...
PERF_COUNT_HW_CPU_CYCLES(0): 7444
export PERF_COUNT_HW_CPU_CYCLES=1; LD_PRELOAD=self-profile.so ./bsearch
Sorting...
PERF_COUNT_HW_CPU_CYCLES(0): 6779

Signed-off-by: Yunseong Kim <[email protected]>
yskelg added a commit to yskelg/self-profiling that referenced this issue Aug 8, 2024
This code is a simple implementation of my idea, with a focus on making
the self-profile portable.
It seems useful, even if there's a call to the main function, because
This method seems to reduce overhead compared to using "perf record" directly.
We can directly insert the code according to its original purpose.

Here are the test results from my Raspberry Pi 5, gcc version 12.2.0 (Debian 12.2.0-14)

$ uname -a
Linux paran 6.10.1-v8-16k+ ThinkOpenly#1 SMP PREEMPT Sat Jul 27 17:52:03 KST 2024 aarch64 GNU/Linux

$ make run
export PERF_COUNT_HW_CPU_CYCLES=1; ./test_profile
Sorting...
00: { "H", 107, 0.900000 }
01: { "I", 111, 0.900000 }
02: { "G", 117, 0.900000 }
03: { "E", 127, 0.900000 }
04: { "F", 147, 0.900000 }
05: { "A", 157, 0.900000 }
06: { "K", 157, 0.900000 }
07: { "L", 157, 0.900000 }
08: { "M", 157, 0.900000 }
09: { "N", 157, 0.900000 }
10: { "O", 157, 0.900000 }
11: { "P", 157, 0.900000 }
12: { "Z", 157, 0.900000 }
13: { "C", 175, 0.900000 }
14: { "J", 227, 0.900000 }
15: { "B", 517, 0.900000 }
16: { "D", 571, 0.900000 }
PERF_COUNT_HW_CPU_CYCLES(0): 7970
export PERF_COUNT_HW_CPU_CYCLES=1; LD_PRELOAD=self-profile.so ./preload_test_profile
Sorting...
PERF_COUNT_HW_CPU_CYCLES(0): 7444
export PERF_COUNT_HW_CPU_CYCLES=1; LD_PRELOAD=self-profile.so ./bsearch
Sorting...
PERF_COUNT_HW_CPU_CYCLES(0): 6779

Signed-off-by: Yunseong Kim <[email protected]>
@yskelg
Copy link
Author

yskelg commented Aug 9, 2024

Thank you @ThinkOpenly for your comments, which have helped me to articulate the self-profiling project more clearly.

One of the key strengths of this project, in my opinion, is the ability to focus profiling specifically on the code where it's needed most.

As you know, If used alongside production code, I believe we can divide the activation into macros and build options—similar to how static trace points are activated with ftrace in the Linux kernel.

This project has reminded me of the importance of understanding the underlying principles to explore new approaches, rather than always relying on existing tools passively.

Do you see significant advantage to using LD_PRELOAD in place of PROFILE_BEGIN/PROFILE_END?

I think my focus is on portability with other executable program. My PR is a Proof of Concept based on what I’ve implemented so far, and I’m happy to update it with any additional ideas you might have. In #2 , I implemented the ability to measure the original main function.

Are you suggesting there is a way to avoid having to instrument/compile/link by using LD_PRELOAD?

If there’s a specific function the user wants to profile, similar to the main function in self-profile.c, this implementation allows for that. I focused on the reusability of the main function for now.

P.S.
If there are any additional features you'd like to see implemented, please feel free to open an issue or leave a comment.

Once again, thank you for the inspiration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants