Feature implementation from commits 0e7dadb..6197e9f#1
Feature implementation from commits 0e7dadb..6197e9f#1yashuatla wants to merge 13 commits intofeature-base-1from
Conversation
add two FAQs for windows build requestions.
add third-party demo
refine readme
Fix model architecture name
…crosoft#204) * Update CMakeLists.txt I added a CMake option to compile the Llama.cpp server. This update allows us to easily build and deploy the server using BitNet * Create run_inference_server.py same as run_inference, but for use with llama.cpp's built in server, for some extra comfort In particular: - The build directory is determined based on whether the system is running on Windows or not. - A list of arguments (`--model`, `-m` etc.) is created. - The main argument list is parsed and passed to the `subprocess.run()` method to execute the system command.
…d/aarch64 (microsoft#242) GCC does not recognize Clang-specific warning flags like -Wunreachable-code-break and -Wunreachable-code-return, which are passed by upstream submodules (e.g., ggml). This patch forces CMake to use Clang via command-line arguments, avoiding the need to patch nested submodules. This resolves compiler errors without modifying submodule source code.
| def from_name(cls, name: str): | ||
| if name in transformer_configs: | ||
| return cls(**transformer_configs[name]) | ||
| config = [k for k in transformer_configs if k in name.upper() or k in name] |
There was a problem hiding this comment.
🐛 Correctness Issue
Ambiguous model name matching logic.
The current substring matching can select the wrong config if one config name is a substring of another, causing runtime errors or incorrect model conversion.
Current Code (Diff):
- config = [k for k in transformer_configs if k in name.upper() or k in name]
+ config = [k for k in transformer_configs if k == name.upper() or k == name]📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| config = [k for k in transformer_configs if k in name.upper() or k in name] | |
| config = [k for k in transformer_configs if k == name.upper() or k == name] |
| help="Path to input .safetensors file" | ||
| ) | ||
| parser.add_argument( | ||
| "--output", type=str, default="./checkpoints/model_state.pt", |
There was a problem hiding this comment.
🐛 Correctness Issue
Default output path assumes directory exists.
The default output path assumes './checkpoints' directory exists, which will cause file write failures if the directory doesn't exist.
Current Code (Diff):
- "--output", type=str, default="./checkpoints/model_state.pt",
+ "--output", type=str, default="./model_state.pt",📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| "--output", type=str, default="./checkpoints/model_state.pt", | |
| "--output", type=str, default="./model_state.pt", |
|
|
||
| model_args_prefill = fast.ModelArgs(use_kernel=False) | ||
| model_args_decode = fast.ModelArgs(use_kernel=True) | ||
| tokenizer = Tokenizer("./tokenizer.model") |
There was a problem hiding this comment.
🐛 Correctness Issue
Hardcoded tokenizer path ignores parameter.
The code hardcodes the tokenizer path to './tokenizer.model' despite having an optional tokenizer_path parameter, which will cause failures if the tokenizer isn't in the expected location.
Current Code (Diff):
- tokenizer = Tokenizer("./tokenizer.model")
+ tokenizer = Tokenizer(tokenizer_path if tokenizer_path is not None else "./tokenizer.model")📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| tokenizer = Tokenizer("./tokenizer.model") | |
| tokenizer = Tokenizer(tokenizer_path if tokenizer_path is not None else "./tokenizer.model") |
|
|
||
| eos_id = self.tokenizer.eot_id | ||
| for niter in range(1, gen_length): | ||
| kv_seqlen.add_(kv_seqlen < max_seq_length) |
There was a problem hiding this comment.
🐛 Correctness Issue
Incorrect tensor operation.
The expression kv_seqlen.add_(kv_seqlen < max_seq_length) attempts to add a boolean tensor to kv_seqlen, which will cause incorrect behavior or runtime errors.
Current Code (Diff):
- kv_seqlen.add_(kv_seqlen < max_seq_length)
+ kv_seqlen.add_((kv_seqlen < max_seq_length).to(torch.int))📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| kv_seqlen.add_(kv_seqlen < max_seq_length) | |
| kv_seqlen.add_((kv_seqlen < max_seq_length).to(torch.int)) |
| if use_sampling: | ||
| temp = 0.7 | ||
| top_p = 0.95 | ||
| probs = torch.softmax(logits / temp, dim=-1) | ||
| next_token = sample_utils.top_p(probs, top_p) |
There was a problem hiding this comment.
🐛 Correctness Issue
Hardcoded sampling parameters override GenArgs.
The code hardcodes temperature and top_p values instead of using the ones from GenArgs, making those parameters in GenArgs effectively unused.
Current Code (Diff):
- temp = 0.7
- top_p = 0.95
- probs = torch.softmax(logits / temp, dim=-1)
- next_token = sample_utils.top_p(probs, top_p)
+ temp = self.gen_args.temperature
+ top_p = self.gen_args.top_p
+ probs = torch.softmax(logits / temp, dim=-1)
+ next_token = sample_utils.top_p(probs, top_p)📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| if use_sampling: | |
| temp = 0.7 | |
| top_p = 0.95 | |
| probs = torch.softmax(logits / temp, dim=-1) | |
| next_token = sample_utils.top_p(probs, top_p) | |
| temp = self.gen_args.temperature | |
| top_p = self.gen_args.top_p | |
| probs = torch.softmax(logits / temp, dim=-1) | |
| next_token = sample_utils.top_p(probs, top_p) |
| if use_sampling: | ||
| temp = 0.7 | ||
| top_p = 0.95 | ||
| probs = torch.softmax(logits / temp, dim=-1) | ||
| next_token = sample_utils.top_p(probs, top_p) |
There was a problem hiding this comment.
🐛 Correctness Issue
Duplicate hardcoded sampling parameters.
The code duplicates the hardcoded temperature and top_p values in the generation loop, ignoring the GenArgs parameters.
Current Code (Diff):
- temp = 0.7
- top_p = 0.95
- probs = torch.softmax(logits / temp, dim=-1)
- next_token = sample_utils.top_p(probs, top_p)
+ temp = self.gen_args.temperature
+ top_p = self.gen_args.top_p
+ probs = torch.softmax(logits / temp, dim=-1)
+ next_token = sample_utils.top_p(probs, top_p)📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| if use_sampling: | |
| temp = 0.7 | |
| top_p = 0.95 | |
| probs = torch.softmax(logits / temp, dim=-1) | |
| next_token = sample_utils.top_p(probs, top_p) | |
| temp = self.gen_args.temperature | |
| top_p = self.gen_args.top_p | |
| probs = torch.softmax(logits / temp, dim=-1) | |
| next_token = sample_utils.top_p(probs, top_p) |
| ) | ||
|
|
||
| import ctypes | ||
| bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so') |
There was a problem hiding this comment.
🐛 Correctness Issue
Unsafe Library Loading Without Error Handling.
Loading a shared library with a hardcoded path without error handling will crash the application if the library is missing or incompatible.
Current Code (Diff):
- bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so')
+ try:
+ bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so')
+ except OSError as e:
+ raise ImportError(f"Failed to load BitNet library: {e}. Please ensure 'bitnet_kernels/libbitnet.so' is available.")📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so') | |
| try: | |
| bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so') | |
| except OSError as e: | |
| raise ImportError(f"Failed to load BitNet library: {e}. Please ensure 'bitnet_kernels/libbitnet.so' is available.") |
| return int8_weight | ||
|
|
||
|
|
||
| def interleave_weight_int8(qweight, nbits=2):\ |
There was a problem hiding this comment.
🐛 Correctness Issue
Syntax Error: Trailing backslash in function definition.
The function definition has a trailing backslash that will cause a Python syntax error when the code is executed, breaking any dependent code.
Current Code (Diff):
- def interleave_weight_int8(qweight, nbits=2):\
+ def interleave_weight_int8(qweight, nbits=2):📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| def interleave_weight_int8(qweight, nbits=2):\ | |
| def interleave_weight_int8(qweight, nbits=2): |
🔄 Dependencies Affected
gpu/convert_checkpoint.py
Function: convert_int8_to_int2
Issue: This function calls convert_weight_int8_to_int2 which uses interleave_weight_int8, and will fail due to the syntax error
Suggestion: No change needed in this file, fix the syntax error in the original function
gpu/test.py
Function: Unknown
Issue: Line 44 uses convert_weight_int8_to_int2 which calls interleave_weight_int8, and will fail due to the syntax error
Suggestion: No change needed in this file, fix the syntax error in the original function
| """ | ||
| probs_sort, probs_idx = torch.sort(probs, dim=-1, descending=True) | ||
| probs_sum = torch.cumsum(probs_sort, dim=-1) | ||
| mask = probs_sum - probs_sort > p |
There was a problem hiding this comment.
🐛 Correctness Issue
Incorrect mask logic for top-p sampling.
The mask condition is reversed, causing the algorithm to exclude tokens that should be included in the sampling pool, which will lead to incorrect text generation results.
Current Code (Diff):
- mask = probs_sum - probs_sort > p
+ mask = probs_sum > p📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| mask = probs_sum - probs_sort > p | |
| mask = probs_sum > p |
🔄 Dependencies Affected
gpu/generate.py
Function: generate_text
Issue: Text generation will produce incorrect results due to improper token filtering in top_p sampling
Suggestion: No change needed in the dependency, fix the implementation in sample_utils.py
| time: float | ||
|
|
||
| def show(self) -> str: | ||
| tps = self.tokens / self.time |
There was a problem hiding this comment.
🐛 Correctness Issue
Division by Zero Risk.
The code divides by self.time without checking if it's zero, which would cause a runtime crash when displaying statistics.
Current Code (Diff):
- tps = self.tokens / self.time
+ tps = self.tokens / self.time if self.time > 0 else 0📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| tps = self.tokens / self.time | |
| tps = self.tokens / self.time if self.time > 0 else 0 |
| torch.manual_seed(42) | ||
| np.random.seed(42) | ||
|
|
||
| bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so') |
There was a problem hiding this comment.
🐛 Correctness Issue
No error handling for C library loading.
Loading the C library without error handling will crash the application if the library file doesn't exist at the specified path.
Current Code (Diff):
- bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so')
+ try:
+ bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so')
+ except OSError as e:
+ raise RuntimeError(f"Failed to load BitNet library: {e}")📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so') | |
| try: | |
| bitnet_lib = ctypes.CDLL('bitnet_kernels/libbitnet.so') | |
| except OSError as e: | |
| raise RuntimeError(f"Failed to load BitNet library: {e}") |
| N = input1.shape[0] | ||
| K = input1.shape[1] * 4 | ||
|
|
||
| bitnet_lib.bitlinear_int8xint2(*[ctypes.c_void_p(input0.data_ptr()), ctypes.c_void_p(input1.data_ptr()), ctypes.c_void_p(ret.data_ptr()), ctypes.c_void_p(s.data_ptr()), ctypes.c_void_p(ws.data_ptr()), ctypes.c_int(M), ctypes.c_int(N), ctypes.c_int(K), ctypes.c_void_p(stream.cuda_stream)]) |
There was a problem hiding this comment.
🐛 Correctness Issue
No error handling for C library function call.
The C library function call has no error checking, which could lead to silent failures or crashes if the function encounters an error.
Current Code (Diff):
- bitnet_lib.bitlinear_int8xint2(*[ctypes.c_void_p(input0.data_ptr()), ctypes.c_void_p(input1.data_ptr()), ctypes.c_void_p(ret.data_ptr()), ctypes.c_void_p(s.data_ptr()), ctypes.c_void_p(ws.data_ptr()), ctypes.c_int(M), ctypes.c_int(N), ctypes.c_int(K), ctypes.c_void_p(stream.cuda_stream)])
+ result = bitnet_lib.bitlinear_int8xint2(*[ctypes.c_void_p(input0.data_ptr()), ctypes.c_void_p(input1.data_ptr()), ctypes.c_void_p(ret.data_ptr()), ctypes.c_void_p(s.data_ptr()), ctypes.c_void_p(ws.data_ptr()), ctypes.c_int(M), ctypes.c_int(N), ctypes.c_int(K), ctypes.c_void_p(stream.cuda_stream)])
+ if result != 0: # Assuming 0 is success, adjust based on actual API
+ raise RuntimeError(f"BitNet library call failed with error code: {result}")📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| bitnet_lib.bitlinear_int8xint2(*[ctypes.c_void_p(input0.data_ptr()), ctypes.c_void_p(input1.data_ptr()), ctypes.c_void_p(ret.data_ptr()), ctypes.c_void_p(s.data_ptr()), ctypes.c_void_p(ws.data_ptr()), ctypes.c_int(M), ctypes.c_int(N), ctypes.c_int(K), ctypes.c_void_p(stream.cuda_stream)]) | |
| result = bitnet_lib.bitlinear_int8xint2(*[ctypes.c_void_p(input0.data_ptr()), ctypes.c_void_p(input1.data_ptr()), ctypes.c_void_p(ret.data_ptr()), ctypes.c_void_p(s.data_ptr()), ctypes.c_void_p(ws.data_ptr()), ctypes.c_int(M), ctypes.c_int(N), ctypes.c_int(K), ctypes.c_void_p(stream.cuda_stream)]) | |
| if result != 0: # Assuming 0 is success, adjust based on actual API | |
| raise RuntimeError(f"BitNet library call failed with error code: {result}") |
| # tokens.extend(self.tokenizer.encode("\n\n", bos=False, eos=False)) | ||
| return tokens | ||
|
|
||
| def encode_message(self, message: Message, return_target=False) -> List[int]: |
There was a problem hiding this comment.
🐛 Correctness Issue
Incorrect return type annotation.
The function is annotated to return List[int] but actually returns a tuple (List[int], List[int] or None), which could cause type errors when used by other components.
Current Code (Diff):
- def encode_message(self, message: Message, return_target=False) -> List[int]:
+ def encode_message(self, message: Message, return_target=False) -> Union[List[int], tuple[List[int], Union[List[int], None]]]:📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| def encode_message(self, message: Message, return_target=False) -> List[int]: | |
| def encode_message(self, message: Message, return_target=False) -> Union[List[int], tuple[List[int], Union[List[int], None]]]: |
|
|
||
| return tokens, None | ||
|
|
||
| def encode_dialog_prompt(self, dialog: Dialog, completion=False, return_target=False) -> List[int]: |
There was a problem hiding this comment.
🐛 Correctness Issue
Incorrect return type annotation.
The function is annotated to return List[int] but can also return a tuple (List[int], List[int]) when return_target=True, which could cause type errors.
Current Code (Diff):
- def encode_dialog_prompt(self, dialog: Dialog, completion=False, return_target=False) -> List[int]:
+ def encode_dialog_prompt(self, dialog: Dialog, completion=False, return_target=False) -> Union[List[int], tuple[List[int], List[int]]]:📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| def encode_dialog_prompt(self, dialog: Dialog, completion=False, return_target=False) -> List[int]: | |
| def encode_dialog_prompt(self, dialog: Dialog, completion=False, return_target=False) -> Union[List[int], tuple[List[int], List[int]]]: |
| if not os.path.exists(server_path): | ||
| server_path = os.path.join(build_dir, "bin", "llama-server") | ||
| else: | ||
| server_path = os.path.join(build_dir, "bin", "llama-server") |
There was a problem hiding this comment.
🐛 Correctness Issue
Missing server executable validation.
The script doesn't verify if the server executable exists before attempting to run it, which will cause a runtime error if the file is missing.
Current Code (Diff):
- server_path = os.path.join(build_dir, "bin", "llama-server")
+ server_path = os.path.join(build_dir, "bin", "llama-server")
+
+ if not os.path.exists(server_path):
+ print(f"Error: Server executable not found at {server_path}")
+ sys.exit(1)📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| server_path = os.path.join(build_dir, "bin", "llama-server") | |
| server_path = os.path.join(build_dir, "bin", "llama-server") | |
| if not os.path.exists(server_path): | |
| print(f"Error: Server executable not found at {server_path}") | |
| sys.exit(1) |
| exit(0) | ||
| logging.info("Compiling the code using CMake.") | ||
| run_command(["cmake", "-B", "build", *COMPILER_EXTRA_ARGS[arch], *OS_EXTRA_ARGS.get(platform.system(), [])], log_step="generate_build_files") | ||
| run_command(["cmake", "-B", "build", *COMPILER_EXTRA_ARGS[arch], *OS_EXTRA_ARGS.get(platform.system(), []), "-DCMAKE_C_COMPILER=clang", "-DCMAKE_CXX_COMPILER=clang++"], log_step="generate_build_files") |
There was a problem hiding this comment.
🐛 Correctness Issue
Hardcoded compiler breaks cross-platform compatibility.
Forcing clang/clang++ as compilers will break builds on systems where these compilers aren't available or where specific compiler requirements exist.
Current Code (Diff):
- run_command(["cmake", "-B", "build", *COMPILER_EXTRA_ARGS[arch], *OS_EXTRA_ARGS.get(platform.system(), []), "-DCMAKE_C_COMPILER=clang", "-DCMAKE_CXX_COMPILER=clang++"], log_step="generate_build_files")
+ run_command(["cmake", "-B", "build", *COMPILER_EXTRA_ARGS[arch], *OS_EXTRA_ARGS.get(platform.system(), [])], log_step="generate_build_files")📝 Committable suggestion
‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀
| run_command(["cmake", "-B", "build", *COMPILER_EXTRA_ARGS[arch], *OS_EXTRA_ARGS.get(platform.system(), []), "-DCMAKE_C_COMPILER=clang", "-DCMAKE_CXX_COMPILER=clang++"], log_step="generate_build_files") | |
| run_command(["cmake", "-B", "build", *COMPILER_EXTRA_ARGS[arch], *OS_EXTRA_ARGS.get(platform.system(), [])], log_step="generate_build_files") |
This PR contains changes from a range of commits from the original repository.
Commit Range:
0e7dadb..6197e9fFiles Changed: 22 (13 programming files)
Programming Ratio: 59.1%
Commits included:
... and 3 more commits