koboldcpp. Properly trained models send that to signal the end of their response, but when it's ignored (which koboldcpp unfortunately does by default, probably for backwards-compatibility reasons), the model is forced to keep generating tokens and by going "out of.

koboldcpp exe (same as above) cd your-llamacpp-folder

I can't seem to find documentation anywhere on the net. koboldcpp. 3. Stars - the number of stars that a project has on GitHub. The current version of KoboldCPP now supports 8k context, but it isn't intuitive on how to set it up. Decide your Model. It will inheret some NSFW stuff from its base model and it has softer NSFW training still within it. Step 4. exe file from GitHub. GPT-2 (All versions, including legacy f16, newer format + quanitzed, cerebras) Supports OpenBLAS acceleration only for newer format. So, I've tried all the popular backends, and I've settled on KoboldCPP as the one that does what I want the best. Sort: Recently updated KoboldAI/fairseq-dense-13B. • 6 mo. 3. Includes all Pygmalion base models and fine-tunes (models built off of the original). artoonu. Pashax22. 8 in February 2023, and has since added many cutting. cpp (just copy the output from console when building & linking) compare timings against the llama. In koboldcpp it's a bit faster, but it has missing features compared to this webui, and before this update even the 30B was fast for me so not sure what happened. cpp, and adds a versatile Kobold API endpoint, additional format support, backward compatibility, as well as a fancy UI with persistent stories, editing tools, save formats, memory, world info,. Create a new folder on your PC. I have the tokens set at 200, and it uses up the full length every time, by writing lines for me as well. 1. • 6 mo. To help answer the commonly asked questions and issues regarding KoboldCpp and ggml, I've assembled a comprehensive resource addressing them. g. As for top_p, I use fork of Kobold AI with tail free sampling (tfs) suppport and in my opinion it produces much better results than top_p. When I use the working koboldcpp_cublas. Since my machine is at the lower end, the wait-time doesn't feel that long if you see the answer developing. . CPU: Intel i7-12700. KoboldCpp - release 1. And thought it was supposed to use more ram, but instead it goes full juice on my cpu and still ends up being that slow. evstarshov asked this question in Q&A. Text Generation • Updated 4 days ago • 5. Launch Koboldcpp. Otherwise, please manually select ggml file: 2023-04-28 12:56:09. Not sure about a specific version, but the one in. Support is expected to come over the next few days. The current version of KoboldCPP now supports 8k context, but it isn't intuitive on how to set it up. Running . ; Launching with no command line arguments displays a GUI containing a subset of configurable settings. It pops up, dumps a bunch of text then closes immediately. ggmlv3. koboldcpp-1. Alternatively, on Win10, you can just open the KoboldAI folder in explorer, Shift+Right click on empty space in the folder window, and pick 'Open PowerShell window here'. A The "Is Pepsi Okay?" edition. Comes bundled together with KoboldCPP. Save the memory/story file. • 4 mo. I would like to see koboldcpp's language model dataset for chat and scenarios. It's really easy to get started. cpp repo. LoRa support #96. If anyone has a question about KoboldCpp that's still. We have used some of these posts to build our list of alternatives and similar projects. 1. KoboldCpp is basically llama. Unfortunately, I've run into two problems with it that are just annoying enough to make me consider trying another option. Concedo-llamacpp This is a placeholder model used for a llamacpp powered KoboldAI API emulator by Concedo. For more information, be sure to run the program with the --help flag. 23 beta. py --threads 2 --nommap --useclblast 0 0 models/nous-hermes-13b. Alternatively, drag and drop a compatible ggml model on top of the . You signed in with another tab or window. . When you create a subtitle file for an English or Japanese video using Whisper, the following. Running 13B and 30B models on a PC with a 12gb NVIDIA RTX 3060. Double click KoboldCPP. The Author's note appears in the middle of the text and can be shifted by selecting the strength . I have a RX 6600 XT 8GB GPU, and a 4-core i3-9100F CPU w/16gb sysram Using a 13B model (chronos-hermes-13b. KoboldCPP is a roleplaying program that allows you to use GGML AI models, which are largely dependent on your CPU+RAM. 16 tokens per second (30b), also requiring autotune. Thanks to u/ruryruy's invaluable help, I was able to recompile llama-cpp-python manually using Visual Studio, and then simply replace the DLL in my Conda env. Environment. Text Generation Transformers PyTorch English opt text-generation-inference. 33. Mythomax doesnt like the roleplay preset if you use it as is, the parenthesis in the response instruct seem to influence it to try to use them more. 4. Oobabooga's got bloated and recent updates throw errors with my 7B-4bit GPTQ getting out of memory. Once it reaches its token limit, it will print the tokens it had generated. But that file's set up to add CLBlast and OpenBlas too, you can either remove those lines so it's just this code:They will NOT be compatible with koboldcpp, text-generation-ui, and other UIs and libraries yet. Seriously. g. Gptq-triton runs faster. cpp. ago. Solution 1 - Regenerate the key 1. Closed. Since the latest release added support for cuBLAS, is there any chance of adding Clblast? Koboldcpp (which, as I understand, also uses llama. exe, and then connect with Kobold or Kobold Lite. 2. I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed). md by @city-unit in #1165; Added custom CSS box to UI Theme settings by @digiwombat in #1166; Staging by @Cohee1207 in #1168; New Contributors @Hakirus made their first contribution in #1113Step 4. Looking at the serv. pkg install python. 43 is just an updated experimental release cooked for my own use and shared with the adventurous or those who want more context-size under Nvidia CUDA mmq, this until LlamaCPP moves to a quantized KV cache allowing also to integrate within the accessory buffers. `Welcome to KoboldCpp - Version 1. I finally managed to make this unofficial version work, its a limited version that only supports the GPT-Neo Horni model, but otherwise contains most features of the official version. (P. I run koboldcpp on both PC and laptop and I noticed significant performance downgrade on PC after updating from 1. KoboldCpp works and oobabooga doesn't, so I choose to not look back. dll Loading model: C:UsersMatthewDesktopsmartsggml-model-stablelm-tuned-alpha-7b-q4_0. exe. Also has a lightweight dashboard for managing your own horde workers. N/A | 0 | (Disk cache) N/A | 0 | (CPU) Then it returns this error: RuntimeError: One of your GPUs ran out of memory when KoboldAI tried to load your model. This community's purpose to bridge the gap between the developers and the end-users. Table of ContentsKoboldcpp is an amazing solution that lets people run GGML models and it allows you to run those great models we have been enjoying for our own chatbots without having to rely on expensive hardware as long as you have a bit of patience waiting for the reply's. Make sure Airoboros-7B-SuperHOT is ran with the following parameters: --wbits 4 --groupsize 128 --model_type llama --trust-remote-code --api. Supports CLBlast and OpenBLAS acceleration for all versions. So: Is there a tric. 04 LTS, and has both an NVIDIA CUDA and a generic/OpenCL/ROCm version. This is an example to launch koboldcpp in streaming mode, load a 8k SuperHOT variant of a 4 bit quantized ggml model and split it between the GPU and CPU. Current Behavior. Hit the Settings button. Model: Mostly 7b models at 8_0 quant. Sometimes even just bringing up a vaguely sensual keyword like belt, throat, tongue, etc can get it going in a nsfw direction. 1. This release brings an exciting new feature --smartcontext, this mode provides a way of prompt context manipulation that avoids frequent context recalculation. This new implementation of context shifting is inspired by the upstream one, but because their solution isn't meant for the more advanced use cases people often do in Koboldcpp (Memory, character cards, etc) we had to deviate. 2 - Run Termux. koboldcpp. Covers everything from "how to extend context past 2048 with rope scaling", "what is smartcontext", "EOS tokens and how to unban them", "what's mirostat", "using the command line", sampler orders and types, stop sequence, KoboldAI API endpoints and more. And it works! See their (genius) comment here. I have --useclblast 0 0 for my 3080, but your arguments might be different depending on your hardware configuration. **So What is SillyTavern?** Tavern is a user interface you can install on your computer (and Android phones) that allows you to interact text generation AIs and chat/roleplay with characters you or the community create. Download the 3B, 7B, or 13B model from Hugging Face. cpp repo. I did some testing (2 tests each just in case). RWKV is an RNN with transformer-level LLM performance. Launch Koboldcpp. Load koboldcpp with a Pygmalion model in ggml/ggjt format. KoboldCpp - Combining all the various ggml. Dracotronic May 18, 2023, 7:49pm #1. com | 31 Oct 2023. Custom --grammar support [for koboldcpp] by @kalomaze in #1161; Quick and dirty stat re-creator button by @city-unit in #1164; Update readme. cpp, and adds a versatile Kobold API endpoint, additional format support, backward compatibility, as well as a fancy UI with persistent stories, editing tools, save formats, memory, world info. 2. SDK version, e. 2. So OP might be able to try that. same functonality as KoboldAI, but uses your CPU and RAM instead of GPU; very simple to setup on Windows (must be compiled from source on MacOS and Linux) slower than GPU APIs; GitHub # Kobold Horde. From persistent stories and efficient editing tools to flexible save formats and convenient memory management, KoboldCpp has it all. Recent commits have higher weight than older. bat as administrator. w64devkit is a Dockerfile that builds from source a small, portable development suite for creating C and C++ applications on and for x64 Windows. 33 or later. KoboldCPP:A look at the current state of running large language. It's a single self contained distributable from Concedo, that builds off llama. ) Apparently it's good - very good!koboldcpp processing prompt without BLAS much faster ----- Attempting to use OpenBLAS library for faster prompt ingestion. Why didn't we mention it? Because you are asking about VenusAI and/or JanitorAI which. Then there is 'extra space' for another 512 tokens (2048 - 512 - 1024). Preferably those focused around hypnosis, transformation, and possession. s. Copy the script below into a file named "run. Development is very rapid so there are no tagged versions as of now. timeout /t 2 >nul echo. SuperHOT is a new system that employs RoPE to expand context beyond what was originally possible for a model. I know this isn't really new, but I don't see it being discussed much either. I think it has potential for storywriters. But currently there's even a known issue with that and koboldcpp regarding sampler order used in the proxy presets (PR for fix is waiting to be merged, until it's merged, manually changing the presets may be required). The text was updated successfully, but these errors were encountered:To run, execute koboldcpp. cpp (through koboldcpp. 4) yesterday before posting the aforementioned comment, this instead of recompiling a new one from your present experimental KoboldCPP build, the context related VRAM occupation growth becomes normal again in the present experimental KoboldCPP build. koboldcpp --gpulayers 31 --useclblast 0 0 --smartcontext --psutil_set_threads. Recent commits have higher weight than older. dllGeneral KoboldCpp question for my Vega VII on Windows 11: Is 5% gpu usage normal? My video memory is full and it puts out like 2-3 tokens per seconds when using wizardLM-13B-Uncensored. It would be a very special present for Apple Silicon computer users. If you open up the web interface at localhost:5001 (or whatever), hit the Settings button and at the bottom of the dialog box, for 'Format' select 'Instruct Mode'. henk717 pushed a commit to henk717/koboldcpp that referenced this issue Jul 12, 2023. for Linux: linux mint. KoBold Metals | 12,124 followers on LinkedIn. Even on KoboldCpp's Usage section it was said "To run, execute koboldcpp. pkg upgrade. To comfortably run it locally, you'll need a graphics card with 16GB of VRAM or more. KoboldCpp 1. Uses your RAM and CPU but can also use GPU acceleration. Open koboldcpp. It seems that streaming works only in the normal story mode, but stops working once I change into chat-mode. g. The only caveat is that, unless something's changed recently, koboldcpp won't be able to use your GPU if you're using a lora file. I carefully followed the README. for Linux: Operating System, e. Each token is estimated to be ~3. It's a single self contained distributable from Concedo, that builds off llama. cpp, and adds a versatile Kobold API endpoint, additional format support, backward compatibility, as well as a fancy UI with persistent stories, editing tools, save formats, memory. I would like to see koboldcpp's language model dataset for chat and scenarios. Not sure if I should try on a different kernal, distro, or even consider doing in windows. By default KoboldCpp. Low VRAM option enabled, offloading 27 layers to GPU, batch size 256, smart context off. . A look at the current state of running large language models at home. exe --useclblast 0 0 --gpulayers 50 --contextsize 2048 Welcome to KoboldCpp - Version 1. Paste the summary after the last sentence. I have both Koboldcpp and SillyTavern installed from Termux. Pygmalion 2 7B Pygmalion 2 13B are chat/roleplay models based on Meta's . 8K Members. You can select a model from the dropdown,. bin model from Hugging Face with koboldcpp, I found out unexpectedly that adding useclblast and gpulayers results in much slower token output speed. Please Help · Issue #297 · LostRuins/koboldcpp · GitHub. CPU Version: Download and install the latest version of KoboldCPP. Pyg 6b was great, I ran it through koboldcpp and then SillyTavern so I could make my characters how I wanted (there’s also a good Pyg 6b preset in silly taverns settings). bin Change --gpulayers 100 to the number of layers you want/are able to. You could run a 13B like that, but it would be slower than a model run purely on the GPU. SillyTavern will "lose connection" with the API every so often. github","contentType":"directory"},{"name":"cmake","path":"cmake. bin file onto the . So, I've tried all the popular backends, and I've settled on KoboldCPP as the one that does what I want the best. Recent commits have higher weight than older. 0. So please make them available during inference for text generation. The way that it works is: Every possible token has a probability percentage attached to it. dll will be required. When the backend crashes half way during generation. . So, I found a pytorch package that can run on Windows with an AMD GPU (pytorch-directml) and was wondering if it would work in KoboldAI. for Linux: The API is down (causing issue 1) Streaming isn't supported because it can't get the version (causing issue 2) Isn't sending stop sequences to the API, because it can't get the version (causing issue 3) to join this. Koboldcpp is an amazing solution that lets people run GGML models and it allows you to run those great models we have been enjoying for our own chatbots without having to rely on expensive hardware as long as you have a bit of patience waiting for the reply's. You can use it to write stories, blog posts, play a text adventure game, use it like a chatbot and more! In some cases it might even help you with an assignment or programming task (But always make sure. I’ve used gpt4-x-alpaca-native-13B-ggml the most for stories but your can find other ggml models at Hugging Face. . #499 opened Oct 28, 2023 by WingFoxie. Welcome to KoboldCpp - Version 1. My tokens per second is decent, but once you factor in the insane amount of time it takes to process the prompt every time I send a message, it drops to being abysmal. exe --blasbatchsize 2048 --contextsize 4096 --highpriority --nommap --ropeconfig 1. Oh and one thing I noticed, the consistency and "always in french" understanding is vastly better on my linux computer than on my windows. q4_K_M. Hit Launch. 18 For command line arguments, please refer to --help Otherwise, please manually select ggml file: Attempting to use OpenBLAS library for faster prompt ingestion. If you get inaccurate results or wish to experiment, you can set an override tokenizer for SillyTavern to use while forming a request to the AI backend: None. use weights_only in conversion script (LostRuins#32). Koboldcpp on AMD GPUs/Windows, settings question Using the Easy Launcher, there's some setting names that aren't very intuitive. q4_0. Welcome to KoboldAI Lite! There are 27 total volunteer (s) in the KoboldAI Horde, and 65 request (s) in queues. Also has a lightweight dashboard for managing your own horde workers. cpp, offering a lightweight and super fast way to run various LLAMA. You can download the latest version of it from the following link: After finishing the download, move. Text Generation Transformers PyTorch English opt text-generation-inference. Welcome to the Official KoboldCpp Colab Notebook. Looks like an almost 45% reduction in reqs. Mythalion 13B is a merge between Pygmalion 2 and Gryphe's MythoMax. You need to use the right platform and device id from clinfo! The easy launcher which appears when running koboldcpp without arguments may not do this automatically like in my case. (kobold also seems to generate only a specific amount of tokens. pkg install clang wget git cmake. When I want to update SillyTavern I go into the folder and just put the "git pull" command but with Koboldcpp I can't do the same. 4. When you download Kobold ai it runs in the terminal and once its on the last step you'll see a screen with purple and green text, next to where it says: __main__:general_startup. KoboldCPP:When I using the wizardlm-30b-uncensored. koboldcpp. Currently KoboldCPP is unable to stop inference when an EOS token is emitted, which causes the model to devolve into gibberish, Pygmalion 7B is now fixed on the dev branch of KoboldCPP, which has fixed the EOS issue. Koboldcpp: model API tokenizer. bin model from Hugging Face with koboldcpp, I found out unexpectedly that adding useclblast and gpulayers results in much slower token output speed. Foxy6670 pushed a commit to Foxy6670/koboldcpp that referenced this issue Apr 17, 2023. LostRuins / koboldcpp Public. cpp) 'and' your GPU you'll need to go through the process of actually merging the lora into the base llama model and then creating a new quantized bin file from it. Setting up Koboldcpp: Download Koboldcpp and put the . NEW FEATURE: Context Shifting (A. zip to a location you wish to install KoboldAI, you will need roughly 20GB of free space for the installation (this does not include the models). Second, you will find that although those have many . cpp with the Kobold Lite UI, integrated into a single binary. 3. cpp (a lightweight and fast solution to running 4bit. 11 Attempting to use OpenBLAS library for faster prompt ingestion. exe --useclblast 0 1 Welcome to KoboldCpp - Version 1. The first bot response will work, but the next responses will be empty, unless I make sure the recommended values are set in SillyTavern. 9 projects | news. I'm not super technical but I managed to get everything installed and working (Sort of). EvenSmarterContext) - This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing. 30 43,757 7. Koboldcpp can use your RX 580 for processing prompts (but not generating responses) because it can use CLBlast. It's possible to set up GGML streaming by other means, but it's also a major pain in the ass: you either have to deal with quirky and unreliable Unga, navigate through their bugs and compile llamacpp-for-python with CLBlast or CUDA compatibility in it yourself if you actually want to have adequate GGML performance, or you have to use reliable. 6 C text-generation-webui VS koboldcpp A simple one-file way to run various GGML and GGUF models with KoboldAI's UI llama. AMD/Intel Arc users should go for CLBlast instead, as OpenBLAS is. I have an i7-12700H, with 14 cores and 20 logical processors. Hit Launch. Try a different bot. this restricts malicious weights from executing arbitrary code by restricting the unpickler to only loading tensors, primitive types, and dictionaries. metal. It will only run GGML models, though. Show HN: Phind Model beats GPT-4 at coding, with GPT-3. You can also run it using the command line koboldcpp. So if you want GPU accelerated prompt ingestion, you need to add --useclblast command with arguments for id and device. Giving an example, let's say ctx_limit is 2048, your WI/CI is 512 tokens, you set 'summary limit' to 1024 (instead of the fixed 1,000). Mantella is a Skyrim mod which allows you to naturally speak to NPCs using Whisper (speech-to-text), LLMs (text generation), and xVASynth (text-to-speech). A compatible libopenblas will be required. While i had proper sfw runs on this model despite it being optimized against literotica i can't say i had good runs on the horni-ln version. This example goes over how to use LangChain with that API. It takes a bit of extra work, but basically you have to run SillyTavern on a PC/Laptop, then edit the whitelist. Author's Note. Content-length header not sent on text generation API endpoints bug. I think most people are downloading and running locally. exe is the actual command prompt window that displays the information. Type in . evstarshov. 6 Attempting to library without OpenBLAS. For info, please check koboldcpp. California-based artificial intelligence (AI) powered mineral exploration company KoBold Metals has raised $192. problems occur. • 6 mo. bin files, a good rule of thumb is to just go for q5_1. KoboldCPP, on another hand, is a fork of. WolframRavenwolf • 3 mo. i got the github link but even there i don't understand what i need to do. Yes, I'm running Kobold with GPU support on an RTX2080. It's a kobold compatible REST api, with a subset of the endpoints. In this tutorial, we will demonstrate how to run a Large Language Model (LLM) on your local environment using KoboldCPP. • 4 mo. 5-turbo model for free, while it's pay-per-use on the OpenAI API. License: other. A place to discuss the SillyTavern fork of TavernAI. This is how we will be locally hosting the LLaMA model. koboldcpp. Hacker News is a popular site for tech enthusiasts and entrepreneurs, where they can share and discuss news, projects, and opinions. It was discovered and developed by kaiokendev. exe or drag and drop your quantized ggml_model. I've recently switched to KoboldCPP + SillyTavern. Edit: The 1. 7B. Kobold. KoboldAI is a "a browser-based front-end for AI-assisted writing with multiple local & remote AI models. To use the increased context with KoboldCpp and (when supported) llama. N/A | 0 | (Disk cache) N/A | 0 | (CPU) Then it returns this error: RuntimeError: One of your GPUs ran out of memory when KoboldAI tried to load your model. exe, and then connect with Kobold or Kobold Lite. 39. I reviewed the Discussions, and have a new bug or useful enhancement to share. 10 Attempting to use CLBlast library for faster prompt ingestion. Min P Test Build (koboldcpp) Min P sampling added. In order to use the increased context length, you can presently use: KoboldCpp - release 1. Configure ssh to use the key. 44. 1 - Install Termux (Download it from F-Droid, the PlayStore version is outdated). h, ggml-metal. KoboldCpp is an easy-to-use AI text-generation software for GGML and GGUF models. cpp, and adds a versatile Kobold API endpoint, additional format support, backward compatibility, as well as a fancy UI with persistent stories, editing tools, save formats, memory, world info, author's note, characters, scenarios. Support is expected to come over the next few days. Koboldcpp on AMD GPUs/Windows, settings question Using the Easy Launcher, there's some setting names that aren't very intuitive. Extract the . bin file onto the . As for the context, I think you can just hit the Memory button right above the. A compatible clblast. For command line arguments, please refer to --help. Why didn't we mention it? Because you are asking about VenusAI and/or JanitorAI which. 33 2,028 9. Switch to ‘Use CuBLAS’ instead of ‘Use OpenBLAS’ if you are on a CUDA GPU (which are NVIDIA graphics cards) for massive performance gains. maddes8chtApr 23, 2023. for. So this here will run a new kobold web service on port 5001:1. cpp like ggml-metal. ghost commented on Jun 17. koboldcpp --gpulayers 31 --useclblast 0 0 --smartcontext --psutil_set_threads. The best part is that it’s self-contained and distributable, making it easy to get started. cpp) already has it, so it shouldn't be that hard. With KoboldCpp, you get accelerated CPU/GPU text generation and a fancy writing UI, along. KoboldCpp is an easy-to-use AI text-generation software for GGML and GGUF models. A fictional character named a 35-year-old housewife appeared. 0 10000 --unbantokens --useclblast 0 0 --usemlock --model. Download a suitable model (Mythomax is a good start) at Fire up KoboldCPP, load the model, then start SillyTavern and switch the connection mode to KoboldAI. C:@KoboldAI>koboldcpp_concedo_1-10. ago. We’re on a journey to advance and democratize artificial intelligence through open source and open science. You switched accounts on another tab or window. It is done by loading a model -> online sources -> Kobold API and there I enter localhost:5001. Activity is a relative number indicating how actively a project is being developed. ggmlv3. So please make them available during inference for text generation. Anyway, when I entered the prompt "tell me a story" the response in the webUI was "Okay" but meanwhile in the console (after a really long time) I could see the following output:Step #1. Welcome to KoboldAI on Google Colab, TPU Edition! KoboldAI is a powerful and easy way to use a variety of AI based text generation experiences. Describe the bug When trying to connect to koboldcpp using the KoboldAI API, SillyTavern crashes/exits. cpp, however work is still being done to find the optimal implementation. Take the following steps for basic 8k context usuage. Psutil selects 12 threads for me, which is the number of physical cores on my CPU, however I have also manually tried setting threads to 8 (the number of performance cores) which also does. You can use it to write stories, blog posts, play a text adventure game, use it like a chatbot and more! In some cases it might even help you with an assignment or programming task (But always make sure. I just had some tests and I was able to massively increase the speed of generation by increasing the threads number. exe, and then connect with Kobold or Kobold Lite. #499 opened Oct 28, 2023 by WingFoxie. While 13b l2 models are giving good writing like old 33b l1 models. 4 tasks done. Physical (or virtual) hardware you are using, e. Learn how to use the API and its features in this webpage. A. Welcome to KoboldCpp - Version 1. Here is a video example of the mod fully working only using offline AI tools. Not sure if I should try on a different kernal, distro, or even consider doing in windows. **So What is SillyTavern?** Tavern is a user interface you can install on your computer (and Android phones) that allows you to interact text generation AIs and chat/roleplay with characters you or the community create. LostRuinson May 11. 19. ago. With KoboldCpp, you gain access to a wealth of features and tools that enhance your experience in running local LLM (Language Model) applications. AMD/Intel Arc users should go for CLBlast instead, as OpenBLAS is CPU only. 3 temp and still get meaningful output. SillyTavern -. I get around the same performance as cpu (32 core 3970x vs 3090), about 4-5 tokens per second for the 30b model.

koboldcpp. exe and select model OR run "KoboldCPP. koboldcpp