Krystian Safjan's Bloghttps://www.safjan.com/2024-02-22T00:00:00+01:00Data Scientist | Researcher | Team Leader<br><br> working at Ernst & Young and writing about <a href="/category/machine-learning.html">Data Science and Visualization</a>, on <a href="/category/machine-learning.html">Machine Learning, Deep Learning</a> and <a href="/tag/nlp/">NLP</a>. There are also some <a href="/category/howto.html">howto</a> posts on tools and workflows.Open Source LLM Observability Tools and Platforms2024-02-22T00:00:00+01:002024-02-22T00:00:00+01:00Krystian Safjantag:www.safjan.com,2024-02-22:/open-source-llm-observability-tools-and-platforms/<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<p><a id="llm-observability-in-the-context-of-llmops-for-generative-ai"></a></p>
<h2>LLM Observability in the Context of LLMOps for Generative AI</h2>
<p>AI is transforming the world, and one area where it has made significant strides is in generative models, particularly in the field of Large Language Models (LLMs) like GPT-3 and transformer models. However, as impressive as these models are, managing, monitoring, and understanding their behavior and output remains a challenge. Enter LLMOps, a new field focusing on the management and deployment of LLMs, and a key aspect of this is LLM Observability. </p>
<ul>
<li><a href="#llm-observability-in-the-context-of-llmops-for-generative-ai">LLM Observability in the Context of LLMOps for Generative AI</a></li>
<li><a href="#what-is-llm-observability">What is LLM Observability?</a></li>
<li><a href="#expected-functionalities-of-an-llm-observability-solution">Expected Functionalities of an LLM Observability Solution</a></li>
<li><a href="#open-source-llm-observability-tools-and-platforms">Open Source LLM Observability Tools and Platforms</a></li>
<li><a href="#other---related">Other - related</a></li>
<li><a href="#references">References</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="what-is-llm-observability"></a></p>
<h2>What is LLM Observability?</h2>
<p>LLM Observability is the ability to understand, monitor, and infer the internal state of an LLM from its external outputs. It encompasses several areas including model health monitoring, performance tracking, debugging, and evaluating model fairness and safety. </p>
<p>In the context of LLMOps, LLM Observability is critical. LLMs are complex and can be unpredictable, producing outputs that range from harmless to potentially harmful or biased. It's therefore essential to have the right tools and methodologies for observing and understanding these models' behaviors in real-time, during training, testing, and after deployment.</p>
<p><a id="expected-functionalities-of-an-llm-observability-solution"></a></p>
<h2>Expected Functionalities of an LLM Observability Solution</h2>
<ol>
<li>
<p><strong>Model Performance Monitoring</strong>: An observability solution should be able to track and monitor the performance of an LLM in real-time. This includes tracking metrics like accuracy, precision, recall, and F1 score, as well as more specific metrics like perplexity or token costs in the case of language models.</p>
</li>
<li>
<p><strong>Model Health Monitoring</strong>: The solution should be capable of monitoring the overall health of the model, identifying and alerting on anomalies or potentially problematic patterns in the model's behavior.</p>
</li>
<li>
<p><strong>Debugging and Error Tracking</strong>: If something does go wrong, the solution should provide debugging and error tracking functionalities, helping developers identify, trace, and fix issues.</p>
</li>
<li>
<p><strong>Fairness, Bias, and Safety Evaluation</strong>: Given the potential for bias and ethical issues in AI, any observability solution should include features for evaluating fairness and safety, helping ensure that the model's outputs are unbiased and ethically sound.</p>
</li>
<li>
<p><strong>Interpretability</strong>: LLMs can often be "black boxes", producing outputs without clear reasoning. A good observability solution should help make the model's decision-making process more transparent, providing insights into why a particular output was produced.</p>
</li>
<li>
<p><strong>Integration with Existing LLMOps Tools</strong>: Finally, the solution should be capable of integrating with existing LLMOps tools and workflows, from model development and training to deployment and maintenance.</p>
</li>
</ol>
<blockquote>
<p>LLM Observability is a crucial aspect of LLMOps for generative AI. It provides the <strong>visibility</strong> and <strong>control</strong> needed <strong>to effectively manage, deploy, and maintain Large Language Models</strong>, ensuring they <strong>perform as expected, are free from bias, and are safe to use</strong>.</p>
</blockquote>
<p><a id="open-source-llm-observability-tools-and-platforms"></a></p>
<h2>Open Source LLM Observability Tools and Platforms</h2>
<ol>
<li><a href="https://github.com/aavetis/azure-openai-logger">Azure OpenAI Logger</a> - <img alt="github stars shield" src="https://img.shields.io/github/stars/aavetis/azure-openai-logger.svg?logo=github"> - "Batteries included" logging solution for your Azure OpenAI instance.</li>
<li><a href="https://github.com/deepchecks/deepchecks">Deepchecks</a> - <img alt="github stars shield" src="https://img.shields.io/github/stars/deepchecks/deepchecks.svg?logo=github"> - Tests for Continuous Validation of ML Models & Data. Deepchecks is a Python package for comprehensively validating your machine learning models and data with minimal effort.</li>
<li><a href="https://github.com/evidentlyai/evidently">Evidently</a> - <img alt="github stars shield" src="https://img.shields.io/github/stars/evidentlyai/evidently.svg?logo=github"> - Evaluate and monitor ML models from validation to production.</li>
<li><a href="https://github.com/Giskard-AI/giskard">Giskard</a> - <img alt="github stars shield" src="https://img.shields.io/github/stars/Giskard-AI/giskard.svg?logo=github"> - Testing framework dedicated to ML models, from tabular to LLMs. Detect risks of biases, performance issues and errors in 4 lines of code.</li>
<li><a href="https://github.com/whylabs/whylogs">whylogs</a> - <img alt="github stars shield" src="https://img.shields.io/github/stars/whylabs/whylogs.svg?logo=github"> - The open standard for data logging</li>
<li><a href="https://github.com/lunary-ai/lunary">lunary</a> - <img alt="github stars shield" src="https://img.shields.io/github/stars/lunary-ai/lunary.svg?logo=github"> - The production toolkit for LLMs. observability, prompt management, and evaluations.</li>
<li><a href="https://github.com/traceloop/openllmetry">openllmetry</a> - <img alt="github stars shield" src="https://img.shields.io/github/stars/traceloop/openllmetry.svg?logo=github"> - Open-source observability for your LLM application, based on OpenTelemetry</li>
<li><a href="https://github.com/Arize-ai/phoenix">phoenix (Arize Ai)</a> - <img alt="github stars shield" src="https://img.shields.io/github/stars/Arize-ai/phoenix.svg?logo=github"> - AI Observability & Evaluation - Evaluate, troubleshoot, and fine-tune your LLM, CV, and NLP models in a notebook.</li>
<li><a href="https://github.com/langfuse/langfuse">langfuse</a> - <img alt="github stars shield" src="https://img.shields.io/github/stars/langfuse/langfuse.svg?logo=github"> - Open source LLM engineering platform. observability, metrics, evals, prompt management SDKs + integrations for Typescript, Python</li>
<li><a href="https://github.com/whylabs/langkit">LangKit</a> - <img alt="github stars shield" src="https://img.shields.io/github/stars/whylabs/langkit.svg?logo=github"> - An open-source toolkit for monitoring Large Language Models (LLMs). Extracts signals from prompts & responses, ensuring safety & security. Features include text quality, relevance metrics, & sentiment analysis. Comprehensive tool for LLM observability.</li>
<li><a href="https://github.com/AgentOps-AI/agentops">agentops</a> - <img alt="github stars shield" src="https://img.shields.io/github/stars/AgentOps-AI/agentops.svg?logo=github"> - Python SDK for agent evals and observability</li>
<li><a href="https://github.com/pezzolabs/pezzo">pezzo</a> - <img alt="github stars shield" src="https://img.shields.io/github/stars/pezzolabs/pezzo.svg?logo=github"> - Open-source, developer-first LLMOps platform designed to streamline prompt design, version management, instant delivery, collaboration, troubleshooting, observability and more.</li>
<li><a href="https://github.com/fiddler-labs/fiddler-auditor">Fiddler AI</a> - <img alt="github stars shield" src="https://img.shields.io/github/stars/fiddler-labs/fiddler-auditor.svg?logo=github"> - Evaluate, monitor, analyse, and improve machine learning and generative models from pre-production to production. Ship more ML and LLMs into production, and monitor ML and LLM metrics like hallucination, PII, and toxicity.</li>
<li><a href="https://github.com/Theodo-UK/OmniLog">OmniLog</a> - <img alt="github stars shield" src="https://img.shields.io/github/stars/Theodo-UK/OmniLog.svg?logo=github"> - Observability tool for your LLM prompts.</li>
</ol>
<p><a id="other---related"></a></p>
<h2>Other - related</h2>
<ul>
<li><a href="https://github.com/great-expectations/great_expectations">Great Expectations</a> - Always know what to expect from your data.</li>
<li><a href="https://github.com/AgentOps-AI/tokencost">AgentOps-AI/tokencost</a> - Easy token price estimates for LLMs</li>
<li><a href="https://github.com/YANG-DB/observability-prompots">observability prompts</a> - LLM observability related prompts</li>
<li><a href="https://github.com/AstronomerAmber/LLM_Observability">LLM Observability</a> </li>
<li><a href="https://github.com/BoundaryML/baml">baml</a> - A programming language to build strongly-typed LLM functions. Testing and observability included</li>
<li><a href="https://github.com/fluxninja/aperture">aperture</a> - Rate limiting, caching, and request prioritization for modern workloads</li>
</ul>
<p><a id="references"></a></p>
<h2>References</h2>
<ul>
<li><a href="https://towardsdatascience.com/llm-monitoring-and-observability-c28121e75c2f">LLM Monitoring and Observability — A Summary of Techniques and Approaches for Responsible AI | by Josh Poduska | Towards Data Science</a></li>
<li><a href="https://www.aporia.com/learn/how-to-monitor-large-language-models/">Monitoring LLMs: Metrics, challenges, & hallucinations</a></li>
<li><a href="https://github.com/mattcvincent/intro-llm-observability">mattcvincent/intro-<em>llm</em>-<em>observability</em></a> - Intro to LLM Observability</li>
<li><a href="https://www.33rdsquare.com/what-is-perplexity-ai/">Demystifying Perplexity: An AI Expert‘s Comprehensive Guide - 33rd Square</a></li>
<li><a href="https://huggingface.co/spaces/evaluate-metric/perplexity">Perplexity - a Hugging Face Space by evaluate-metric</a></li>
</ul>The Most Powerful Mac Productivity and Automation Apps2024-01-24T00:00:00+01:002024-01-24T00:00:00+01:00Krystian Safjantag:www.safjan.com,2024-01-24:/the-most-powerful-mac-productivity-and-automation-apps/<ol>
<li><a href="https://www.alfredapp.com/">Alfred</a>: A productivity app for Mac OS X, which boosts your efficiency with hotkeys, keywords, text expansion, and more. </li>
<li><a href="https://folivora.ai/">BetterTouchTool</a>: Allows you to configure many types of gestures for your Mac’s Trackpad, Magic Mouse, and Keyboard.</li>
<li><a href="https://www.noodlesoft.com/">Hazel</a>: A system preference pane …</li></ol><ol>
<li><a href="https://www.alfredapp.com/">Alfred</a>: A productivity app for Mac OS X, which boosts your efficiency with hotkeys, keywords, text expansion, and more. </li>
<li><a href="https://folivora.ai/">BetterTouchTool</a>: Allows you to configure many types of gestures for your Mac’s Trackpad, Magic Mouse, and Keyboard.</li>
<li><a href="https://www.noodlesoft.com/">Hazel</a>: A system preference pane that works silently in the background, automatically filing, organizing, and cleaning up your desktop.</li>
<li><a href="https://support.apple.com/guide/automator/welcome/mac">Automator</a>: A built-in Mac utility for automating tasks. You can create workflows, watch folders, and set up automated actions.</li>
<li><a href="https://www.keyboardmaestro.com/">Keyboard Maestro</a>: Enhances the power of your keyboard by creating macros that can automate virtually anything on your Mac. </li>
<li><a href="https://qsapp.com/">QuickSilver</a>: A light, fast, and free Mac application launcher that also replaces your task switcher.</li>
<li><a href="https://culturedcode.com/things/">Things</a>: Task management software that makes it easy to stay organized and get things done.</li>
<li><a href="https://ulysses.app/">Ulysses</a>: A feature-rich text editor for writers that allows you to manage and organize all your writing in a single app. </li>
<li><a href="https://www.folivora.ai/bettersnaptool">BetterSnapTool</a>: Allows users to quickly and easily manage their window positions and sizes by either dragging them to one of the screen's corners or to the top, left or right side of the screen.</li>
<li><a href="https://www.macbartender.com/">Bartender</a>: Lets you organize your menu bar apps by hiding them, rearranging them, or moving them to the Bartender Bar. </li>
<li><a href="http://magnet.crowdcafe.com/">Magnet</a>: Keeps your workspace organized and allows you to snap application windows in different halves or quarters of your screen.</li>
</ol>Avoid using curl -u “username:secret”!2024-01-20T00:00:00+01:002024-01-20T00:00:00+01:00Krystian Safjantag:www.safjan.com,2024-01-20:/avoid-using-curl-u-usernamesecret/<p>When invoking an endpoint guarded by Basic Authentication, you might resort to the -u username:password feature in curl.</p>
<p><code>curl -u "jane@examplewebsite.com:mySecretGuard" http://api.myawesomeapp.com/information</code></p>
<p>However, this approach is not the most efficient or secure.</p>
<p>In executing …</p><p>When invoking an endpoint guarded by Basic Authentication, you might resort to the -u username:password feature in curl.</p>
<p><code>curl -u "jane@examplewebsite.com:mySecretGuard" http://api.myawesomeapp.com/information</code></p>
<p>However, this approach is not the most efficient or secure.</p>
<p>In executing this command, the credentials are archived in your shell history, posing a considerable security threat.</p>
<p>On the bright side, there's a straightforward solution to this issue!</p>
<p>Now you can generate a file in your home directory titled <code>.netrc</code> as shown below:</p>
<div class="highlight"><pre><span></span><code><span class="n">machine</span><span class="w"> </span><span class="n">api</span><span class="p">.</span><span class="n">myawesomeapp</span><span class="p">.</span><span class="n">com</span><span class="w"> </span>
<span class="w"> </span><span class="n">login</span><span class="w"> </span><span class="n">jane</span><span class="nv">@examplewebsite</span><span class="p">.</span><span class="n">com</span><span class="w"> </span>
<span class="w"> </span><span class="n">password</span><span class="w"> </span><span class="n">mySecretGuard</span><span class="w"> </span>
</code></pre></div>
<p>Afterwards, when running the curl command, just include -n and the credentials will be fetched from the file you just created.</p>
<p><code>curl -n http://api.myawesomeapp.com/information</code></p>
<p>To give you more context, curl is a command-line tool for getting or sending data using URL syntax. It supports various protocols, including but not limited to HTTP, HTTPS, FTP, and FTPS. Curl is widely used for making API requests.</p>
<p>In addition, the <code>.netrc</code> file is a special file that stores login and initialisation information used by the auto-login process. It generally resides in the user's home directory. This file can contain information like the name of the machine to which to connect, and any necessary usernames and passwords.</p>
<p>On a final note, remember that this method works only with the curl command. Other command-line tools may require different approaches to secure authentication. Always prioritise data security by opting for methods that safeguard your login credentials.</p>HTML5 interactive elements2024-01-04T00:00:00+01:002024-01-04T00:00:00+01:00Krystian Safjantag:www.safjan.com,2024-01-04:/html5-interactive-elements/<h1>HTML5 Interactive Elements: An Overview and Usage Guide</h1>
<p>HyperText Markup Language (HTML) is the standard markup language for documents designed to be rendered in a web browser. Over the years, HTML has evolved to keep up with the growing need for better …</p><h1>HTML5 Interactive Elements: An Overview and Usage Guide</h1>
<p>HyperText Markup Language (HTML) is the standard markup language for documents designed to be rendered in a web browser. Over the years, HTML has evolved to keep up with the growing need for better structure and interactivity. </p>
<p>HTML5, the latest version, introduces several interactive tags or elements, which makes building interactive, dynamic web content easier without having to resort to JavaScript or CSS. Let's dive into these interactive elements and have a look at some examples to understand their usage better.</p>
<h2>The <code><details></code> and <code><summary></code> Elements</h2>
<p>The <code><details></code> and <code><summary></code> tags allow us to create an interactive widget that the user can open or close. The <code><summary></code> tag is a child of the <code><details></code> tag, representing the summary or brief description of the content in <code><details></code>.</p>
<div class="highlight"><pre><span></span><code><span class="p"><</span><span class="nt">details</span><span class="p">></span>
<span class="p"><</span><span class="nt">summary</span><span class="p">></span>The Solar System<span class="p"></</span><span class="nt">summary</span><span class="p">></span>
<span class="p"><</span><span class="nt">p</span><span class="p">></span>The Solar System includes the Sun, the Earth (where you are now!) and all the other planets.<span class="p"></</span><span class="nt">p</span><span class="p">></span>
<span class="p"></</span><span class="nt">details</span><span class="p">></span>
</code></pre></div>
<details>
<summary>The Solar System</summary>
<p>The Solar System includes the Sun, the Earth (where you are now!) and all the other planets.</p>
</details>
<h2>The <code><dialog></code> Element</h2>
<p>The <code><dialog></code> element presents content in a dialogue box or a window. You can toggle the visibility of the <code><dialog></code> by changing the 'open' attribute.</p>
<div class="highlight"><pre><span></span><code><span class="p"><</span><span class="nt">dialog</span> <span class="na">open</span><span class="p">></span>
This is a dialog box!<span class="p"><</span><span class="nt">br</span><span class="p">></span>
<span class="p"><</span><span class="nt">button</span> <span class="na">onclick</span><span class="o">=</span><span class="s">"this.parentElement.close()"</span><span class="p">></span>Close<span class="p"></</span><span class="nt">button</span><span class="p">></span>
<span class="p"></</span><span class="nt">dialog</span><span class="p">></span>
</code></pre></div>
<p><dialog open>
This is a dialog box!<br>
<button onclick="this.parentElement.close()">Close</button>
</dialog></p>
<h2>The <code><datalist></code> Element</h2>
<p>The <code><datalist></code> element permits the creation of pre-defined options for an <code><input></code> element. User can either select an option or type in their input.</p>
<div class="highlight"><pre><span></span><code><span class="p"><</span><span class="nt">label</span> <span class="na">for</span><span class="o">=</span><span class="s">"browsers"</span><span class="p">></span>Choose a browser from the list:<span class="p"></</span><span class="nt">label</span><span class="p">></span>
<span class="p"><</span><span class="nt">input</span> <span class="na">list</span><span class="o">=</span><span class="s">"browsers"</span> <span class="na">name</span><span class="o">=</span><span class="s">"browser"</span> <span class="na">id</span><span class="o">=</span><span class="s">"browser"</span><span class="p">></span>
<span class="p"><</span><span class="nt">datalist</span> <span class="na">id</span><span class="o">=</span><span class="s">"browsers"</span><span class="p">></span>
<span class="p"><</span><span class="nt">option</span> <span class="na">value</span><span class="o">=</span><span class="s">"Chrome"</span><span class="p">></span>
<span class="p"><</span><span class="nt">option</span> <span class="na">value</span><span class="o">=</span><span class="s">"Firefox"</span><span class="p">></span>
<span class="p"><</span><span class="nt">option</span> <span class="na">value</span><span class="o">=</span><span class="s">"Internet Explorer"</span><span class="p">></span>
<span class="p"><</span><span class="nt">option</span> <span class="na">value</span><span class="o">=</span><span class="s">"Opera"</span><span class="p">></span>
<span class="p"><</span><span class="nt">option</span> <span class="na">value</span><span class="o">=</span><span class="s">"Safari"</span><span class="p">></span>
<span class="p"></</span><span class="nt">datalist</span><span class="p">></span>
</code></pre></div>
<p><label for="browsers">Choose a browser from the list:</label>
<input list="browsers" name="browser" id="browser">
<datalist id="browsers">
<br>
<option value="Chrome">
<option value="Firefox">
<option value="Internet Explorer">
<option value="Opera">
<option value="Safari">
</datalist>
## The `<progress>` Element
The `<progress>` element serves to represent the progress of a task. Use the `value` attribute to specify the current progress and the `max` attribute to indicate the progress bar's maximum value.
<div class="highlight"><pre><span></span><code><span class="p"><</span><span class="nt">progress</span> <span class="na">value</span><span class="o">=</span><span class="s">"70"</span> <span class="na">max</span><span class="o">=</span><span class="s">"100"</span><span class="p">></</span><span class="nt">progress</span><span class="p">></span>
</code></pre></div>
<progress value="70" max="100"></progress>
## The `<meter>` Element
The `<meter>` tag is used to represent the scalar measurement within a known range, or a fractional value. This could be the disk usage, the relevance of a query result or any other form of gauge.
<div class="highlight"><pre><span></span><code>Disk usage: <span class="p"><</span><span class="nt">meter</span> <span class="na">value</span><span class="o">=</span><span class="s">"0.6"</span><span class="p">></span>60%<span class="p"></</span><span class="nt">meter</span><span class="p">></span>
</code></pre></div>
Disk usage: <meter value="0.6">60%</meter>
## The `<output>` Element
The `<output>` tag is a container for calculation results. To link the output element with other elements, you can use the `for` attribute.
<div class="highlight"><pre><span></span><code><span class="p"><</span><span class="nt">form</span> <span class="na">oninput</span><span class="o">=</span><span class="s">"x.value=parseInt(a.value)+parseInt(b.value)"</span><span class="p">></span>
0<span class="p"><</span><span class="nt">input</span> <span class="na">type</span><span class="o">=</span><span class="s">"range"</span> <span class="na">id</span><span class="o">=</span><span class="s">"a"</span> <span class="na">value</span><span class="o">=</span><span class="s">"50"</span><span class="p">></span>100 +
0<span class="p"><</span><span class="nt">input</span> <span class="na">type</span><span class="o">=</span><span class="s">"range"</span> <span class="na">id</span><span class="o">=</span><span class="s">"b"</span> <span class="na">value</span><span class="o">=</span><span class="s">"50"</span><span class="p">></span>100 =
<span class="p"><</span><span class="nt">output</span> <span class="na">name</span><span class="o">=</span><span class="s">"x"</span> <span class="na">for</span><span class="o">=</span><span class="s">"a b"</span><span class="p">></</span><span class="nt">output</span><span class="p">></span>
<span class="p"></</span><span class="nt">form</span><span class="p">></span>
</code></pre></div>
<form oninput="x.value=parseInt(a.value)+parseInt(b.value)">
0<input type="range" id="a" value="50">100 +
0<input type="range" id="b" value="50">100 =
<output name="x" for="a b"></output>
</form>
## The `<canvas>` Element
The `<canvas>` tag allows for dynamic and scriptable rendering of shapes and bitmap images. It's a low-level, procedural model that updates a bitmap.
<div class="highlight"><pre><span></span><code><span class="p"><</span><span class="nt">canvas</span> <span class="na">id</span><span class="o">=</span><span class="s">"myCanvas"</span> <span class="na">width</span><span class="o">=</span><span class="s">"200"</span> <span class="na">height</span><span class="o">=</span><span class="s">"100"</span> <span class="na">style</span><span class="o">=</span><span class="s">"border:1px solid #000000;"</span><span class="p">></span>
<span class="p"></</span><span class="nt">canvas</span><span class="p">></span>
</code></pre></div>
You can then use JavaScript to interact with this element:
<div class="highlight"><pre><span></span><code><span class="kd">var</span><span class="w"> </span><span class="nx">c</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">document</span><span class="p">.</span><span class="nx">getElementById</span><span class="p">(</span><span class="s2">"myCanvas"</span><span class="p">);</span>
<span class="kd">var</span><span class="w"> </span><span class="nx">ctx</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nx">c</span><span class="p">.</span><span class="nx">getContext</span><span class="p">(</span><span class="s2">"2d"</span><span class="p">);</span>
<span class="nx">ctx</span><span class="p">.</span><span class="nx">fillStyle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"#FF0000"</span><span class="p">;</span>
<span class="nx">ctx</span><span class="p">.</span><span class="nx">fillRect</span><span class="p">(</span><span class="mf">0</span><span class="p">,</span><span class="w"> </span><span class="mf">0</span><span class="p">,</span><span class="w"> </span><span class="mf">80</span><span class="p">,</span><span class="w"> </span><span class="mf">80</span><span class="p">);</span>
</code></pre></div>
<canvas id="myCanvas" width="200" height="100" style="border:1px solid #000000;">
</canvas>
<script>
var c = document.getElementById("myCanvas");
var ctx = c.getContext("2d");
ctx.fillStyle = "#FF0000";
ctx.fillRect(0, 0, 80, 80);
</script>
</p>entr - run arbitrary command when files change2024-01-01T00:00:00+01:002024-01-01T00:00:00+01:00Krystian Safjantag:www.safjan.com,2024-01-01:/entr-run-arbitrary-command-when-files-change/<p><code>entr</code> is a UNIX utility which runs arbitrary commands when files change. It helps in automating tasks during development such as rebuilding projects, running tests, or syncing files.</p>
<p>Here's a simple usage example:</p>
<div class="highlight"><pre><span></span><code>ls *.c | entr make
</code></pre></div>
<p>In the above example, <code>ls …</code></p><p><code>entr</code> is a UNIX utility which runs arbitrary commands when files change. It helps in automating tasks during development such as rebuilding projects, running tests, or syncing files.</p>
<p>Here's a simple usage example:</p>
<div class="highlight"><pre><span></span><code>ls *.c | entr make
</code></pre></div>
<p>In the above example, <code>ls *.c</code> lists all C files in the directory. This list is piped (<code>|</code>) into <code>entr</code>. When any of these files changes, <code>entr</code> executes the <code>make</code> command.</p>
<p>Some key features of <code>entr</code> include:</p>
<ul>
<li>It frees up developers to focus on the code by automating rebuild tasks.</li>
<li>It doesn't require a configuration file or a list of tasks to run. It just reruns the command you provide it each time a file changes.</li>
<li>You can use it with any command that needs to operate on a file. This might be shell commands, like <code>ls</code> or <code>echo</code>, or any other CLI tool you have in your system. </li>
</ul>
<p>Additional commands for <code>entr</code> include:</p>
<ul>
<li><code>-r</code> : To restart a long running process like a server when a file changes.</li>
<li><code>-p</code> : Postpone execution until files are updated.</li>
<li><code>-s</code> : Evaluate the first argument using the interpreter specified by the SHELL environment variable.</li>
<li><code>-d</code> : Track directories recursively and include files that are created after the utility starts</li>
</ul>
<p>Please note that <code>entr</code> requires a list of files as input. It does not discover files on its own, it expects to receive a list of files from stdin, which is usually supplied with command line utilities like <code>ls</code>, <code>find</code> or <code>git ls-files</code>.</p>Tverski Similarity Metrics2023-12-10T00:00:00+01:002023-12-10T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-12-10:/tverski-similarity-metrics/<p>Tversky similarity and <a href="https://www.safjan.com/jaro-winkler-similarity/">Jaro-Winkler Similarity</a> similarity are two different similarity metrics that are used to compare strings or sequences. They are designed for specific purposes and have different mathematical formulas and applications.</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#tversky-similarity">Tversky Similarity</a></li>
<li><a href="#formula">Formula</a></li>
<li><a href="#python-example">Python Example</a></li>
<li><a href="#jaro-winkler-similarity-for-reference">Jaro-Winkler Similarity (for reference)</a></li>
<li><a href="#summary">Summary …</a></li></ul><p>Tversky similarity and <a href="https://www.safjan.com/jaro-winkler-similarity/">Jaro-Winkler Similarity</a> similarity are two different similarity metrics that are used to compare strings or sequences. They are designed for specific purposes and have different mathematical formulas and applications.</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#tversky-similarity">Tversky Similarity</a></li>
<li><a href="#formula">Formula</a></li>
<li><a href="#python-example">Python Example</a></li>
<li><a href="#jaro-winkler-similarity-for-reference">Jaro-Winkler Similarity (for reference)</a></li>
<li><a href="#summary">Summary</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="tversky-similarity"></a></p>
<h2>Tversky Similarity</h2>
<p><strong>Tversky similarity is a metric used to compare sets</strong>, typically in the context of information retrieval, information retrieval evaluation, and recommendation systems. It was introduced by Amos Tversky in his work on set comparison. Tversky similarity takes into account the <strong>number of common elements</strong> between two sets as well as the <strong>differences in elements between them</strong>. It has two parameters, alpha and beta, which control the balance between precision and recall.</p>
<p>Let's dive into the mathematical formula, explanation, and Python examples for Tversky similarity.</p>
<p><a id="formula"></a></p>
<h3>Formula</h3>
<p>Tversky similarity measures the similarity between two sets A and B, considering the trade-off between false positives and false negatives. The formula for Tversky similarity is:</p>
<div class="math">$$
Tversky(A, B) = \frac{|A \cap B|}{|A \cap B| + \alpha |A - B| + \beta |B - A|}
$$</div>
<p>Where:
- <span class="math">\((|A \cap B|)\)</span> is the size of the intersection of sets A and B.
- <span class="math">\((|A - B|)\)</span> is the size of the set difference of A minus B.
- (<span class="math">\(|B - A|)\)</span> is the size of the set difference of B minus A.
- <span class="math">\(\alpha\)</span> and <span class="math">\(\beta\)</span> are parameters that control the trade-off between precision and recall. When <span class="math">\(\alpha = \beta = 1\)</span>, the Tversky similarity becomes the Jaccard similarity.</p>
<p><a id="python-example"></a></p>
<h3>Python Example</h3>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">tversky_similarity</span><span class="p">(</span><span class="n">set_a</span><span class="p">,</span> <span class="n">set_b</span><span class="p">,</span> <span class="n">alpha</span><span class="p">,</span> <span class="n">beta</span><span class="p">):</span>
<span class="n">intersection</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">set_a</span><span class="o">.</span><span class="n">intersection</span><span class="p">(</span><span class="n">set_b</span><span class="p">))</span>
<span class="n">a_minus_b</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">set_a</span><span class="o">.</span><span class="n">difference</span><span class="p">(</span><span class="n">set_b</span><span class="p">))</span>
<span class="n">b_minus_a</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">set_b</span><span class="o">.</span><span class="n">difference</span><span class="p">(</span><span class="n">set_a</span><span class="p">))</span>
<span class="n">similarity</span> <span class="o">=</span> <span class="n">intersection</span> <span class="o">/</span> <span class="p">(</span><span class="n">intersection</span> <span class="o">+</span> <span class="n">alpha</span> <span class="o">*</span> <span class="n">a_minus_b</span> <span class="o">+</span> <span class="n">beta</span> <span class="o">*</span> <span class="n">b_minus_a</span><span class="p">)</span>
<span class="k">return</span> <span class="n">similarity</span>
<span class="n">set1</span> <span class="o">=</span> <span class="p">{</span><span class="s2">"apple"</span><span class="p">,</span> <span class="s2">"banana"</span><span class="p">,</span> <span class="s2">"cherry"</span><span class="p">}</span>
<span class="n">set2</span> <span class="o">=</span> <span class="p">{</span><span class="s2">"banana"</span><span class="p">,</span> <span class="s2">"cherry"</span><span class="p">,</span> <span class="s2">"date"</span><span class="p">,</span> <span class="s2">"elderberry"</span><span class="p">}</span>
<span class="n">alpha</span> <span class="o">=</span> <span class="mf">0.5</span>
<span class="n">beta</span> <span class="o">=</span> <span class="mf">0.5</span>
<span class="n">similarity</span> <span class="o">=</span> <span class="n">tversky_similarity</span><span class="p">(</span><span class="n">set1</span><span class="p">,</span> <span class="n">set2</span><span class="p">,</span> <span class="n">alpha</span><span class="p">,</span> <span class="n">beta</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Tversky Similarity:"</span><span class="p">,</span> <span class="n">similarity</span><span class="p">)</span>
</code></pre></div>
<p><a id="jaro-winkler-similarity-for-reference"></a></p>
<h2>Jaro-Winkler Similarity (for reference)</h2>
<p>Jaro-Winkler similarity is a metric used to compare two strings, often used in record linkage and fuzzy string matching tasks. It was developed by William E. Winkler and Matthew A. Jaro. Jaro-Winkler similarity calculates a score between 0 and 1, where 1 indicates a perfect match and 0 indicates no similarity. It considers the number of matching characters between two strings and the positions of those matching characters. The Jaro-Winkler similarity gives more weight to the common prefix of the strings, making it particularly useful for comparing names and short strings. For more information about Jaro-Winkler similarity see: <a href="https://www.safjan.com/jaro-winkler-similarity/">Jaro-Winkler Similarity</a>.</p>
<p><a id="summary"></a></p>
<h2>Summary</h2>
<p>The main differences between Tversky similarity and Jaro-Winkler similarity are:</p>
<ul>
<li><strong>Application Domain:</strong> Tversky similarity is used to compare sets, while Jaro-Winkler similarity is used to compare strings.</li>
<li><strong>Parameters:</strong> Tversky similarity has parameters alpha and beta to control precision and recall, while Jaro-Winkler similarity does not have such parameters.</li>
<li><strong>Target Data:</strong> Tversky similarity works with sets of items, while Jaro-Winkler similarity works with individual strings.</li>
<li><strong>Use Cases:</strong> Tversky similarity is commonly used in information retrieval and recommendation systems, while Jaro-Winkler similarity is used in fuzzy string matching and record linkage tasks.</li>
</ul>
<p>X::<a href="https://www.safjan.com/jaro-winkler-similarity/">Jaro-Winkler Similarity</a></p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>GitHub Search Techniques2023-12-07T00:00:00+01:002023-12-07T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-12-07:/github-search-techniques/<ol>
<li>
<p><strong>Search By Name</strong>: Use "in:name" along with your search term to find repositories with that name. Example: "Ruby-Projects in:name".</p>
</li>
<li>
<p><strong>Search By Description</strong>: Use "in:description" along with your search term to find repositories with that term in their description. Example …</p></li></ol><ol>
<li>
<p><strong>Search By Name</strong>: Use "in:name" along with your search term to find repositories with that name. Example: "Ruby-Projects in:name".</p>
</li>
<li>
<p><strong>Search By Description</strong>: Use "in:description" along with your search term to find repositories with that term in their description. Example: "machine learning in:description".</p>
</li>
<li>
<p><strong>Search By Readme</strong>: Use "in:readme" along with your search term to find repositories with that term in their README file. Example: "learn ruby in:readme".</p>
</li>
<li>
<p><strong>Search By Topic</strong>: Use "in:topic" along with your search term to find repositories with that topic. Example: "mobile development in:topic".</p>
</li>
<li>
<p><strong>Search By Organization</strong>: Use "org:" along with your search term to find repositories from a specific organization. Example: "org:Microsoft".</p>
</li>
<li>
<p><strong>Search By License</strong>: Use "license:" along with your search term to find open-source repositories that match a certain license. Example: "license:Apache-2.0".</p>
</li>
<li>
<p><strong>Search By Stars</strong>: Use "stars:>" followed by a number to find repositories with that number of stars or more. Example: "stars:>1000".</p>
</li>
<li>
<p><strong>Search By Date</strong>: Use "Created" or "Updated" followed by a date in the format "YYYY-MM-DD" to find repositories created or updated after a certain date. Example: "in:date created:>2023-06-01".</p>
</li>
<li>
<p><strong>Search By Forks</strong>: Use "forks:>" followed by a number to find repositories that have been forked that number of times or more. Example: "forks:>1000".</p>
</li>
<li>
<p><strong>Search By Language</strong>: Use "language:" with your search term to find repositories in a specific programming language. Example: "language:ruby".</p>
</li>
<li>
<p><strong>Search by Last Push</strong>: Use "pushed:>" followed by a date to find repositories updated after a certain date. Example: "pushed:>2023-03-01 rails".</p>
</li>
</ol>
<p>These techniques can help you quickly find the repositories you need. These search tips can transform the task of searching for repositories into an enjoyable and productive experience.</p>Databricks Curriculum - From Zero to Hero2023-12-04T00:00:00+01:002023-12-04T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-12-04:/databricks-curriculum-from-zero-to-hero/<h2>Stage 1: Beginner</h2>
<h3>Topic 1: Introduction to Databricks</h3>
<ul>
<li><strong>Prerequisites:</strong> None</li>
<li><strong>Enables:</strong> Understanding of what Databricks is and what it can do.</li>
<li>
<p><strong>Reasoning:</strong> As a starting point, you need to understand what Databricks is and why it's used.</p>
</li>
<li>
<p>Understand the concept of Databricks …</p></li></ul><h2>Stage 1: Beginner</h2>
<h3>Topic 1: Introduction to Databricks</h3>
<ul>
<li><strong>Prerequisites:</strong> None</li>
<li><strong>Enables:</strong> Understanding of what Databricks is and what it can do.</li>
<li>
<p><strong>Reasoning:</strong> As a starting point, you need to understand what Databricks is and why it's used.</p>
</li>
<li>
<p>Understand the concept of Databricks</p>
</li>
<li>Learn about the history and evolution of Databricks</li>
<li>Understand the benefits and use-cases of Databricks</li>
<li>Explore the architecture of Databricks</li>
</ul>
<h3>Topic 2: Setting up Databricks</h3>
<ul>
<li><strong>Prerequisites:</strong> Introduction to Databricks</li>
<li><strong>Enables:</strong> Ability to setup and navigate the Databricks environment.</li>
<li>
<p><strong>Reasoning:</strong> Before you can use Databricks, you need to know how to set it up and navigate the platform.</p>
</li>
<li>
<p>Create a Databricks account</p>
</li>
<li>Understand the Databricks workspace</li>
<li>Learn how to create a Databricks cluster</li>
<li>Learn how to create notebooks and libraries</li>
<li>Understand how to manage and monitor clusters</li>
</ul>
<h3>Topic 3: Introduction to Apache Spark</h3>
<ul>
<li><strong>Prerequisites:</strong> Setting up Databricks</li>
<li><strong>Enables:</strong> Understanding of Apache Spark and its importance in Databricks.</li>
<li>
<p><strong>Reasoning:</strong> Databricks is built on Apache Spark, so understanding Spark is crucial.</p>
</li>
<li>
<p>Understand the concept of Apache Spark</p>
</li>
<li>Learn about the history and evolution of Apache Spark</li>
<li>Understand the architecture of Apache Spark</li>
<li>Explore the core components of Spark: Spark SQL, Spark Streaming, MLlib, and GraphX</li>
<li>Understand how Spark integrates with Databricks</li>
</ul>
<h3>Topic 4: Basic Data Processing with Databricks</h3>
<ul>
<li><strong>Prerequisites:</strong> Introduction to Apache Spark</li>
<li><strong>Enables:</strong> Ability to perform basic data processing tasks in Databricks.</li>
<li>
<p><strong>Reasoning:</strong> Data processing is a key function of Databricks.</p>
</li>
<li>
<p>Understand the concept of data processing</p>
</li>
<li>Learn how to load and inspect data in Databricks</li>
<li>Understand the basic operations on data such as filtering, aggregation, and transformation</li>
<li>Learn how to visualize data in Databricks</li>
<li>Understand how to save and export processed data</li>
</ul>
<h2>Stage 2: Intermediate</h2>
<h3>Topic 5: DataFrames and SQL in Databricks</h3>
<ul>
<li><strong>Prerequisites:</strong> Basic Data Processing with Databricks</li>
<li><strong>Enables:</strong> Ability to use DataFrames and SQL for data manipulation in Databricks.</li>
<li>
<p><strong>Reasoning:</strong> DataFrames and SQL are essential tools for data manipulation in Databricks.</p>
</li>
<li>
<p>Understand the concept of DataFrames in Spark</p>
</li>
<li>Learn how to create DataFrames from different data sources</li>
<li>Perform operations on DataFrames such as select, filter, and aggregate</li>
<li>Understand the concept of SQL in Spark</li>
<li>Learn how to perform SQL queries on DataFrames</li>
<li>Understand how to convert between DataFrames and SQL</li>
</ul>
<h3>Topic 6: ETL Processes in Databricks</h3>
<ul>
<li><strong>Prerequisites:</strong> DataFrames and SQL in Databricks</li>
<li><strong>Enables:</strong> Understanding and implementation of ETL processes in Databricks.</li>
<li>
<p><strong>Reasoning:</strong> ETL (Extract, Transform, Load) processes are a key part of data processing in Databricks.</p>
</li>
<li>
<p>Understand the concept of ETL (Extract, Transform, Load)</p>
</li>
<li>Learn how to extract data from different sources in Databricks</li>
<li>Understand how to transform data using Spark transformations</li>
<li>Learn how to load data into different destinations</li>
<li>Perform a complete ETL process on a sample dataset</li>
</ul>
<h3>Topic 7: Machine Learning with Databricks</h3>
<ul>
<li><strong>Prerequisites:</strong> ETL Processes in Databricks</li>
<li><strong>Enables:</strong> Ability to use Databricks for machine learning tasks.</li>
<li>
<p><strong>Reasoning:</strong> Machine learning is a powerful tool for data analysis, and Databricks provides robust support for machine learning tasks.</p>
</li>
<li>
<p>Understand the concept of machine learning</p>
</li>
<li>Learn about the machine learning library in Spark (MLlib)</li>
<li>Understand the machine learning workflow: data preparation, model training, model evaluation, and model deployment</li>
<li>Learn how to prepare data for machine learning</li>
<li>Train and evaluate a machine learning model on a sample dataset</li>
</ul>
<h2>Stage 3: Advanced</h2>
<h3>Topic 8: Stream Processing in Databricks</h3>
<ul>
<li><strong>Prerequisites:</strong> Machine Learning with Databricks</li>
<li><strong>Enables:</strong> Ability to handle real-time data streams in Databricks.</li>
<li>
<p><strong>Reasoning:</strong> Real-time data processing is a critical capability in many data-intensive applications.</p>
</li>
<li>
<p>Understand the concept of stream processing</p>
</li>
<li>Learn about Spark Streaming and its integration with Databricks</li>
<li>Understand how to ingest real-time data streams</li>
<li>Learn how to perform transformations and actions on data streams</li>
<li>Understand how to output data streams to various destinations</li>
</ul>
<h3>Topic 9: Advanced Spark Programming in Databricks</h3>
<ul>
<li><strong>Prerequisites:</strong> Stream Processing in Databricks</li>
<li><strong>Enables:</strong> Mastery of advanced Spark programming techniques in Databricks.</li>
<li>
<p><strong>Reasoning:</strong> To fully leverage the power of Databricks, you need to be proficient in advanced Spark programming techniques.</p>
</li>
<li>
<p>Deepen understanding of Spark's core concepts</p>
</li>
<li>Learn about Spark's advanced features such as Spark's Catalyst Optimizer, Tungsten Execution Engine, and GraphX for graph processing</li>
<li>Understand how to optimize Spark applications for performance</li>
<li>Learn how to debug and troubleshoot Spark applications</li>
<li>Understand how to manage and monitor Spark applications in Databricks</li>
</ul>
<h3>Topic 10: Databricks for Data Science</h3>
<ul>
<li><strong>Prerequisites:</strong> Advanced Spark Programming in Databricks</li>
<li><strong>Enables:</strong> Ability to use Databricks as a tool for advanced data science tasks.</li>
<li>
<p><strong>Reasoning:</strong> Databricks is a powerful tool for data science, and mastering its use for these tasks will enable you to tackle complex data science problems.</p>
</li>
<li>
<p>Understand how Databricks can be used for advanced data science tasks</p>
</li>
<li>Learn about Databricks' integration with popular data science libraries and tools</li>
<li>Understand how to perform exploratory data analysis in Databricks</li>
<li>Learn how to build, evaluate, and tune advanced machine learning models</li>
<li>Understand how to deploy machine learning models in Databricks</li>
</ul>
<p>This curriculum provides a comprehensive path from beginner to advanced user of Databricks. By following this path, you will gain a deep understanding of Databricks and be able to use it effectively for a wide range of data processing and data science tasks.</p>Databricks - key concepts2023-12-04T00:00:00+01:002023-12-04T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-12-04:/databricks-key-concepts/<script type="module"> import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs'; mermaid.initialize({ startOnLoad: true }); </script>
<pre class="mermaid">
mindmap
Databricks
Databricks Workspace
Databricks Runtime
Databricks File System (DBFS)
Databricks Clusters
Databricks Notebooks
Databricks Jobs
Databricks Tables
</pre>
<p>Here are some of the …</p><script type="module"> import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs'; mermaid.initialize({ startOnLoad: true }); </script>
<pre class="mermaid">
mindmap
Databricks
Databricks Workspace
Databricks Runtime
Databricks File System (DBFS)
Databricks Clusters
Databricks Notebooks
Databricks Jobs
Databricks Tables
</pre>
<p>Here are some of the key features and components of Databricks:</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#databricks-workspace">Databricks Workspace</a></li>
<li><a href="#databricks-runtime">Databricks Runtime</a></li>
<li><a href="#databricks-file-system-dbfs">Databricks File System (DBFS)</a></li>
<li><a href="#databricks-clusters">Databricks Clusters</a></li>
<li><a href="#databricks-notebooks">Databricks Notebooks</a></li>
<li><a href="#databricks-jobs">Databricks Jobs</a></li>
<li><a href="#databricks-tables">Databricks Tables</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="databricks-workspace"></a></p>
<h2>Databricks Workspace</h2>
<p>This is the collaborative environment where you can write code, create visualizations, and share your work with others. It supports several languages including Python, SQL, R, and Scala.
Read more: <a href="https://docs.databricks.com/en/administration-guide/workspace/index.html#what-is-a-workspace">Create and manage your Databricks workspaces | Databricks on AWS</a></p>
<p><a id="databricks-runtime"></a></p>
<h2>Databricks Runtime</h2>
<p>This is the set of core components that run on the clusters in Databricks. It includes Apache Spark but also includes other enhancements maintained by Databricks like performance optimizations, security, and integration with other tools like Delta Lake and MLflow.
Read more: <a href="https://www.databricks.com/glossary/what-is-databricks-runtime">What is Databricks Runtime?</a></p>
<p><a id="databricks-file-system-dbfs"></a></p>
<h2>Databricks File System (DBFS)</h2>
<p>This is a distributed file system installed on Databricks clusters. It allows you to store data and share it across all nodes in a cluster.
Read more: <a href="https://docs.databricks.com/en/dbfs/index.html">What is the Databricks File System (DBFS)?</a></p>
<p><a id="databricks-clusters"></a></p>
<h2>Databricks Clusters</h2>
<p>These are the compute resources that run your code. You can create clusters of different sizes and types depending on your workload.
Read more: <a href="https://learn.microsoft.com/en-us/azure/databricks/clusters/">Compute - Azure Databricks</a></p>
<p><a id="databricks-notebooks"></a></p>
<h2>Databricks Notebooks</h2>
<p>These are collaborative documents that contain code, visualizations, and text. They're great for exploratory data analysis, data science, and machine learning workflows.
Read more: <a href="https://docs.databricks.com/en/notebooks/index.html">Introduction to Databricks notebooks</a></p>
<p><a id="databricks-jobs"></a></p>
<h2>Databricks Jobs</h2>
<p>These are the tasks or computations you run on Databricks. You can schedule jobs to run periodically, or run them on demand.
Read more: <a href="https://docs.databricks.com/en/workflows/jobs/create-run-jobs.html">Create and run Databricks Jobs</a></p>
<p><a id="databricks-tables"></a></p>
<h2>Databricks Tables</h2>
<p>These are the structured data sources that you can query using SQL or data frame APIs in Python, R, and Scala.
Read more: <a href="https://www.databricks.com/product/delta-live-tables">Delta Live Tables</a></p>Semantic Type Detection2023-12-01T00:00:00+01:002023-12-01T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-12-01:/semantic-type-detection/<p>Semantic type detection is an important task in table representation learning, as it involves labeling table columns with standardized semantic types. This can help with <strong>understanding the contents of a table</strong> and can be used for various applications such as data discovery …</p><p>Semantic type detection is an important task in table representation learning, as it involves labeling table columns with standardized semantic types. This can help with <strong>understanding the contents of a table</strong> and can be used for various applications such as data discovery, data validation, and data integration. By accurately detecting the semantic types of columns, machine learning <strong>models can better understand the relationships between columns</strong> and <strong>improve their performance</strong> on tasks like table comprehension and data discovery. Additionally, semantic type detection can help with data integration, as it can help map columns from different sources that may have different naming conventions or formats.</p>
<p>X::<a href="https://www.safjan.com/table-representation-learning/">Table Representation Learning</a></p>Table Representation Learning2023-12-01T00:00:00+01:002023-12-01T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-12-01:/table-representation-learning/<p>Table representation learning is an exciting field that focuses on understanding the structure and relationships within tabular data. This can involve <strong>learning embeddings for individual columns</strong> or <strong>entire tables</strong>, and can be used for various applications such as data discovery, data validation …</p><p>Table representation learning is an exciting field that focuses on understanding the structure and relationships within tabular data. This can involve <strong>learning embeddings for individual columns</strong> or <strong>entire tables</strong>, and can be used for various applications such as data discovery, data validation, and data integration.</p>
<p>One key aspect of table representation learning is <strong>understanding the semantics of column</strong>s, which can be used to <strong>generate metadata</strong> and help with tasks like <strong>table comprehension</strong> and <strong>data discovery</strong>.</p>
<p>By accurately representing columns and their relationships, table representation learning can help improve machine learning models and enable more complex analysis of tabular data.</p>
<p>X::<a href="https://www.safjan.com/semantic-type-detection/">Semantic Type Detection</a></p>Using Mermaid Diagrams in Pelican Blog Post2023-11-28T00:00:00+01:002023-11-28T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-11-28:/mermaid-in-pelican-post/<p>Sometimes, you might want to embed the mermaid diagram in your blogpost written in markdown. Here is how to do it.</p>
<h2>Embed the HTML code (recommended)</h2>
<p>In your markdown file, you can embed HTML code loading mermaid code and initialising it, then …</p><p>Sometimes, you might want to embed the mermaid diagram in your blogpost written in markdown. Here is how to do it.</p>
<h2>Embed the HTML code (recommended)</h2>
<p>In your markdown file, you can embed HTML code loading mermaid code and initialising it, then include mermaid diagram you want.</p>
<div class="highlight"><pre><span></span><code><span class="p"><</span><span class="nt">script</span> <span class="na">type</span><span class="o">=</span><span class="s">"module"</span><span class="p">></span><span class="w"> </span><span class="k">import</span><span class="w"> </span><span class="nx">mermaid</span><span class="w"> </span><span class="kr">from</span><span class="w"> </span><span class="s1">'https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs'</span><span class="p">;</span><span class="w"> </span><span class="nx">mermaid</span><span class="p">.</span><span class="nx">initialize</span><span class="p">({</span><span class="w"> </span><span class="nx">startOnLoad</span><span class="o">:</span><span class="w"> </span><span class="kc">true</span><span class="w"> </span><span class="p">});</span><span class="w"> </span><span class="p"></</span><span class="nt">script</span><span class="p">></span>
Here is a mermaid diagram:
<span class="p"><</span><span class="nt">pre</span> <span class="na">class</span><span class="o">=</span><span class="s">"mermaid"</span><span class="p">></span>
graph TD
A[Client] --> B[Load Balancer]
B --> C[Server01]
B --> D[Server02]
<span class="p"></</span><span class="nt">pre</span><span class="p">></span>
</code></pre></div>
<h2>Extension</h2>
<p>There is extension, not sure if it works:</p>
<p><a href="https://github.com/Lee-W/md_mermaid">Lee-W/md_mermaid</a> - mermaid extension to add support for mermaid graph inside markdown file. NOTE: you need Markdown<3.2 (e.g. 3.1.1)</p>Store Output of the Command Into Array in Bash2023-11-13T00:00:00+01:002023-11-13T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-11-13:/store-output-of-the-command-into-array-in-bash/<p>Both <code>mapfile</code> and <code>read -a</code> can be used to store the output of a command or a list of values into an array. However, the <code>mapfile</code> command is generally preferred when reading lines from a file, while <code>read -a</code> is well-suited for …</p><p>Both <code>mapfile</code> and <code>read -a</code> can be used to store the output of a command or a list of values into an array. However, the <code>mapfile</code> command is generally preferred when reading lines from a file, while <code>read -a</code> is well-suited for reading space-separated values from a string.</p>
<p>Let's assume that we want to store all directories (top-level) that are located in projects forlder. In other words, keeping all projects (dir names) as array elements.</p>
<div class="highlight"><pre><span></span><code><span class="nv">projects</span><span class="o">=(</span><span class="s2">"</span><span class="nv">$HOME</span><span class="s2">"</span>/projects/*<span class="o">)</span>
<span class="c1"># Using 'find' command with '-print0' to handle directory names with special characters</span>
<span class="k">while</span><span class="w"> </span><span class="nv">IFS</span><span class="o">=</span><span class="w"> </span><span class="nb">read</span><span class="w"> </span>-r<span class="w"> </span>-d<span class="w"> </span><span class="s1">$'\0'</span><span class="w"> </span>line<span class="p">;</span><span class="w"> </span><span class="k">do</span>
<span class="w"> </span><span class="nv">projects</span><span class="o">+=(</span><span class="s2">"</span><span class="nv">$line</span><span class="s2">"</span><span class="o">)</span>
<span class="k">done</span><span class="w"> </span><<span class="w"> </span><<span class="o">(</span>find<span class="w"> </span><span class="s2">"</span><span class="si">${</span><span class="nv">projects</span><span class="p">[@]</span><span class="si">}</span><span class="s2">"</span><span class="w"> </span>-maxdepth<span class="w"> </span><span class="m">0</span><span class="w"> </span>-type<span class="w"> </span>d<span class="w"> </span>-print0<span class="o">)</span>
</code></pre></div>
<p>In the provided code, the <code>read</code> command is used together with some parameters. Here is a brief explanation:</p>
<ul>
<li>
<p><code>-a</code> : This option is used when we want to read from input and store it in an array. In the given code snippet, the input is obtained from a subshell command that lists directories (<code>ls -d ${projects[@]}</code>).</p>
</li>
<li>
<p><code>-r</code> : This option prevents backslash escapes from being interpreted. It helps you to read the strings "as is".</p>
</li>
<li>
<p><code>-d $'\0'</code> : This tells <code>read</code> to continue until it encounters a null byte (<code>\0</code>), which is the delimiter used by <code>find . -print0</code>.</p>
</li>
</ul>
<p>So <code>read -r -d $'\0' line</code> reads input separated by null characters into the variable <code>line</code>. This is done inside a <code>while</code> loop, which continues to perform this reading operation for each directory returned by <code>find</code>, assigned to the <code>projects</code> array one by one.</p>
<p>The while loop structure <code>while IFS= read -r -d $'\0' line; do</code> is commonly used in shell scripting to read lines from a file (or in this case, results from a command substitution) in a safe manner that preserves whitespace and special characters.</p>
<p><code>IFS=</code> is used to temporarily clear the Internal Field Separator variable, which is used by <code>read</code> to split the input line into separate fields. By clearing it, we ensure that <code>read</code> treats each line as a whole, even if it includes spaces.</p>
<p>In this script, the <code>find</code> command is used, ill-equipped with the <code>-print0</code> option to output names using a null character as a delimiter, which helps in dealing with directory names that include spaces or other special characters. The <code>-maxdepth 0</code> option ensures that only the directories (not their subdirectories) are listed. The <code>-type d</code> filter ensures that only directories are returned.</p>
<p>The <code>while</code> loop with <code>IFS= read -r -d $'\0'</code> handles the null delimited output from <code>find</code>. Within the loop, each line is appended to the <code>projects</code> array. Lastly, the elements of the <code>projects</code> array are added to the 'list' array.</p>The Importance of Adding a `py.typed` File to Your Typed Package2023-11-13T00:00:00+01:002023-11-13T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-11-13:/the-importance-of-adding-py-typed-file-to-your-typed-package/<p>For the Python programming, type checking might be an important aspect aspect that ensures the correctness of your code. The <code>mypy</code> type checker is a powerful tool that uses type annotations to verify your code. However, it might not recognize the type …</p><p>For the Python programming, type checking might be an important aspect aspect that ensures the correctness of your code. The <code>mypy</code> type checker is a powerful tool that uses type annotations to verify your code. However, it might not recognize the type hints provided by your package unless you include a <code>py.typed</code> file. This is a common oversight that can lead to incorrect package publishing.</p>
<h2>Understanding <code>py.typed</code></h2>
<p><strong>The <code>py.typed</code> file is a marker file that indicates to type checkers like <code>mypy</code> that your package comes with type annotations.</strong> Without this file, the type checker won't use the type hints provided by your package, leading to potential type errors. This requirement is outlined in <a href="https://www.python.org/dev/peps/pep-0561/#packaging-type-information">PEP-561</a> and the <a href="https://mypy.readthedocs.io/en/stable/installed_packages.html#making-pep-561-compatible-packages">mypy documentation</a>.</p>
<h2>Adding <code>py.typed</code> to Your Package</h2>
<p>Adding a <code>py.typed</code> file to your package is straightforward. Simply create a <code>py.typed</code> file in your package directory and include it in your distribution.</p>
<p>If you're using <a href="https://python-poetry.org/">poetry</a>, you can add the following lines under the <code>[tool.poetry]</code> section of <code>pyproject.toml</code>:</p>
<div class="highlight"><pre><span></span><code><span class="n">packages</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">[</span>
<span class="w"> </span><span class="p">{</span><span class="n">include</span><span class="w"> </span><span class="p">=</span><span class="w"> </span><span class="s2">"mypackage"</span><span class="p">},</span>
<span class="w"> </span><span class="p">{</span><span class="n">include</span><span class="w"> </span><span class="p">=</span><span class="w"> </span><span class="s2">"mypackage/py.typed"</span><span class="p">},</span>
<span class="p">]</span>
</code></pre></div>
<p>For those using <code>setup.py</code>, you can add <code>package_data</code> to the <code>setup</code> call:</p>
<div class="highlight"><pre><span></span><code><span class="n">setup</span><span class="p">(</span>
<span class="n">package_data</span><span class="o">=</span><span class="p">{</span><span class="s2">"mypackage"</span><span class="p">:</span> <span class="p">[</span><span class="s2">"py.typed"</span><span class="p">]},</span>
<span class="p">)</span>
</code></pre></div>
<p>After adding the <code>py.typed</code> file, release <a href="https://github.com/whtsky/pixelmatch-py/commit/9c6297cedd10232ffbe23cc54a4e46e76d1fa13a">a new version for your package</a>. This will ensure that the type information from your packages works as expected.</p>
<h2>Conclusion</h2>
<p>If you're a Python package maintainer, don't forget to include a <code>py.typed</code> file in your typed package. This simple step can make a significant difference in ensuring the correctness of your code and the usability of your package. It's a small effort that goes a long way in maintaining the quality and reliability of your Python package.</p>
<p><strong>Credits</strong> to <a href="https://dev.to/whtsky">Wu Haotian</a> for the article <a href="https://dev.to/whtsky/don-t-forget-py-typed-for-your-typed-python-package-2aa3">Don't forget <code>py.typed</code> for your typed Python package - DEV Community</a> - I have learned about this mechanism from that post.</p>In the Python project made with Poetry shall I add poetry.lock to the git repo or ignore it?2023-11-12T00:00:00+01:002023-11-12T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-11-12:/python-project-with-Poetry-add-poetry-lock-to-the-git-repo-or-ignore-it/<p>up:MOC_Python_Project</p>
<p>In a Python project managed with Poetry, you should definitely add the <code>poetry.lock</code> file to your Git repository. The <code>poetry.lock</code> file ensures that all project dependencies are specified with fixed versions, providing deterministic builds across different environments.</p>
<p>By …</p><p>up:MOC_Python_Project</p>
<p>In a Python project managed with Poetry, you should definitely add the <code>poetry.lock</code> file to your Git repository. The <code>poetry.lock</code> file ensures that all project dependencies are specified with fixed versions, providing deterministic builds across different environments.</p>
<p>By including the <code>poetry.lock</code> file in your repository, you ensure that anyone cloning or checking out your project will have the exact same versions of the dependencies installed. This guarantees that they will have a consistent development environment and can reproduce the same build and execution results.</p>
<p>Including the <code>poetry.lock</code> file also serves as documentation for the specific versions of the dependencies used in your project. This information can be helpful for troubleshooting and debugging purposes.</p>
<p>When working with Poetry, you can also add the <code>pyproject.toml</code> file to your Git repository. This file contains the project metadata and the dependencies specified in a readable format, giving a high-level overview of your project's requirements.</p>
<p>Including both the <code>poetry.lock</code> and <code>pyproject.toml</code> files ensures that others can easily set up and work with your project while maintaining consistency across different development environments.</p>Git change remote origin (replace with new)2023-11-11T00:00:00+01:002023-11-11T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-11-11:/Git-change-remote-origin-replace-with-new/<h2>Git - Replace remote origin</h2>
<p>To change the remote origin in Git and replace it with a new one, you can use the following steps:</p>
<p><strong>Verify the existing remote origin</strong></p>
<p>Check the current remote URL for the origin repository by running the command …</p><h2>Git - Replace remote origin</h2>
<p>To change the remote origin in Git and replace it with a new one, you can use the following steps:</p>
<p><strong>Verify the existing remote origin</strong></p>
<p>Check the current remote URL for the origin repository by running the command:</p>
<div class="highlight"><pre><span></span><code>git<span class="w"> </span>remote<span class="w"> </span>-v<span class="w"> </span>
</code></pre></div>
<p>This command will display the fetch and push URLs for all the remotes.</p>
<p><strong>Remove the existing remote origin</strong></p>
<p>In order to replace the remote origin, you need to remove the current one. Run the command:</p>
<p><code>git remote remove origin</code>.</p>
<p>This will remove the old origin from your local Git repository.</p>
<p><strong>Add the new remote origin</strong></p>
<p>Once you have removed the existing remote origin, you can add the new one by running the command: <code>git remote add origin <new_remote_url></code>. Replace <code><new_remote_url></code> with the URL of the new remote repository you want to set as the origin.</p>
<p><strong>Verify the changes</strong>
You can ensure that the new remote origin is set correctly by running</p>
<p><code>git remote -v</code></p>
<p><strong>Push the branch to the new origin</strong></p>
<p>Finally, you can push your branch to the new remote origin using:</p>
<p><code>git push -u origin <branch_name></code>.</p>
<p>Replace <code><branch_name></code> with the name of the branch you want to push.</p>
<h2>When you might need to perform this operation</h2>
<p>There are several situations where you might want to change the remote origin (replace it with a new one) in Git. Some common examples include:</p>
<ol>
<li>
<p>Changing the repository hosting provider: If you are migrating your codebase from one hosting provider to another (e.g., from GitHub to GitLab), you would need to update the remote origin URL to point to the new provider.</p>
</li>
<li>
<p>Moving the repository from a personal account to an organization account: If you initially created a repository under your personal account and later decide to move it to an organization account, you would change the remote origin to point to the new organization repository.</p>
</li>
<li>
<p>Renaming the repository: If you decide to change the name of your repository, you may want to update the remote origin URL to reflect the new name.</p>
</li>
<li>
<p>Collaborating with multiple repositories: In some cases, you might want to work with multiple remote repositories, perhaps to collaborate with different teams or maintain several mirrored repositories. Changing the remote origin allows you to switch between these repositories easily.</p>
</li>
<li>
<p>Fixing an incorrect or outdated remote origin: If you accidentally set the wrong remote origin URL or if the previous URL has become outdated, you can change it to point to the correct one.</p>
</li>
</ol>
<p>Remember, changing the remote origin should be done with caution, especially in collaborative environments, as it affects the repository's remote connections. Make sure to communicate the changes to your team and consider any implications before making the switch.</p>SPLADE sparse vectors - explaination, properties2023-11-10T00:00:00+01:002023-12-08T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-11-10:/splade-sparse-vectors/<div class="highlight"><pre><span></span><code><span class="n">style</span><span class="o">:</span><span class="w"> </span><span class="n">bullet</span>
<span class="n">min_depth</span><span class="o">:</span><span class="w"> </span><span class="mi">2</span>
<span class="n">max_depth</span><span class="o">:</span><span class="w"> </span><span class="mi">6</span><span class="w"> </span>
<span class="n">title</span><span class="o">:</span><span class="w"> </span><span class="s2">"**Contents**"</span>
</code></pre></div>
<h2>TL; DR</h2>
<p>SPLADE is a neural retrieval model which learns query/document <strong>sparse</strong> expansion via the BERT MLM head and sparse regularization. Sparse representations benefit from several advantages compared to dense approaches: efficient use …</p><div class="highlight"><pre><span></span><code><span class="n">style</span><span class="o">:</span><span class="w"> </span><span class="n">bullet</span>
<span class="n">min_depth</span><span class="o">:</span><span class="w"> </span><span class="mi">2</span>
<span class="n">max_depth</span><span class="o">:</span><span class="w"> </span><span class="mi">6</span><span class="w"> </span>
<span class="n">title</span><span class="o">:</span><span class="w"> </span><span class="s2">"**Contents**"</span>
</code></pre></div>
<h2>TL; DR</h2>
<p>SPLADE is a neural retrieval model which learns query/document <strong>sparse</strong> expansion via the BERT MLM head and sparse regularization. Sparse representations benefit from several advantages compared to dense approaches: efficient use of inverted index, explicit lexical match, interpretability... They also seem to be better at generalizing on out-of-domain data (BEIR benchmark).</p>
<h2>Intro</h2>
<p>I have learned about SPLADE from the article: <a href="https://www.pinecone.io/learn/splade/">SPLADE for Sparse Vector Search Explained | Pinecone</a>. Here below are the key concepts from the article (LLM summary)</p>
<p>The article discusses the evolution of search and recommendation systems, focusing on the shift from traditional "bag of words" methods to modern vector search. It explains how big tech companies like Google, Netflix, and Amazon use vector search to power their systems.</p>
<p>The traditional <strong>bag of words</strong> methods transform documents into a set of words, populating a sparse "frequency vector". While these methods are <strong>efficient and interpretable</strong>, they are <strong>not perfect</strong> due to their <strong>reliance on exact term matching,</strong> which doesn't align with human nature.</p>
<p><strong>Dense embedding</strong> models offer a solution by allowing search based on <strong>semantic meaning</strong>. However, they require <strong>vast amounts of data for fine-tuning</strong> and don't perform well in niche domains where data is scarce.</p>
<p>The article introduces a solution to these problems: <strong>a merger of sparse and dense retrieval through hybrid search and learnable sparse embeddings</strong>. It focuses on <strong>SPLADE</strong> (Sparse Lexical and Expansion model), a <strong>model that uses a pretrained language model like BERT to enhance sparse vector embedding.</strong></p>
<h2>how it works</h2>
<p>The idea behind the <strong>Sp</strong>arse <strong>L</strong>exical <strong>a</strong>n<strong>d</strong> <strong>E</strong>xpansion models is that a pretrained language model like BERT can identify connections between words/sub-words (called <em>word-pieces</em> or “terms” in this article) and use that knowledge to enhance our sparse vector embedding.</p>
<p>This works in two ways, it allows us to weigh the relevance of different terms (something like the will carry less relevance than a less common word like orangutan). And it enables <em>term expansion</em>: the inclusion of alternative but relevant terms beyond those found in the original sequence.</p>
<p><img alt="Term expansion allows us to identify relevant but different terms and use them in the sparse vector retrieval step." src="https://cdn.sanity.io/images/vr8gru94/production/17f0aac1f34b4475121744b672156a611dd8aed6-1029x331.png"></p>
<p>Term expansion allows us to identify relevant but different terms and use them in the sparse vector retrieval step.</p>
<p>The most significant advantage of SPLADE is not necessarily that it can <em>do</em> term expansion but instead that it can <em>learn</em> term expansions. Traditional methods required rule-based term expansion which is time-consuming <em>and</em> fundamentally limited. Whereas SPLADE can use the best language models to learn term expansions and even tweak them based on the sentence context.</p>
<p>The article also discusses the pros and cons of sparse and dense vectors, the concept of two-stage retrieval, and the drawbacks of this approach. It then delves into the workings of SPLADE, explaining how it builds sparse embeddings and how it can be implemented using Hugging Face and PyTorch or the official SPLADE library.</p>
<p>The article concludes by acknowledging the <strong>limitations of SPLADE</strong>, such as its <strong>slower retrieval speed compared to other sparse methods</strong>, and suggests solutions to these problems. It also highlights the potential of mixing both dense and sparse representations using hybrid search indexes to make vector search more accurate and accessible.</p>
<p>X::<a href="https://www.safjan.com/tfidf-with-examples/">TF-IDF with examples</a></p>
<h2>References</h2>
<ul>
<li><a href="https://github.com/naver/splade">GitHub - naver/splade: SPLADE: sparse neural search (SIGIR21, SIGIR22)</a></li>
</ul>
<p>[1] T. Formal, B. Piwowarski, S. Clinchant, <a href="https://arxiv.org/abs/2107.05720">SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking</a> (2021), SIGIR 21</p>
<p>[2] T. Formal, C. Lassance, B. Piwowarski, S. Clinchant, <a href="https://export.arxiv.org/abs/2109.10086">SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval</a> (2021)</p>
<ul>
<li>https://www.linkedin.com/posts/prithivirajdamodaran_%3F%3F%3F%3F%3F%3F-%3F%3F%3F%3F-%3F%3F%3F%3F%3F%3F%3F%3F-activity-7164581754270400512-Aa87?utm_source=share&utm_medium=member_desktop</li>
</ul>TF-IDF with examples2023-11-10T00:00:00+01:002023-11-10T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-11-10:/tfidf-with-examples/<p>X::<a href="https://www.safjan.com/splade-sparse-vectors/">SPLADE sparse vectors - explaination, properties</a></p>
<p><strong>TF-IDF</strong> stands for <strong>Term Frequency-Inverse Document Frequency</strong>. It's a numerical statistic used to reflect how important a word is to a document in a collection or corpus. It's often used in information retrieval and text mining …</p><p>X::<a href="https://www.safjan.com/splade-sparse-vectors/">SPLADE sparse vectors - explaination, properties</a></p>
<p><strong>TF-IDF</strong> stands for <strong>Term Frequency-Inverse Document Frequency</strong>. It's a numerical statistic used to reflect how important a word is to a document in a collection or corpus. It's often used in information retrieval and text mining.</p>
<p>TF-IDF is composed of two parts:</p>
<ol>
<li>
<p><strong>Term Frequency (TF)</strong>: This measures the frequency of a word in a document. It's the ratio of the number of times a word appears in a document compared to the total number of words in that document. It increases as the number of occurrences of that word within the document increases. Each document has its own term frequency.</p>
</li>
<li>
<p><strong>Inverse Document Frequency (IDF)</strong>: This measures the importance of the word in the entire corpus. The IDF of a rare word is high, whereas the IDF of a frequent word is likely to be low. Thus, words that occur frequently across many documents will have a lower IDF, and rare words will have a high IDF.</p>
</li>
</ol>
<p>The TF-IDF value is calculated by multiplying these two metrics: TF and IDF.</p>
<h2>Minimal example</h2>
<h2>High TF-IDF</h2>
<p>Consider a document containing 100 words wherein the word 'cat' appears 3 times.</p>
<p>The term frequency (TF) for 'cat' is then (3 / 100) = 0.03.</p>
<p>Now, assume we have 10 million documents and the word 'cat' appears in one thousand of these. Then, the inverse document frequency (IDF) is calculated as log(10,000,000 / 1,000) = 4.</p>
<p>So, the TF-IDF weight is the product of these quantities: 0.03 * 4 = 0.12.</p>
<h3>Low TF-IDF</h3>
<p>Now, let's consider a common word like 'the'. Assume it appears 20 times in a document of 100 words. So, TF for 'the' is (20/100) = 0.2.</p>
<p>Assume 'the' appears in 1 million out of 10 million documents. So, IDF for 'the' is log(10,000,000 / 1,000,000) = 1.</p>
<p>The TF-IDF weight for 'the' is 0.2 * 1 = 0.2.</p>
<p>Even though 'the' appeared more times than 'cat' in the document, the TF-IDF weight for 'cat' is higher than 'the'. This is because IDF gives a higher weight to words that are less frequent in the corpus, making 'cat' more important than 'the' in the context of our corpus.</p>
<h2>The formula</h2>
<p>here is the TF-IDF equation in LaTeX format:</p>
<p>The term frequency <span class="math">\(TF\)</span> is calculated as:</p>
<div class="math">$$
TF(t, d) = \frac{f_{t, d}}{\sum_{t' \in d} f_{t', d}}
$$</div>
<p>Where:</p>
<ul>
<li><span class="math">\(f_{t, d}\)</span> is the frequency of term <span class="math">\(t\)</span> in document <span class="math">\(d\)</span></li>
<li>The denominator is the sum of frequencies of all terms in document <span class="math">\(d\)</span></li>
</ul>
<p>The inverse document frequency <span class="math">\(IDF\)</span> is calculated as:</p>
<div class="math">$$
IDF(t, D) = \log \frac{|D|}{|\{d \in D: t \in d\}|}
$$</div>
<p>Where:</p>
<ul>
<li><span class="math">\(|D|\)</span> is the total number of documents in the corpus</li>
<li>The denominator is the number of documents where the term <span class="math">\(t\)</span> appears</li>
</ul>
<p>Finally, the TF-IDF is calculated as:</p>
<div class="math">$$
TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D)
$$</div>
<p>Where:</p>
<ul>
<li><span class="math">\(t\)</span> is the term</li>
<li><span class="math">\(d\)</span> is the document</li>
<li><span class="math">\(D\)</span> is the corpus (set of all documents)</li>
</ul>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Growth Hacking Methodology2023-11-07T00:00:00+01:002023-11-07T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-11-07:/growth-hacking-methodology/<p>Growth Hacking is a marketing strategy primarily used by startups and small businesses, which focuses on rapid growth within a short time frame. It involves experimenting with and implementing creative, low-cost strategies to acquire and retain customers.</p>
<p>Here are some key points …</p><p>Growth Hacking is a marketing strategy primarily used by startups and small businesses, which focuses on rapid growth within a short time frame. It involves experimenting with and implementing creative, low-cost strategies to acquire and retain customers.</p>
<p>Here are some key points about Growth Hacking:</p>
<ol>
<li>
<p><strong>Experimentation</strong>: Growth hacking involves constant experimentation across various channels and product development paths to identify the most effective ways to grow a business.</p>
</li>
<li>
<p><strong>Creativity</strong>: Growth hackers often use unconventional marketing strategies to get maximum growth. This could include viral marketing, social media, targeted advertising, SEO, email marketing, and more.</p>
</li>
<li>
<p><strong>Data-Driven</strong>: Growth hacking is heavily reliant on data analysis. Growth hackers track and analyze user data to understand behavior, test hypotheses, and make informed decisions.</p>
</li>
<li>
<p><strong>Agility</strong>: Growth hacking requires agility and adaptability. Growth hackers must be willing to pivot quickly, change strategies, and try new things based on what the data is telling them.</p>
</li>
<li>
<p><strong>Product Development</strong>: Growth hacking isn't just about marketing. It often involves tweaking the product itself to make it more appealing or to encourage users to spread the word about it.</p>
</li>
<li>
<p><strong>Customer Retention</strong>: While much of growth hacking focuses on customer acquisition, it's also about customer retention. Growth hackers look for ways to increase customer loyalty and encourage repeat business.</p>
</li>
<li>
<p><strong>Viral Loops</strong>: Growth hackers often aim to create viral loops, where existing users naturally attract new users, creating a self-perpetuating cycle of growth.</p>
</li>
</ol>
<p>An example of a successful growth hack is Dropbox's referral program. They offered extra storage space to users who referred their friends, which led to a significant increase in user sign-ups. This is a classic example of a growth hack – a simple, cost-effective solution that led to substantial growth.</p>
<h2>References</h2>
<ol>
<li>
<p>Book: "Growth Hacker Marketing: A Primer on the Future of PR, Marketing, and Advertising" by Ryan Holiday. This book is a good starting point for understanding the concept of growth hacking.</p>
</li>
<li>
<p>Book: "Hacking Growth: How Today's Fastest-Growing Companies Drive Breakout Success" by Sean Ellis and Morgan Brown. Sean Ellis is the person who coined the term "growth hacking," and this book provides a deep dive into the methodology.</p>
</li>
<li>
<p><a href="https://growthhackers.com/">GrowthHackers</a> - An online community where growth hackers share case studies, articles, and resources.</p>
</li>
<li>
<p><a href="https://www.quicksprout.com/growth-hacking/">The Growth Hacking Starter Guide - Real Examples</a> - An online guide that provides a comprehensive overview of growth hacking.</p>
</li>
<li>
<p><a href="https://neilpatel.com/what-is-growth-hacking/">Growth Hacking Made Simple: Definition</a> by Neil Patel</p>
</li>
<li>
<p><a href="https://growthrocks.com/blog/growth-hacking-books/">Top 17 Growth Hacking Books to Read in 2022</a></p>
</li>
</ol>
<p>X::<a href="https://www.safjan.com/criticism-of-the-lean-startup/">Criticism of the Lean Startup</a>
X::<a href="https://www.safjan.com/product-led-growth/">Product Led Growth</a></p>Product Led Growth2023-11-07T00:00:00+01:002023-11-07T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-11-07:/product-led-growth/<p>Product Led Growth (PLG) is a business methodology in which the product itself serves as the primary driver of customer acquisition, conversion, and expansion. It's a model that prioritizes product usage as the key growth driver, rather than traditional marketing or sales …</p><p>Product Led Growth (PLG) is a business methodology in which the product itself serves as the primary driver of customer acquisition, conversion, and expansion. It's a model that prioritizes product usage as the key growth driver, rather than traditional marketing or sales efforts.</p>
<p>Here are some key points about Product Led Growth:</p>
<ol>
<li>
<p><strong>User-Centric</strong>: PLG focuses on the user experience. The product is designed to be so user-friendly and intuitive that it sells itself. The aim is to create a product that users love and can't live without.</p>
</li>
<li>
<p><strong>Viral Growth</strong>: PLG often relies on viral growth. This means that current users recommend the product to others, creating a network effect. This can be facilitated by incorporating features that naturally encourage sharing or collaboration.</p>
</li>
<li>
<p><strong>Freemium or Free Trial Models</strong>: Many PLG companies offer a freemium model or free trial to attract users. This allows users to try the product and see its value before deciding to pay for premium features.</p>
</li>
<li>
<p><strong>Self-Service</strong>: PLG products are typically self-service, meaning users can sign up, use, and even upgrade the product without needing to interact with a sales team.</p>
</li>
<li>
<p><strong>Data-Driven</strong>: PLG companies use data to understand user behavior, identify opportunities for improvement, and make informed decisions. They often use metrics like daily active users (DAU), monthly active users (MAU), and net promoter score (NPS) to measure success.</p>
</li>
<li>
<p><strong>Customer Success Focus</strong>: In a PLG model, customer success is crucial. Companies need to ensure users are getting maximum value from the product, which often involves providing educational resources, responsive support, and regular product updates.</p>
</li>
</ol>
<p>Examples of successful PLG companies include Slack, Dropbox, and Zoom. These companies have created products that users love, leading to rapid, organic growth.</p>
<p>X::<a href="https://www.safjan.com/criticism-of-the-lean-startup/">Criticism of the Lean Startup</a></p>
<p>X::<a href="https://www.safjan.com/growth-hacking-methodology/">Growth Hacking Methodology</a></p>
<p>X::<a href="https://www.safjan.com/design-thinking/">Design Thinking</a></p>
<h2>References</h2>
<ul>
<li><a href="https://www.productled.org/foundations/what-is-product-led-growth">What is product-led growth?</a></li>
<li><a href="https://www.appcues.com/blog/pirate-metric-saas-growth">Why activation is the most important pirate metric for SaaS growth | Appcues Blog</a></li>
</ul>RAG-Fusion - Enhancing Information Retrieval in Large Language Models2023-11-06T00:00:00+01:002023-11-06T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-11-06:/rag-fusion-enhancing-information-retrieval-in-large-language-models/<p>In the realm of Large Language Models (LLMs) such as ChatGPT, a new technique known as Retrieval Augmented Generation (RAG) is gaining prominence. This technique is designed to enhance a user's input by incorporating additional information from an external source. This supplementary …</p><p>In the realm of Large Language Models (LLMs) such as ChatGPT, a new technique known as Retrieval Augmented Generation (RAG) is gaining prominence. This technique is designed to enhance a user's input by incorporating additional information from an external source. This supplementary data is then leveraged by the LLM to enrich the response it generates. In this blog post, we will delve deeper into the core concept of RAG-fusion, which revolves around multiple query generation and re-ranking of results. For other methods that can improve RAG performance see my other <a href="https://www.safjan.com/techniques-to-boost-rag-performance-in-production/">Techniques to Boost RAG Performance in Production</a>.</p>
<h2>What is RAG-fusion?</h2>
<p><strong>The principle behind RAG-fusion is to generate multiple versions of the user's original query using a LLM, and then re-rank the results to select the most relevant retrieved parts.</strong></p>
<blockquote>
<p>NOTE: The term RAG in the name of the technique might be a bit misleading since "RAG-fusion" refers only to the first part of RAG - retrieval process.</p>
</blockquote>
<p>How it works? For instance, the prompt template for this task might look something like this: "Generate multiple search queries related to: {original_query}", where <code>{original_query}</code> is a placeholder for the user's original query. This step enables the model to explore different perspectives and interpretations of the original query, thereby broadening the range of potential responses.</p>
<h2>Re-ranking: A Crucial Step</h2>
<p>The next vital step in the RAG-fusion process is re-ranking. This step is critical in determining the most pertinent answers to the user's query. The re-ranking process, often referred to as Reciprocal Rank Fusion (RRF), involves collecting ranked search outcomes from multiple strategies.</p>
<p>Each document is assigned a reciprocal rank score. These scores are then merged to create a new ranking. The underlying principle here is that documents that consistently appear in top positions across diverse search strategies are likely more pertinent and should, therefore, receive a higher rank in the consolidated result.</p>
<p><img alt="RAG Fusion" src="https://miro.medium.com/v2/resize:fit:1400/1*tDALPmWxwAPf7UADpZwjWQ@2x.jpeg">
Figure 1. RAG fusion proces flow for ranking four documents A, B, C, D against three retrieval sources (can be three variants of the original user query). Source of image: <a href="https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1">Forget RAG, the Future is RAG-Fusion article by Adrian H. Raudaschl</a></p>
<h2>Why RAG-fusion Matters?</h2>
<p>It provides a boost to the LLM's ability to generate more accurate, contextually relevant responses. By considering multiple interpretations of the original query and re-ranking the results, it ensures that the model's output is as closely aligned with the user's intent as possible.</p>
<p>RAG-fusion might be a powerful technique that brings together the strengths of large language models and advanced information retrieval strategies. By employing multiple query generation and re-ranking, it takes a leap towards making AI-powered systems more responsive, accurate, and user-friendly.</p>
<p><strong>NOTE 1:</strong> For more methods that can improve RAG performance see my other <a href="https://www.safjan.com/techniques-to-boost-rag-performance-in-production/">Techniques to Boost RAG Performance in Production</a>.
<strong>NOTE 2:</strong> This technique is also referred as Query-Rewriting. You can find a section on that on LlamaIndex documentation (<a href="https://docs.llamaindex.ai/en/stable/examples/query_transformations/query_transform_cookbook.html">Query Transformation Cookbook</a>)</p>
<h2>- X::<a href="https://www.safjan.com/understanding-retrieval-augmented-generation-rag-empowering-llms/">Understanding Retrieval-Augmented Generation (RAG) empowering LLMs</a></h2>
<p><strong>Edits:</strong>
2024-02-01 - add reference to LLamaIndex Query Transform Cookbook</p>
<h2>References</h2>
<ul>
<li><a href="https://github.com/Raudaschl/rag-fusion/tree/master">GitHub - Raudaschl/rag-fusion</a> - exemplary implementation</li>
<li><a href="https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1">Forget RAG, the Future is RAG-Fusion | by Adrian H. Raudaschl | Oct, 2023 | Towards Data Science</a></li>
<li>RAG-fussion in LangChain: <a href="https://python.langchain.com/docs/templates/rag-fusion">usage</a>, template <a href="https://github.com/langchain-ai/langchain/tree/master/templates/rag-fusion">code</a></li>
<li><a href="https://docs.llamaindex.ai/en/stable/examples/query_transformations/query_transform_cookbook.html">Query Transformation Cookbook</a></li>
</ul>What Is the Key Difference Between PCA and SVD?2023-11-06T00:00:00+01:002023-11-06T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-11-06:/what-is-the-key-difference-between-pca-and-svd/<p>Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) are two matrix factorization methods used in machine learning and data analysis for dimensionality reduction. Though they are used for similar purposes, there are some key differences between the two. The key difference …</p><p>Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) are two matrix factorization methods used in machine learning and data analysis for dimensionality reduction. Though they are used for similar purposes, there are some key differences between the two. The key difference between Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) lies in their respective applications and the matrices they operate on.</p>
<h2>Dealing with the data</h2>
<ul>
<li><strong>PCA</strong> primarily deals with the covariance structure of the data. It's a statistical procedure that transforms the coordinates of a dataset into a new coordinate system. In the new system, the first axis corresponds to the first principal component that accounts for the maximum variance in the data. The second axis, perpendicular to the first, aligns with the direction of the second largest variance, and so on. PCA effectively tries to find orthogonal axes (the principal components) along which the variance of the data is maximized.</li>
<li><strong>SVD</strong>, on the other hand, does not rely on a covariance matrix. It is a factorization of the original data matrix, and it decomposes the original data into three matrices. This can be done without computing covariance, and even allows to work with missing data.</li>
</ul>
<h2>Computations</h2>
<ul>
<li>
<p>Both PCA and SVD involve eigen-decomposition. For PCA, the eigen-decomposition is on the covariance matrix of the data which is a square symmetric matrix of size <code>d x d</code> (where <code>d</code> is the number of features). This could be an issue if <code>d</code> is large, since calculating the covariance matrix and performing subsequent eigen-decomposition could be computationally expensive.</p>
</li>
<li>
<p>In contrast, SVD performs the decomposition on the data matrix itself (of size <code>n x d</code> where <code>n</code> is the number of observations and <code>d</code> is the number of features), theoretically making the computation more efficient, especially when <code>d</code> is much larger than <code>n</code>.</p>
</li>
</ul>
<p>In summary, while the two techniques are related (PCA can actually be solved using SVD), they approach the problem of dimensionality reduction differently. PCA focuses on the covariance structure and tries to maximize variance along orthogonal axes, while SVD focuses on matrix factorization and can handle cases where data is missing. However, from an application perspective, they are generally used interchangeably.</p>
<p><strong>PCA is a specific application of SVD, primarily used for dimensionality reduction, while SVD is a more general matrix decomposition technique with broader applications in linear algebra and data analysis.</strong></p>Choosing technology for the LLM knowledge graph2023-11-05T00:00:00+01:002023-11-05T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-11-05:/choosing-technology-for-the-lmm-knowledge-graph/<p>There are several technologies that can be used to implement a knowledge graph, depending on the specific requirements of your project. Here are three commonly used technologies for implementing knowledge graphs:</p>
<ol>
<li>
<p><a href="https://en.wikipedia.org/wiki/Resource_Description_Framework"><strong>Resource Description Framework (RDF)</strong></a> (RDF) is a widely adopted standard for …</p></li></ol><p>There are several technologies that can be used to implement a knowledge graph, depending on the specific requirements of your project. Here are three commonly used technologies for implementing knowledge graphs:</p>
<ol>
<li>
<p><a href="https://en.wikipedia.org/wiki/Resource_Description_Framework"><strong>Resource Description Framework (RDF)</strong></a> (RDF) is a widely adopted standard for representing data in the form of triples (subject-predicate-object). It provides a flexible and extensible way to model graph data. RDF-based technologies like <a href="https://db-engines.com/en/article/RDF+Stores">RDF stores</a> or <a href="https://en.wikipedia.org/wiki/Triplestore">triplestores</a> (e.g., <a href="https://jena.apache.org/">Apache Jena</a>, <a href="https://virtuoso.openlinksw.com/">Virtuoso</a>, <a href="https://www.stardog.com/">Stardog</a>) are commonly used to store and query knowledge graphs.</p>
</li>
<li>
<p><a href="https://en.wikipedia.org/wiki/Graph_database"><strong>Graph Databases</strong></a> are purpose-built to store, manage, and query graph data efficiently. These databases are optimized for traversing relationships between entities and provide fast graph-based queries. Examples of popular graph databases include <a href="https://neo4j.com/">Neo4j</a>, <a href="https://aws.amazon.com/neptune/">Amazon Neptune</a>, and <a href="https://janusgraph.org/">JanusGraph</a>.</p>
</li>
<li>
<p><a href="https://en.wikipedia.org/wiki/Triplestore"><strong>Triplestores</strong></a> are specialized databases designed specifically for RDF data. They store and query data using the RDF data model. Triplestores like <a href="https://jena.apache.org/">Apache Jena</a>, <a href="https://virtuoso.openlinksw.com/">Virtuoso</a>, and <a href="https://www.allegrograph.com/">AllegroGraph</a> provide features for storing and querying large-scale RDF knowledge graphs effectively.</p>
</li>
</ol>
<p>Implementing a knowledge graph using these technologies typically involves defining a schema or ontology that describes the entities, their properties, and the semantic relationships between them. The triples or statements representing the data are then stored and indexed by the chosen technology for efficient retrieval and querying.</p>Prompt Discovery2023-11-04T00:00:00+01:002023-11-04T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-11-04:/prompt-discovery/<p>Learn prompt discovery to uncover the most effective prompts and combinations thereof to achieve specific tasks, while also considering factors like response quality, model performance, and computational efficiency</p><p>Prompt discovery, in the context of large language models and prompt engineering, refers to the systematic process of identifying, optimizing, and fine-tuning prompts that elicit desired responses from the language model. It involves a blend of linguistic, computational, and experimental techniques to formulate prompts that yield accurate and contextually relevant outputs from the model.</p>
<blockquote>
<p>The goal of prompt discovery is to uncover the most effective prompts and combinations thereof to achieve specific tasks, while also considering factors like response quality, model performance, and computational efficiency.</p>
</blockquote>
<p>In highly technical terms, prompt discovery encompasses several complex problems and activities:</p>
<ol>
<li>
<p><strong>Prompt Formulation</strong>: This involves crafting prompts that are clear, unambiguous, and tailored to the desired task. Different phrasings and structures might lead to variations in model behavior, so prompt engineers need to experiment with syntax and semantics to achieve optimal results.</p>
</li>
<li>
<p><strong>Prompt Permutations</strong>: Researchers need to explore various permutations of prompts by altering wording, adding context, or using different query types. Systematically generating and testing different prompt variations is a crucial part of prompt discovery to identify which specific formulations generate the desired outputs.</p>
</li>
<li>
<p><strong>Fine-tuning Parameters</strong>: Discovering the ideal fine-tuning parameters for each prompt and model combination is a complex optimization problem. Researchers must experiment with factors like learning rates, batch sizes, and optimization algorithms to fine-tune the model for specific prompts.</p>
</li>
<li>
<p><strong>Benchmarking and Comparison</strong>: Comparing response quality across different prompt permutations, models, and settings is essential. This involves devising appropriate evaluation metrics to quantitatively assess the performance of the model in response to different prompts and making informed decisions based on these metrics.</p>
</li>
<li>
<p><strong>Generalization and Transfer Learning</strong>: Investigating the extent to which prompts can be generalized across tasks or domains is a challenging problem. Researchers need to explore how prompts can be adapted or transferred to different tasks without sacrificing performance.</p>
</li>
<li>
<p><strong>Exploration of Novel Prompts</strong>: As the field evolves, prompt engineers must continuously come up with innovative prompt formulations that push the boundaries of the model's capabilities. This might involve experimenting with new query structures, linguistic constructs, or contextual cues.</p>
</li>
</ol>
<p><img alt="process" src="/images/prompt_discovery/prompt_discovery_process.png"></p>
<p><em><strong>Figure 1:</strong> Flowchart illustrating the steps in prompt discovery. Starting with prompt formulation, it progresses through prompt permutations, fine-tuning parameters, benchmarking and comparison, generalization and transfer learning, to the exploration of novel prompts.</em></p>
<p>For prompt discovery, a range of tools, both existing and potentially developed in the future, can be instrumental:</p>
<ol>
<li>
<p><strong>Automated Prompt Generation</strong>: AI-assisted tools that automatically generate prompt variations based on input specifications could expedite the discovery process.</p>
</li>
<li>
<p><strong>Prompt Optimization Algorithms</strong>: Advanced optimization algorithms tailored for prompt discovery, including genetic algorithms or reinforcement learning approaches, could efficiently explore the prompt space.</p>
</li>
<li>
<p><strong>Interactive Prompt Testing Environments</strong>: User-friendly interfaces that allow prompt engineers to interactively test and fine-tune prompts with real-time model feedback can facilitate rapid iteration.</p>
</li>
<li>
<p><strong>Prompt Benchmarking Platforms</strong>: Comprehensive platforms for benchmarking prompt performance across various tasks, models, and settings could aid in making informed prompt selection decisions.</p>
</li>
<li>
<p><strong>Semantic Analysis Tools</strong>: Tools that provide detailed semantic analysis of prompt-response pairs can help identify patterns and nuances in model behavior, guiding prompt formulation.</p>
</li>
<li>
<p><strong>Natural Language Understanding Frameworks</strong>: Advanced NLU frameworks that provide insights into model comprehension and reasoning processes can inform prompt design for better results.</p>
</li>
<li>
<p><strong>Transfer Learning Techniques</strong>: Techniques that enable efficient transfer of knowledge from one prompt to another could support prompt generalization across tasks.</p>
</li>
<li>
<p><strong>Continuous Model Monitoring</strong>: Real-time monitoring tools that track model performance in response to different prompts can aid in prompt discovery over time.</p>
</li>
</ol>
<p><img alt="mindmap" src="/images/prompt_discovery/prompt_discovery_mindmap.png"></p>
<p><strong>*Figure 2:</strong> Mindmap illustrating the key aspects of prompt discovery. It includes formulation, permutations, fine-tuning, benchmarking, generalization, novel prompts, and the different tools involved in the process.</p>
<p>In summary, prompt discovery is a process that involves intricate prompt formulation, thorough benchmarking, fine-tuning, and adaptation. The tools mentioned above, along with future advancements, will play a vital role in shaping the efficiency and effectiveness of prompt discovery efforts.</p>Techniques to Boost RAG Performance in Production2023-11-01T00:00:00+01:002023-11-04T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-11-01:/techniques-to-boost-rag-performance-in-production/<p>This article discusses several advanced techniques that can be applied at different stages of the RAG pipeline to enhance its performance in a production setting.</p><p>Retrieval-Augmented Generation (RAG) is a powerful tool in the domain of machine learning, offering significant potential for improving the quality of text generation in various applications. However, optimizing its performance can be a challenging task. For the introductory text on RAG see my other <a href="https://safjan.com/understanding-retrieval-augmented-generation-rag-empowering-llms/">article</a>. This article discusses several advanced techniques that can be applied at different stages of the RAG pipeline to enhance its performance in a production setting.</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#leveraging-hybrid-search">Leveraging Hybrid Search</a></li>
<li><a href="#utilizing-summaries-for-data-chunks">Utilizing Summaries for Data Chunks</a></li>
<li><a href="#applying-query-transformations">Applying Query Transformations</a></li>
<li><a href="#query-compression">Query Compression</a></li>
<li><a href="#optimal-chunking-strategy">Optimal Chunking Strategy</a></li>
<li><a href="#fine-tuning-embedding-models">Fine-tuning Embedding Models</a></li>
<li><a href="#enriching-metadata">Enriching Metadata</a></li>
<li><a href="#employing-re-ranking">Employing Re-ranking</a></li>
<li><a href="#addressing-the-lost-in-the-middle-problem">Addressing the 'Lost in the Middle' Problem</a></li>
<li><a href="#meta-data-filtering">Meta-data Filtering</a></li>
<li><a href="#query-routing">Query Routing</a></li>
<li><a href="#references">References</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="leveraging-hybrid-search"></a></p>
<h2>Leveraging Hybrid Search</h2>
<p>Hybrid search, a fusion of semantic search and keyword search, can be employed to retrieve pertinent data from a vector store. This method often yields superior results across a range of use cases. It essentially combines the strength of keyword search (precision) and semantic search (recall), providing a more comprehensive search solution.
dups/hybrid_search</p>
<p><a id="utilizing-summaries-for-data-chunks"></a></p>
<h2>Utilizing Summaries for Data Chunks</h2>
<p>An efficient way to enhance the quality of generation and reduce the number of tokens in the input is by summarizing the chunks of data and storing these summaries in the vector store. This technique is especially useful when dealing with data that includes numerous filler words. By summarizing the chunks, we can eliminate these superfluous elements, thereby refining the quality of the input data.
<a id="query-compression"></a></p>
<p><a id="applying-query-transformations"></a></p>
<h2>Applying Query Transformations</h2>
<p>Query transformations can significantly enhance the quality of responses. For instance, if a system does not find relevant context for a query, the LLM can rephrase the query and try again. See the <a href="https://www.safjan.com/rag-fusion-enhancing-information-retrieval-in-large-language-models/">RAG-Fusion - Enhancing Information Retrieval in Large Language Models</a>.</p>
<p>Similarly, the <a href="http://boston.lti.cs.cmu.edu/luyug/HyDE/HyDE.pdf">HyDE</a> strategy generates a hypothetical response to a query and uses both for embedding lookup, which has been found to dramatically enhance performance.</p>
<p>Another technique involves breaking down complex queries into sub-queries, a process that LLMs tend to handle better. This approach can be integrated into the RAG system to decompose a query into multiple simpler questions.</p>
<p><a id="query-compression"></a></p>
<h2>Query Compression</h2>
<p>Query compression, (see a tool like <a href="https://www.microsoft.com/en-us/research/project/llmlingua/longllmlingua/">LongLLMLingua</a>) is a technique for improving RAG's performance in long context scenarios where large language models often face challenges such as increased computational and financial costs, longer latency, and inferior performance. By enhancing the density and optimizing the position of key information in the input prompt, LongLLMLingua improves LLMs' perception of key information, which in turn, reduces computational load, decreases latency, and improves performance. This strategy ensures that vital information is not lost or diluted in lengthy contexts, thereby enhancing the relevance and quality of the generated output.
<a id="optimal-chunking-strategy"></a></p>
<h2>Optimal Chunking Strategy</h2>
<p>There are multiple strategies that can be applied to chunking see <a href="https://safjan.com/from-fixed-size-to-nlp-chunking-a-deep-dive-into-text-chunking-techniques/#from-fixed-size-to-nlp-chunking-a-deep-dive-into-text-chunking-techniques">Chunking strategies</a>. One of the aspects can be controlling the chunk overlap. Semantic retrieval may pose a challenge when a selected chunk has meaningful context in adjacent chunks that could be missed. To mitigate this, an overlap of chunks can be implemented, whereby neighboring chunks are also passed to the Language Model (LLM) for generation. This guarantees that the surrounding context is incorporated, thus enhancing the output's quality.</p>
<p><a id="fine-tuning-embedding-models"></a></p>
<h2>Fine-tuning Embedding Models</h2>
<p>While off-the-shelf embedding models such as BERT and Ada may suffice for many use cases, they might not adequately represent specific domains in the vector space, leading to suboptimal retrieval quality. In such instances, it would be advantageous to fine-tune an embedding model using domain-specific data to significantly improve retrieval quality.</p>
<p><a id="enriching-metadata"></a></p>
<h2>Enriching Metadata</h2>
<p>The provision of metadata like source information about the chunks being processed can enhance the LLM's comprehension of the context, leading to a better output generation. This additional layer of information can provide the LLM with a more holistic understanding of the data, enabling it to generate more accurate and relevant responses.</p>
<p><a id="employing-re-ranking"></a></p>
<h2>Employing Re-ranking</h2>
<p>Semantic search may yield top-k results that are too similar to each other. To ensure a wider array of snippets, it is beneficial to <a href="https://www.sbert.net/examples/applications/retrieve_rerank/README.html">re-rank</a> the results based on other factors such as metadata and keyword matches. This diversification of snippets can lead to a more nuanced and comprehensive context for the LLM to generate responses. Re-ranker can be based on a cross-encoder.</p>
<p><a id="addressing-the-lost-in-the-middle-problem"></a></p>
<h2>Addressing the 'Lost in the Middle' Problem</h2>
<p>LLMs tend not to assign equal weight to all tokens in the input, often overlooking tokens located in the middle. This phenomenon, known as the <a href="https://arxiv.org/abs/2307.03172">'lost in the middle' problem</a>, can be addressed by reordering the context snippets to place the most vital snippets at the beginning and end of the input, with less important snippets situated in the middle.</p>
<p><a id="meta-data-filtering"></a></p>
<h2>Meta-data Filtering</h2>
<p>Meta-data, such as date tags, can be added to your chunks to improve retrieval. For example, filtering by recency can be beneficial when querying email history. Recent emails may not necessarily be the most similar from an embedding standpoint, but they are more likely to be relevant.</p>
<p><a id="query-routing"></a></p>
<h2>Query Routing</h2>
<p>Having multiple indexes and routing queries to the appropriate index can be beneficial. For instance, different indexes could handle summarization questions, pointed questions, and date-sensitive questions. Trying to optimize one index for all these behaviors may compromise its effectiveness.</p>
<p>The performance of RAG in production can be significantly improved by applying a range of techniques, including hybrid search, chunk summarization, overlapping chunks, fine-tuned embedding models, metadata enhancement, re-ranking, addressing the 'lost in the middle' problem, query transformations, meta-data filtering, and query routing. These strategies will help to optimize the RAG pipeline, ensuring higher quality output and improved overall performance.</p>
<p><a id="references"></a></p>
<h2>References</h2>
<ol>
<li><a href="https://llmstack.ai/blog/retrieval-augmented-generation">Retrieval Augmented Generation (RAG): What, Why and How? | LLMStack</a></li>
<li><a href="https://arxiv.org/abs/2307.03172">[2307.03172] Lost in the Middle: How Language Models Use Long Contexts</a></li>
<li><a href="https://towardsdatascience.com/10-ways-to-improve-the-performance-of-retrieval-augmented-generation-systems-5fa2cee7cd5c">10 Ways to Improve the Performance of Retrieval Augmented Generation Systems | by Matt Ambrogi | Sep, 2023 | Towards Data Science</a></li>
<li>Hypothetical Document Embeddings (HyDE) - <a href="http://boston.lti.cs.cmu.edu/luyug/HyDE/HyDE.pdf">Precise Zero-Shot Dense Retrieval without Relevance Labels</a></li>
<li><a href="https://www.sbert.net/examples/applications/retrieve_rerank/README.html">Retrieve & Re-Rank — Sentence-Transformers documentation</a></li>
<li><a href="https://blog.llamaindex.ai/improving-rag-effectiveness-with-retrieval-augmented-dual-instruction-tuning-ra-dit-01e73116655d">Improving RAG effectiveness with Retrieval-Augmented Dual Instruction Tuning (RA-DIT) | by Emanuel Ferreira | Oct, 2023 | LlamaIndex Blog</a></li>
<li><a href="https://medium.com/towards-generative-ai/improving-rag-retrieval-augmented-generation-answer-quality-with-re-ranker-55a19931325">Improving RAG (Retrieval Augmented Generation) Answer Quality with Re-ranker | by Shivam Solanki | Towards Generative AI | Medium</a></li>
<li>SingleStore (db), finetuning embeddings model, CacheGPT, Nemo-Guardrails, <a href="https://madhukarkumar.medium.com/secrets-to-optimizing-rag-llm-apps-for-better-accuracy-performance-and-lower-cost-da1014127c0a">Secrets to Optimizing RAG LLM Apps for Better Performance, Accuracy and Lower Costs! | by Madhukar Kumar | madhukarkumar | Sep, 2023 | Medium</a></li>
<li><a href="https://github.com/run-llama/finetune-embedding">run-llama/finetune-embedding: Fine-Tuning Embedding for RAG with Synthetic Data</a></li>
<li><a href="https://github.com/zilliztech/GPTCache">zilliztech/GPTCache: Semantic cache for LLMs. Fully integrated with LangChain and llama_index.</a></li>
<li><a href="https://github.com/NVIDIA/NeMo-Guardrails">NVIDIA/NeMo-Guardrails: NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems.</a></li>
<li>library to evaluate the context retrieved from your enterprise corpus of data (how do you know if the context being retrieved is accurate) <a href="https://github.com/explodinggradients/ragas">GitHub - explodinggradients/ragas: Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines</a></li>
<li>LangSmith, introduced by LangChain - a highly effective tool for monitoring and examining the responses between the app and the LLM.</li>
<li><a href="https://arxiv.org/abs/2310.15123">[2310.15123] Branch-Solve-Merge Improves Large Language Model Evaluation and Generation</a></li>
</ol>Python Expertise Level - Self-Assessment2023-10-17T00:00:00+02:002023-10-17T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-10-17:/python-expertise-level-self-assessment/<p>Sometimes you need to assess your own or candidate's level of expertise in Python programming. I have created some statements that roughly corresponds to the various level of expertise. Note that knowing programming language techniques contributes to expertise but does not make …</p><p>Sometimes you need to assess your own or candidate's level of expertise in Python programming. I have created some statements that roughly corresponds to the various level of expertise. Note that knowing programming language techniques contributes to expertise but does not make a great programmer automatically. Knowledge of algorithms and data structures, programming patterns, and software architectures are some other important factors - to mention a few.</p>
<p>Having that said, I still find useful this simple classification of Python programmers into three categories: beginners, advanced, and experts.</p>
<h2>Beginners</h2>
<ol>
<li>Familiar with basic Python syntax and data types (strings, integers, lists, dictionaries).</li>
<li>Can write simple functions and use control flow statements (if, for, while).</li>
<li>Understands the concept of variables and variable scope.</li>
<li>Can use basic Python libraries like <code>math</code> and <code>random</code>.</li>
<li>Knows how to handle errors and exceptions using try/except blocks.</li>
<li>Can read from and write to files.</li>
<li>Understands the basics of object-oriented programming: classes, objects, methods.</li>
<li>Can use basic string and list methods for manipulation.</li>
<li>Knows how to use basic Python data structures like lists, tuples, and dictionaries.</li>
<li>Can write simple Python scripts to automate tasks.</li>
</ol>
<h2>Advanced</h2>
<ol>
<li>Understands and uses generators and decorators.</li>
<li>Can write complex functions and classes with multiple methods and attributes.</li>
<li>Understands and uses list comprehensions and lambda functions.</li>
<li>Can use regular expressions for pattern matching in strings (note: more regex skill that python)</li>
<li>Understands and uses context managers for resource management.</li>
<li>Can use advanced Python data structures like sets and frozensets.</li>
<li>Understands and uses Python's memory management and optimization techniques.</li>
<li>Can use Python's built-in functions like <code>map()</code>, <code>filter()</code>, <code>reduce()</code>.</li>
<li>Understands and uses Python's module and package system.</li>
</ol>
<h2>Experts</h2>
<ol>
<li>Understands and uses metaclasses and descriptors.</li>
<li>Can write and understand asynchronous code using <code>asyncio</code>.</li>
<li>Understands and uses Python's concurrency and parallelism features.</li>
<li>Can use Python's C API to extend Python with C/C++ code.</li>
<li>Understands and uses Python's dynamic typing system to its full extent.</li>
<li>Can write and understand complex decorators and context managers.</li>
<li>Proficient in Python's debugging and profiling, using tools like <code>pdb</code> for debugging and <code>cProfile</code> for profiling to optimize their code.</li>
<li>Have a deep understanding of Python's Global Interpreter Lock (GIL) and how it affects multithreaded programs.</li>
</ol>
<p>There is <a href="https://news.ycombinator.com/item?id=38032092">HN</a> discussion on this note.</p>
<p><strong>Edits:</strong></p>
<ul>
<li>2023-10-30: remove from Experts: 7. Understands and uses Python's garbage collection system.</li>
<li>2023-10-30: remove from Experts: Have a good understanding of Python's internals, such as bytecode, the Python interpreter's execution model, and how Python's data types are implemented at the C level.</li>
<li>2023-10-30: remove from Advanced: Can use advanced Python libraries like <code>numpy</code>, <code>pandas</code>, <code>matplotlib</code> not a python std lib.</li>
<li>added note:</li>
</ul>Understanding the Differences in Language Models - Transformers vs. Markov Models2023-10-07T00:00:00+02:002023-10-07T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-10-07:/understanding-differences-gpt-transformers-markov-models/<p>This article explores distinguishing details of Markov Models and Transformer-based models like GPT, focusing on how they predict the next character in a sequence. It explores the fundamental differences between these models, with a particular emphasis on how the self-attention mechanism in Transformer models makes a difference compared to the fixed context length in Markov models.</p><p>In the field of machine learning and natural language processing (NLP), different models have been developed to understand and generate human language. Two such models that have gained significant attention are the Markov Models and the Transformer-based models like GPT (<a href="https://en.wikipedia.org/wiki/Generative_pre-trained_transformer">Generative Pretrained Transformer</a>). While both types of models can predict the next character in a sequence, they differ significantly in their underlying mechanisms and capabilities. This article aims to delve into the intricacies of these models, with a particular focus on how the self-attention mechanism in Transformer models makes a difference compared to the fixed context length in Markov models.</p>
<h2>Markov Models: A Brief Overview</h2>
<p><a href="https://en.wikipedia.org/wiki/Markov_model">Markov Models</a>, named after the Russian mathematician <a href="https://en.wikipedia.org/wiki/Andrey_Markov">Andrey Markov</a>, are a class of models that predict future states based solely on the current state, disregarding all past states. This property is known as the Markov Property, and it is the fundamental assumption that underlies all Markov models.</p>
<p>In the context of language modeling, a Markov Model might predict the next word or character in a sentence based on the current word or character. For instance, given the word "The", a Markov Model might predict that the next word is "cat" based on the probability distribution of words that follow "The" in its training data.</p>
<p>The main limitation of Markov Models is their lack of memory. Since they only consider the current state, they are unable to capture long-term dependencies in a sequence. For example, in the sentence "I grew up in France... I speak fluent ___", a Markov Model might struggle to fill in the blank correctly because the relevant context ("France") is several words back.</p>
<p><img alt="Markov Chain text generation" src="/images/transformers_vs_markov/markov_model_text_generation.png"></p>
<p><strong>Figure 1.</strong> <em>Markov Model might predict the next word based on the probability distribution of words in its training data. Image Source: <a href="https://jaroslawwiosna.github.io/markov-chain-text/">markov-chain-text | Modern C++ Markov chain text generator</a> by Jarosław Wiosna</em></p>
<h2>Transformer Models: An Introduction</h2>
<p>Transformer models, on the other hand, were introduced in the seminal paper <a href="https://arxiv.org/abs/1706.03762">"Attention is All You Need"</a> by Vaswani et al. (2017). They represent a significant departure from previous sequence-to-sequence models, eschewing recurrent and convolutional layers in favor of self-attention mechanisms.</p>
<p>GPT, developed by OpenAI, is a prominent example of a Transformer model. It is a generative model that can generate human-like text by predicting the next word in a sequence. Unlike Markov Models, GPT considers the entire context of a sequence when making predictions, allowing it to capture long-term dependencies.</p>
<h2>The Power of Self-Attention</h2>
<p>The key innovation of Transformer models is the self-attention mechanism. This mechanism allows the model to weigh the importance of different words in the context when predicting the next word. For instance, in the sentence "The cat, which was black and white, jumped over the ___", the model might assign more importance to "cat" and "jumped" when predicting the next word.</p>
<p>Self-attention is calculated using the dot product of the query and key vectors, which are learned representations of the input. The resulting attention scores are then used to weight the value vectors, which are also learned representations of the input. This weighted sum forms the output of the self-attention layer.</p>
<p>The self-attention mechanism allows Transformer models to consider the entire context of a sequence, rather than just the current state. This is a significant advantage over Markov Models, which are limited by their fixed context length.</p>
<p><img alt="Transformer model - Context and Attention" src="/images/transformers_vs_markov/transformers_context_and_atention.png"></p>
<p><strong>Figure 2.</strong> <em>The self-attention mechanism allows Transformer models to consider the entire context of a sequence, rather than just the current state. Image Source: <a href="https://dzone.com/articles/a-deep-dive-into-the-transformer-architecture-the">A Deep Dive Into the Transformer Architecture – The Development of Transformer Models</a> by Kevin Hooke</em></p>
<h2>Fixed Context Length vs. Variable Context Length</h2>
<p>Markov Models, due to their inherent design, have a fixed context length. They only consider the current state when making predictions, which limits their ability to capture long-term dependencies. This can lead to less accurate predictions, especially in complex sequences where the relevant context might be several states back.</p>
<p>Transformer models, on the other hand, have a variable context length. Thanks to the self-attention mechanism, they can consider the entire context of a sequence when making predictions. This allows them to capture long-term dependencies and make more accurate predictions.</p>
<p>Moreover, the self-attention mechanism allows Transformer models to dynamically adjust the context length based on the input. For instance, in a sentence with many irrelevant words, the model might focus on a few key words, effectively reducing the context length. This dynamic context length is another advantage of Transformer models over Markov Models.</p>
<h2>Conclusion</h2>
<p>While both Markov Models and Transformer models like GPT can predict the next character in a sequence, they differ significantly in their underlying mechanisms and capabilities. Markov Models, with their fixed context length, are limited in their ability to capture long-term dependencies. Transformer models, with their self-attention mechanism, can consider the entire context of a sequence, allowing them to capture long-term dependencies and make more accurate predictions.</p>
<p><em>Any comments or suggestions? <a href="mailto:ksafjan@gmail.com?subject=Blog+post">Let me know</a>.</em></p>
<h2>References</h2>
<ol>
<li>Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). <a href="https://arxiv.org/abs/1706.03762">Attention is all you need</a>. In Advances in neural information processing systems (pp. 5998-6008).</li>
<li>Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). <a href="https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf">Language Models are Unsupervised Multitask Learners</a>. OpenAI Blog.</li>
<li>Bishop, C. M. (2006). <a href="https://www.springer.com/gp/book/9780387310732">Pattern Recognition and Machine Learning</a>. Springer.</li>
<li>Ruder, S. (2019). <a href="http://jalammar.github.io/illustrated-transformer/">The Illustrated Transformer</a>. Jay Alammar's Blog.</li>
<li>Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). <a href="https://arxiv.org/abs/2005.14165">Language Models are Few-Shot Learners</a>. In Advances in Neural Information Processing Systems.</li>
<li>Chollet, F. (2018). <a href="https://www.manning.com/books/deep-learning-with-python">Deep Learning with Python</a>. Manning Publications Co.</li>
<li>Jurafsky, D., & Martin, J. H. (2019). <a href="https://web.stanford.edu/~jurafsky/slp3/">Speech and Language Processing</a>. Stanford University.</li>
<li>Al-Rfou, R., Choe, D., Constant, N., Guo, M., & Jones, L. (2019). <a href="https://arxiv.org/abs/1808.04444">Character-Level Language Modeling with Deeper Self-Attention</a>. In Proceedings of the AAAI Conference on Artificial Intelligence.</li>
<li>Goodfellow, I., Bengio, Y., & Courville, A. (2016). <a href="http://www.deeplearningbook.org/">Deep Learning</a>. MIT press.</li>
<li>Manning, C. D., & Schütze, H. (1999). <a href="https://mitpress.mit.edu/books/foundations-statistical-natural-language-processing">Foundations of Statistical Natural Language Processing</a>. MIT Press.</li>
<li>Jurafsky, D., & Martin, J. H. (2009). <a href="https://www.pearson.com/us/higher-education/program/Jurafsky-Speech-and-Language-Processing-An-Introduction-to-Natural-Language-Processing-Computational-Linguistics-and-Speech-Recognition-2nd-Edition/PGM319216.html">Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition</a>. Prentice Hall.</li>
<li>Jelinek, F. (1997). <a href="https://mitpress.mit.edu/books/statistical-methods-speech-recognition">Statistical Methods for Speech Recognition</a>. MIT Press.</li>
<li>Russell, S., & Norvig, P. (2016). <a href="http://aima.cs.berkeley.edu/">Artificial Intelligence: A Modern Approach</a>. Pearson.</li>
<li>Charniak, E. (1993). <a href="https://mitpress.mit.edu/books/statistical-language-learning">Statistical Language Learning</a>. MIT Press.</li>
<li>Lin, T. (2015). <a href="https://towardsdatascience.com/markov-chains-and-text-generation-162fd4ec8f26">Markov Chains and Text Generation</a>. Towards Data Science Blog.</li>
<li>Goodman, J. (2001). <a href="https://www.microsoft.com/en-us/research/publication/a-bit-of-progress-in-language-modeling/">A bit of progress in language modeling</a>. Microsoft Research.</li>
<li>Rosenfeld, R. (2000). <a href="https://www.cs.cmu.edu/~roni/papers/SLM-hlt01.pdf">Two Decades of Statistical Language Modeling: Where Do We Go From Here?</a>. Proceedings of the IEEE.</li>
<li>Nazarko, K. (2021). <a href="https://towardsdatascience.com/text-generation-gpt-2-lstm-markov-chain-9ea371820e1e">Word-level text generation using GPT-2, LSTM and Markov Chain</a>. Towards Data Science Blog.</li>
<li>Adyatama, A. (2020). <a href="https://algotech.netlify.app/blog/text-generating-with-markov-chains/">Text Generation with Markov Chains</a> Algoritma Technical Blog.</li>
</ol>How Agile Can Kill Creativity in Data Science team?2023-09-29T00:00:00+02:002023-09-29T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-09-29:/how-agile-can-kill-creativity-in-data-science-team/<p>Discover the delicate balance between Agile methodologies and imagination in the domain of data science and analytics. Uncover the impact of Agile approaches on creativity within data science teams. Explore how these practices shape the innovative landscape of data science and analytics.</p><p>Agile methodologies can provide numerous benefits to data science and analytics teams, such as quicker delivery, enhanced collaboration, and increased customer satisfaction. However, if not implemented effectively, Agile may unintentionally impede creativity in these teams. Here are a few ways Agile can potentially hinder creativity in data science/analytics.</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#potential-problems">Potential problems</a><ul>
<li><a href="#tight-deadlines-and-sprints">Tight deadlines and sprints</a></li>
<li><a href="#focus-on-deliverables">Focus on deliverables</a></li>
<li><a href="#lack-of-autonomy">Lack of autonomy</a></li>
<li><a href="#constant-and-sudden-changes">Constant and sudden changes</a></li>
<li><a href="#overemphasis-on-standardized-processes">Overemphasis on standardized processes</a></li>
</ul>
</li>
<li><a href="#mitigation">Mitigation</a><ul>
<li><a href="#complementary-practices">Complementary practices</a></li>
<li><a href="#frameworks-tailored-for-data-science-projects">Frameworks tailored for data science projects</a></li>
</ul>
</li>
<li><a href="#references">References</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="potential-problems"></a></p>
<h2>Potential problems</h2>
<p><a id="tight-deadlines-and-sprints"></a></p>
<h3>Tight deadlines and sprints</h3>
<p>Agile typically operates on tight timelines with fixed sprints. This can limit the time available for exploration, experimentation, and creative thinking. The emphasis on adhering to strict schedules may discourage innovative approaches that require more time to develop.</p>
<p><a id="focus-on-deliverables"></a></p>
<h3>Focus on deliverables</h3>
<p>Agile methodologies often prioritize delivering functioning solutions over long-term exploration. This focus on short-term goals can discourage team members from taking the time to explore complex problems creatively, resulting in a more practical, rather than innovative, approach.</p>
<p><a id="lack-of-autonomy"></a></p>
<h3>Lack of autonomy</h3>
<p>In some Agile implementations, teams may be closely supervised or required to adhere to preset workflows. This kind of micromanagement limits individual creativity, as team members may not have the freedom to experiment, propose alternative solutions, or take calculated risks.</p>
<p><a id="constant-and-sudden-changes"></a></p>
<h3>Constant and sudden changes</h3>
<p>Agile projects often involve iterative development with frequent changes in priorities and requirements. While this adaptability is beneficial in many cases, it can disrupt the creative process and impede the ability to think deeply about problems. Constantly switching gears may hinder the exploration of unconventional solutions.</p>
<p><a id="overemphasis-on-standardized-processes"></a></p>
<h3>Overemphasis on standardized processes</h3>
<p>Agile frameworks provide standardized processes and practices that ensure consistency and predictability. While these are essential for efficient project management, a strict adherence to these processes can stifle creativity as it may discourage deviation from the prescribed methods.</p>
<p><a id="mitigation"></a></p>
<h2>Mitigation</h2>
<p><a id="complementary-practices"></a></p>
<h3>Complementary practices</h3>
<p>To prevent the potential negative impact on creativity, Agile methodologies should be complemented with the following practices:</p>
<ul>
<li>Allow dedicated time for <strong>exploration</strong> and <strong>learning</strong> outside of fixed sprints.</li>
<li>Encourage <strong>cross-functional collaboration</strong> and knowledge sharing to foster creativity.</li>
<li>Provide opportunities for <strong>innovation-driven initiatives</strong> alongside project-driven ones.</li>
<li>Support a <strong>psychologically safe environment</strong> that allows for experimentation and failure.</li>
<li><strong>Recognize</strong> and reward <strong>creative thinking</strong> and experimentation within the team.</li>
</ul>
<p><a id="frameworks-tailored-for-data-science-projects"></a></p>
<h3>Frameworks tailored for data science projects</h3>
<p>The data science teams need to adapt Agile practices to suit their specific needs and contexts, and to balance the trade-offs between speed, flexibility, and quality. Data science teams can adopt or modify a framework that is tailored for data science projects, such as the <strong><a href="https://learn.microsoft.com/en-us/azure/architecture/data-science-process/overview">Team Data Science Process</a></strong> (TDSP) or the <strong><a href="https://www.datascience-pm.com/agile-data-science/">Agile Data Science Process</a></strong> These frameworks provide guidance on how to structure, execute, and manage data science projects using Agile principles and practices.</p>
<p>By adjusting Agile practices to accommodate these considerations, data science/analytics teams can create a balance between efficient project management and fostering creativity and innovation.</p>
<p><em>Any comments or suggestions? <a href="mailto:ksafjan@gmail.com?subject=Blog+post">Let me know</a>.</em></p>
<p><a id="references"></a></p>
<h2>References</h2>
<ol>
<li><a href="https://learn.microsoft.com/en-us/azure/architecture/data-science-process/overview">What is the Team Data Science Process? - Azure Architecture Center | Microsoft Learn</a></li>
<li><a href="https://www.datascience-pm.com/agile-data-science/">Agile Data Science - Data Science Process Alliance</a></li>
<li><a href="https://eugeneyan.com/writing/data-science-and-agile-what-works-and-what-doesnt/">Data Science and Agile (What Works, and What Doesn't)</a> (read about poor resource planning)</li>
<li><a href="https://www.geeksforgeeks.org/agile-methodology-advantages-and-disadvantages/">Agile Methodology Advantages and Disadvantages - GeeksforGeeks</a></li>
</ol>The Right Way to Job-Hop2023-09-29T00:00:00+02:002023-09-29T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-09-29:/the-right-way-to-job-hop/<p>NOTE: The text below are the advices extracted from the podcast transcript using LLM.</p>
<p>Based on the podcast "The right way to Job-hop" <a href="https://stackoverflow.blog/2022/10/11/the-right-way-to-job-hop-ai-generated-pokemon-ep-495/">transcript</a>, here are some key pieces of advice on how to do "job hopping" the right way:</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#key-pieces-of-advice">Key Pieces …</a></li></ul><p>NOTE: The text below are the advices extracted from the podcast transcript using LLM.</p>
<p>Based on the podcast "The right way to Job-hop" <a href="https://stackoverflow.blog/2022/10/11/the-right-way-to-job-hop-ai-generated-pokemon-ep-495/">transcript</a>, here are some key pieces of advice on how to do "job hopping" the right way:</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#key-pieces-of-advice">Key Pieces of Advice</a></li>
<li><a href="#follow-new-tech-trends">Follow New Tech Trends</a></li>
<li><a href="#dont-stay-too-long-in-one-place">Don't Stay Too Long in One Place</a></li>
<li><a href="#use-job-hopping-to-gain-a-variety-of-experience">Use Job Hopping to Gain a Variety of Experience</a></li>
<li><a href="#leave-a-job-for-a-good-reason">Leave a Job for a Good Reason</a></li>
<li><a href="#stay-at-a-job-for-at-least-a-year">Stay at a Job for at Least a Year</a></li>
<li><a href="#be-prepared-to-explain-your-job-hopping">Be Prepared to Explain Your Job Hopping</a></li>
<li><a href="#consider-remote-opportunities">Consider Remote Opportunities</a></li>
<li><a href="#focus-on-increasing-your-personal-wealth">Focus on Increasing Your Personal Wealth</a></li>
<li><a href="#ensure-youre-moving-up-with-each-job-change">Ensure You're Moving Up With Each Job Change</a></li>
<li><a href="#ask-the-right-questions-during-interviews">Ask the Right Questions During Interviews</a></li>
<li><a href="#additional-pieces-of-advice">Additional pieces of advice</a></li>
<li><a href="#understand-the-impact-of-job-hopping-on-your-resume">Understand the Impact of Job Hopping on Your Resume</a></li>
<li><a href="#avoid-short-stints">Avoid Short Stints</a></li>
<li><a href="#use-job-hopping-as-a-negotiation-tool">Use Job Hopping as a Negotiation Tool</a></li>
<li><a href="#consider-the-company-culture">Consider the Company Culture</a></li>
<li><a href="#be-transparent-and-honest">Be Transparent and Honest</a></li>
<li><a href="#keep-learning-and-updating-your-skills">Keep Learning and Updating Your Skills</a></li>
<li><a href="#maintain-professional-relationships">Maintain Professional Relationships</a></li>
<li><a href="#consider-the-impact-on-your-long-term-career-goals">Consider the Impact on Your Long-Term Career Goals</a></li>
<li><a href="#take-advantage-of-remote-work-opportunities">Take Advantage of Remote Work Opportunities</a></li>
<li><a href="#always-leave-on-good-terms-dont-burn-bridges">Always Leave on Good Terms, Don't Burn Bridges</a></li>
<li><a href="#evaluate-the-companys-stability">Evaluate the Company's Stability</a></li>
<li><a href="#consider-the-impact-on-your-work-life-balance">Consider the Impact on Your Work-Life Balance</a></li>
<li><a href="#take-time-to-reflect-on-each-job-change">Take Time to Reflect on Each Job Change</a></li>
<li><a href="#be-prepared-for-potential-negative-perceptions">Be Prepared for Potential Negative Perceptions</a></li>
<li><a href="#dont-job-hop-just-for-the-sake-of-it">Don't Job Hop Just for the Sake of It</a></li>
<li><a href="#consider-the-benefits-and-drawbacks">Consider the Benefits and Drawbacks</a></li>
<li><a href="#keep-your-skills-up-to-date">Keep Your Skills Up to Date</a></li>
<li><a href="#network-effectively">Network Effectively</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="key-pieces-of-advice"></a></p>
<h2>Key Pieces of Advice</h2>
<p><a id="follow-new-tech-trends"></a></p>
<h3>Follow New Tech Trends</h3>
<p>The tech industry is characterized by rapid and constant evolution. As such, it's crucial to stay abreast of emerging technologies and trends. By doing so, you can identify opportunities to gain experience in these new areas, which can enhance your skill set and make you more marketable. Job hopping can be a strategic way to follow these trends, allowing you to move between companies that are at the forefront of these changes, thereby ensuring your skills remain relevant and in-demand.</p>
<p><a id="dont-stay-too-long-in-one-place"></a></p>
<h3>Don't Stay Too Long in One Place</h3>
<p>Unlike many other industries where longevity in a role is often rewarded, the tech industry values adaptability and diverse experience. Given the high demand for tech skills, employers are often willing to offer competitive compensation packages to attract talent, even if the candidate has a history of changing jobs frequently. Therefore, don't hesitate to change jobs every few years if it means advancing your career, gaining new skills, or improving your compensation.</p>
<p><a id="use-job-hopping-to-gain-a-variety-of-experience"></a></p>
<h3>Use Job Hopping to Gain a Variety of Experience</h3>
<p>Job hopping can provide a wealth of diverse experiences. By moving between different companies, roles, and projects, you can acquire a broad range of skills and insights. This variety can not only enhance your professional development and accelerate your career progression but also make you a more attractive candidate to potential employers who value such diverse experience.</p>
<p><a id="leave-a-job-for-a-good-reason"></a></p>
<h3>Leave a Job for a Good Reason</h3>
<p>While job hopping is more accepted in the tech industry, it's still important to have a valid reason for leaving each job. This could be to pursue a new opportunity, acquire new skills, seek a higher salary, or aim for a promotion. Leaving a job without a good reason could raise concerns for potential employers, who may question your commitment or reliability. Therefore, always ensure you can articulate your reasons for job changes in a positive and professional manner.</p>
<p><a id="stay-at-a-job-for-at-least-a-year"></a></p>
<h3>Stay at a Job for at Least a Year</h3>
<p>While frequent job changes can be beneficial, it's advisable to stay at each job for at least a year. This duration allows you sufficient time to fully understand your role, contribute meaningfully to the company, and leave a positive impression. It also demonstrates to future employers that you can commit to a role and see projects through to completion.</p>
<p><a id="be-prepared-to-explain-your-job-hopping"></a></p>
<h3>Be Prepared to Explain Your Job Hopping</h3>
<p>If your resume shows frequent job changes, be prepared to explain this during interviews. Honesty is key here. Focus on the positive aspects of job hopping, such as the diverse skills and experiences you've gained, the opportunities you've had to work on different projects or with different technologies, and how these experiences have contributed to your professional growth.</p>
<p><a id="consider-remote-opportunities"></a></p>
<h3>Consider Remote Opportunities</h3>
<p>The rise of remote work has significantly expanded job opportunities. You can now work for companies based in different cities, states, or even countries without having to relocate. This can make job hopping more convenient and less disruptive to your personal life, while also opening up a wider range of potential job opportunities.</p>
<p><a id="focus-on-increasing-your-personal-wealth"></a></p>
<h3>Focus on Increasing Your Personal Wealth</h3>
<p>While loyalty to an employer is important, it's also crucial to focus on your personal financial growth. If changing jobs can help you achieve higher compensation, whether through a higher salary, better benefits, or equity options, then it's a move worth considering. Remember, your primary professional obligation is to your own career development and financial stability.</p>
<p><a id="ensure-youre-moving-up-with-each-job-change"></a></p>
<h3>Ensure You're Moving Up With Each Job Change</h3>
<p>Each job change should represent a step forward in your career. Whether it's a higher role, more responsibilities, or the opportunity to work with new technologies, each move should contribute to your career progression. This upward trajectory can demonstrate to potential employers your ambition, your ability to take on new challenges, and your commitment to professional growth.</p>
<p><a id="ask-the-right-questions-during-interviews"></a></p>
<h3>Ask the Right Questions During Interviews</h3>
<p>When interviewing for a new job, it's important to ask questions that can help you understand the company's culture and whether it aligns with your values and career goals. This can help you avoid accepting a job that isn't a good fit for you. Ask about the company's values, their approach to work-life balance, opportunities for professional development, and their expectations for the role you're applying for. This can give you a clearer picture of what it would be like to work for the company and help you make an informed decision.</p>
<p><a id="additional-pieces-of-advice"></a></p>
<h2>Additional pieces of advice</h2>
<p><a id="understand-the-impact-of-job-hopping-on-your-resume"></a></p>
<h3>Understand the Impact of Job Hopping on Your Resume</h3>
<p>It's important to recognize that the perception of frequent job changes can vary across industries. In the tech sector, it's generally accepted and can even be seen as a sign of adaptability and a desire to acquire diverse skills. However, in other industries, it might raise questions about your stability or commitment. Therefore, when crafting your resume and cover letter, tailor them to address any potential concerns. Highlight the skills and experiences you've gained through job hopping and how they've contributed to your professional growth.</p>
<p><a id="avoid-short-stints"></a></p>
<h3>Avoid Short Stints</h3>
<p>While job hopping can offer numerous benefits, extremely short stints (like three to nine months) at multiple companies can raise red flags for potential employers. It might suggest that you struggle to commit to a role or adapt to a new environment. Aim to stay at each job for at least a year, which shows that you can contribute meaningfully to a company and see projects through to completion.</p>
<p><a id="use-job-hopping-as-a-negotiation-tool"></a></p>
<h3>Use Job Hopping as a Negotiation Tool</h3>
<p>Job hopping can serve as a powerful negotiation tool. If you receive a job offer with a higher salary or better benefits from another company, you can use this as leverage to negotiate better terms with your current employer. This strategy can help you maximize your earning potential and benefits without necessarily having to change jobs.</p>
<p><a id="consider-the-company-culture"></a></p>
<h3>Consider the Company Culture</h3>
<p>Before deciding to hop to a new job, take the time to understand the company's culture. If the company values loyalty and long-term commitment, frequent job hopping might be viewed negatively. Conversely, if the company values diverse experiences and skills, job hopping might be seen as a positive attribute. Understanding a company's culture can help you make informed decisions about job hopping.</p>
<p><a id="be-transparent-and-honest"></a></p>
<h3>Be Transparent and Honest</h3>
<p>During interviews, be transparent and honest about your reasons for job hopping. If you're leaving a job due to dissatisfaction, explain your reasons professionally and constructively. This can demonstrate to potential employers that you're thoughtful about your career decisions and are not simply leaving jobs on a whim.</p>
<p><a id="keep-learning-and-updating-your-skills"></a></p>
<h3>Keep Learning and Updating Your Skills</h3>
<p>The tech industry is characterized by rapid and continuous evolution. Therefore, it's crucial to keep learning and updating your skills to stay relevant. This commitment to continuous learning can make you more attractive to potential employers and open up more opportunities for job hopping.</p>
<p><a id="maintain-professional-relationships"></a></p>
<h3>Maintain Professional Relationships</h3>
<p>Even if you change jobs frequently, it's important to maintain positive relationships with your former employers and colleagues. They can provide valuable references in the future and might even offer you new opportunities. Networking is a key aspect of career development, and maintaining these professional relationships can be beneficial in the long run.</p>
<p><a id="consider-the-impact-on-your-long-term-career-goals"></a></p>
<h3>Consider the Impact on Your Long-Term Career Goals</h3>
<p>While job hopping can provide immediate benefits such as higher pay or a more desirable role, it's important to consider how it aligns with your long-term career goals. If a new job offers valuable experience or skills that align with your long-term objectives, it might be worth making the move. Always consider the long-term implications of job hopping on your career trajectory.</p>
<p><a id="take-advantage-of-remote-work-opportunities"></a></p>
<h3>Take Advantage of Remote Work Opportunities</h3>
<p>The rise of remote work has significantly expanded the job market. This means you can job hop without the geographical constraints that traditionally limited job opportunities. This can allow you to access opportunities in different cities, states, or even countries, broadening your career prospects.</p>
<p><a id="always-leave-on-good-terms-dont-burn-bridges"></a></p>
<h3>Always Leave on Good Terms, Don't Burn Bridges</h3>
<p>Regardless of your reasons for leaving a job, always strive to leave on good terms. This includes giving proper notice, completing all outstanding tasks, and offering to assist with the transition. This will help maintain your professional reputation, which is crucial when job hopping. Leaving on good terms also ensures that you leave a positive lasting impression, which can be beneficial for future job opportunities and references. When leaving a job, it's important to maintain positive relationships with your former colleagues and managers. These relationships can be valuable for networking, references, and potential future collaborations. Always leave on good terms, express gratitude for the experience, and keep the lines of communication open.</p>
<p><a id="evaluate-the-companys-stability"></a></p>
<h3>Evaluate the Company's Stability</h3>
<p>Before making a decision to switch jobs, it's crucial to assess the stability of the prospective company. If the company exhibits signs of instability or has a high employee turnover rate, it might not be the best choice for your next move, even if the job offers a higher salary or better benefits. A stable work environment can provide a sense of security and allow for long-term growth and development.</p>
<p><a id="consider-the-impact-on-your-work-life-balance"></a></p>
<h3>Consider the Impact on Your Work-Life Balance</h3>
<p>Job hopping can sometimes disrupt your work-life balance, particularly if you're constantly adapting to new roles, teams, and work environments. When considering a new job, think about how it will affect your personal life, including your family, hobbies, and personal commitments. Ensure that the new job aligns with your work-life balance goals and won't negatively impact your personal life.</p>
<p><a id="take-time-to-reflect-on-each-job-change"></a></p>
<h3>Take Time to Reflect on Each Job Change</h3>
<p>After each job change, take some time to reflect on your experiences. Consider what you learned, what you liked and disliked, and how these experiences can inform your future career decisions. This reflection can help you understand your career preferences, strengths, and areas for improvement, enabling you to make more informed decisions when job hopping.</p>
<p><a id="be-prepared-for-potential-negative-perceptions"></a></p>
<h3>Be Prepared for Potential Negative Perceptions</h3>
<p>While job hopping is more accepted in the tech industry, it may still be viewed negatively by some people. Be prepared to address any potential negative perceptions during interviews. Explain why job hopping has been beneficial for your career, focusing on the diverse skills and experiences you've gained.</p>
<p><a id="dont-job-hop-just-for-the-sake-of-it"></a></p>
<h3>Don't Job Hop Just for the Sake of It</h3>
<p>While job hopping can offer many benefits, it's important not to do it without a clear purpose. Ensure that each job change aligns with your career goals and offers valuable experience or skills. Aimless job hopping can lead to a disjointed career path and may raise red flags for potential employers.</p>
<p><a id="consider-the-benefits-and-drawbacks"></a></p>
<h3>Consider the Benefits and Drawbacks</h3>
<p>Before deciding to job hop, weigh the benefits and drawbacks. While job hopping can offer higher salaries, diverse experiences, and faster career progression, it can also lead to instability, a lack of deep expertise in one area, and potential negative perceptions. Make sure that the benefits outweigh the drawbacks before making a move.</p>
<p><a id="keep-your-skills-up-to-date"></a></p>
<h3>Keep Your Skills Up to Date</h3>
<p>In the fast-paced tech industry, keeping your skills up to date is crucial. By continuously learning and adapting to new technologies and trends, you'll be more attractive to potential employers and better equipped to take on new roles. Consider professional development opportunities, online courses, and industry certifications to keep your skills fresh.</p>
<p><a id="network-effectively"></a></p>
<h3>Network Effectively</h3>
<p>Networking is key when job hopping. Maintain your professional relationships and make new connections in the industry. Attend industry events, join professional organizations, and leverage social media platforms like LinkedIn to expand your network. A strong professional network can open up new opportunities and make job hopping easier and more successful.</p>LangChain RecursiveCharacterTextSplitter - Split by Tokens instead of characters2023-09-27T00:00:00+02:002023-09-27T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-09-27:/langchain-recursivecharactertextsplitter-split-by-tokens-instead-of-characters/<h1>LangChain RecursiveCharacterTextSplitter - Split by Tokens instead of Characters</h1>
<p>The LangChain <a href="https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter">RecursiveCharacterTextSplitter</a> is a tool that allows you to split text on predefined characters that are considered as a potential division points. By default, the size of the chunk is in characters but …</p><h1>LangChain RecursiveCharacterTextSplitter - Split by Tokens instead of Characters</h1>
<p>The LangChain <a href="https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter">RecursiveCharacterTextSplitter</a> is a tool that allows you to split text on predefined characters that are considered as a potential division points. By default, the size of the chunk is in characters but by using <code>from_tiktoken_encoder()</code> method you can easily split to achieve given size of the chunk in tokens instead of characters. This is especially useful since LLMs have context limits expressed in tokens not in characters. This split can be useful in various natural language processing tasks, such as language modeling or text classification.</p>
<p>To use the RecursiveCharacterTextSplitter, follow these steps:</p>
<ol>
<li>
<p>Import the required module: <code>from langchain.text_splitter import RecursiveCharacterTextSplitter</code></p>
</li>
<li>
<p>Set the desired chunk size (in tokens): <code>CHUNK_SIZE_TOKENS = 1_500</code></p>
</li>
<li>
<p>Instantiate the RecursiveCharacterTextSplitter using the <code>from_tiktoken_encoder</code> method and provide the chunk size and overlap values:</p>
</li>
</ol>
<div class="highlight"><pre><span></span><code><span class="n">text_splitter</span> <span class="o">=</span> <span class="n">RecursiveCharacterTextSplitter</span><span class="o">.</span><span class="n">from_tiktoken_encoder</span><span class="p">(</span>
<span class="n">chunk_size</span><span class="o">=</span><span class="n">CHUNK_SIZE_TOKENS</span><span class="p">,</span>
<span class="n">chunk_overlap</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div>
<ol>
<li>Once the text_splitter object is created, you can use the <code>create_documents</code> method to split your text into documents. Make sure to pass the text to be split as a parameter in a list format:</li>
</ol>
<div class="highlight"><pre><span></span><code><span class="n">docs</span> <span class="o">=</span> <span class="n">text_splitter</span><span class="o">.</span><span class="n">create_documents</span><span class="p">([</span><span class="n">text</span><span class="p">])</span>
</code></pre></div>
<p>For alternative solutions and further discussion, you can refer to the following GitHub issue: <a href="https://github.com/langchain-ai/langchain/issues/4678#issuecomment-1704305645">LangChain Issue #4678</a>.</p>From Fixed-Size to NLP Chunking - A Deep Dive into Text Chunking Techniques2023-09-11T00:00:00+02:002023-11-06T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-09-11:/from-fixed-size-to-nlp-chunking-a-deep-dive-into-text-chunking-techniques/<p>Discover text chunking - the secret sauce behind accurate search results and smarter language models! By understanding how to effectively chunk text, we can improve the way we index documents, handle user queries, and utilize search results. Ready to uncover the secrets of text chunking?</p><h2>Understanding Chunking</h2>
<p>Chunking is a process that aims to embed a piece of content with as little noise as possible while maintaining semantic relevance[^2]. This process is particularly useful in semantic search, where we index a corpus of documents, each containing valuable information on a specific topic.</p>
<p>An effective chunking strategy ensures that search results accurately capture the essence of a user's query. If our chunks are too small or too large, it may lead to imprecise search results or missed opportunities to surface relevant content. As a <strong>rule of thumb</strong>, if the <strong>chunk of text makes sense without the surrounding context to a human</strong>, it will likely make sense to the language model as well[^2]. Therefore, finding the optimal chunk size for the documents in the corpus is crucial to ensuring that the search results are accurate and relevant.</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#factors-influencing-chunking-strategy">Factors Influencing Chunking Strategy</a><ul>
<li><a href="#size-of-the-texts-to-be-indexed">Size of the Texts to be Indexed</a></li>
<li><a href="#length-and-complexity-of-user-queries">Length and Complexity of User Queries</a></li>
<li><a href="#utilization-of-the-retrieved-results-in-the-application">Utilization of the Retrieved Results in the Application</a></li>
</ul>
</li>
<li><a href="#chunking-methods">Chunking Methods</a><ul>
<li><a href="#fixed-size-in-characters-overlapping-sliding-window">Fixed-size (in characters) Overlapping Sliding Window</a></li>
<li><a href="#fixed-size-in-tokens-overlapping-sliding-window">Fixed-size (in tokens) Overlapping Sliding Window</a></li>
<li><a href="#recursive-structure-aware-splitting">Recursive Structure Aware Splitting</a></li>
<li><a href="#structure-aware-splitting-by-sentence-paragraph-section-chapter">Structure Aware Splitting (by Sentence, Paragraph, Section, Chapter)</a></li>
<li><a href="#nlp-chunking-tracking-topic-changes">NLP Chunking: Tracking Topic Changes</a></li>
<li><a href="#content-aware-splitting-markdown-latex-html">Content-Aware Splitting (Markdown, LaTeX, HTML)</a></li>
<li><a href="#adding-extra-context-to-the-chunk-metadata-summaries">Adding Extra Context to the Chunk (metadata, summaries)</a></li>
</ul>
</li>
<li><a href="#references">References</a></li>
<li><a href="#further-reading">Further Reading</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="factors-influencing-chunking-strategy"></a></p>
<h2>Factors Influencing Chunking Strategy</h2>
<p>There are three main factors to consider when determining a chunking strategy for a specific use case and application:</p>
<ol>
<li>The size of the texts to be indexed and chunked</li>
<li>The length and complexity of user queries</li>
<li>The utilization of the retrieved results in the application</li>
</ol>
<p><a id="size-of-the-texts-to-be-indexed"></a></p>
<h3>Size of the Texts to be Indexed</h3>
<p>The chunking unit and size should be adjusted according to the nature of the text. The chunk should be long enough to contain the relevant semantic load. For instance, individual words may not convey a specific message or piece of information, while putting an entire encyclopedia in one chunk may result in a chunk that is "about everything."</p>
<p><a id="length-and-complexity-of-user-queries"></a></p>
<h3>Length and Complexity of User Queries</h3>
<ul>
<li><strong>Longer queries</strong> or those with greater complexity typically <strong>benefit from a smaller chunk length</strong>. This helps to narrow down the search space and improve the precision of the search results. Smaller chunks allow more focused matching against embeddings, reducing the impact of irrelevant parts within the query.</li>
<li><strong>Shorter and simpler queries</strong> might not require chunking at all, as they can be processed as a single unit. Chunking may introduce unnecessary overhead in these cases, potentially hampering search performance.</li>
</ul>
<p><a id="utilization-of-the-retrieved-results-in-the-application"></a></p>
<h3>Utilization of the Retrieved Results in the Application</h3>
<p>In cases where search results are only an intermediate step in the whole chain in the app, the size of the chunk might have significant importance for the seamless operation of the application. For example, if results from multiple search queries are the input context for the prompt to the LLM, having small chunks might ease fitting all inputs in the maximum allowed context size for a given LLM. Conversely, if the search result is presented to the user, larger chunks may be more appropriate.</p>
<p><a id="chunking-methods"></a></p>
<h2>Chunking Methods</h2>
<p>There are several methods for chunking text, each with its own advantages and disadvantages. The choice of method depends on the specific requirements of the use case and application.</p>
<p><a id="fixed-size-in-characters-overlapping-sliding-window"></a></p>
<h3>Fixed-size (in characters) Overlapping Sliding Window</h3>
<p>The Fixed-size overlapping sliding window method is a naive approach to text chunking, dividing the text into fixed-size pieces regarded as chunks. In this method, the text is divided based on the count of characters, making it straightforward to implement. The use of overlap in this method aids in preserving the integrity of sentences or thoughts, ensuring they are not cut in the middle. If one window truncates a thought, another window might contain the complete thought.</p>
<p>However, this method presents certain limitations. One significant drawback is the lack of precise control over the context size. Most language models operate on the basis of tokens rather than characters or words, making this method less efficient. The strict and fixed-size nature of the window might also result in severing words, sentences, or paragraphs in the middle, which could impede comprehension and disrupt the flow of information.</p>
<p>Furthermore, this method does not take semantics into account, providing no guarantee that the semantic unit of the text capturing a given idea or thought will be accurately encapsulated within a chunk. Consequently, one chunk may not be semantically distinct from another.</p>
<h4>Use Cases</h4>
<p>The Fixed-size overlapping sliding window method can be beneficial in certain scenarios. It is especially useful in preliminary exploratory data analysis, where the goal is to obtain a general understanding of the text structure rather than a deep semantic analysis. Additionally, it could be employed in scenarios where the text data does not have a strong semantic structure, such as in certain types of raw data or logs.</p>
<p>However, for tasks that require semantic understanding and precise context, such as sentiment analysis, question-answering systems, or text summarization, more sophisticated text chunking methods would be more appropriate.</p>
<h4>Summary</h4>
<p><strong>Pros:</strong></p>
<ul>
<li>Counting characters makes implementation easy</li>
<li>Using overlap helps to avoid having sentences or thoughts cut in the middle - if one window is cutting the thought, perhaps another will have it in one piece.</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>Not precise control of the context size - models work and size the text in tokens not in characters or words</li>
<li>Having a strict, fixed-size window might lead to cutting words, sentences, or paragraphs in the middle.</li>
<li>Doesn't take semantics into account, no guarantee that the semantic unit of text capturing the given idea, thought will be accurately captured in the chunk and another chunk will be dedicated to another idea</li>
</ul>
<p><strong>Use cases:</strong></p>
<ul>
<li>Preliminary exploratory data analysis where a general understanding of the text is required</li>
<li>Scenarios where the text does not have a strong semantic structure, such as certain types of raw data or logs</li>
<li>Not recommended for tasks requiring semantic understanding and precise contexts like sentiment analysis, question-answering systems, or text summarization</li>
</ul>
<p><a id="fixed-size-in-tokens-overlapping-sliding-window"></a></p>
<h3>Fixed-size (in tokens) Overlapping Sliding Window</h3>
<p>The Fixed-size sliding window method in tokens is another approach to text chunking. Unlike the character-based method, this approach divides the text into chunks based on the count of tokens that came out from the tokenizer, making it more aligned with the way language models operate.</p>
<p>In this method, the size of the context is more precisely controlled as it works on tokens rather than characters. A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This can make avoiding cutting words in the middle a little better than when counting characters, but the problem still persists. It can still sever sentences or thoughts in the middle, which could disrupt the flow of information. Moreover, similar to the character-based method, this approach does not take semantics into account. There's no guarantee that a chunk accurately captures a unique thought or idea, making the chunks potentially semantically inconsistent.</p>
<h4>Where to Use It</h4>
<p>The use cases are similar to the fixed size window based on characters count with one difference - when the count is based on tokens it works better for the tasks where we are limited by the LLM context size.</p>
<h4>Summary</h4>
<p><strong>Pros:</strong></p>
<ul>
<li>More precise control over LLM context size as it operates on tokens, not characters.</li>
<li>Still, relatively easy to implement</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>Can still sever sentences or thoughts in the middle</li>
<li>Does not take semantics into account, hence no guarantee that a chunk accurately captures a unique thought or idea</li>
</ul>
<p><strong>Use cases:</strong></p>
<ul>
<li>For exploratory, initial work with LLMs</li>
<li>Not recommended for tasks requiring a deep understanding of the semantics and context of the text, like sentiment analysis or text summarization</li>
</ul>
<p><a id="recursive-structure-aware-splitting"></a></p>
<h3>Recursive Structure Aware Splitting</h3>
<p>Recursive Structure-aware Aware Splitting is a hybrid approach to text chunking, combining elements of the fixed-size sliding window method and the structure-aware splitting method. This method attempts to create chunks of approximately fixed sizes, either in characters or tokens, while also trying to preserve the original units of text such as words, sentences, or paragraphs.</p>
<p>In this method, the text is recursively split using various separators such as paragraph breaks ("\n\n"), new lines ("\n"), or spaces (" "), moving to the next level of granularity only when necessary. This allows the method to balance the need for a fixed chunk size with the desire to respect the natural linguistic boundaries of the text.</p>
<p>The major advantage of this method is its flexibility. It provides more precise control over context size compared to fixed-size methods, while also ensuring that semantic units of text are not unnecessarily severed.</p>
<p>However, this method also has its drawbacks. The complexity of implementation is higher due to the recursive nature of the splitting. There's also the risk of ending up with chunks of highly variable sizes, especially with texts of varying structural complexity.</p>
<blockquote>
<p>NOTE: <a href="https://www.langchain.com/">LangChain</a> has an implementation of <a href="https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter">Recursively split</a></p>
</blockquote>
<h4>Where to Use It</h4>
<p>Recursive Structure Aware Splitting is particularly useful in tasks where both the granularity of tokens and the preservation of semantic integrity are crucial. This includes tasks such as text summarization, sentiment analysis, and document classification.</p>
<p>However, due to its complexity, it might not be the best fit for tasks that require quick and simple text chunking, or for tasks involving texts with inconsistent or unclear structural divisions.</p>
<h4>Summary</h4>
<p><strong>Pros:</strong></p>
<ul>
<li>Balances the need for fixed chunk sizes with the preservation of natural linguistic boundaries</li>
<li>Provides more precise control over the context size</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>Higher complexity of implementation due to the recursive nature of the splitting</li>
<li>Risk of ending up with chunks of highly variable sizes</li>
</ul>
<p><strong>Use cases:</strong></p>
<ul>
<li>Useful in tasks where both the granularity of tokens and the preservation of semantic integrity are crucial, such as text summarization, sentiment analysis, and document classification</li>
<li>Not recommended for tasks requiring quick and simple text chunking, or tasks involving texts with inconsistent or unclear structural divisions</li>
</ul>
<p><a id="structure-aware-splitting-by-sentence-paragraph-section-chapter"></a></p>
<h3>Structure Aware Splitting (by Sentence, Paragraph, Section, Chapter)</h3>
<p>Structure Aware Splitting is an advanced approach to text chunking, which takes into account the inherent structure of the text. Instead of using a fixed-size window, this method divides the text into chunks based on its natural divisions such as sentences, paragraphs, sections, or chapters.</p>
<p>This method is particularly beneficial as it respects the natural linguistic boundaries of the text, ensuring that words, sentences, and thoughts are not cut in the middle. This aids in preserving the semantic integrity of the information within each chunk.</p>
<p>However, this method does have certain limitations. Handling text of varying structural complexity might be challenging. For instance, some texts might not have clearly defined sections or chapters, e.g. text extracted from the OCR output, unformatted speech-to-text outputs, text extracted from tables. Also, while it's more semantically aware than the fixed-size methods, it still doesn't guarantee perfect semantic consistency within chunks, especially for larger structural units like sections or chapters.</p>
<h4>Where to Use It</h4>
<p>Structure Aware Splitting is highly effective for tasks that require a good understanding of the context and semantics of the text. It is particularly useful for text summarization, sentiment analysis, and document classification tasks.</p>
<p>However, it might not be the best fit for tasks involving texts that lack defined structural divisions, or for tasks that require a finer granularity, such as word-level Named Entity Recognition (NER).</p>
<h4>Summary</h4>
<p><strong>Pros:</strong></p>
<ul>
<li>Respects natural linguistic boundaries, avoiding severing words, sentences, or thoughts</li>
<li>Preserves the semantic integrity of information within each chunk</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>Challenging to handle text with varying structural complexity</li>
<li>Does not guarantee perfect semantic consistency within chunks, especially for larger structural units</li>
<li>We don't have control over chunk size. Chunks from given document might significantly vary in the size.</li>
</ul>
<p><strong>Use cases:</strong></p>
<ul>
<li>Effective for tasks requiring good understanding of context and semantics, such as text summarization, sentiment analysis, and document classification</li>
<li>Not recommended for tasks involving texts that lack defined structural divisions, or tasks needing finer granularity, like word-level NER</li>
</ul>
<p><a id="nlp-chunking-tracking-topic-changes"></a></p>
<h3>NLP Chunking: Tracking Topic Changes</h3>
<p>NLP Chunking with Topic Tracking is a sophisticated approach to text chunking. This method divides the text into chunks based on semantic understanding, specifically by detecting significant shifts in the topics of sentences. If the topic of a sentence significantly differs from the topic of the previous chunk, this sentence is considered the beginning of a new chunk.</p>
<p>This method has the distinct advantage of maintaining semantic consistency within each chunk. By tracking the changes in topics, this method ensures that each chunk is semantically distinct from the others, thereby capturing the inherent structure and meaning of the text.</p>
<p>However, this method is not without its challenges. It requires advanced NLP techniques to accurately detect topic shifts, which adds to the complexity of implementation. Additionally, the accuracy of chunking heavily depends on the effectiveness of the topic modeling and detection techniques used.</p>
<h4>Where to Use It</h4>
<p>NLP Chunking with Topic Tracking is highly effective for tasks that require an understanding of the semantic context and topic continuity. It is particularly useful for text summarization, sentiment analysis, and document classification tasks.</p>
<p>This method might not be the best fit for tasks involving texts that have a high degree of topic overlap or for tasks that require simple text chunking without the need for deep semantic understanding.</p>
<h4>Summary</h4>
<p><strong>Pros:</strong></p>
<ul>
<li>Maintains semantic consistency within each chunk</li>
<li>Captures the inherent structure and meaning of the text by tracking topic changes</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>Requires advanced NLP techniques, increasing the complexity of implementation</li>
<li>The accuracy of chunking heavily depends on the effectiveness of the topic modeling and detection techniques used</li>
</ul>
<p><strong>Use cases:</strong></p>
<ul>
<li>Highly effective for tasks requiring semantic context and topic continuity, such as text summarization, sentiment analysis, and document classification</li>
<li>Not recommended for tasks involving texts with high degrees of topic overlap or tasks requiring simple text chunking without the need for deep semantic understanding</li>
</ul>
<p><a id="content-aware-splitting-markdown-latex-html"></a></p>
<h3>Content-Aware Splitting (Markdown, LaTeX, HTML)</h3>
<p>Content-Aware Splitting is a method of text chunking that focuses on the type and structure of the content, particularly in structured documents like those written in Markdown, LaTeX, or HTML. This method identifies and respects the inherent structure and divisions of the content, such as headings, code blocks, and tables, to create distinct chunks.</p>
<p>The primary advantage of this method is that it ensures different types of content are not mixed within a single chunk. For instance, a chunk containing a code block will not also contain a part of a table. This helps maintain the integrity and context of the content within each chunk.</p>
<p>However, this method also presents certain challenges. It requires understanding and parsing the specific syntax of the structured document format, which can increase the complexity of implementation. Moreover, it might not be suitable for documents that lack clear structural divisions or those written in plain text without any specific format.</p>
<h4>Where to Use It</h4>
<p>Content Aware Splitting is especially useful when dealing with structured documents or content with clear formatting, such as technical documentation, academic papers, or web pages. It helps ensure that the chunks created are meaningful and contextually consistent.</p>
<p>However, this method might not be the best fit for unstructured or plain text documents, or for tasks that do not require a deep understanding of the content structure.</p>
<h4>Summary</h4>
<p><strong>Pros:</strong></p>
<ul>
<li>Ensures different types of content are not mixed within a single chunk</li>
<li>Respects and maintains the integrity and context of the content within each chunk</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>Requires understanding and parsing the specific syntax of the structured document format</li>
<li>Might not be suitable for unstructured or plain text documents</li>
</ul>
<p><strong>Where to Use It:</strong></p>
<ul>
<li>Particularly useful for structured documents or content with clear formatting, such as technical documentation, academic papers, or web pages</li>
<li>Not recommended for unstructured or plain text documents, or tasks that do not require a deep understanding of the content structure</li>
</ul>
<p><a id="adding-extra-context-to-the-chunk-metadata-summaries"></a></p>
<h3>Adding Extra Context to the Chunk (metadata, summaries)</h3>
<p>Adding extra context to the chunks in the form of metadata or summaries can significantly enhance the value of each chunk and improve the overall understanding of the text[^3]. Here are two strategies:</p>
<h4>Adding Metadata to Each Chunk</h4>
<p>This strategy involves adding relevant metadata to each chunk. Metadata could include information such as the source of the text, the author, the date of publication, or even data about the content of the chunk itself, like its topic or keywords. This extra context can provide valuable insights and make the chunks more meaningful and easier to analyze.</p>
<blockquote>
<p>NOTE: In the case of the chunks that are vectorized using text embeddings, be aware, that vector databases typically allow storage of metadata alongside the embedding vectors.</p>
</blockquote>
<p><strong>Pros:</strong></p>
<ul>
<li>Provides additional information about each chunk</li>
<li>Enhances the value of each chunk, making them more meaningful and easier to analyze</li>
<li>Can help to produce more effective embeddings by fixing the broader context for the chunk.</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>Requires additional processing to generate and attach the metadata</li>
<li>The usefulness of the metadata depends on its relevance and accuracy</li>
</ul>
<p><strong>Where to Use It:</strong></p>
<ul>
<li>Especially useful in tasks that involve analyzing the origin, authorship, or content of the chunks, such as text classification, document clustering, or information retrieval</li>
<li>Can be used to filter the sources used to provide context to LLMs.</li>
</ul>
<p>You can get intuition what is possible by reading llama_index documentation on metadata extraction and usage: <a href="https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_metadata_extractor.html">Metadata Extraction Usage Pattern - LlamaIndex 🦙 0.9.30</a></p>
<h4>Passing on Chunk Summaries</h4>
<p>In this strategy, each chunk is summarized, and that summary is passed on to the next chunk. This method provides a 'running context' that can enhance the understanding of the text and maintain the continuity of information.</p>
<p><strong>Pros:</strong></p>
<ul>
<li>Enhances the understanding of the text by maintaining a running context</li>
<li>Helps to ensure the continuity of information across chunks</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>Requires advanced NLP techniques to generate accurate and meaningful summaries</li>
<li>The effectiveness of this method depends on the quality of the summaries</li>
</ul>
<p><strong>Where to Use It:</strong></p>
<ul>
<li>Particularly useful in tasks where understanding the continuity and context of the text is crucial, such as text summarization or reading comprehension tasks</li>
</ul>
<h4>Other Experimental Strategies for Adding Context to the Chunks</h4>
<ol>
<li>
<p><strong>Keyword Tagging:</strong> This method involves identifying and tagging the most important keywords or phrases in each chunk. These tags then serve as a quick reference or summary of the chunk's content. Advanced NLP techniques can be used to identify these keywords based on their relevance and frequency.</p>
</li>
<li>
<p><strong>Sentiment Analysis:</strong> For text that contains opinions or reviews, performing sentiment analysis on each chunk and attaching the sentiment score (positive, negative, neutral) as metadata can provide valuable context. This can be particularly useful in tasks such as customer feedback analysis or social media monitoring.</p>
</li>
<li>
<p><strong>Entity Recognition:</strong> Applying Named Entity Recognition (NER) techniques to each chunk can identify and label entities such as names of people, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. This entity information can be added to each chunk, providing valuable context, especially in tasks like information extraction or knowledge graph construction.</p>
</li>
<li>
<p><strong>Topic Classification:</strong> Each chunk can be classified into one or more topics using machine learning or NLP techniques. This topic label can provide a quick understanding of what each chunk is about, adding valuable context, especially for tasks like document classification or recommendation.</p>
</li>
<li>
<p><strong>Chunk Linking:</strong> This method involves creating links between related chunks based on shared keywords, entities, or topics. These links can provide a 'map' of the content, showing how different chunks relate to each other. This can be particularly useful in tasks involving large and complex texts, where understanding the overall structure and relations between different parts is important.</p>
</li>
</ol>
<h2>Conclusions</h2>
<p>In the field of Natural Language Processing, text chunking emerges as a powerful technique that significantly enhances the performance of semantic search and language models. By breaking down text into manageable, contextually relevant chunks, we can ensure more accurate and meaningful search results.</p>
<p>The choice of chunking method, whether it's fixed-size, structure-aware, or NLP chunking, depends on the specific requirements of the use case and application. Each method has its own strengths and limitations, and understanding these is crucial to implementing an effective chunking strategy.</p>
<p>Moreover, adding extra context to the chunks, such as metadata or summaries, can further enhance the value of each chunk and improve the overall understanding of the text. Experimental strategies like keyword tagging, sentiment analysis, entity recognition, topic classification, and chunk linking offer promising avenues for further exploration.</p>
<p><em>Any comments or suggestions? <a href="mailto:ksafjan@gmail.com?subject=Blog+post">Let me know</a>.</em>
<a id="references"></a></p>
<h2>References</h2>
<ul>
<li>[^1] <a href="https://blog.abacus.ai/blog/2023/08/10/create-your-custom-chatgpt-pick-the-best-llm-that-works-for-you/">Create a CustomGPT And Supercharge your Company with AI - Pick the Best LLM - The Abacus.AI Blog</a></li>
<li>[^2] <a href="https://www.pinecone.io/learn/chunking-strategies/">Chunking Strategies for LLM Applications | Pinecone</a></li>
<li>[^3] <a href="https://actalyst.medium.com/optimize-llm-enterprise-applications-through-embeddings-and-chunking-strategy-1bbdb03bedae">Optimize LLM Enterprise Applications through Embeddings and Chunking Strategy. | by Actalyst | Aug, 2023 | Medium</a></li>
<li>[^4] <a href="https://vectara.com/grounded-generation-done-right-chunking/">Retrieval Augmented Generation (RAG) Done Right: Chunking - Vectara</a> (NLP chunking, compare chunking strategies) + <a href="https://github.com/vectara/example-notebooks/blob/main/notebooks/chunking-demo.ipynb">notebook</a></li>
</ul>
<p><a id="further-reading"></a></p>
<h2>Further Reading</h2>
<ul>
<li><a href="https://medium.com/aimonks/simple-guide-to-text-chunking-for-your-llm-applications-bddfe8ad7892">Simple guide to Text Chunking for Your LLM Applications | by NoCode AI | 𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨 | Medium</a></li>
<li><a href="https://arxiv.org/abs/2307.03172">[2307.03172] Lost in the Middle: How Language Models Use Long Contexts</a></li>
<li><a href="https://community.openai.com/t/the-length-of-the-embedding-contents/111471">The length of the embedding contents - API - OpenAI Developer Forum</a></li>
<li><a href="https://www.anyscale.com/blog/a-comprehensive-guide-for-building-rag-based-llm-applications-part-1">Building RAG-based LLM Applications for Production (Part 1)</a></li>
<li>expanding context, hierarchical search, ...: <a href="https://reframe.is/wiki/Effects-of-Chunk-Sizes-on-Retrieval-Augmented-Generation-RAG-Applications-8b728c36d005434dba39ad19be9b82cc/">Effects of Chunk Sizes on Retrieval Augmented Generation (RAG) Applications</a></li>
<li><a href="https://dl.acm.org/doi/10.1007/s10579-013-9250-3">A novel method for performance evaluation of text chunking | Language Resources and Evaluation</a></li>
<li><a href="https://www.mattambrogi.com/posts/chunk-size-matters/">Matt Ambrogi</a> "Chunk Size Matters"</li>
<li><a href="https://blog.llamaindex.ai/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5">Evaluating the Ideal Chunk Size for a RAG System using LlamaIndex | by Ravi Theja | Oct, 2023 | LlamaIndex Blog</a></li>
<li>short (4min 25s) overview of chunking methods from Weaviate: <a href="https://www.youtube.com/watch?v=h5id4erwD4s">Chunking Methods to use Custom Data with LLMs</a></li>
<li><a href="https://www.youtube.com/watch?v=8OJC21T2SL4">The 5 Levels Of Text Splitting For Retrieval - YouTube</a> (Fixed Size Chunking, Recursive Chunking, Document Based Chunking, <strong>Semantic Chunking</strong>, Agentic Chunking - chunking strategy that explore the possibility to use LLM to determine how much and what text should be included in a chunk based on the context.) + <a href="https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb">notebook</a></li>
<li>Visualization of chunking - <a href="https://chunkviz.up.railway.app/">ChunkViz</a></li>
</ul>
<p><strong>Edits:</strong></p>
<ul>
<li>2023-11-06 - added reference: Evaluating the Ideal Chunk Size for a RAG System using LlamaIndex</li>
<li>2023-11-13 - added video from Weaviate</li>
</ul>
<p>X::<a href="https://www.safjan.com/from-fixed-size-to-nlp-chunking-a-deep-dive-into-text-chunking-techniques/">From Fixed-Size to NLP Chunking - A Deep Dive into Text Chunking Techniques</a></p>Criticism of the Lean Startup2023-09-04T00:00:00+02:002023-11-07T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-09-04:/criticism-of-the-lean-startup/<p>X::<a href="https://www.safjan.com/product-led-growth/">Product Led Growth</a>
X::<a href="https://www.safjan.com/growth-hacking-methodology/">Growth Hacking Methodology</a></p>
<p>The Lean Startup method is still considered a valuable and relevant approach to launching and managing startups. However, it's important to recognize that the business and entrepreneurial landscape is dynamic, and the applicability of …</p><p>X::<a href="https://www.safjan.com/product-led-growth/">Product Led Growth</a>
X::<a href="https://www.safjan.com/growth-hacking-methodology/">Growth Hacking Methodology</a></p>
<p>The Lean Startup method is still considered a valuable and relevant approach to launching and managing startups. However, it's important to recognize that the business and entrepreneurial landscape is dynamic, and the applicability of any methodology can evolve over time.</p>
<p>The Lean Startup method, popularized by Eric Ries, emphasizes a systematic and iterative approach to building and scaling a startup by validating assumptions, minimizing waste, and staying agile. Many principles of the Lean Startup, such as customer-centricity, rapid experimentation, and continuous learning, remain highly relevant in today's business environment.</p>
<p>However, there are some criticisms and challenges associated with the Lean Startup method, including:</p>
<ol>
<li>
<p><strong>Oversimplification</strong>: Critics argue that the Lean Startup method can sometimes oversimplify the complexity of building a successful business. While it encourages rapid experimentation, it may not address all the intricacies and industry-specific nuances that startups may encounter.</p>
</li>
<li>
<p><strong>Overemphasis on MVP (Minimum Viable Product)</strong>: Some argue that an overemphasis on building MVPs can lead to premature scaling or neglecting long-term vision and product quality. In some industries, especially those requiring substantial upfront investment or regulatory compliance, an MVP might not be appropriate.</p>
</li>
<li>
<p><strong>Bias Toward Tech Startups</strong>: The Lean Startup method was initially designed with tech startups in mind and may not be as applicable to businesses in other industries, such as healthcare, biotech, or manufacturing, which have longer development cycles and higher regulatory barriers.</p>
</li>
<li>
<p><strong>Market Saturation</strong>: In some markets, especially in technology hubs like Silicon Valley, there's a concern that the Lean Startup method has led to an oversaturation of similar ideas and startups, making it more challenging for any single company to stand out.</p>
</li>
<li>
<p><strong>Evolving Landscape</strong>: As technology and business landscapes evolve, new methodologies and approaches may emerge that complement or surpass the Lean Startup method. For example, concepts like <a href="https://www.safjan.com/design-thinking/">Design Thinking</a>, <a href="https://www.safjan.com/growth-hacking-methodology/">Growth Hacking Methodology</a> and <a href="https://www.safjan.com/product-led-growth/">Product Led Growth</a> have gained traction in recent years.</p>
</li>
</ol>
<p>To assess the current validity and relevance of the Lean Startup method, it's essential to consider the specific context, industry, and maturity of your startup. While the core principles of customer-centricity, iteration, and learning remain valuable, startups should also be open to adapting and combining methodologies based on their unique circumstances and challenges. Additionally, staying updated with the latest trends and methodologies in entrepreneurship is crucial to making informed decisions.</p>Design Thinking2023-09-04T00:00:00+02:002023-11-07T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-09-04:/design-thinking/<p>X::<a href="https://www.safjan.com/criticism-of-the-lean-startup/">Criticism of the Lean Startup</a></p>
<p>X::<a href="https://www.safjan.com/growth-hacking-methodology/">Growth Hacking Methodology</a>
X::<a href="https://www.safjan.com/product-led-growth/">Product Led Growth</a></p>
<p>Design thinking is a human-centered and problem-solving approach to innovation and product development that has gained significant traction in the business world in recent years. It places a …</p><p>X::<a href="https://www.safjan.com/criticism-of-the-lean-startup/">Criticism of the Lean Startup</a></p>
<p>X::<a href="https://www.safjan.com/growth-hacking-methodology/">Growth Hacking Methodology</a>
X::<a href="https://www.safjan.com/product-led-growth/">Product Led Growth</a></p>
<p>Design thinking is a human-centered and problem-solving approach to innovation and product development that has gained significant traction in the business world in recent years. It places a strong emphasis on empathy, creativity, and iterative processes to tackle complex problems and create user-centric solutions. Here's a comprehensive exploration of design thinking in the context of business and product development:</p>
<p><strong>Introduction to Design Thinking:</strong> Design thinking is a methodology that originated in the world of design but has since transcended its origins to become a widely adopted approach in various industries, including technology, healthcare, finance, and more. At its core, design thinking is about understanding and addressing the needs of users or customers by fostering a deep sense of empathy, engaging in creative problem-solving, and iterating on solutions to continuously improve them.</p>
<p><strong>Key Principles of Design Thinking:</strong></p>
<ol>
<li>
<p><strong>Empathy:</strong> Design thinking starts with empathizing with the end-users or customers to gain a deep understanding of their needs, desires, and pain points. This empathetic approach helps teams uncover insights that might not be apparent through traditional data analysis.</p>
</li>
<li>
<p><strong>Define:</strong> Once user needs are understood, the next step is to define the problem clearly and succinctly. This step involves synthesizing the information gathered during the empathy phase to create a user-centered problem statement.</p>
</li>
<li>
<p><strong>Ideate:</strong> In this phase, teams brainstorm and generate a wide range of potential solutions without judgment. It's a creative and often collaborative process that encourages thinking outside the box.</p>
</li>
<li>
<p><strong>Prototype:</strong> Prototyping involves creating low-fidelity representations of the proposed solutions. These prototypes can be anything from simple sketches to interactive mock-ups, depending on the context. The goal is to quickly visualize and test ideas.</p>
</li>
<li>
<p><strong>Test:</strong> The testing phase involves gathering feedback from users by exposing them to the prototypes. This feedback loop allows teams to refine and improve their solutions based on real-world insights.</p>
</li>
</ol>
<p><strong>Benefits of Design Thinking in Business/Product Development:</strong></p>
<ol>
<li>
<p><strong>User-Centric Innovation:</strong> Design thinking places the user at the center of the development process, leading to products and services that genuinely meet user needs and preferences.</p>
</li>
<li>
<p><strong>Enhanced Creativity:</strong> By encouraging ideation without constraints in the early stages, design thinking fosters creative thinking, which can lead to breakthrough solutions.</p>
</li>
<li>
<p><strong>Reduced Risk:</strong> Iterative testing and prototyping help identify and address issues early in the development process, reducing the risk of costly mistakes later on.</p>
</li>
<li>
<p><strong>Improved Collaboration:</strong> Design thinking often involves cross-functional teams collaborating to solve problems, breaking down silos and fostering a culture of cooperation.</p>
</li>
<li>
<p><strong>Adaptability:</strong> The iterative nature of design thinking allows businesses to adapt to changing circumstances and emerging trends more effectively.</p>
</li>
</ol>
<p><strong>Real-World Examples:</strong></p>
<p>Numerous successful companies have embraced design thinking to drive innovation and improve their products and services. For instance:</p>
<ul>
<li>
<p><strong>Apple:</strong> Apple is renowned for its commitment to user-centric design. Products like the iPhone and MacBook exemplify how design thinking has been instrumental in creating highly intuitive and visually appealing devices.</p>
</li>
<li>
<p><strong>IBM:</strong> IBM's design thinking transformation has led to the creation of IBM Design Studios, which apply design thinking principles to a wide range of projects, from software development to organizational strategy.</p>
</li>
<li>
<p><strong>Airbnb:</strong> Airbnb uses design thinking to create memorable experiences for its users. The platform continuously iterates on its website and app to enhance user satisfaction.</p>
</li>
</ul>
<p><strong>Conclusion:</strong></p>
<p>In today's fast-paced and ever-changing business landscape, design thinking offers a structured yet flexible approach to innovation and problem-solving. By prioritizing empathy, creativity, and iterative development, organizations can create products and services that resonate with users, drive growth, and stay adaptable in an increasingly competitive marketplace. As design thinking continues to evolve, it remains a valuable methodology for businesses seeking to stay customer-focused and innovative.</p>Problems with Langchain and how to minimize their impact2023-09-01T00:00:00+02:002023-10-19T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-09-01:/problems-with-Langchain-and-how-to-minimize-their-impact/<p>Beyond the Hype - LangChain's Hidden Flaws and How to Master AI Development.</p><h2>Introduction</h2>
<p><a href="https://docs.langchain.com/docs/">LangChain</a>, a popular framework for building applications with <a href="https://en.wikipedia.org/wiki/Large_language_model">large language models</a> (LLMs), has been touted as a game-changer in the world of AI-driven development. However, as more users dive into the library and its capabilities, some have found that it falls short of expectations. In this section, we'll discuss ten issues with LangChain that have left users underwhelmed and questioning its value proposition.</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#problems">Problems</a><ul>
<li><a href="#1-overly-complex-and-unnecessary-abstractions">1. Overly complex and unnecessary abstractions</a></li>
<li><a href="#2-easy-breakable-and-unreliable">2. Easy breakable and unreliable</a></li>
<li><a href="#3-poor-documentation">3. Poor documentation</a></li>
<li><a href="#4-a-high-level-of-abstraction-hinders-customization">4. A high level of abstraction hinders customization</a></li>
<li><a href="#5-inefficient-token-usage">5. Inefficient token usage</a></li>
<li><a href="#6-difficult-integration-with-existing-tools">6. Difficult integration with existing tools</a></li>
<li><a href="#7-limited-value-proposition">7. Limited value proposition</a></li>
<li><a href="#8-inconsistent-behavior-and-hidden-details">8. Inconsistent behavior and hidden details</a></li>
<li><a href="#9-better-alternatives-available">9. Better alternatives available</a></li>
<li><a href="#10-primarily-optimized-for-demos">10. Primarily optimized for demos</a></li>
</ul>
</li>
<li><a href="#takeaways---how-to-use-the-langchain-right-way">Takeaways - How to Use the LangChain Right Way?</a></li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="problems"></a></p>
<h2>Problems</h2>
<p><a id="1-overly-complex-and-unnecessary-abstractions"></a></p>
<h3>1. Overly complex and unnecessary abstractions</h3>
<p>LangChain has been criticized for having too many layers of abstraction, making it difficult to understand and modify the underlying code. These layers can lead to confusion, especially for those who are new to LLMs or LangChain itself. The complexity can also make it challenging to adapt the library to specific use cases or integrate it with existing tools and scripts. In some cases, users have found that they can achieve their goals more easily by using simpler, more straightforward code.</p>
<p><a id="2-easy-breakable-and-unreliable"></a></p>
<h3>2. Easy breakable and unreliable</h3>
<p>Some users have found LangChain to be unreliable and difficult to fix due to its complex structure. The framework's fragility can lead to unexpected issues in production systems, making it challenging to maintain and scale applications built with LangChain. Users have reported that the deeper and more complex their application becomes, the more LangChain seems to become a risk to its maintainability.</p>
<p><a id="3-poor-documentation"></a></p>
<h3>3. Poor documentation</h3>
<p>LangChain's documentation has been described as confusing and lacking in key details, making it challenging for users to fully understand the library's capabilities and limitations. The documentation often omits explanations of default parameters and important details, leaving users to piece together information from various sources. This lack of clarity can hinder users' ability to effectively leverage LangChain in their projects.</p>
<p><a id="4-a-high-level-of-abstraction-hinders-customization"></a></p>
<h3>4. A high level of abstraction hinders customization</h3>
<p>Users have reported that LangChain's high level of abstraction makes it difficult to modify and adapt the library for specific use cases. This can be particularly problematic when users want to make small changes to the default behavior of LangChain or integrate it with other tools and scripts. In these cases, users may find it easier to bypass LangChain altogether and build their own solutions from scratch.</p>
<p><a id="5-inefficient-token-usage"></a></p>
<h3>5. Inefficient token usage</h3>
<p>LangChain has been criticized for inefficient token usage in its API calls, which can result in higher costs. This can be particularly problematic for users who are trying to minimize their expenses while working with LLMs. Some users have found that they can achieve better results with fewer tokens by using custom Python code or other alternative libraries.</p>
<p><a id="6-difficult-integration-with-existing-tools"></a></p>
<h3>6. Difficult integration with existing tools</h3>
<p>Users have reported difficulties integrating LangChain with their existing Python tools and scripts. This can be especially challenging for those who have complex analytics or other advanced functionality built into their applications. The high level of abstraction in LangChain can make it difficult to interface with these existing tools, forcing users to build workarounds or abandon LangChain in favor of more compatible solutions.</p>
<p><a id="7-limited-value-proposition"></a></p>
<h3>7. Limited value proposition</h3>
<p>Some users feel that LangChain does not provide enough value compared to the effort required to implement and maintain it. They argue that the library's primary use case is to quickly create demos or prototypes, rather than building production-ready applications. In these cases, users may find it more efficient to build their own solutions or explore alternative libraries that offer a better balance of ease of use and functionality.</p>
<p><a id="8-inconsistent-behavior-and-hidden-details"></a></p>
<h3>8. Inconsistent behavior and hidden details</h3>
<p>LangChain has been criticized for hiding important details and having inconsistent behavior, which can lead to unexpected issues in production systems. Users have reported that LangChain's default settings and behaviors are often undocumented or poorly explained, making it difficult to predict how the library will behave in different scenarios. This lack of transparency can lead to frustration and wasted time troubleshooting issues that could have been avoided with better documentation.</p>
<p><a id="9-better-alternatives-available"></a></p>
<h3>9. Better alternatives available</h3>
<p>Users have mentioned other libraries, such as <a href="https://github.com/microsoft/semantic-kernel">Semantic Kernel</a>, <a href="https://github.com/jerryjliu/llama_index">LlamaIndex</a>, <a href="https://haystack.deepset.ai/">Deepset Haystack</a> , or <a href="https://github.com/TransformerOptimus/SuperAGI">SuperAGI</a>, as more suitable alternatives to LangChain. These alternatives often provide clearer documentation, more flexible customization options, and better integration with existing tools and scripts. In some cases, users have found that they can achieve their goals more easily and efficiently by using these alternative libraries instead of LangChain. See <a href="https://github.com/kyrolabs/awesome-langchain#other-llm-frameworks">awesome-langchain</a> for a list of LLM frameworks.</p>
<p><a id="10-primarily-optimized-for-demos"></a></p>
<h3>10. Primarily optimized for demos</h3>
<p>LangChain has been described as being primarily optimized for quickly creating demos, rather than for building production-ready applications. <a href="https://blog.streamlit.io/langchain-streamlit/">Partnership</a> with <a href="https://streamlit.io/generative-ai?ref=blog.streamlit.io">Streamlit</a> should ease demo creation even more. While this can be useful for those who want to quickly experiment with LLMs or showcase their ideas, it can be limiting for users who want to build more robust, scalable applications. In these cases, users may find that LangChain's focus on demos and prototypes hinders their ability to build high-quality, production-ready applications.</p>
<p><a id="takeaways---how-to-use-the-langchain-right-way"></a></p>
<h2>Takeaways - How to Use the LangChain Right Way?</h2>
<p>Based on the community comments and experiences shared, here are some pieces of advice on how to create apps using LangChain that will be reliable, easy to maintain and debug:</p>
<ol>
<li>
<p><strong>Use LangChain for prototyping and experimentation</strong>: LangChain can be useful for quickly creating prototypes and validating ideas. However, for more complex and production-level applications, you might want to consider implementing the functionality you need yourself.</p>
</li>
<li>
<p><strong>Understand the underlying concepts</strong>: Before using LangChain, make sure to understand the core concepts of LLMs, prompts, and how the different components of the framework interact. This will help you make informed decisions about which parts of LangChain to use and which to replace with custom implementations.</p>
</li>
<li>
<p><strong>Focus on the value of the ecosystem</strong>: LangChain provides integrations with various tools, indexes, and prompt templates. Leverage these resources to build your application, but be aware of the limitations and potential issues that might arise from using the default settings and abstractions.</p>
</li>
<li>
<p><strong>Be prepared to write custom code</strong>: LangChain might not cover all use cases or provide the level of control and customization you need for your application. Be prepared to write custom code to better suit your specific requirements and use case.</p>
</li>
<li>
<p><strong>Keep an eye on alternative tools and libraries</strong>: As the field of LLMs is rapidly evolving, new tools and libraries are being developed that might better suit your needs. Stay informed about the latest developments and consider using alternative libraries like <a href="https://haystack.deepset.ai/">Deepset Haystack</a>, <a href="https://github.com/stanfordnlp/dspy">DSPy</a> , or Microsoft tools like <a href="https://learn.microsoft.com/en-us/semantic-kernel/overview/">semantic-kernel</a> and <a href="https://github.com/microsoft/autogen">AutoGen</a> if they better align with your project requirements. The <a href="https://github.com/kyrolabs/awesome-langchain#other-llm-frameworks">list</a> is huge and growing!</p>
</li>
<li>
<p><strong>Learn from LangChain's source code</strong>: If you find that LangChain's abstractions and documentation are not sufficient for your needs, you can learn from the source code itself. Use the provided prompts and implementation details as inspiration and adapt them to your own project.</p>
</li>
<li>
<p><strong>Consider local LLM models</strong>: While LangChain primarily focuses on using OpenAI's models, you might want to explore using local LLM models like Llama, Galpaca, Vicuna, or Koala. These models can offer benefits in terms of cost, privacy, and offline capabilities. However, be aware that they might not be as powerful or accurate as GPT-3.5 Turbo.</p>
</li>
<li>
<p><strong>Integrate with existing tools and scripts</strong>: If you need to interface with existing Python tools or scripts, make sure to understand how LangChain interacts with them and how you can best integrate them into your application.</p>
</li>
<li>
<p><strong>Test and measure the performance of your application</strong>: When using LangChain, ensure that you thoroughly test your application and measure its performance against different prompts and configurations. This will help you identify potential issues and areas for improvement.</p>
</li>
<li>
<p><strong>Keep an eye on the costs</strong>: Be mindful of the API costs associated with using LangChain and consider optimizing your application to reduce the number of API calls and tokens used.</p>
</li>
</ol>
<p>My favourite choice from this list would be #6 - to learn from the LangChain implemented tools and techniques by looking into the code.</p>
<p><em>Any comments or suggestions? <a href="mailto:ksafjan@gmail.com?subject=Blog+post">Let me know</a>.</em></p>
<p><a id="conclusion"></a></p>
<h2>Conclusion</h2>
<p>In considering LangChain, it's vital to acknowledge its limitations and challenges before embracing it enthusiastically. Although LangChain has garnered significant attention and investment, users have pinpointed various drawbacks that could impede its effectiveness in more intricate, production-ready applications. To make well-informed decisions about LangChain's suitability for their projects, developers should gain an understanding of these issues.
In the ever-evolving landscape of LLM-driven development, assessing the available tools and libraries is crucial to determining which aligns best with your specific needs and requirements. It's worth noting that the ideal solution might not yet exist, necessitating adaptation or customization of existing tools or even the creation of your own to realize your vision for AI-driven applications.</p>
<p><strong>edits:</strong></p>
<ul>
<li>2023-10-19: Added AutoGen and semantic-kernel, removed GPTi,</li>
<li>2023-10-19: Added link to list of alternative frameworks</li>
</ul>Jaro-Winkler Similarity2023-08-29T00:00:00+02:002023-08-29T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-08-29:/jaro-winkler-similarity/<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#jaro-winkler-similarity">Jaro-Winkler Similarity</a></li>
<li><a href="#python-example">Python Example:</a></li>
<li><a href="#valuable-properties-of-jaro-winkler-similarity">Valuable Properties of Jaro-Winkler Similarity:</a></li>
<li><a href="#recommendations-for-usage">Recommendations for Usage:</a></li>
<li><a href="#cases-to-consider-alternatives">Cases to Consider Alternatives:</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="jaro-winkler-similarity"></a></p>
<h2>Jaro-Winkler Similarity</h2>
<p>Jaro-Winkler similarity is designed to compare two strings, giving more weight to the common prefix of the strings. The formula for Jaro-Winkler similarity is …</p><!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#jaro-winkler-similarity">Jaro-Winkler Similarity</a></li>
<li><a href="#python-example">Python Example:</a></li>
<li><a href="#valuable-properties-of-jaro-winkler-similarity">Valuable Properties of Jaro-Winkler Similarity:</a></li>
<li><a href="#recommendations-for-usage">Recommendations for Usage:</a></li>
<li><a href="#cases-to-consider-alternatives">Cases to Consider Alternatives:</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="jaro-winkler-similarity"></a></p>
<h2>Jaro-Winkler Similarity</h2>
<p>Jaro-Winkler similarity is designed to compare two strings, giving more weight to the common prefix of the strings. The formula for Jaro-Winkler similarity is:</p>
<div class="math">$$
JW(s1, s2) = J(s1, s2) + \frac{L \cdot p \cdot (1 - J(s1, s2))}{10}
$$</div>
<p>Where:</p>
<ul>
<li><span class="math">\(J(s1, s2)\)</span> is the Jaro similarity between strings (s1) and (s2).</li>
<li><span class="math">\(L\)</span> is the length of the common prefix between the strings.</li>
<li><span class="math">\(p\)</span> is a constant scaling factor (typically 0.1) that increases the similarity for strings that share a common prefix.</li>
</ul>
<p>The Jaro similarity <span class="math">\(J(s1, s2)\)</span> is calculated as:</p>
<div class="math">$$
J(s1, s2) = \frac{m}{\max(\text{len}(s1), \text{len}(s2))}, \quad
$$</div>
<p>
Where:</p>
<ul>
<li><span class="math">\(m\)</span> is the number of matching characters</li>
</ul>
<p><a id="python-example"></a></p>
<h3>Python Example</h3>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">jaro_similarity</span><span class="p">(</span><span class="n">s1</span><span class="p">,</span> <span class="n">s2</span><span class="p">):</span>
<span class="n">len_s1</span><span class="p">,</span> <span class="n">len_s2</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">s1</span><span class="p">),</span> <span class="nb">len</span><span class="p">(</span><span class="n">s2</span><span class="p">)</span>
<span class="n">match_distance</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">len_s1</span><span class="p">,</span> <span class="n">len_s2</span><span class="p">)</span> <span class="o">//</span> <span class="mi">2</span> <span class="o">-</span> <span class="mi">1</span>
<span class="n">common_chars_s1</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">common_chars_s2</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">char</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">s1</span><span class="p">):</span>
<span class="n">start</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">i</span> <span class="o">-</span> <span class="n">match_distance</span><span class="p">)</span>
<span class="n">end</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="n">match_distance</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">len_s2</span><span class="p">)</span>
<span class="k">if</span> <span class="n">char</span> <span class="ow">in</span> <span class="n">s2</span><span class="p">[</span><span class="n">start</span><span class="p">:</span><span class="n">end</span><span class="p">]:</span>
<span class="n">common_chars_s1</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">char</span><span class="p">)</span>
<span class="n">common_chars_s2</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">s2</span><span class="p">[</span><span class="n">start</span><span class="p">:</span><span class="n">end</span><span class="p">][</span><span class="n">s2</span><span class="p">[</span><span class="n">start</span><span class="p">:</span><span class="n">end</span><span class="p">]</span><span class="o">.</span><span class="n">index</span><span class="p">(</span><span class="n">char</span><span class="p">)])</span>
<span class="n">m</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">common_chars_s1</span><span class="p">)</span>
<span class="k">if</span> <span class="n">m</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">return</span> <span class="mf">0.0</span>
<span class="n">transpositions</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">c1</span> <span class="o">!=</span> <span class="n">c2</span> <span class="k">for</span> <span class="n">c1</span><span class="p">,</span> <span class="n">c2</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">common_chars_s1</span><span class="p">,</span> <span class="n">common_chars_s2</span><span class="p">))</span> <span class="o">//</span> <span class="mi">2</span>
<span class="n">jaro_similarity</span> <span class="o">=</span> <span class="p">(</span><span class="n">m</span> <span class="o">/</span> <span class="n">len_s1</span> <span class="o">+</span> <span class="n">m</span> <span class="o">/</span> <span class="n">len_s2</span> <span class="o">+</span> <span class="p">(</span><span class="n">m</span> <span class="o">-</span> <span class="n">transpositions</span><span class="p">)</span> <span class="o">/</span> <span class="n">m</span><span class="p">)</span> <span class="o">/</span> <span class="mi">3</span>
<span class="k">return</span> <span class="n">jaro_similarity</span>
<span class="k">def</span> <span class="nf">jaro_winkler_similarity</span><span class="p">(</span><span class="n">s1</span><span class="p">,</span> <span class="n">s2</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="mf">0.1</span><span class="p">):</span>
<span class="n">jaro_sim</span> <span class="o">=</span> <span class="n">jaro_similarity</span><span class="p">(</span><span class="n">s1</span><span class="p">,</span> <span class="n">s2</span><span class="p">)</span>
<span class="n">common_prefix_len</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">c1</span><span class="p">,</span> <span class="n">c2</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">s1</span><span class="p">,</span> <span class="n">s2</span><span class="p">)):</span>
<span class="k">if</span> <span class="n">c1</span> <span class="o">==</span> <span class="n">c2</span><span class="p">:</span>
<span class="n">common_prefix_len</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">break</span>
<span class="n">jaro_winkler_sim</span> <span class="o">=</span> <span class="n">jaro_sim</span> <span class="o">+</span> <span class="p">(</span><span class="n">common_prefix_len</span> <span class="o">*</span> <span class="n">p</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">jaro_sim</span><span class="p">))</span>
<span class="k">return</span> <span class="n">jaro_winkler_sim</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="n">string1</span> <span class="o">=</span> <span class="s2">"apple"</span>
<span class="n">string2</span> <span class="o">=</span> <span class="s2">"applet"</span>
<span class="n">jw_similarity</span> <span class="o">=</span> <span class="n">jaro_winkler_similarity</span><span class="p">(</span><span class="n">string1</span><span class="p">,</span> <span class="n">string2</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Jaro-Winkler Similarity:"</span><span class="p">,</span> <span class="n">jw_similarity</span><span class="p">)</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code>Jaro-Winkler Similarity: 0.9722222222222223
</code></pre></div>
<p>The Jaro-Winkler similarity metric possesses several valuable properties that make it suitable for specific use cases. However, it's important to note that no single similarity metric is universally best for all scenarios. Here are some valuable properties of the Jaro-Winkler metric, as well as recommendations for its usage and instances where other metrics might be more appropriate.
<a id="valuable-properties-of-jaro-winkler-similarity"></a></p>
<h3>Valuable Properties of Jaro-Winkler Similarity</h3>
<ol>
<li>
<p><strong>String Comparison with Common Prefix:</strong> The Jaro-Winkler metric gives higher weight to common prefixes, making it effective for comparing strings that often have a prefix or abbreviation. This is particularly useful for names and addresses.</p>
</li>
<li>
<p><strong>Adjustable Scaling Factor:</strong> The Jaro-Winkler metric allows for tuning the impact of the common prefix on the similarity score using the scaling factor <span class="math">\(p\)</span>. This allows you to emphasize or de-emphasize the common prefix based on your needs.</p>
</li>
<li>
<p><strong>Simple to Understand and Implement:</strong> The calculation of Jaro-Winkler similarity involves straightforward string matching and prefix length consideration, making it relatively easy to implement and understand.</p>
</li>
</ol>
<p><a id="recommendations-for-usage"></a></p>
<h3>Recommendations for Usage</h3>
<ol>
<li>
<p><strong>Names and Addresses:</strong> Jaro-Winkler similarity is highly recommended when comparing names, addresses, and other strings with common prefixes or abbreviations. It's often used in record linkage, deduplication, and fuzzy matching tasks in databases.</p>
</li>
<li>
<p><strong>Fuzzy String Matching:</strong> When dealing with noisy or misspelled data, the Jaro-Winkler metric can be effective in finding approximate matches. It's suitable for scenarios where small typographical errors or variations are common.</p>
</li>
<li>
<p><strong>Short Texts:</strong> Jaro-Winkler is well-suited for comparing short texts like product names, usernames, and titles, where the common prefix is an important aspect of similarity.</p>
</li>
</ol>
<p><a id="cases-to-consider-alternatives"></a></p>
<h3>Cases to Consider Alternatives</h3>
<ol>
<li>
<p><strong>Long Texts:</strong> For comparing long texts or documents, <strong>cosine similarity</strong> or <strong>Jaccard similarity</strong> of term frequencies might be more appropriate, as they consider the distribution of terms across the entire text.</p>
</li>
<li>
<p><strong>Semantic Similarity:</strong> If you're interested in capturing semantic meaning rather than character-level similarity, <strong>word embeddings</strong>-based metrics like cosine similarity between vector representations might be more suitable.</p>
</li>
<li>
<p><strong>Numerical Data:</strong> For comparing numerical data, other similarity metrics such as <strong>Euclidean distance</strong>, <strong>Manhattan distance</strong>, or <strong>Pearson correlation coefficient</strong> might be more meaningful.</p>
</li>
<li>
<p><strong>Customized Weights:</strong> If you have specific domain knowledge about feature importance, you might opt for a customized similarity metric that incorporates these weights effectively.</p>
</li>
<li>
<p><strong>Language-Specific Features:</strong> If the text includes language-specific features, phonetic differences, or linguistic nuances, other specialized metrics like <strong>Soundex</strong> or <strong>Levenshtein distance</strong> might be considered.</p>
</li>
</ol>
<h2>Examples</h2>
<p>Here are some concrete pairs of strings that demonstrate the properties of the Jaro-Winkler similarity metric (<span class="math">\(p\)</span>=0.2 if not stated differently):</p>
<p><strong>Common Prefix Emphasis:</strong></p>
<ul>
<li>String 1: "Michael"</li>
<li>String 2: "Michelle"</li>
<li>Jaro-Winkler similarity: 0.963</li>
</ul>
<p>Explanation: The common prefix "Mich" contributes significantly to the similarity score in Jaro-Winkler, resulting in a high similarity even though the rest of the strings differ.</p>
<p><strong>Case Sensitivity and Scaling Factor:</strong></p>
<ul>
<li>String 1: "McDonald's"</li>
<li>String 2: "Mcdonells"</li>
<li>Jaro-Winkler similarity: 0.853</li>
</ul>
<p>Explanation: The common prefix "Mcdon" is considered due to the case difference. The scaling factor can adjust the impact of this prefix on the similarity score.</p>
<p><strong>No Common Prefix:</strong></p>
<ul>
<li>String 1: "hello"</li>
<li>String 2: "world"</li>
<li>Jaro-Winkler similarity: 0.433</li>
</ul>
<p>Explanation: Without a common prefix, the Jaro-Winkler similarity is low, even if the strings share some characters.</p>
<p><strong>Short vs. Long Strings:</strong></p>
<ul>
<li>String 1: "AI"</li>
<li>String 2: "Artificial Intelligence"</li>
<li>Jaro-Winkler similarity: 0.623
Explanation: The short string "AI" has a higher similarity to the beginning of "Artificial Intelligence" due to the common prefix "A".</li>
</ul>
<p><strong>Typographical Errors:</strong></p>
<ul>
<li>String 1: "telephone"</li>
<li>String 2: "telephne"</li>
<li>Jaro-Winkler similarity: 0.967</li>
</ul>
<p>Explanation: Despite the missing "o," the common prefix "teleph" contributes to a high Jaro-Winkler similarity score.</p>
<p><strong>Short and Noisy Data:</strong></p>
<ul>
<li>String 1: "abacus"</li>
<li>String 2: "abaxus"</li>
<li>Jaro-Winkler similarity: 0.956</li>
</ul>
<p>Explanation: The similarity captures the similarity in the common prefix "aba" and penalizes the difference at the end.</p>
<p><strong>Significance of Scaling Factor:</strong></p>
<ul>
<li>String 1: "Thompson"</li>
<li>String 2: "Thomson"</li>
<li>Jaro-Winkler similarity with <span class="math">\(p=0.1\)</span>: 0.975</li>
<li>Jaro-Winkler similarity with <span class="math">\(p=0.2\)</span>: 0.992</li>
</ul>
<p>Explanation: The scaling factor <span class="math">\(p\)</span> affects the similarity score. A higher <span class="math">\(p\)</span> gives more emphasis to the common prefix, leading to a higher similarity.</p>
<p>These examples illustrate how the Jaro-Winkler similarity metric behaves based on different characteristics of input strings, such as common prefixes, case sensitivity, typos, length, and the scaling factor <span class="math">\(p\)</span>.</p>
<h2>Summary</h2>
<p>Jaro-Winkler similarity is highly valuable when dealing with short strings, names, and addresses, especially when common prefixes play a significant role. However, for longer texts, semantic similarity, numerical data, and specialized linguistic considerations, other metrics might be more appropriate. Always consider the specific characteristics of your data and the goals of your analysis when choosing a similarity metric.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Bearer Token Authentication for API2023-08-24T00:00:00+02:002023-08-24T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-08-24:/bearer-token-authentication-for-api/<h2>Bearer Token Authentication</h2>
<p>Bearer authentication is a method of API authentication that involves including a "bearer token" in the request header. This token is typically a long string of characters, often encoded in a specific format like JSON Web Token (JWT) or …</p><h2>Bearer Token Authentication</h2>
<p>Bearer authentication is a method of API authentication that involves including a "bearer token" in the request header. This token is typically a long string of characters, often encoded in a specific format like JSON Web Token (JWT) or OAuth token. Bearer authentication is commonly used to secure APIs by allowing only authorized users or applications to access protected resources.</p>
<p>Here's how the process generally works:</p>
<ol>
<li>
<p><strong>Authentication</strong>: The user or application requests access to a protected resource by sending a request to the API server.</p>
</li>
<li>
<p><strong>Token Generation</strong>: Upon successful authentication, the server generates a bearer token, which serves as proof of the user's or application's identity and permissions.</p>
</li>
<li>
<p><strong>Token Inclusion</strong>: The generated bearer token is then included in the "Authorization" header of subsequent requests to the API. The header typically looks like this:</p>
</li>
</ol>
<p><code>Authorization: Bearer <token></code></p>
<p>Here, <code><token></code> represents the actual bearer token.</p>
<ol>
<li>
<p><strong>Authorization</strong>: The API server receives the request and extracts the bearer token from the header. It then validates the token to determine if the user or application is authorized to access the requested resource.</p>
</li>
<li>
<p><strong>Access Control</strong>: If the bearer token is valid and the user or application has the necessary permissions, the API server grants access to the requested resource. If the token is invalid or expired, the server denies access.</p>
</li>
</ol>
<p>Bearer authentication is often preferred due to its simplicity and ease of implementation. It allows the server to validate the token without needing to store any session information, making it suitable for stateless architectures like RESTful APIs. However, securing bearer tokens is crucial since anyone in possession of a valid token can access the associated resources. This is why HTTPS and token encryption are recommended to protect the token during transmission.</p>
<p><strong>NOTE</strong>: bearer tokens should be handled carefully. They can potentially be exposed if not properly secured, and their use should be combined with other security measures, such as <strong>rate limiting</strong>, <strong>token expiration</strong>, and regular <strong>token rotation</strong>, to enhance the overall security of an API.</p>
<h2>Token Encryption</h2>
<p>Token encryption plays a crucial role in securing bearer tokens used for API authentication. Encrypting bearer tokens ensures that the token's content remains confidential and tamper-proof while it's being transmitted or stored. Here's an overview of how token encryption works:</p>
<ol>
<li>
<p><strong>Token Content</strong>: Bearer tokens often contain important information such as user identity, permissions, and expiration time. This information should be protected from unauthorized access.</p>
</li>
<li>
<p><strong>Choose Encryption Algorithm</strong>: A strong encryption algorithm is selected for securing the token. Common choices include AES (Advanced Encryption Standard) and RSA (Rivest-Shamir-Adleman).</p>
</li>
<li>
<p><strong>Generate Encryption Keys</strong>: Encryption requires keys: a public key for encryption and a private key for decryption (in the case of asymmetric encryption like RSA) or a shared key (in the case of symmetric encryption like AES). These keys must be kept secret.</p>
</li>
<li>
<p><strong>Encryption Process</strong>:</p>
<ul>
<li>
<p><strong>Asymmetric Encryption (e.g., RSA)</strong>: If using asymmetric encryption, the sender uses the recipient's public key to encrypt the token. Only the recipient possessing the corresponding private key can decrypt and access the original token.</p>
</li>
<li>
<p><strong>Symmetric Encryption (e.g., AES)</strong>: In symmetric encryption, both the sender and receiver share the same secret key. The sender uses this key to encrypt the token, and the recipient uses the same key to decrypt it.</p>
</li>
</ul>
</li>
<li>
<p><strong>Transmission</strong>: The encrypted token can now be safely transmitted over the network. Even if intercepted by malicious actors, the encrypted content should be meaningless without the decryption key.</p>
</li>
<li>
<p><strong>Decryption Process</strong>:</p>
<ul>
<li>
<p><strong>Asymmetric Encryption (e.g., RSA)</strong>: The recipient uses their private key to decrypt the token, revealing its original content.</p>
</li>
<li>
<p><strong>Symmetric Encryption (e.g., AES)</strong>: The recipient uses the shared secret key to decrypt the token and access its original content.</p>
</li>
</ul>
</li>
</ol>
<p>Encryption adds an additional layer of security to bearer tokens. Even if an attacker gains access to the encrypted token, they won't be able to decipher its contents without the appropriate decryption key.</p>
<p>It's important to note a few considerations:</p>
<ul>
<li>
<p><strong>Key Management</strong>: The security of encrypted tokens depends heavily on proper key management. Keys should be stored securely and rotated periodically.</p>
</li>
<li>
<p><strong>Algorithm and Key Length</strong>: The choice of encryption algorithm and key length impacts the security of the encrypted token. Strong algorithms with sufficient key lengths should be used.</p>
</li>
<li>
<p><strong>HTTPS</strong>: While encryption protects the token in transit, using HTTPS (TLS/SSL) for communication further ensures the confidentiality and integrity of the entire data exchange, including the token.</p>
</li>
<li>
<p><strong>Token Validation</strong>: Even when using encrypted tokens, the receiving server must still validate the decrypted token to ensure its authenticity, integrity, and authorization.</p>
</li>
</ul>
<p>Combining token encryption with other security practices, such as secure token storage and token expiration, provides a comprehensive approach to securing bearer tokens and API authentication.</p>Understanding Retrieval-Augmented Generation (RAG) empowering LLMs2023-08-24T00:00:00+02:002023-10-23T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-08-24:/understanding-retrieval-augmented-generation-rag-empowering-llms/<p>Understand innovative artificial intelligence framework that empower large language models (LLMs) by anchoring them to external knowledge sources with accurate, current information.</p><h2>TLDR</h2>
<p>Retrieval augmented generation refers to the method of enhancing a user's input to a large language model (LLM) such as ChatGPT by incorporating extra information obtained from an external source. This additional data can then be utilized by the LLM to enrich the response it produces.</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#introduction-understanding-retrieval-augmented-generation-rag">Introduction: Understanding Retrieval-Augmented Generation (RAG)</a></li>
<li><a href="#the-need-for-rag-in-large-language-models">The Need for RAG in Large Language Models</a></li>
<li><a href="#the-open-book-approach-of-rag">The 'Open Book' Approach of RAG</a></li>
<li><a href="#personalized-and-verifiable-responses-with-rag">Personalized and Verifiable Responses with RAG</a></li>
<li><a href="#challenges-and-future-directions">Challenges and Future Directions</a></li>
<li><a href="#conclusion">Conclusion</a></li>
<li><a href="#references">References</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="introduction-understanding-retrieval-augmented-generation-rag"></a></p>
<h2>Introduction: Understanding Retrieval-Augmented Generation (RAG)</h2>
<p>Retrieval-Augmented Generation, commonly referred to as RAG, and sometimes called Grounded Generation (GG), represents an ingenious integration of pretrained dense retrieval (DPR) and <a href="https://en.wikipedia.org/wiki/Seq2seq">sequence-to-sequence</a> models.</p>
<blockquote>
<p>Transformer architecture (used in GPT models) is a member of sequence-to-sequence (Seq2Seq) architectures. Seq2Seq models are designed to handle tasks that involve transforming an input sequence into an output sequence, such as machine translation, text summarization, and dialogue generation.</p>
</blockquote>
<p>The process involves retrieving documents using DPR and subsequently transmitting them to a seq2seq model. Through a process of marginalization, these models then produce desired outputs. The retriever and seq2seq modules commence their operations as pretrained models, and through a joint fine-tuning process, they adapt collaboratively, thus enhancing both retrieval and generation for specific downstream tasks. <strong>This innovative artificial intelligence framework serves as a means to empower large language models (LLMs) by anchoring them to external knowledge sources.</strong> Consequently, this strategic approach ensures the availability of accurate, current information, thereby granting users valuable insights into the generative mechanisms of these models. For a comprehensive understanding of the RAG technique, we offer an in-depth exploration, commencing with a simplified overview and progressively delving into more intricate technical facets.</p>
<p><img alt="Data processing in RAG" src="https://learn.microsoft.com/en-us/azure/machine-learning/media/concept-retrieval-augmented-generation/retrieval-augmented-generation-walkthrough.png?view=azureml-api-2#lightbox"></p>
<p>Figure 1. Data processing, storage and referencing in RAG method. Source: <a href="https://learn.microsoft.com/en-us/azure/machine-learning/concept-retrieval-augmented-generation?view=azureml-api-2">Microsoft</a></p>
<p><a id="the-need-for-rag-in-large-language-models"></a></p>
<h2>The Need for RAG in Large Language Models</h2>
<p>Large language models, while powerful, can sometimes be inconsistent in their responses. They may provide accurate answers to certain questions but struggle with others, often regurgitating random facts from their training data. This inconsistency stems from the fact that LLMs understand the statistical relationships between words but not their actual meanings.</p>
<p>To address this issue, researchers have developed the RAG <strong>framework, which improves the quality of LLM-generated responses by grounding the model in external sources of knowledge.</strong> This approach not only ensures access to the most current and reliable facts but also allows users to verify the model's claims for accuracy and trustworthiness.</p>
<p><a id="the-open-book-approach-of-rag"></a></p>
<h2>The 'Open Book' Approach of RAG</h2>
<p>RAG operates in <strong>two main phases: retrieval and content generation</strong>. During the retrieval phase, algorithms search for and retrieve relevant snippets of information based on the user's prompt or question. These facts can come from various sources, such as indexed documents on the internet or a closed-domain enterprise setting for added security and reliability.</p>
<p>In the generative phase, the LLM uses the retrieved information and its internal representation of training data to synthesize a tailored answer for the user.</p>
<blockquote>
<p>This approach is akin to an "open book" exam, where the model can browse through content in a book rather than relying solely on its memory.</p>
</blockquote>
<p><img alt="RAG Operation" src="/images/retrieval_augmented_generation/RAG.png">
Figure 2. RAG operation. Information preparation and storage. Augmenting prompt with external information.</p>
<p><a id="personalized-and-verifiable-responses-with-rag"></a></p>
<h2>Personalized and Verifiable Responses with RAG</h2>
<p>RAG allows LLM-powered chatbots to provide more personalized answers without the need for human-written scripts. By reducing the need to continuously train the model on new data, RAG can lower the computational and financial costs of running LLM-powered chatbots in an enterprise setting.</p>
<p>Moreover, RAG enables LLMs to generate more specific, diverse, and factual language compared to traditional parametric-only seq2seq models. This feature is particularly useful for businesses that require up-to-date information and verifiable responses.</p>
<p><a id="challenges-and-future-directions"></a></p>
<h2>Challenges and Future Directions</h2>
<p>Despite its advantages, RAG is not without its challenges. For instance, <strong>LLMs may struggle to recognize when they don't know the answer</strong> to a question, leading to incorrect or misleading information. To address this issue, researchers are working on fine-tuning LLMs to recognize unanswerable questions and probe for more detail until they can provide a definitive answer.</p>
<p>Furthermore, there is ongoing research to improve both the retrieval and generation aspects of RAG. This includes <strong>finding and fetching the most relevant information possible and structuring that information</strong> to elicit the richest responses from the LLM.</p>
<p><a id="conclusion"></a></p>
<h2>Conclusion</h2>
<p>Retrieval-Augmented Generation offers a promising solution to the limitations of large language models by grounding them in external knowledge sources. By adopting RAG, businesses can achieve customized solutions, maintain data relevance, and optimize costs while harnessing the reasoning capabilities of LLMs. As research continues to advance in this area, we can expect even more powerful and efficient language models in the future.</p>
<p><em>Any comments or suggestions? <a href="mailto:ksafjan@gmail.com?subject=Blog+post">Let me know</a>.</em></p>
<p><a id="references"></a></p>
<h2>References</h2>
<ul>
<li>Original paper <a href="https://arxiv.org/abs/2005.11401">Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks</a> by Patrick Lewis et al. (available as <a href="https://paperswithcode.com/method/rag">paper with code</a>)</li>
<li>Exemplary notebooks on amazon Sagemaker:<ul>
<li><a href="https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/jumpstart-foundation-models/question_answering_retrieval_augmented_generation/question_answering_jumpstart_knn.html">Retrieval-Augmented Generation: Question Answering based on Custom Dataset</a></li>
<li><a href="https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/jumpstart-foundation-models/question_answering_retrieval_augmented_generation/question_answering_langchain_jumpstart.html">Retrieval-Augmented Generation: Question Answering based on Custom Dataset with Open-sourced LangChain Library</a></li>
</ul>
</li>
<li>Python library with RAG implementation: <a href="https://github.com/llmware-ai/llmware">GitHub - llmware-ai/llmware: Providing enterprise-grade LLM-based development framework, tools and fine-tuned models.</a></li>
<li>Analytics: <a href="https://www.vectorview.ai/">Vectorview</a></li>
<li>Deep-dive into specific use-case of RAG with scaling in mind: <a href="https://www.anyscale.com/blog/a-comprehensive-guide-for-building-rag-based-llm-applications-part-1">Building RAG-based LLM Applications for Production (Part 1)</a></li>
<li>Good section on possible improvements to RAG: <a href="https://llmstack.ai/blog/retrieval-augmented-generation">Retrieval Augmented Generation (RAG): What, Why and How? | LLMStack</a></li>
<li>General intro to RAG: <a href="https://scriv.ai/guides/retrieval-augmented-generation-overview/">How do domain-specific chatbots work? An Overview of Retrieval Augmented Generation (RAG) | Scriv</a></li>
<li>Optimization, async, using summaries: <a href="https://madhukarkumar.medium.com/secrets-to-optimizing-rag-llm-apps-for-better-accuracy-performance-and-lower-cost-da1014127c0a">Secrets to Optimizing RAG LLM Apps for Better Performance, Accuracy and Lower Costs! | by Madhukar Kumar | madhukarkumar | Sep, 2023 | Medium</a></li>
<li>Check the GitHub for the RAG-related projects: <a href="https://github.com/topics/retrieval-augmented-generation?l=python">retrieval-augmented-generation · GitHub Topics</a></li>
<li><a href="https://www.reddit.com/r/LocalLLaMA/comments/16cbimi/yet_another_rag_system_implementation_details_and/">Yet another RAG system - implementation details and lessons learned : r/LocalLLaMA</a></li>
<li><a href="https://www.deeplearning.ai/short-courses/building-evaluating-advanced-rag/">Building and Evaluating Advanced RAG Applications - DeepLearning.AI</a> - recent course from deeplearning.ai (Andrew Ng). Instructors: Jerry Liu and Anupam Datta.<ul>
<li>In this course, we’ll explore:<ul>
<li>Two advanced retrieval methods: Sentence-window retrieval and auto-merging retrieval that perform better compared to the baseline RAG pipeline.</li>
<li>Evaluation and experiment tracking: A way evaluate and iteratively improve your RAG pipeline’s performance.</li>
<li>The RAG triad: Context Relevance, Groundedness, and Answer Relevance, which are methods to evaluate the relevance and truthfulness of your LLM’s response.</li>
</ul>
</li>
</ul>
</li>
</ul>
<p>X::<a href="https://www.safjan.com/techniques-to-boost-rag-performance-in-production/">Techniques to Boost RAG Performance in Production</a></p>
<p><strong>Edits:</strong></p>
<ul>
<li>2023-10-23 - added link to LLMStack</li>
<li>2023-11-06 - added TLDR section</li>
<li>2023-11-06 - added ToC</li>
</ul>Create Self-Hosted Python Package Repository - General Guide2023-08-12T00:00:00+02:002023-08-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-08-12:/Create Self-Hosted Python Package Repository/<p>X::<a href="https://www.safjan.com/lesser-known-python-package-repository-managers/">Lesser-known Python Package Repository Managers</a>
X::<a href="https://www.safjan.com/storing-private-python-packages-with-local-nas-and-lightweight-servers/">Storing Private Python Packages with Local NAS and Lightweight Servers</a></p>
<p>Creating a self-hosted Python package repository allows you to host and manage your own Python packages, making them accessible to your team or the public …</p><p>X::<a href="https://www.safjan.com/lesser-known-python-package-repository-managers/">Lesser-known Python Package Repository Managers</a>
X::<a href="https://www.safjan.com/storing-private-python-packages-with-local-nas-and-lightweight-servers/">Storing Private Python Packages with Local NAS and Lightweight Servers</a></p>
<p>Creating a self-hosted Python package repository allows you to host and manage your own Python packages, making them accessible to your team or the public without relying on external services like PyPI. Here's a general guide on how to set up a self-hosted Python package repository.</p>
<div class="highlight"><pre><span></span><code><span class="n">style</span><span class="o">:</span><span class="w"> </span><span class="n">bullet</span>
<span class="n">min_depth</span><span class="o">:</span><span class="w"> </span><span class="mi">2</span>
<span class="n">max_depth</span><span class="o">:</span><span class="w"> </span><span class="mi">6</span><span class="w"> </span>
<span class="n">title</span><span class="o">:</span><span class="w"> </span><span class="s2">"**Contents**"</span>
</code></pre></div>
<h2>General Guide</h2>
<h3>Choose a Repository Manager</h3>
<p>You need a repository manager to host and manage your Python packages. Two popular options are:</p>
<ul>
<li><strong>Devpi</strong>: A powerful and customizable Python package repository server.</li>
<li><strong>Artifactory</strong>: A general-purpose repository manager that can host various types of packages, including Python.</li>
</ul>
<h3>Set Up a Server</h3>
<p>You will need a server to host your package repository. This could be a dedicated server, a cloud instance (AWS, GCP, Azure), or even a local machine if the repository is for internal use.</p>
<h3>Install and Configure the Repository Manager</h3>
<h4>Devpi</h4>
<ul>
<li>Install Devpi using pip: <code>pip install devpi-server devpi-web</code></li>
<li>Configure Devpi: Follow the instructions in the <a href="https://devpi.net/docs/devpi/devpi/stable/+doc/quickstart-server.html">official documentation</a>.</li>
</ul>
<h4>Artifactory</h4>
<p>Download and install Artifactory: Follow the instructions on the <a href="https://www.jfrog.com/confluence/display/JFROG/Installing+Artifactory">Artifactory website</a>.</p>
<h3>Create a Virtual Environment (optional but recommended)</h3>
<p>Set up a Python virtual environment on your server to keep your package repository isolated from the system Python.</p>
<h3>Upload Packages</h3>
<p>Once your repository manager is set up, use tools like <code>twine</code> to upload your Python packages. Make sure to specify your self-hosted repository URL.</p>
<h3>Accessing Packages</h3>
<p>To use packages from your self-hosted repository, users can modify their <code>pip.conf</code> or <code>.pypirc</code> configuration file to include your repository's URL.</p>
<h3>Security and Access Control</h3>
<p>Configure user authentication and access control to restrict who can upload and access packages in your repository.</p>
<h3>Maintenance and Backup</h3>
<ul>
<li>Regularly back up your package repository data to prevent data loss.</li>
<li>Keep your repository manager and server updated with security patches.</li>
</ul>
<h3>Documentation</h3>
<p>Provide clear documentation to your team on how to access, upload, and manage packages in your self-hosted repository.</p>
<h2>Artifactory vs. Devpi - pros & cons and setup instructions</h2>
<p>Let's explore two popular free and open-source options for creating a self-hosted Python package repository: Devpi and Artifactory, along with their pros, cons, use cases, and a tutorial for setting up each.</p>
<h3>Option 1: Devpi</h3>
<p><strong>Pros:</strong></p>
<ul>
<li>Designed specifically for Python package management.</li>
<li>Provides features like caching, replication, and access control.</li>
<li>Supports easy package versioning and management.</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>Limited support for non-Python packages.</li>
<li>Web interface might not be as polished as other solutions.</li>
</ul>
<p><strong>Use Cases:</strong></p>
<ul>
<li>Small to medium-sized teams working exclusively with Python.</li>
<li>Projects where ease of setup and simple usage is preferred.</li>
</ul>
<p><strong>Tutorial:</strong></p>
<ol>
<li><strong>Install Devpi</strong>:</li>
</ol>
<div class="highlight"><pre><span></span><code>pip install devpi-server devpi-web
</code></pre></div>
<ol>
<li><strong>Create and Configure Devpi Server</strong>:</li>
<li>Initialize a new Devpi server:</li>
</ol>
<div class="highlight"><pre><span></span><code>devpi-server --init
</code></pre></div>
<ul>
<li>Start the Devpi server:</li>
</ul>
<div class="highlight"><pre><span></span><code>devpi-server
</code></pre></div>
<ol>
<li><strong>Create Users and Indexes</strong>:</li>
<li>Create a user:</li>
</ol>
<div class="highlight"><pre><span></span><code>devpi use http://localhost:3141
devpi user -c <username>
</code></pre></div>
<ul>
<li>Create an index:</li>
</ul>
<div class="highlight"><pre><span></span><code>devpi index -c <indexname>
</code></pre></div>
<ol>
<li><strong>Upload and Use Packages</strong>:</li>
<li>Upload a package:</li>
</ol>
<div class="highlight"><pre><span></span><code><span class="n">devpi</span><span class="w"> </span><span class="n">upload</span>
</code></pre></div>
<ul>
<li>Install a package from your Devpi index:</li>
</ul>
<div class="highlight"><pre><span></span><code><span class="nx">pip</span><span class="w"> </span><span class="nx">install</span><span class="w"> </span><span class="o">-</span><span class="nx">i</span><span class="w"> </span><span class="nx">http</span><span class="p">:</span><span class="c1">//localhost:3141/<username>/<indexname>/simple/ <package></span>
</code></pre></div>
<h3>Option 2: Artifactory</h3>
<p><strong>Pros:</strong></p>
<ul>
<li>Versatile repository manager supporting multiple package types.</li>
<li>Robust access control and security features.</li>
<li>Highly configurable and scalable.</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>More complex setup compared to Devpi.</li>
<li>Heavier resource requirements.</li>
</ul>
<p><strong>Use Cases:</strong></p>
<ul>
<li>Large organizations with diverse technology stacks.</li>
<li>Projects needing advanced access control and security features.</li>
</ul>
<p><strong>Tutorial:</strong></p>
<ol>
<li><strong>Install Artifactory</strong>:</li>
<li>
<p>Follow the installation guide for <a href="https://www.jfrog.com/confluence/display/JFROG/Installing+Artifactory">Artifactory Community Edition</a>.</p>
</li>
<li>
<p><strong>Configure Artifactory</strong>:</p>
</li>
<li>
<p>Access Artifactory's web interface and set up your repository.</p>
</li>
<li>
<p><strong>Create a Virtual Repository</strong>:</p>
</li>
<li>
<p>Create a new virtual repository and include a "PyPI" remote repository as a source.</p>
</li>
<li>
<p><strong>Upload Packages</strong>:</p>
</li>
<li>Use <code>twine</code> to upload your Python packages to your virtual repository:</li>
</ol>
<div class="highlight"><pre><span></span><code><span class="n">twine</span><span class="w"> </span><span class="n">upload</span><span class="w"> </span><span class="o">--</span><span class="n">repository</span><span class="o">-</span><span class="n">url</span><span class="w"> </span><span class="o"><</span><span class="n">Artifactory_URL</span><span class="o">>/<</span><span class="n">repository_name</span><span class="o">></span><span class="w"> </span><span class="n">dist</span><span class="o">/*</span>
</code></pre></div>
<ol>
<li><strong>Access and Use Packages</strong>:</li>
<li>Configure your pip to use your Artifactory repository as an index:</li>
</ol>
<div class="highlight"><pre><span></span><code><span class="nx">pip</span><span class="w"> </span><span class="nx">config</span><span class="w"> </span><span class="nx">set</span><span class="w"> </span><span class="nx">global</span><span class="p">.</span><span class="nx">index</span><span class="o">-</span><span class="nx">url</span><span class="w"> </span><span class="p"><</span><span class="nx">Artifactory_URL</span><span class="p">></span><span class="o">/</span><span class="p"><</span><span class="nx">repository_name</span><span class="p">></span><span class="o">/</span><span class="nx">simple</span><span class="o">/</span>
</code></pre></div>
<ul>
<li>Install packages as usual using pip.</li>
</ul>
<h2>Closing thoughts</h2>
<ul>
<li>Setting up a self-hosted Python package repository requires careful consideration of your team's needs and technical expertise. Choose the option that best aligns with your requirements and resources.</li>
<li>Remember, setting up and maintaining a self-hosted package repository requires technical expertise and ongoing maintenance. If you're not experienced with server management and administration, consider starting with a simpler approach or seeking help from someone with relevant experience.</li>
</ul>Cookiecutter alternatives2023-08-12T00:00:00+02:002023-08-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-08-12:/cookiecutter-alternatives/<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#alternative-tools-to-cookiecutter-for-scaffolding-projects">Alternative Tools to Cookiecutter for Scaffolding Projects</a><ul>
<li><a href="#1-yeoman">1. <strong>Yeoman</strong></a></li>
<li><a href="#2-hygen">2. <strong>Hygen</strong></a></li>
<li><a href="#3-plop">3. <strong>Plop</strong></a></li>
<li><a href="#4-hyde">4. <strong>Hyde</strong></a></li>
<li><a href="#5-slush">5. <strong>Slush</strong></a></li>
<li><a href="#6-blueprint">6. <strong>Blueprint</strong></a></li>
<li><a href="#7-sao">7. <strong>Sao</strong></a></li>
<li><a href="#8-plopdown">8. <strong>Plopdown</strong></a></li>
<li><a href="#9-jolt">9. <strong>Jolt</strong></a></li>
<li><a href="#10-boilr">10. <strong>Boilr</strong></a></li>
</ul>
</li>
<li><a href="#recommendations-for-various-use-cases">Recommendations for various use-cases</a><ul>
<li><a href="#use-case-1-rapid-prototyping-and-small-projects---plop">Use Case 1: Rapid Prototyping and Small Projects - Plop</a></li>
<li><a href="#use-case-2-large-scale-projects-with-opinionated-conventions---yeoman">Use Case …</a></li></ul></li></ul><!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#alternative-tools-to-cookiecutter-for-scaffolding-projects">Alternative Tools to Cookiecutter for Scaffolding Projects</a><ul>
<li><a href="#1-yeoman">1. <strong>Yeoman</strong></a></li>
<li><a href="#2-hygen">2. <strong>Hygen</strong></a></li>
<li><a href="#3-plop">3. <strong>Plop</strong></a></li>
<li><a href="#4-hyde">4. <strong>Hyde</strong></a></li>
<li><a href="#5-slush">5. <strong>Slush</strong></a></li>
<li><a href="#6-blueprint">6. <strong>Blueprint</strong></a></li>
<li><a href="#7-sao">7. <strong>Sao</strong></a></li>
<li><a href="#8-plopdown">8. <strong>Plopdown</strong></a></li>
<li><a href="#9-jolt">9. <strong>Jolt</strong></a></li>
<li><a href="#10-boilr">10. <strong>Boilr</strong></a></li>
</ul>
</li>
<li><a href="#recommendations-for-various-use-cases">Recommendations for various use-cases</a><ul>
<li><a href="#use-case-1-rapid-prototyping-and-small-projects---plop">Use Case 1: Rapid Prototyping and Small Projects - Plop</a></li>
<li><a href="#use-case-2-large-scale-projects-with-opinionated-conventions---yeoman">Use Case 2: Large-Scale Projects with Opinionated Conventions - Yeoman</a></li>
<li><a href="#use-case-3-advanced-file-processing-and-task-automation---slush">Use Case 3: Advanced File Processing and Task Automation - Slush</a></li>
</ul>
</li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="alternative-tools-to-cookiecutter-for-scaffolding-projects"></a></p>
<h2>Alternative Tools to Cookiecutter for Scaffolding Projects</h2>
<p>Scaffolding tools are essential for accelerating the process of project setup and code generation by providing predefined templates and structures. One of the popular tools for this purpose is Cookiecutter, which allows developers to create projects from project templates. However, the software development ecosystem is diverse, and there are several alternative tools to Cookiecutter, each with unique features and characteristics that differentiate them from one another.</p>
<p>In this article, we will explore ten alternative tools to Cookiecutter and highlight their standout features and best-suited use cases.</p>
<p><a id="1-yeoman"></a></p>
<h3>1. <strong>Yeoman</strong></h3>
<p>Yeoman is a robust scaffolding tool that offers a vast collection of generators to create projects across various languages and frameworks. It provides a strong community support and an extensive library of community-contributed generators.</p>
<p>Unlike Cookiecutter, Yeoman focuses on a more opinionated approach, meaning it enforces best practices and conventions for specific frameworks, which can speed up development. Additionally, it supports interactive user prompts, making project setup more user-friendly.</p>
<p>Yeoman's wide range of generators and its integration with popular build tools like Grunt and Gulp make it suitable for large-scale projects and complex workflows.</p>
<p><strong>Best Use Case:</strong> Yeoman is best suited for developers who want a structured and opinionated approach to project generation, especially in scenarios where adherence to specific conventions is crucial.</p>
<p><a id="2-hygen"></a></p>
<h3>2. <strong>Hygen</strong></h3>
<p>Hygen is a fast and flexible code generator that allows developers to create custom templates for their projects. It offers a template language with conditional logic and supports both built-in and custom helpers.</p>
<p>Hygen focuses on simplicity and allows developers to create templates using their preferred language, making it highly customizable. Unlike Cookiecutter, which relies on Jinja2 templates, Hygen's customizability extends to both the template language and directory structure.</p>
<p>The ability to generate code snippets and templates quickly and effortlessly makes Hygen ideal for scenarios where rapid prototyping and iterative development are essential.</p>
<p><strong>Best Use Case:</strong> Hygen is best suited for developers who need a lightweight, customizable, and language-agnostic solution for scaffolding projects.</p>
<p><a id="3-plop"></a></p>
<h3>3. <strong>Plop</strong></h3>
<p>Plop is a simple yet powerful micro-generator tool that focuses on creating small and reusable templates. It allows developers to define custom generators with ease, making it a popular choice for smaller projects.</p>
<p>Plop stands out from Cookiecutter due to its minimalistic approach and single-purpose philosophy. Instead of managing complex project structures, Plop concentrates on code generation for specific components or modules.</p>
<p>Plop's ability to create small, self-contained generators with custom logic and prompts is ideal for developers who require lightweight scaffolding tools for repetitive tasks.</p>
<p><strong>Best Use Case:</strong> Plop is best suited for developers who work on component-based architectures and require a quick and straightforward way to generate components, modules, or boilerplate code.</p>
<p><a id="4-hyde"></a></p>
<h3>4. <strong>Hyde</strong></h3>
<p>Hyde is a lightweight scaffolding tool that allows developers to create projects using a simple YAML configuration file. It offers a minimalist approach to project generation, making it easy to use and understand.</p>
<p>Unlike Cookiecutter, which relies on templates and prompts, Hyde uses a declarative configuration file to define project structures. This simplicity enables developers to get started quickly without the need for a dedicated template engine.</p>
<p>Hyde's unique feature lies in its simplicity, making it an excellent choice for small to medium-sized projects and developers who prefer a configuration-driven approach.</p>
<p><strong>Best Use Case:</strong> Hyde is best suited for developers who want a straightforward and lightweight solution for setting up projects without the complexity of template languages.</p>
<p><a id="5-slush"></a></p>
<h3>5. <strong>Slush</strong></h3>
<p>Slush is a streaming scaffolding tool built on top of Gulp.js, providing a pipeline-based approach to project generation. It allows developers to compose complex generators using Gulp plugins, offering powerful customization capabilities.</p>
<p>Unlike Cookiecutter, which operates on static templates, Slush leverages Gulp's streaming capabilities to process files, enabling developers to manipulate and modify the project structure during generation.</p>
<p>Slush's streaming nature and its compatibility with Gulp plugins make it stand out for projects that require advanced file processing and task automation during scaffolding.</p>
<p><strong>Best Use Case:</strong> Slush is best suited for developers who are already familiar with Gulp and need to integrate project generation with complex build processes.</p>
<p><a id="6-blueprint"></a></p>
<h3>6. <strong>Blueprint</strong></h3>
<p>Blueprint is a modern scaffolding tool designed for simplicity and flexibility. It allows developers to create project templates using Handlebars templates, YAML configuration, or JavaScript code, providing multiple options for customizing templates.</p>
<p>Unlike Cookiecutter, which mainly relies on Jinja2 templates, Blueprint gives developers the freedom to choose their preferred template language. It also offers a straightforward CLI interface for generating projects.</p>
<p>Blueprint's versatility and support for various template creation methods make it suitable for developers who have existing Handlebars or JavaScript templates they wish to reuse for project generation.</p>
<p><strong>Best Use Case:</strong> Blueprint is best suited for developers who want a lightweight and flexible scaffolding tool with support for multiple template languages.</p>
<p><a id="7-sao"></a></p>
<h3>7. <strong>Sao</strong></h3>
<p>Sao is a pluggable and customizable scaffolding tool that provides a simple JSON-based template definition. It enables developers to create their own template plugins and extend existing ones seamlessly.</p>
<p>Sao's plugin system and JSON-based templates offer a high level of customization, allowing developers to tailor project generation to their specific requirements without being tied to a specific template language.</p>
<p>The ability to create custom plugins and extend existing templates makes Sao a powerful choice for developers who value modularity and plugin support.</p>
<p><strong>Best Use Case:</strong> Sao is best suited for developers who need a versatile and extensible scaffolding tool with the option to create and share their custom plugins.</p>
<p><a id="8-plopdown"></a></p>
<h3>8. <strong>Plopdown</strong></h3>
<p>Plopdown is a powerful scaffolding tool that generates templates using a JSON configuration file. It offers advanced features like dynamic prompts, glob pattern matching, and custom logic for template generation.</p>
<p>Plopdown's support for dynamic prompts and glob patterns sets it apart from Cookiecutter. It allows developers to generate project files based on complex conditions and patterns, making it suitable for projects with dynamic requirements.</p>
<p>Plopdown's ability to handle dynamic inputs and patterns make it ideal for projects that require a high degree of customization and flexibility during the scaffolding process.</p>
<p><strong>Best Use Case:</strong> Plopdown is best suited for developers who need a flexible and powerful scaffolding tool capable of handling dynamic inputs and complex project structures.</p>
<p><a id="9-jolt"></a></p>
<h3>9. <strong>Jolt</strong></h3>
<p>Jolt is a lightweight and straightforward scaffolding tool that allows developers to create templates using a concise YAML syntax. It emphasizes minimal configuration and aims to reduce boilerplate code.</p>
<p>Unlike Cookiecutter, which may require extensive configuration, Jolt's YAML syntax simplifies the template creation process, making it a fast and efficient choice for smaller projects.</p>
<p>Jolt's simplicity and focus on reducing boilerplate code make it stand out for quick prototyping and smaller projects with straightforward requirements.</p>
<p><strong>Best Use Case:</strong> Jolt is best suited for developers who prefer a lightweight and minimalistic scaffolding tool for rapid project setup.</p>
<p><a id="10-boilr"></a></p>
<h3>10. <strong>Boilr</strong></h3>
<p>Boilr is a command-line scaffolding tool that utilizes a template registry, allowing developers to share and discover templates easily. It provides a curated list of templates for various languages and frameworks.</p>
<p>Unlike Cookiecutter, Boilr's template registry simplifies the process of finding and using project templates, making it an excellent choice for developers who want a seamless experience with pre-built templates.</p>
<p>Boilr's extensive template registry and its command-line interface make it stand out for its accessibility and ease of use.</p>
<p><strong>Best Use Case:</strong> Boilr is best suited for developers who prefer a command-line tool with access to a wide variety of pre-built templates for different project types.</p>
<p><a id="recommendations-for-various-use-cases"></a></p>
<h2>Recommendations for various use-cases</h2>
<p>Sure! Here are three distinct use cases with specific requirements, along with recommended tools for each use case:</p>
<p><a id="use-case-1-rapid-prototyping-and-small-projects---plop"></a></p>
<h3>Use Case 1: Rapid Prototyping and Small Projects - Plop</h3>
<p><strong>Requirements:</strong></p>
<ul>
<li>Lightweight and easy-to-use tool.</li>
<li>Minimal configuration and setup.</li>
<li>Ability to quickly generate boilerplate code and components.</li>
</ul>
<p><strong>Recommended Tool: Plop</strong></p>
<p><strong>Reasoning:</strong> Plop is an excellent choice for rapid prototyping and small projects due to its simplicity and focus on generating small, reusable templates. Its straightforward YAML-based configuration allows developers to get started quickly without the overhead of extensive setup. Plop's ability to create self-contained generators with custom logic and prompts makes it perfect for generating boilerplate code and components in a fast and efficient manner.</p>
<p><a id="use-case-2-large-scale-projects-with-opinionated-conventions---yeoman"></a></p>
<h3>Use Case 2: Large-Scale Projects with Opinionated Conventions - Yeoman</h3>
<p><strong>Requirements:</strong></p>
<ul>
<li>Strong community support and a wide range of templates.</li>
<li>Ability to enforce best practices and conventions for specific frameworks.</li>
<li>Interactive user prompts for customizable project setups.</li>
</ul>
<p><strong>Recommended Tool: Yeoman</strong></p>
<p><strong>Reasoning:</strong> Yeoman is a powerful scaffolding tool with an extensive library of community-contributed generators, making it suitable for large-scale projects. It enforces opinionated conventions, which is beneficial for maintaining consistency and best practices across the codebase. Yeoman's interactive user prompts make project setup user-friendly, allowing developers to customize the generated code according to their specific requirements.</p>
<p><a id="use-case-3-advanced-file-processing-and-task-automation---slush"></a></p>
<h3>Use Case 3: Advanced File Processing and Task Automation - Slush</h3>
<p><strong>Requirements:</strong></p>
<ul>
<li>Integration with build tools for advanced file processing.</li>
<li>Flexibility to manipulate and modify project structure during generation.</li>
<li>Support for custom plugins and extensibility.</li>
</ul>
<p><strong>Recommended Tool: Slush</strong></p>
<p><strong>Reasoning:</strong> Slush is an ideal choice for projects that require advanced file processing and task automation during scaffolding. Built on top of Gulp.js, Slush leverages Gulp's streaming capabilities to process files, allowing developers to manipulate and modify the project structure during generation. Its pipeline-based approach and compatibility with Gulp plugins provide high flexibility and customization possibilities. Developers who are already familiar with Gulp will find Slush seamless to integrate into their existing build processes.</p>
<p>These recommended tools cater to different use cases, ensuring that developers can find the most suitable scaffolding tool based on their project requirements and preferences.</p>
<p><a id="conclusion"></a></p>
<h2>Conclusion</h2>
<p>While Cookiecutter is a popular choice for scaffolding projects, developers have several alternative tools to consider, each with its own unique features and characteristics. Depending on the project requirements, preferences, and familiarity with specific tools, developers can choose the one that best fits their needs. Whether it's Yeoman's opinionated approach, Plop's focus on micro-generators, or Sao's pluggable architecture, there is a suitable alternative for every scenario. Experimenting with these tools can significantly enhance the development workflow and productivity.</p>Lesser-known Python Package Repository Managers2023-08-12T00:00:00+02:002023-08-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-08-12:/lesser-known-python-package-repository-managers/<p>X::<a href="https://www.safjan.com/Create Self-Hosted Python Package Repository/">Create Self-Hosted Python Package Repository - General Guide</a>
X::<a href="https://www.safjan.com/storing-private-python-packages-with-local-nas-and-lightweight-servers/">Storing Private Python Packages with Local NAS and Lightweight Servers</a></p>
<p>The <a href="https://jfrog.com/artifactory/">Artifactory</a> (paid) and <a href="https://devpi.net/docs/devpi/devpi/stable/%2Bd/index.html">Devpi</a> (free, open source) are most widely used python package repository managers, but there are some other interesting projects …</p><p>X::<a href="https://www.safjan.com/Create Self-Hosted Python Package Repository/">Create Self-Hosted Python Package Repository - General Guide</a>
X::<a href="https://www.safjan.com/storing-private-python-packages-with-local-nas-and-lightweight-servers/">Storing Private Python Packages with Local NAS and Lightweight Servers</a></p>
<p>The <a href="https://jfrog.com/artifactory/">Artifactory</a> (paid) and <a href="https://devpi.net/docs/devpi/devpi/stable/%2Bd/index.html">Devpi</a> (free, open source) are most widely used python package repository managers, but there are some other interesting projects. Here are a few lesser-known Python package repository managers along with links to their source code or home websites.</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#warehouse">Warehouse</a></li>
<li><a href="#pypiserver">pypiserver</a></li>
<li><a href="#bandersnatch">Bandersnatch</a></li>
<li><a href="#eggbasket">EggBasket</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="warehouse"></a></p>
<h2>Warehouse</h2>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/pypa/warehouse.svg?logo=github"></p>
<p>The current codebase behind the Python Package Index (PyPI). While not lesser-known, it's worth mentioning as an alternative to the official PyPI implementation.</p>
<ul>
<li>Source Code: <a href="https://github.com/pypa/warehouse">https://github.com/pypa/warehouse</a></li>
</ul>
<p><a id="pypiserver"></a></p>
<h2>pypiserver</h2>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/pypiserver/pypiserver.svg?logo=github"></p>
<p>PyHockey is a minimal Python package server that's easy to set up and use for hosting private packages.</p>
<ul>
<li>Source Code: <a href="https://github.com/pypiserver/pypiserver">https://github.com/pypiserver/pypiserver</a></li>
</ul>
<p><a id="bandersnatch"></a></p>
<h2>Bandersnatch</h2>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/pypa/bandersnatch.svg?logo=github"></p>
<p>A PyPI mirror client that can be used to create a complete copy of the Python Package Index (PyPI) locally or in a private network.</p>
<ul>
<li>Home: <a href="https://pypi.org/project/bandersnatch/">https://pypi.org/project/bandersnatch/</a></li>
<li>Source Code: <a href="https://github.com/pypa/bandersnatch">https://github.com/pypa/bandersnatch</a></li>
</ul>
<p><a id="eggbasket"></a></p>
<h2>EggBasket</h2>
<p>EggBasket is a lightweight, easily-configurable Python package server designed for simplicity and ease of use.</p>
<ul>
<li>Home: <a href="https://pypi.org/project/eggbasket/">https://pypi.org/project/eggbasket/</a></li>
</ul>
<p>Please note that the popularity and maintenance status of these repositories may vary, so it's a good idea to review the documentation and GitHub repositories to ensure they meet your requirements before setting up a self-hosted package repository.</p>Split glued or joined words2023-08-12T00:00:00+02:002023-08-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-08-12:/split-glued-or-joined-words/<h2>wordninja package</h2>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/keredson/wordninja.svg?logo=github"></p>
<p>install <a href="https://github.com/keredson/wordninja">wordninja</a> package: <code>pip install wordnija</code></p>
<div class="highlight"><pre><span></span><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">wordninja</span>
<span class="o">>>></span> <span class="n">wordninja</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">'bettergood'</span><span class="p">)</span>
<span class="p">[</span><span class="s1">'better'</span><span class="p">,</span> <span class="s1">'good'</span><span class="p">]</span>
</code></pre></div>
<h2>wordsegment package</h2>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/grantjenks/python-wordsegment.svg?logo=github"></p>
<p>install the <a href="https://github.com/grantjenks/python-wordsegment">wordsegment</a> package: <code>pip install wordsegment</code>.</p>
<p>use programatically:</p>
<div class="highlight"><pre><span></span><code><span class="o">>>></span> <span class="kn">from</span> <span class="nn">wordsegment</span> <span class="kn">import</span> <span class="n">load</span><span class="p">,</span> <span class="n">segment</span>
<span class="o">>>></span> <span class="n">load</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">segment</span><span class="p">(</span><span class="s1">'thisisatest'</span><span class="p">)</span>
<span class="p">[</span><span class="s1">'this'</span><span class="p">,</span> <span class="s1">'is'</span><span class="p">,</span> <span class="s1">'a'</span><span class="p">,</span> <span class="s1">'test'</span><span class="p">]</span>
</code></pre></div>
<p>or from CLI</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span><span class="nb">echo …</span></code></pre></div><h2>wordninja package</h2>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/keredson/wordninja.svg?logo=github"></p>
<p>install <a href="https://github.com/keredson/wordninja">wordninja</a> package: <code>pip install wordnija</code></p>
<div class="highlight"><pre><span></span><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">wordninja</span>
<span class="o">>>></span> <span class="n">wordninja</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">'bettergood'</span><span class="p">)</span>
<span class="p">[</span><span class="s1">'better'</span><span class="p">,</span> <span class="s1">'good'</span><span class="p">]</span>
</code></pre></div>
<h2>wordsegment package</h2>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/grantjenks/python-wordsegment.svg?logo=github"></p>
<p>install the <a href="https://github.com/grantjenks/python-wordsegment">wordsegment</a> package: <code>pip install wordsegment</code>.</p>
<p>use programatically:</p>
<div class="highlight"><pre><span></span><code><span class="o">>>></span> <span class="kn">from</span> <span class="nn">wordsegment</span> <span class="kn">import</span> <span class="n">load</span><span class="p">,</span> <span class="n">segment</span>
<span class="o">>>></span> <span class="n">load</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">segment</span><span class="p">(</span><span class="s1">'thisisatest'</span><span class="p">)</span>
<span class="p">[</span><span class="s1">'this'</span><span class="p">,</span> <span class="s1">'is'</span><span class="p">,</span> <span class="s1">'a'</span><span class="p">,</span> <span class="s1">'test'</span><span class="p">]</span>
</code></pre></div>
<p>or from CLI</p>
<div class="highlight"><pre><span></span><code>$<span class="w"> </span><span class="nb">echo</span><span class="w"> </span>thisisatest<span class="w"> </span><span class="p">|</span><span class="w"> </span>python<span class="w"> </span>-m<span class="w"> </span>wordsegment
this<span class="w"> </span>is<span class="w"> </span>a<span class="w"> </span><span class="nb">test</span>
</code></pre></div>
<p>Solutions from: <a href="https://stackoverflow.com/a/58010290">string - How can I split multiple joined words? - Stack Overflow</a></p>Storing Private Python Packages with Local NAS and Lightweight Servers2023-08-12T00:00:00+02:002023-08-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-08-12:/storing-private-python-packages-with-local-nas-and-lightweight-servers/<p>X::<a href="https://www.safjan.com/Create Self-Hosted Python Package Repository/">Create Self-Hosted Python Package Repository - General Guide</a>
X::<a href="https://www.safjan.com/lesser-known-python-package-repository-managers/">Lesser-known Python Package Repository Managers</a></p>
<p>There are simple ways to store private Python packages on a local NAS (Network Attached Storage) without setting up a full-fledged package repository manager like Devpi or Artifactory …</p><p>X::<a href="https://www.safjan.com/Create Self-Hosted Python Package Repository/">Create Self-Hosted Python Package Repository - General Guide</a>
X::<a href="https://www.safjan.com/lesser-known-python-package-repository-managers/">Lesser-known Python Package Repository Managers</a></p>
<p>There are simple ways to store private Python packages on a local NAS (Network Attached Storage) without setting up a full-fledged package repository manager like Devpi or Artifactory. Here are a couple of straightforward alternatives:</p>
<h3>Option 1: Local File System Repository</h3>
<p>This approach involves creating a directory on your NAS to store your Python packages. You can use the <code>pip</code> command's <code>--find-links</code> option to specify the location of your custom package directory.</p>
<p><strong>Pros:</strong></p>
<ul>
<li>Very simple setup and usage.</li>
<li>Well-suited for small teams or personal projects.</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>Lack of advanced features like access control, versioning, and replication.</li>
</ul>
<p><strong>Tutorial:</strong></p>
<ol>
<li><strong>Create a Packages Directory on NAS</strong>:</li>
<li>
<p>Create a directory on your NAS where you will store your Python packages.</p>
</li>
<li>
<p><strong>Upload Packages to NAS</strong>:</p>
</li>
<li>
<p>Copy or move your Python packages into the NAS directory.</p>
</li>
<li>
<p><strong>Install Packages from NAS</strong>:</p>
</li>
<li>
<p>Install packages from your NAS using the <code>pip</code> command with the <code>--find-links</code> option:</p>
<p><code>pip install --find-links=file:///path/to/nas/packages/ <package></code></p>
</li>
</ol>
<h3>Option 2: Local PyPI Server</h3>
<p>You can set up a lightweight local PyPI server like <code>pypiserver</code> to serve your private Python packages.</p>
<p><strong>Pros:</strong></p>
<ul>
<li>Simple setup with basic package management features.</li>
<li>Suitable for small teams and projects.</li>
</ul>
<p><strong>Cons:</strong></p>
<ul>
<li>May lack advanced features like access control and versioning compared to full repository managers.</li>
</ul>
<p><strong>Tutorial:</strong></p>
<ol>
<li><strong>Install <code>pypiserver</code></strong>:</li>
<li>
<p>Install <code>pypiserver</code> using pip:</p>
<p><code>pip install pypiserver</code></p>
</li>
<li>
<p><strong>Create a Packages Directory</strong>:</p>
</li>
<li>
<p>Create a directory to store your Python packages.</p>
</li>
<li>
<p><strong>Start <code>pypiserver</code></strong>:</p>
</li>
<li>
<p>Start the <code>pypiserver</code> with the command:</p>
<p><code>pypi-server -p 8080 /path/to/packages/</code></p>
</li>
<li>
<p><strong>Upload and Install Packages</strong>:</p>
</li>
<li>Copy your Python packages to the packages directory.</li>
<li>
<p>Install packages using the <code>pip</code> command with the local PyPI server URL:</p>
<p><code>pip install --index-url=http://localhost:8080/simple/ <package></code></p>
</li>
</ol>
<p>These simpler approaches provide a way to store private Python packages on a local NAS without the overhead of setting up a comprehensive repository manager. Choose the option that best fits your needs and resources. Keep in mind that while these methods are simpler, they lack some advanced features and may not be as scalable or secure as full repository managers.</p>Prompt Discovery in the Context of Large Language Models (LLMs) and Prompt Engineering2023-08-08T00:00:00+02:002023-08-08T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-08-08:/prompt_discivery-large-language-models-llms-prompt-engineering/<p>Prompt discovery in the context of large language models refers to the systematic process of identifying and optimizing prompts to elicit desired responses from the model. It involves formulating prompts in a way that effectively guides the model's generation towards accurate, relevant …</p><p>Prompt discovery in the context of large language models refers to the systematic process of identifying and optimizing prompts to elicit desired responses from the model. It involves formulating prompts in a way that effectively guides the model's generation towards accurate, relevant, and high-quality outputs. Prompt engineering is a critical component of this process, as it encompasses the design and refinement of prompts to achieve specific tasks or goals.</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#technical-aspects-of-prompt-discovery">Technical Aspects of Prompt Discovery</a></li>
<li><a href="#activities-and-challenges-in-prompt-discovery">Activities and Challenges in Prompt Discovery</a></li>
<li><a href="#types-of-tools-and-technologies-for-prompt-discovery">Types of Tools and Technologies for Prompt Discovery</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="technical-aspects-of-prompt-discovery"></a></p>
<h2>Technical Aspects of Prompt Discovery</h2>
<ol>
<li>
<p><strong>Prompt Formulation and Structure</strong>: This involves crafting prompts using appropriate syntax, keywords, and context to provide clear instructions to the model. Experimentation with different sentence structures, question formats, and contextual cues can impact the model's understanding and response.</p>
</li>
<li>
<p><strong>Semantic Representation</strong>: Developing prompts that capture the desired semantic meaning and intent is crucial. This may involve exploring semantic role labeling, syntactic analysis, and dependency parsing to create prompts that effectively guide the model's reasoning.</p>
</li>
<li>
<p><strong>Prompt Permutations</strong>: Generating a diverse set of prompt variations can help in identifying which phrasings or formulations yield the best results. This could involve systematically modifying sentence structure, word order, or incorporating synonyms and paraphrases.</p>
</li>
<li>
<p><strong>Prompt Length and Complexity</strong>: Analyzing the impact of prompt length and complexity on model performance. Longer prompts may provide more context but risk confusing the model, while shorter prompts might lack necessary context.</p>
</li>
<li>
<p><strong>Multi-step Prompts</strong>: Crafting prompts that involve multi-step instructions or conditional logic to guide the model through a series of steps to reach a desired conclusion.</p>
</li>
<li>
<p><strong>Prompt Contextualization</strong>: Incorporating relevant context or domain-specific information within prompts to enhance the model's knowledge and improve response quality.</p>
</li>
<li>
<p><strong>Prompt Targeting</strong>: Experimenting with prompts that explicitly mention the desired answer or output, guiding the model toward a specific response.</p>
</li>
</ol>
<p><a id="activities-and-challenges-in-prompt-discovery"></a></p>
<h2>Activities and Challenges in Prompt Discovery</h2>
<ol>
<li>
<p><strong>Prompt Effectiveness Evaluation</strong>: Developing methodologies to quantitatively and qualitatively assess the effectiveness of different prompts in eliciting accurate and relevant responses.</p>
</li>
<li>
<p><strong>Prompt Generalization</strong>: Investigating how well a well-optimized prompt can generalize across different models, architectures, and datasets.</p>
</li>
<li>
<p><strong>Prompt Adaptation</strong>: Identifying techniques to adapt prompts for various domains, languages, or tasks, considering nuances in language and context.</p>
</li>
<li>
<p><strong>Adversarial Prompt Design</strong>: Exploring methods to create prompts that challenge the model's limitations and encourage robustness against adversarial inputs.</p>
</li>
<li>
<p><strong>Active Learning for Prompt Refinement</strong>: Developing algorithms that iteratively learn and refine prompts based on model performance, aiming to reduce human intervention in the prompt engineering process.</p>
</li>
<li>
<p><strong>Prompt Diversity Exploration</strong>: Analyzing the impact of diverse prompts on model behavior, uncovering potential biases, and ensuring fairness in responses.</p>
</li>
</ol>
<p><a id="types-of-tools-and-technologies-for-prompt-discovery"></a></p>
<h2>Types of Tools and Technologies for Prompt Discovery</h2>
<ol>
<li>
<p><strong>Prompt Generation Assistants</strong>: AI-driven tools that provide prompt suggestions, permutations, and optimizations based on user-defined criteria and objectives.</p>
</li>
<li>
<p><strong>Prompt Evaluation Metrics</strong>: Novel metrics that quantitatively measure the quality, relevance, and correctness of model responses based on different prompts.</p>
</li>
<li>
<p><strong>Semantic Prompt Analysis</strong>: Advanced natural language understanding tools capable of dissecting prompt semantics, identifying key components, and suggesting improvements.</p>
</li>
<li>
<p><strong>Prompt Optimization Algorithms</strong>: Algorithms that leverage reinforcement learning, genetic algorithms, or neural architecture search to automatically discover effective prompts.</p>
</li>
<li>
<p><strong>Prompt-Aware Model Architectures</strong>: Model architectures explicitly designed to leverage and incorporate prompt information effectively during the generation process.</p>
</li>
<li>
<p><strong>Contextualization Modules</strong>: Modules that enhance prompts with contextual information, leveraging external knowledge sources or domain-specific databases.</p>
</li>
<li>
<p><strong>Bias and Fairness Detection Tools</strong>: Tools that analyze prompts for potential bias and fairness issues, ensuring the generated responses align with ethical and unbiased standards.</p>
</li>
<li>
<p><strong>Interactive Prompt Refinement Interfaces</strong>: Interfaces allowing users to interactively refine and experiment with prompts, providing real-time feedback on model responses.</p>
</li>
</ol>
<p>As the field of prompt engineering and large language models evolves, these tools and techniques will likely become more sophisticated, enabling more efficient and effective prompt discovery processes. There are few tools available at the time of writing (Aug 2023):</p>
<ul>
<li>
<p><a href="https://github.com/ianarawjo/ChainForge">ianarawjo/ChainForge</a> - An open-source visual programming environment for LLM experimentation and prompt evaluation.
<img alt="github stars shield" src="https://img.shields.io/github/stars/ianarawjo/ChainForge.svg?logo=github"></p>
</li>
<li>
<p><a href="https://github.com/logspace-ai/langflow">logspace-ai/langflow</a> - Langflow is a UI for LangChain, designed with react-flow to provide an effortless way to experiment and prototype flows.
<img alt="github stars shield" src="https://img.shields.io/github/stars/logspace-ai/langflow.svg?logo=github"></p>
</li>
<li>
<p><a href="https://github.com/FlowiseAI/Flowise">FlowiseAI/Flowise</a> - Drag & drop UI to build your customized LLM flow
<img alt="github stars shield" src="https://img.shields.io/github/stars/FlowiseAI/Flowise.svg?logo=github"></p>
</li>
</ul>Azure OpenAI Langchain configuration2023-08-02T00:00:00+02:002023-10-23T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-08-02:/azure-openai-langchain-configuration/<p>This note contains a recipe for how to configure LangChain to use Azure OpenAI.</p>
<p>NOTE: requires <code>python-dotenv</code> python package installed</p>
<h2>create <code>.env</code> with configuration and secrets</h2>
<div class="highlight"><pre><span></span><code>OPENAI_API_TYPE="azure"
OPENAI_API_KEY="***"
OPENAI_API_BASE="***"
OPENAI_API_VERSION="***"
</code></pre></div>
<h2>initialize langchain</h2>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">dotenv</span> <span class="kn">import</span> <span class="n">load_dotenv</span><span class="p">,</span><span class="n">find_dotenv</span>
<span class="kn">from</span> <span class="nn">langchain.llms</span> <span class="kn">import …</span></code></pre></div><p>This note contains a recipe for how to configure LangChain to use Azure OpenAI.</p>
<p>NOTE: requires <code>python-dotenv</code> python package installed</p>
<h2>create <code>.env</code> with configuration and secrets</h2>
<div class="highlight"><pre><span></span><code>OPENAI_API_TYPE="azure"
OPENAI_API_KEY="***"
OPENAI_API_BASE="***"
OPENAI_API_VERSION="***"
</code></pre></div>
<h2>initialize langchain</h2>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">dotenv</span> <span class="kn">import</span> <span class="n">load_dotenv</span><span class="p">,</span><span class="n">find_dotenv</span>
<span class="kn">from</span> <span class="nn">langchain.llms</span> <span class="kn">import</span> <span class="n">AzureOpenAI</span>
<span class="n">load_dotenv</span><span class="p">(</span><span class="n">find_dotenv</span><span class="p">())</span>
<span class="n">deployment_name</span> <span class="o">=</span> <span class="s2">"text-davinci-003"</span>
<span class="n">model_name</span> <span class="o">=</span> <span class="s2">"text-davinci-003"</span>
<span class="n">llm</span> <span class="o">=</span> <span class="n">AzureOpenAI</span><span class="p">(</span><span class="n">deployment_name</span><span class="o">=</span><span class="n">deployment_name</span><span class="p">,</span> <span class="n">model_name</span><span class="o">=</span><span class="n">model_name</span><span class="p">)</span>
<span class="c1"># check if it works</span>
<span class="nb">print</span><span class="p">(</span><span class="n">llm</span><span class="p">(</span><span class="s2">"What is the capital of France?"</span><span class="p">))</span>
</code></pre></div>
<p>NOTE: <code>find_dotenv</code> - its purpose is to locate the <code>.env</code> file in your project directory or its parent directories. It starts the search from the current working directory and recursively moves up the directory tree until it finds the <code>.env</code> file. If no <code>.env</code> file is found, it returns the path of the current working directory. This function is beneficial because it ensures your code can locate the <code>.env</code> file regardless of the directory from which your script is executed.</p>Rank Fusion Algorithms - From Simple to Advanced2023-07-28T00:00:00+02:002023-10-09T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-07-28:/Rank-fusion-algorithms-from-simple-to-advanced/<h2>Introduction</h2>
<p>Rank fusion is a fundamental technique used in various domains, including data science and search engine optimization, to combine multiple ranked lists into a single, more reliable ranking. This process aims to exploit the strengths of individual ranking algorithms and mitigate …</p><h2>Introduction</h2>
<p>Rank fusion is a fundamental technique used in various domains, including data science and search engine optimization, to combine multiple ranked lists into a single, more reliable ranking. This process aims to exploit the strengths of individual ranking algorithms and mitigate their weaknesses, leading to improved overall performance. In this blog post, we will explore a range of rank fusion algorithms, starting from simple yet effective methods to advanced techniques employed by tech giants to achieve state-of-the-art results.</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#algorithms">Algorithms</a><ul>
<li><a href="#borda-algorithm">Borda Algorithm</a></li>
<li><a href="#combining-probability-mass-function-cpmf">Combining Probability Mass Function (CPMF)</a></li>
<li><a href="#rank-biased-precision-rbp">Rank-Biased Precision (RBP)</a></li>
<li><a href="#lambdamart">LambdaMART</a></li>
<li><a href="#neural-rank-fusion">Neural Rank Fusion</a></li>
<li><a href="#reciprocal-rank-fusion">Reciprocal rank fusion</a></li>
</ul>
</li>
<li><a href="#conclusion">Conclusion</a></li>
<li><a href="#references">References</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="algorithms"></a></p>
<h2>Algorithms</h2>
<p><a id="borda-algorithm"></a></p>
<h3>Borda Algorithm</h3>
<p>The Borda algorithm is one of the simplest rank fusion techniques. It assigns scores to items based on their positions in the individual rankings and then combines these scores to obtain a fused ranking. In the context of search engine results, each document receives points based on its position in the ranked lists. The points are then summed up to form the final rank.</p>
<p>Consider <span class="math">\(n\)</span> ranked lists <span class="math">\(\{R_1, R_2, \ldots, R_n\}\)</span> with <span class="math">\(m\)</span> items. The Borda algorithm assigns points to each item <span class="math">\(i\)</span> in the following way:</p>
<div class="math">$$
\text{Borda Score}(i) = \sum_{j=1}^{n} (m - \text{rank}_j(i))
$$</div>
<p>
Where <span class="math">\(\text{rank}_j(i)\)</span> denotes the position of item <span class="math">\(i\)</span> in the <span class="math">\(j\)</span>th ranked list.</p>
<p>The Borda algorithm is easy to implement, but it might lag in performance for large datasets or when the individual rankings are significantly diverse.</p>
<p><a id="combining-probability-mass-function-cpmf"></a></p>
<h3>Combining Probability Mass Function (CPMF)</h3>
<p>CPMF is a probabilistic rank fusion method that incorporates the probability of an item being at a certain rank in individual lists. It assumes that the rankings are probabilistic and uses the Probability Mass Function (PMF) to calculate the fused ranking. CPMF outperforms Borda for diverse and noisy datasets.</p>
<p>Let <span class="math">\(p_{ij}\)</span> be the probability that item <span class="math">\(i\)</span> appears at rank <span class="math">\(j\)</span> in the <span class="math">\(n\)</span> lists. The CPMF score for item <span class="math">\(i\)</span> is given by:</p>
<div class="math">$$
\text{CPMF Score}(i) = \sum_{j=1}^{m} p_{ij}
$$</div>
<p>The probabilities <span class="math">\(p_{ij}\)</span> can be estimated using techniques like the <a href="https://hturner.github.io/PlackettLuce/articles/Overview.html">Plackett-Luce model</a> or the Thurstone-Mosteller model.</p>
<p><a id="rank-biased-precision-rbp"></a></p>
<h3>Rank-Biased Precision (RBP)</h3>
<p>RBP is a rank fusion method widely used in information retrieval systems. It incorporates a user-defined persistence parameter <span class="math">\(p\)</span> to reflect the probability that a user will examine the search results up to a certain rank. This parameter allows the search engine to optimize rankings based on user behavior.</p>
<p>For a given ranked list <span class="math">\(R_j\)</span>, the RBP score is calculated as follows:</p>
<div class="math">$$
\text{RBP Score}(R_j) = (1 - p) \sum_{k=1}^{m} p^{k-1} \text{rel}(R_j[k])
$$</div>
<p>
Where <span class="math">\(\text{rel}(R_j[k])\)</span> is an indicator function representing the relevance of the item at rank <span class="math">\(k\)</span> in list <span class="math">\(R_j\)</span>.</p>
<p>RBP provides more flexibility in tuning the importance of different ranks based on user preferences.</p>
<p><a id="lambdamart"></a></p>
<h3>LambdaMART</h3>
<p>LambdaMART is an advanced algorithm used by tech giants like Microsoft and Yahoo for learning-to-rank tasks. It is based on the gradient boosting framework and employs LambdaRank, which optimizes the ListNet loss function using gradient descent.</p>
<p>The LambdaMART algorithm involves constructing a set of weak rankers (usually decision trees) that are iteratively refined to minimize the LambdaRank objective, which directly measures the pairwise disagreement between ranks.</p>
<div class="math">$$
\text{LambdaRank Objective} = \sum_{i=1}^{m} \sum_{j=1}^{m} \text{DCG gain}(i, j) \cdot \text{Lambda}(i, j)
$$</div>
<p>Where <span class="math">\(\text{DCG gain}(i, j)\)</span> is the gain of swapping items at ranks <span class="math">\(i\)</span> and <span class="math">\(j\)</span> in the ranking, and <span class="math">\(\text{Lambda}(i, j)\)</span> is a weight function that depends on the gradients of the individual models.</p>
<p>LambdaMART's ability to optimize for ranking measures directly contributes to its superior performance in learning-to-rank scenarios.</p>
<p><a id="neural-rank-fusion"></a></p>
<h3>Neural Rank Fusion</h3>
<p>With the rise of deep learning, neural rank fusion methods have gained popularity due to their ability to learn complex patterns from data. Neural rank fusion models typically employ techniques like siamese networks or transformer-based architectures to process individual rankings and generate a fused ranking.</p>
<p>In a siamese network-based approach, the individual rankings are fed into two parallel networks with shared weights. The networks learn to map the rankings into a common embedding space, where the fused ranking is generated based on similarity scores.</p>
<p>On the other hand, transformer-based rank fusion models utilize attention mechanisms to process and combine individual rankings effectively.</p>
<p>Neural rank fusion methods often outperform traditional algorithms when sufficient training data is available, but they may require substantial computational resources.</p>
<p><a id="reciprocal-rank-fusion"></a></p>
<h3>Reciprocal rank fusion</h3>
<p>The <a href="https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf">Reciprocal Rank Fusion (RRF)</a> is an advanced algorithmic technique designed to amalgamate multiple result sets, each having distinct relevance indicators, into a unified result set. One of the key advantages of RRF is its ability to deliver high-quality results without the necessity for any tuning. Moreover, it does not mandate the relevance indicators to be interconnected or similar in nature.</p>
<p>Diving deeper into the algorithm, RRF is based on the concept of reciprocal rank. The reciprocal rank of a document is the multiplicative inverse of its rank. In the context of information retrieval, the rank of a document is its position in a list of documents sorted by relevance. The reciprocal rank is used to give higher weight to documents that appear earlier in the list.</p>
<p>The RRF algorithm combines the reciprocal ranks of the same document from different result sets to compute a combined score. The combined score is then used to rank the documents in the final result set. The formula used in the RRF algorithm is as follows:</p>
<div class="math">$$
\text{RRF Score} = \frac{1}{k + rank}
$$</div>
<p>Where <span class="math">\(k\)</span> is a constant (usually set to 60), and <span class="math">\(rank\)</span> is the rank of the document in a particular result set. The RRF score is calculated for each document in each result set, and the scores are then summed up to get the final score for each document.</p>
<p>The properties of the RRF algorithm include its simplicity, effectiveness, and robustness. It is simple because it only requires the ranks of the documents and does not need any tuning. It is effective because it can combine result sets with different relevance indicators and still produce high-quality results. It is robust because it is not sensitive to the choice of <span class="math">\(k\)</span> and can handle a large number of result sets.
<a id="conclusion"></a></p>
<h2>Conclusion</h2>
<p>Rank fusion serves as a potent tool in the arsenal of data scientists and search engine experts, enhancing the efficacy of ranking performance. The spectrum of rank fusion algorithms ranges from the <strong>straightforward Borda algorithm</strong> to the more complex Neural Rank Fusion, each tailored to meet specific scenarios and data attributes. While the <strong>Borda</strong> algorithm is <strong>appreciated</strong> for its <strong>simplicity</strong> and <strong>ease of implementation</strong>, more advanced techniques like <strong>LambdaMART</strong> and <strong>Neural Rank Fusion</strong> are capable of delivering <strong>cutting-edge results for large-scale applications</strong>.</p>
<p>Incorporating the Reciprocal Rank Fusion (RRF) into this discussion, it stands out for its ability to <strong>combine multiple result sets with varying relevance indicators</strong> <strong>without the need for tuning</strong>. This makes it a robust and effective choice for many applications.</p>
<p><strong>Edits</strong>:</p>
<ul>
<li>2023-10-09 - Added "Reciprocal rank fusion", rewrite conclusion
<a id="references"></a></li>
</ul>
<h2>References</h2>
<ol>
<li>Wikipedia article: <a href="https://en.wikipedia.org/wiki/Borda_count">Borda algorithm</a></li>
<li><a href="https://www.microsoft.com/en-us/research/publication/from-ranknet-to-lambdarank-to-lambdamart-an-overview/">From RankNet to LambdaRank to LambdaMART: An Overview</a></li>
<li><a href="https://www.microsoft.com/en-us/research/publication/from-ranknet-to-lambdarank-to-lambdamart-an-overview/">Burges, Christopher. "From RankNet to LambdaRank to LambdaMART: An overview." Learning 11.23-581 (2010): 81.</a></li>
<li><a href="https://people.eng.unimelb.edu.au/jzobel/fulltext/acmtois08.pdf">Rank-Biased Precision for Measurement of Retrieval Effectiveness</a></li>
<li><a href="https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf">Reciprocal rank fusion (RRF)</a></li>
</ol>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Implementing Reciprocal Rank Fusion (RRF) in Python2023-07-28T00:00:00+02:002023-10-09T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-07-28:/implementing-rank-fusion-in-python/<p>In the world of Information Retrieval, ranking is one of the most crucial aspects. It prioritizes the matching information according to its relevancy. In many cases, having a single ranking model may not satisfy the diverse needs of users. This is where …</p><p>In the world of Information Retrieval, ranking is one of the most crucial aspects. It prioritizes the matching information according to its relevancy. In many cases, having a single ranking model may not satisfy the diverse needs of users. This is where the idea of Rank Fusion comes in; combining various ranking models to enhance the retrieval performance.
Let's learn how to implement a simple rank fusion approach in Python.</p>
<h2>Understanding the RRF Ranking Process</h2>
<p>The Reciprocal Rank Fusion (RRF) operates by collecting search outcomes from various strategies, assigning each document in the results a reciprocal rank score, and subsequently merging these scores to generate a new ranking. The underlying principle is that documents that consistently appear in top positions across diverse search strategies are likely more pertinent and should thus receive a higher rank in the consolidated result.</p>
<p>Here's a simplified breakdown of the RRF process:</p>
<ol>
<li>
<p>Collect ranked search outcomes from multiple simultaneous queries.</p>
</li>
<li>
<p>Assign reciprocal rank scores to each result in the ranked lists. The RRF process generates a new search score for each match in each result set. For each document in the search results, the algorithm assigns a reciprocal rank score based on its position in the list. This score is computed as 1/(rank + k), where 'rank' is the document's position in the list, and 'k' is a constant. Empirical observation suggests that 'k' performs best when set to a small value, such as 60. Note that this 'k' value is a constant in the RRF algorithm and is entirely distinct from the 'k' that regulates the number of nearest neighbors.</p>
</li>
<li>
<p>Combine scores. The algorithm adds up the reciprocal rank scores acquired from each search strategy for each document, thereby generating a combined score for each document.</p>
</li>
<li>
<p>The algorithm ranks documents based on the combined scores and arranges them accordingly. The resulting list constitutes the fused ranking.</p>
</li>
</ol>
<p>To depict the Reciprocal Rank Fusion (RRF) process, we can use a flowchart.
<img alt="Reciprocal Rank Fusion (RRF) process flow chart" src="/images/Reciprocal_Rank_Fusion/Reciprocal_Rank_Fusion.png"></p>
<p><strong>*Figure 1:</strong> Reciprocal Rank Fusion (RRF) Process Flowchart. The diagram illustrates the steps involved in the RRF ranking process.</p>
<h2>Implementing Reciprocal Rank Fusion</h2>
<p>The <a href="https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf">Reciprocal Rank Fusion (RRF)</a> is an advanced algorithmic technique designed to amalgamate multiple result sets, each having distinct relevance indicators, into a unified result set. One of the key advantages of RRF is its ability to deliver high-quality results without the necessity for any tuning. Moreover, it does not mandate the relevance indicators to be interconnected or similar in nature.</p>
<p>RRF uses the following formula to determine the score for ranking each document:</p>
<div class="highlight"><pre><span></span><code><span class="n">score</span> <span class="o">=</span> <span class="mf">0.0</span>
<span class="k">for</span> <span class="n">q</span> <span class="ow">in</span> <span class="n">queries</span><span class="p">:</span>
<span class="k">if</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">result</span><span class="p">(</span><span class="n">q</span><span class="p">):</span>
<span class="n">score</span> <span class="o">+=</span> <span class="mf">1.0</span> <span class="o">/</span> <span class="p">(</span> <span class="n">k</span> <span class="o">+</span> <span class="n">rank</span><span class="p">(</span> <span class="n">result</span><span class="p">(</span><span class="n">q</span><span class="p">),</span> <span class="n">d</span> <span class="p">)</span> <span class="p">)</span>
<span class="k">return</span> <span class="n">score</span>
<span class="c1"># where</span>
<span class="c1"># k is a ranking constant</span>
<span class="c1"># q is a query in the set of queries</span>
<span class="c1"># d is a document in the result set of q</span>
<span class="c1"># result(q) is the result set of q</span>
<span class="c1"># rank( result(q), d ) is d's rank within the result(q) starting from 1</span>
</code></pre></div>
<p>(code from <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html">Elasticsearch documentation</a>)</p>
<p>You could significantly improve performance by using maps and list comprehensions - referred to as "vectorizing" in overlapping contexts.</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">reciprocal_rank_fusion</span><span class="p">(</span><span class="n">queries</span><span class="p">,</span> <span class="n">d</span><span class="p">,</span> <span class="n">k</span><span class="p">,</span> <span class="n">result_func</span><span class="p">,</span> <span class="n">rank_func</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">sum</span><span class="p">([</span><span class="mf">1.0</span> <span class="o">/</span> <span class="p">(</span><span class="n">k</span> <span class="o">+</span> <span class="n">rank_func</span><span class="p">(</span><span class="n">result_func</span><span class="p">(</span><span class="n">q</span><span class="p">),</span> <span class="n">d</span><span class="p">))</span> <span class="k">if</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">result_func</span><span class="p">(</span><span class="n">q</span><span class="p">)</span> <span class="k">else</span> <span class="mi">0</span> <span class="k">for</span> <span class="n">q</span> <span class="ow">in</span> <span class="n">queries</span><span class="p">])</span>
</code></pre></div>
<p>This function takes as arguments:</p>
<ul>
<li>A collection of queries</li>
<li>A document <code>d</code></li>
<li>A ranking constant <code>k</code></li>
<li>A function <code>result_func</code> that represents the <code>result(q)</code> operation in your original code.</li>
<li>A function <code>rank_func</code> that represents the <code>rank(result(q), d)</code> operation in your original code.</li>
</ul>
<blockquote>
<p><strong>NOTE:</strong> list comprehension is used to perform the operations that for-loop did, allowing Python to compute the results in a more optimized way. However, this isn't truly "vectorized" computing as you would find in libraries like NumPy or in languages like R, where computations are performed concurrently rather than sequentially.</p>
</blockquote>
<p>The <code>result_func</code> function takes a query <code>q</code>, and returns a list of documents that are the results of the query. For simplicity, let's assume that each query corresponds to a list of documents in a dictionary called <code>database</code>.</p>
<p>The <code>rank_func</code> function takes a list of documents (results of a query) and a specific document <code>d</code>, and returns the rank of <code>d</code> in the list.</p>
<div class="highlight"><pre><span></span><code><span class="n">database</span> <span class="o">=</span> <span class="p">{</span> <span class="c1"># assuming your queries and results are stored in a dictionary</span>
<span class="s1">'query1'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'doc1'</span><span class="p">,</span> <span class="s1">'doc2'</span><span class="p">,</span> <span class="s1">'doc3'</span><span class="p">],</span>
<span class="s1">'query2'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'doc3'</span><span class="p">,</span> <span class="s1">'doc1'</span><span class="p">,</span> <span class="s1">'doc2'</span><span class="p">],</span>
<span class="c1"># more queries and their document results...</span>
<span class="p">}</span>
<span class="k">def</span> <span class="nf">result_func</span><span class="p">(</span><span class="n">q</span><span class="p">):</span>
<span class="k">return</span> <span class="n">database</span><span class="p">[</span><span class="n">q</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">rank_func</span><span class="p">(</span><span class="n">results</span><span class="p">,</span> <span class="n">d</span><span class="p">):</span>
<span class="k">return</span> <span class="n">results</span><span class="o">.</span><span class="n">index</span><span class="p">(</span><span class="n">d</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span> <span class="c1"># adding 1 because ranks start from 1</span>
</code></pre></div>
<p>Then, the <code>reciprocal_rank_fusion</code> function can be called like this:</p>
<div class="highlight"><pre><span></span><code><span class="n">k</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">d</span> <span class="o">=</span> <span class="s1">'doc1'</span>
<span class="n">queries</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'query1'</span><span class="p">,</span> <span class="s1">'query2'</span><span class="p">]</span> <span class="c1"># fill this with your actual query keys</span>
<span class="nb">print</span><span class="p">(</span><span class="n">reciprocal_rank_fusion</span><span class="p">(</span><span class="n">queries</span><span class="p">,</span> <span class="n">d</span><span class="p">,</span> <span class="n">k</span><span class="p">,</span> <span class="n">result_func</span><span class="p">,</span> <span class="n">rank_func</span><span class="p">))</span>
</code></pre></div>
<p>This assumes that queries and their corresponding results are uniquely stored in a dictionary, and that your document ranks are determined by their order in the list of results.</p>
<p>Please modify the functions <code>result_func</code>, <code>rank_func</code>, and <code>database</code> to fit your specific application details and data.</p>
<h2>Conclusion</h2>
<p>The concept of Rank Fusion, particularly the Reciprocal Rank Fusion (RRF) method, offers a promising approach to amalgamate multiple result sets into a unified one. This article has demonstrated how to implement a simple RRF in Python.</p>
<p>While the example provided in this article is simplified, it provides a solid foundation for understanding the RRF process and how to implement it in Python. Depending on the specific application and data, the functions and database structure may need to be modified. However, the core concept and approach remain the same.</p>
<p>The RRF method is a powerful tool in the field of Information Retrieval, providing a robust and efficient way to combine multiple ranking models to enhance retrieval performance. By understanding and implementing this method, one can significantly improve the quality and relevance of search results, thereby enhancing user satisfaction and system effectiveness.</p>
<p><strong>Edits:</strong></p>
<ul>
<li>2023-11-06: changed title to: Implementing Reciprocal Rank Fusion and Borda Count in Python</li>
<li>2023-11-06: added RRF description</li>
<li>2023-11-06: added optimized implementation</li>
</ul>
<p>X::<a href="https://www.safjan.com/Rank-fusion-algorithms-from-simple-to-advanced/">Rank Fusion Algorithms - From Simple to Advanced</a></p>gitignore-style exclusion for restic2023-07-27T00:00:00+02:002023-07-27T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-07-27:/gitignore-style-exclusion-for-restic/<p>X::<a href="https://www.safjan.com/verify-backups-restic-example/">Don't Just Create Backups, Verify Them - How Restic Can Help?</a></p>
<p>Restic is a popular backup tool that supports the use of <code>.gitignore</code>-style exclusion patterns to ignore certain files and directories during the backup process. This feature is useful when you …</p><p>X::<a href="https://www.safjan.com/verify-backups-restic-example/">Don't Just Create Backups, Verify Them - How Restic Can Help?</a></p>
<p>Restic is a popular backup tool that supports the use of <code>.gitignore</code>-style exclusion patterns to ignore certain files and directories during the backup process. This feature is useful when you want to exclude specific files or directories from being backed up, such as temporary files, caches, or build artifacts.</p>
<p>To use <code>ignore</code> with Restic, you can create a file called <code>.resticignore</code> in the root of your repository (where you run Restic). This file should contain the patterns for the files and directories you want to ignore, just like you would do with a <code>.gitignore</code> file.</p>
<p>Here's how you can use <code>ignore</code> in Restic:</p>
<ol>
<li>
<p>Create a <code>.resticignore</code> file:
Inside your project's root directory (or the directory you're backing up), create a file named <code>.resticignore</code>. You can use any text editor to create this file.</p>
</li>
<li>
<p>Add patterns to ignore:
In the <code>.resticignore</code> file, list the files and directories you want to ignore during the backup. Each pattern should be on a separate line. You can use the same syntax as you would in a <code>.gitignore</code> file.</p>
</li>
</ol>
<p>For example, a simple <code>.resticignore</code> file might look like this:</p>
<p><code>*.log
temp/
cache/
build/</code></p>
<p>The above example would ignore all files with the <code>.log</code> extension and the <code>temp</code>, <code>cache</code>, and <code>build</code> directories.</p>
<ol>
<li>Run Restic backup with <code>--ignore-file</code> option:
When running Restic to perform the backup, specify the <code>.resticignore</code> file using the <code>--ignore-file</code> option. This tells Restic to use the patterns in that file to exclude certain files and directories.</li>
</ol>
<p>Here's an example command:</p>
<p><code>restic backup /path/to/your/data --ignore-file /path/to/.resticignore</code></p>
<p>Replace <code>/path/to/your/data</code> with the actual path of the data you want to back up and <code>/path/to/.resticignore</code> with the path to your <code>.resticignore</code> file.</p>
<p>By using the <code>.resticignore</code> file, you can customize what gets backed up and what gets excluded. This can be particularly useful to avoid backing up large or unnecessary files, reducing storage space and backup time.</p>Location of Python Virtual Environments - Choosing Between Project-Folder and Centralized Folder2023-07-27T00:00:00+02:002023-07-27T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-07-27:/location-of-python-virtual-environments-choosing-between-project-folder-and-central-folder/<h2>Project-folder Virtual Environments</h2>
<blockquote>
<p>In this approach, you create a virtual environment <strong>within the project directory</strong> itself. This means that each project has its isolated Python environment, and you manage dependencies specific to that project.</p>
</blockquote>
<p>With this approach you have clarity where the …</p><h2>Project-folder Virtual Environments</h2>
<blockquote>
<p>In this approach, you create a virtual environment <strong>within the project directory</strong> itself. This means that each project has its isolated Python environment, and you manage dependencies specific to that project.</p>
</blockquote>
<p>With this approach you have clarity where the associated venv resides - helpful when doing cleanup or backup.</p>
<h2>Centralized Location for Virtual Environments</h2>
<blockquote>
<p>In this approach, you create a <strong>centralized directory</strong> where <strong>all virtual environments reside</strong>. This directory can be outside your projects, e.g., <code>~/.virtualenvs</code> or any other location you prefer.</p>
</blockquote>
<p>With this approach you keep project directory containing mainly code - the other, replicable content - virtual environment files - are located somewhere outside the project. This helps to e.g. backup whole directory without worrying about excluding virtualenv which is typically not worth backing-up.</p>
<p>X: <a href="https://www.safjan.com/python-create-virtualenv-methods/">Creating Virtual Environments in Python</a></p>Cookiecutters for the python package with poetry2023-07-26T00:00:00+02:002023-07-26T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-07-26:/cookiecutter-for-the-python-package-with-poetry/<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#cookiecutter-and-poetry-for-python-project-scaffolding">Cookiecutter and Poetry for Python Project Scaffolding</a></li>
<li><a href="#benefits-of-using-cookiecutter-for-project-scaffolding">Benefits of Using Cookiecutter for Project Scaffolding</a></li>
<li><a href="#advantages-of-using-poetry-for-dependency-management">Advantages of Using Poetry for Dependency Management</a></li>
<li><a href="#cookiecutters">Cookiecutters</a></li>
<li><a href="#cjolowiczcookiecutter-hypermodern-python">cjolowicz/cookiecutter-hypermodern-python</a></li>
<li><a href="#fpgmaascookiecutter-poetry">fpgmaas/cookiecutter-poetry</a></li>
<li><a href="#radix-aipoetry-cookiecutter">radix-ai/poetry-cookiecutter</a></li>
<li><a href="#albertorioscookiecutter-poetry-pypackage">albertorios/cookiecutter-poetry-pypackage</a></li>
<li><a href="#timhughescookiecutter-poetry">timhughes/cookiecutter-poetry</a></li>
<li><a href="#johanvergeercookiecutter-poetry">johanvergeer/cookiecutter-poetry</a></li>
<li><a href="#elbakramercookiecutter-poetry">elbakramer/cookiecutter-poetry</a></li>
<li><a href="#wboxx1cookiecutter-pypackage-poetry">wboxx1/cookiecutter-pypackage-poetry</a></li>
<li><a href="#cookiecutter-wrapper">cookiecutter wrapper</a></li>
<li><a href="#tools-and-services-often-used-in-python-project-cookiecutters">Tools …</a></li></ul><!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#cookiecutter-and-poetry-for-python-project-scaffolding">Cookiecutter and Poetry for Python Project Scaffolding</a></li>
<li><a href="#benefits-of-using-cookiecutter-for-project-scaffolding">Benefits of Using Cookiecutter for Project Scaffolding</a></li>
<li><a href="#advantages-of-using-poetry-for-dependency-management">Advantages of Using Poetry for Dependency Management</a></li>
<li><a href="#cookiecutters">Cookiecutters</a></li>
<li><a href="#cjolowiczcookiecutter-hypermodern-python">cjolowicz/cookiecutter-hypermodern-python</a></li>
<li><a href="#fpgmaascookiecutter-poetry">fpgmaas/cookiecutter-poetry</a></li>
<li><a href="#radix-aipoetry-cookiecutter">radix-ai/poetry-cookiecutter</a></li>
<li><a href="#albertorioscookiecutter-poetry-pypackage">albertorios/cookiecutter-poetry-pypackage</a></li>
<li><a href="#timhughescookiecutter-poetry">timhughes/cookiecutter-poetry</a></li>
<li><a href="#johanvergeercookiecutter-poetry">johanvergeer/cookiecutter-poetry</a></li>
<li><a href="#elbakramercookiecutter-poetry">elbakramer/cookiecutter-poetry</a></li>
<li><a href="#wboxx1cookiecutter-pypackage-poetry">wboxx1/cookiecutter-pypackage-poetry</a></li>
<li><a href="#cookiecutter-wrapper">cookiecutter wrapper</a></li>
<li><a href="#tools-and-services-often-used-in-python-project-cookiecutters">Tools and services often used in python project cookiecutters</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="introduction"></a></p>
<h2>Introduction</h2>
<p><a id="cookiecutter-and-poetry-for-python-project-scaffolding"></a></p>
<h3>Cookiecutter and Poetry for Python Project Scaffolding</h3>
<p>In the world of Python development, efficient project setup and management are essential for streamlined and successful software development. Two powerful tools that aid in this process are <strong>Cookiecutter</strong> and <strong>Poetry</strong>.</p>
<p><strong>Cookiecutter</strong> is a command-line utility that enables developers to create project templates, or "cookiecutters," which serve as scaffolds for new projects. These cookiecutters are pre-configured templates that include project structures, file layouts, and even code snippets to kickstart the development process. With its simplicity and flexibility, Cookiecutter allows developers to easily generate consistent and well-organized projects without reinventing the wheel each time.</p>
<p>On the other hand, <strong>Poetry</strong> is a modern package manager and build tool for Python projects. It simplifies dependency management, packaging, and publishing while ensuring reproducible builds and version control. Poetry provides a user-friendly interface for managing project dependencies and virtual environments, making it a valuable asset for Python developers looking for an efficient way to manage their project's requirements.</p>
<p><a id="benefits-of-using-cookiecutter-for-project-scaffolding"></a></p>
<h3>Benefits of Using Cookiecutter for Project Scaffolding</h3>
<p>Using Cookiecutter for project scaffolding offers several key advantages:</p>
<ol>
<li>
<p><strong>Consistency</strong>: Cookiecutter promotes consistency across projects by providing a standardized and repeatable starting point. This consistency ensures that developers adhere to best practices and maintain a clean project structure throughout the development process.</p>
</li>
<li>
<p><strong>Time Savings</strong>: With Cookiecutter, developers can avoid the repetitive and time-consuming task of setting up a new project from scratch. By using pre-defined templates, the initial project setup becomes quick and hassle-free, allowing developers to focus on writing code and implementing features.</p>
</li>
<li>
<p><strong>Community-Driven Templates</strong>: The open-source nature of Cookiecutter means that developers can access a vast repository of community-contributed templates. This diverse collection covers various project types and frameworks, making it easy to find a suitable starting point for almost any Python project.</p>
</li>
<li>
<p><strong>Flexibility and Customization</strong>: While offering pre-configured templates, Cookiecutter also allows developers to customize their project scaffolds. This flexibility ensures that developers can tailor the project structure to fit their specific needs and project requirements.</p>
</li>
</ol>
<p><a id="advantages-of-using-poetry-for-dependency-management"></a></p>
<h3>Advantages of Using Poetry for Dependency Management</h3>
<p>Poetry's features complement the benefits of Cookiecutter, making it an ideal companion for Python project development:</p>
<ol>
<li>
<p><strong>Dependency Management Made Easy</strong>: Poetry simplifies the management of project dependencies, handling both direct dependencies and their dependencies, providing a single-source-of-truth for the project's requirements.</p>
</li>
<li>
<p><strong>Virtual Environments</strong>: Poetry creates isolated virtual environments for projects, ensuring that each project has its own set of dependencies, avoiding version conflicts and promoting project stability.</p>
</li>
<li>
<p><strong>Publication and Distribution</strong>: Poetry streamlines the process of publishing packages to the Python Package Index (PyPI), simplifying the distribution of Python packages and making them accessible to a wider audience.</p>
</li>
<li>
<p><strong>Version Control and Reproducibility</strong>: Poetry's <code>pyproject.toml</code> file allows for clear specification of package versions, ensuring reproducible builds and making it easier to manage version updates.</p>
</li>
</ol>
<p><a id="cookiecutters"></a></p>
<h2>Cookiecutters</h2>
<p><a id="cjolowiczcookiecutter-hypermodern-python"></a></p>
<h3>cjolowicz/cookiecutter-hypermodern-python</h3>
<p><a href="https://github.com/cjolowicz/cookiecutter-hypermodern-python">https://github.com/cjolowicz/cookiecutter-hypermodern-python</a></p>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/cjolowicz/cookiecutter-hypermodern-python.svg?logo=github"></p>
<p><a href="https://cookiecutter-hypermodern-python.readthedocs.io/en/2021.3.14/guide.html">User Guide — Hypermodern Python Cookiecutter documentation</a></p>
<ul>
<li>Packaging and dependency management with <a href="https://python-poetry.org/">Poetry</a></li>
<li>Test automation with <a href="https://nox.thea.codes/">Nox</a></li>
<li>Linting with <a href="https://pre-commit.com/">pre-commit</a> and <a href="http://flake8.pycqa.org/">Flake8</a></li>
<li>Continuous integration with <a href="https://github.com/features/actions">GitHub Actions</a></li>
<li>Documentation with <a href="http://www.sphinx-doc.org/">Sphinx</a> and <a href="https://readthedocs.org/">Read the Docs</a></li>
<li>Automated uploads to <a href="https://pypi.org/">PyPI</a> and <a href="https://test.pypi.org/">TestPyPI</a></li>
<li>Automated release notes with <a href="https://github.com/release-drafter/release-drafter">Release Drafter</a></li>
<li>Automated dependency updates with <a href="https://dependabot.com/">Dependabot</a></li>
<li>Code formatting with <a href="https://github.com/psf/black">Black</a> and <a href="https://prettier.io/">Prettier</a></li>
<li>Testing with <a href="https://docs.pytest.org/en/latest/">pytest</a></li>
<li>Code coverage with <a href="https://coverage.readthedocs.io/">Coverage.py</a></li>
<li>Coverage reporting with <a href="https://codecov.io/">Codecov</a></li>
<li>Command-line interface with <a href="https://click.palletsprojects.com/">Click</a></li>
<li>Static type-checking with <a href="http://mypy-lang.org/">mypy</a></li>
<li>Runtime type-checking with <a href="https://github.com/agronholm/typeguard">Typeguard</a></li>
<li>Security audit with <a href="https://github.com/PyCQA/bandit">Bandit</a> and <a href="https://github.com/pyupio/safety">Safety</a></li>
<li>Check documentation examples with <a href="https://github.com/Erotemic/xdoctest">xdoctest</a></li>
<li>Generate API documentation with <a href="https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html">autodoc</a> and <a href="https://www.sphinx-doc.org/en/master/usage/extensions/napoleon.html">napoleon</a></li>
<li>Generate command-line reference with <a href="https://sphinx-click.readthedocs.io/">sphinx-click</a></li>
<li>Manage project labels with <a href="https://github.com/marketplace/actions/github-labeler">GitHub Labeler</a></li>
</ul>
<p><a id="fpgmaascookiecutter-poetry"></a></p>
<h3>fpgmaas/cookiecutter-poetry</h3>
<p><a href="https://github.com/fpgmaas/cookiecutter-poetry">https://github.com/fpgmaas/cookiecutter-poetry</a></p>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/fpgmaas/cookiecutter-poetry.svg?logo=github"></p>
<ul>
<li><a href="https://python-poetry.org/">Poetry</a> for dependency management</li>
<li>CI/CD with <a href="https://github.com/features/actions">GitHub Actions</a></li>
<li>Pre-commit hooks with <a href="https://pre-commit.com/">pre-commit</a></li>
<li>Code quality with <a href="https://pypi.org/project/black/">black</a>, <a href="https://github.com/charliermarsh/ruff">ruff</a>, <a href="https://mypy.readthedocs.io/en/stable/">mypy</a>, and <a href="https://github.com/fpgmaas/deptry/">deptry</a></li>
<li>Publishing to <a href="https://pypi.org/">Pypi</a> or <a href="https://jfrog.com/artifactory">Artifactory</a> by creating a new release on GitHub</li>
<li>Testing and coverage with <a href="https://docs.pytest.org/en/7.1.x/">pytest</a> and <a href="https://about.codecov.io/">codecov</a></li>
<li>Documentation with <a href="https://www.mkdocs.org/">MkDocs</a></li>
<li>Compatibility testing for multiple versions of Python with <a href="https://tox.wiki/en/latest/">Tox</a></li>
<li>Containerization with <a href="https://www.docker.com/">Docker</a></li>
</ul>
<p><a id="radix-aipoetry-cookiecutter"></a></p>
<h3>radix-ai/poetry-cookiecutter</h3>
<p><a href="https://github.com/radix-ai/poetry-cookiecutter">https://github.com/radix-ai/poetry-cookiecutter</a></p>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/radix-ai/poetry-cookiecutter.svg?logo=github"></p>
<p><a id="albertorioscookiecutter-poetry-pypackage"></a></p>
<h3>albertorios/cookiecutter-poetry-pypackage</h3>
<p><a href="https://github.com/albertorios/cookiecutter-poetry-pypackage">https://github.com/albertorios/cookiecutter-poetry-pypackage</a></p>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/albertorios/cookiecutter-poetry-pypackage.svg?logo=github"></p>
<ul>
<li>Develop, build, and release Python packages using via <a href="https://python-poetry.org/">Poetry</a></li>
<li>Test against multiple Python versions via <a href="https://tox.readthedocs.io/en/latest/">Tox</a></li>
<li>Bump semantic version via <a href="https://github.com/c4urself/bump2version">bump2version</a></li>
<li>Optional command-line interface via <a href="https://click.palletsprojects.com/">Click</a></li>
<li>Repeatable build environments via <a href="https://www.docker.com/">Docker</a></li>
</ul>
<p><a id="timhughescookiecutter-poetry"></a></p>
<h3>timhughes/cookiecutter-poetry</h3>
<p><a href="https://github.com/timhughes/cookiecutter-poetry">https://github.com/timhughes/cookiecutter-poetry</a></p>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/timhughes/cookiecutter-poetry.svg?logo=github"></p>
<p>Cookiecutter template configured with the following:</p>
<ul>
<li>poetry</li>
<li>pytest</li>
<li>black</li>
<li>bandit</li>
<li>pyinstaller</li>
<li>jupyterlab</li>
<li>click</li>
</ul>
<p><a id="johanvergeercookiecutter-poetry"></a></p>
<h3>johanvergeer/cookiecutter-poetry</h3>
<p><a href="https://github.com/johanvergeer/cookiecutter-poetry">https://github.com/johanvergeer/cookiecutter-poetry</a></p>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/johanvergeer/cookiecutter-poetry.svg?logo=github"></p>
<ul>
<li>Testing setup with <code>pytest</code></li>
<li><a href="https://github.com/features/actions">GitHub Actions</a>: Ready for GitHub actions</li>
<li><a href="http://sphinx-doc.org/">Sphinx</a> docs: Documentation ready for generation with, for example, <a href="https://readthedocs.io/">ReadTheDocs</a></li>
<li>Auto-release to <a href="https://pypi.python.org/pypi">PyPI</a> when you push a new tag to master (optional)</li>
<li>Command-line interface using Click (optional)</li>
<li>GitHub Issue templates for bug reports and feature requests</li>
</ul>
<p><a id="elbakramercookiecutter-poetry"></a></p>
<h3>elbakramer/cookiecutter-poetry</h3>
<p><a href="https://github.com/elbakramer/cookiecutter-poetry">https://github.com/elbakramer/cookiecutter-poetry</a>
(fork from johanvergeer/cookiecutter-poetry)</p>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/elbakramer/cookiecutter-poetry.svg?logo=github"></p>
<ul>
<li>Package and dependency management using <a href="https://python-poetry.org/">Poetry</a></li>
<li>Has the option to stick with setuptools (setup.py)</li>
<li><a href="https://github.com/features/actions">GitHub Actions</a>: Ready for GitHub Actions</li>
<li>Build and test on push or pull request for continuous integration (CI) (+badge)</li>
<li>Build documentation on push, publish the built documentation to Github Pages (+badge)</li>
<li>Draft release on push, this draft can be published manually or even automatically when a new tag is pushed</li>
<li>Build and release Python package to <a href="https://pypi.org/">PyPI</a> when a new release tag is published on GitHub</li>
<li>Many <a href="https://pre-commit.com/">pre-commit</a> hook-based formatting, linting, testing tools</li>
<li>Upgrade syntax for newer Python with <a href="https://github.com/asottile/pyupgrade">pyupgrade</a></li>
<li>Formatting with <a href="https://github.com/psf/black">black</a></li>
<li>Import sorting with <a href="https://github.com/PyCQA/isort">isort</a></li>
<li>Linting with <a href="https://github.com/PyCQA/flake8">flake8</a> and <a href="https://github.com/PyCQA/pylint/">pylint</a></li>
<li>Static typing with <a href="https://github.com/python/mypy">mypy</a></li>
<li>Testing with <a href="https://docs.pytest.org/en/stable/contents.html">pytest</a></li>
<li>Git hooks that run all the above with <a href="https://pre-commit.com/">pre-commit</a></li>
<li>Other integrations with external sites/services</li>
<li><a href="http://sphinx-doc.org/">Sphinx</a> docs serving with <a href="https://readthedocs.io/">ReadTheDocs</a> (+badge)</li>
<li>Coverage report with <a href="https://about.codecov.io/">Codecov</a> (+badge)</li>
<li>Monitoring dependency version updates with <a href="https://requires.io/">Requires.io</a> or <a href="https://pyup.io/">PyUp</a> (+badge)</li>
<li>Version bumping using <a href="https://github.com/c4urself/bump2version">bump2version</a></li>
<li>Dynamic versioning using <a href="https://github.com/mtkennerly/dunamai">dunamai</a></li>
<li>Command-line interface using <a href="https://click.palletsprojects.com/en/7.x/">Click</a></li>
</ul>
<p><a id="wboxx1cookiecutter-pypackage-poetry"></a></p>
<h3>wboxx1/cookiecutter-pypackage-poetry</h3>
<p><a href="https://github.com/wboxx1/cookiecutter-pypackage-poetry">https://github.com/wboxx1/cookiecutter-pypackage-poetry</a></p>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/wboxx1/cookiecutter-pypackage-poetry.svg?logo=github"></p>
<ul>
<li>Testing setup with <code>unittest</code> and <code>python setup.py test</code> or <code>pytest</code></li>
<li><a href="http://travis-ci.org/">Travis-CI</a>: Ready for Travis Continuous Integration testing</li>
<li><a href="http://appveyor.com/">Appveyor</a>: Ready for Appveyor Continuous Integration testing</li>
<li><a href="http://testrun.org/tox/">Tox</a> testing: Setup to easily test for Python 2.7, 3.4, 3.5, 3.6, 3.7</li>
<li><a href="http://sphinx-doc.org/">Sphinx</a> docs: Documentation ready for generation with, for example, <a href="https://readthedocs.io/">ReadTheDocs</a></li>
<li><a href="https://github.com/c4urself/bump2version">Bump2version</a>: Pre-configured version bumping with a single command</li>
<li>Auto-release to <a href="https://pypi.python.org/pypi">PyPI</a> when you push a new tag to master (optional)</li>
<li>Command-line interface using Click (optional)</li>
</ul>
<p><a id="cookiecutter-wrapper"></a></p>
<h2>cookiecutter wrapper</h2>
<p><a href="https://pypi.org/project/cookiecutter-poetry/">https://pypi.org/project/cookiecutter-poetry/</a></p>
<p>Please note that the actual number of GitHub stars would be fetched dynamically when viewing the article in real-time. The badge URL with the GitHub stars includes a placeholder (<code>{shield}</code>) for the dynamic value, and the actual number will be displayed when the badge is rendered on the page.</p>
<p><a id="tools-and-services-often-used-in-python-project-cookiecutters"></a></p>
<h2>Tools and services often used in python project cookiecutters</h2>
<ul>
<li><a href="https://cookiecutter.readthedocs.io/">Cookiecutter</a>: Command-line utility for creating project templates.</li>
<li><a href="https://python-poetry.org/">Poetry</a>: Package manager and build tool for Python projects.</li>
<li><a href="https://pre-commit.com/">Pre-commit</a>: Framework for managing and maintaining multi-language pre-commit hooks.</li>
<li><a href="https://github.com/psf/black">Black</a>: Opinionated code formatter for Python.</li>
<li><a href="https://tox.readthedocs.io/">Tox</a>: Generic virtualenv management and test command line tool.</li>
<li><a href="https://nox.thea.codes/">Nox</a>: Flexible test automation tool.</li>
<li><a href="https://github.com/charliermarsh/ruff">Ruff</a>: Fast Linter, code quality tool for Python projects.</li>
<li><a href="https://github.com/features/actions">GitHub Actions</a>: Continuous integration and continuous deployment service by GitHub.</li>
<li><a href="https://about.codecov.io/">Codecov</a>: Code coverage reporting tool.</li>
<li><a href="https://github.com/c4urself/bump2version">Bump2version</a>: Version-bumping utility for software projects.</li>
<li><a href="https://www.docker.com/">Docker</a>: Platform for building, shipping, and running applications in containers.</li>
<li><a href="http://www.sphinx-doc.org/">Sphinx</a>: Documentation generator for Python projects.</li>
<li><a href="https://readthedocs.org/">Read the Docs</a>: Hosting service for software documentation.</li>
<li><a href="https://github.com/release-drafter/release-drafter">Release Drafter</a>: Automated release notes generation tool.</li>
<li><a href="https://dependabot.com/">Dependabot</a>: Automated dependency updates tool.</li>
<li><a href="https://prettier.io/">Prettier</a>: Opinionated code formatter for various languages, including Python.</li>
<li><a href="https://docs.pytest.org/en/latest/">pytest</a>: Framework for writing and running Python tests.</li>
<li><a href="https://coverage.readthedocs.io/">Coverage.py</a>: Code coverage measurement tool for Python.</li>
<li><a href="https://github.com/agronholm/typeguard">Typeguard</a>: Runtime type checking for Python functions.</li>
<li><a href="https://github.com/PyCQA/bandit">Bandit</a>: Security linter for Python code.</li>
<li><a href="https://github.com/pyupio/safety">Safety</a>: Security dependency checker for Python packages.</li>
<li><a href="https://github.com/Erotemic/xdoctest">xdoctest</a>: Tool for running code examples in docstrings.</li>
<li><a href="https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html">autodoc</a>: Sphinx extension for automatic documentation generation from docstrings.</li>
<li><a href="https://www.sphinx-doc.org/en/master/usage/extensions/napoleon.html">napoleon</a>: Sphinx extension for NumPy and Google style docstrings.</li>
<li><a href="https://sphinx-click.readthedocs.io/">sphinx-click</a>: Sphinx extension for Click-based command-line interfaces.</li>
<li><a href="https://github.com/marketplace/actions/github-labeler">GitHub Labeler</a>: GitHub Action for managing project labels.</li>
</ul>Simplifying Data Download from Google Drive in Google Colab Using gdown2023-07-24T00:00:00+02:002023-07-24T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-07-24:/download-data-google-drive-colab-gdown/<h2>Introduction</h2>
<p>In this blog post, we will explore a straightforward method to download data from Google Drive into your Google Colab notebook using the 'gdown' command. Google Colab is a popular platform for running Python code, especially for machine learning and data …</p><h2>Introduction</h2>
<p>In this blog post, we will explore a straightforward method to download data from Google Drive into your Google Colab notebook using the 'gdown' command. Google Colab is a popular platform for running Python code, especially for machine learning and data analysis tasks. By leveraging 'gdown,' a handy Python library, you can seamlessly access your files stored on Google Drive without any hassle. Let's dive right into the process!</p>
<h2>Steps</h2>
<h3>Step 1: Import gdown and Authenticate Google Drive</h3>
<p>To begin, ensure you have 'gdown' installed in your Colab environment. If it isn't pre-installed, you can do so using the following code snippet:</p>
<div class="highlight"><pre><span></span><code><span class="err">!</span><span class="n">pip</span> <span class="n">install</span> <span class="n">gdown</span>
</code></pre></div>
<h3>Step 2: Obtain the File's Shareable Link</h3>
<p>To download data from your Google Drive, you must first ensure the file or folder is publicly accessible. To do this, right-click on the file or folder in your Google Drive, select "Get Shareable Link," and set the sharing settings to "Anyone with the link."</p>
<h3>Step 3: Retrieve the ID from the Shareable Link</h3>
<p>Upon obtaining the shareable link, extract the file's ID from the link. The ID is typically found after "<a href="https://drive.google.com/file/d/">https://drive.google.com/file/d/</a>". For instance, if your link is "<a href="https://drive.google.com/file/d/ABC12345XYZ/view">https://drive.google.com/file/d/ABC12345XYZ/view</a>," then "ABC12345XYZ" is the file's ID.</p>
<p>Step 4: Download the Data
Using the gdown command, you can now effortlessly download the data from your Google Drive into your Colab notebook. The following code demonstrates how to do this:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">gdown</span>
<span class="n">file_id</span> <span class="o">=</span> <span class="s2">"ABC12345XYZ"</span> <span class="c1"># Replace this with your file's ID</span>
<span class="n">output_file</span> <span class="o">=</span> <span class="s2">"data_file.ext"</span> <span class="c1"># Replace "data_file.ext" with the desired output filename and extension</span>
<span class="n">gdown</span><span class="o">.</span><span class="n">download</span><span class="p">(</span><span class="sa">f</span><span class="s2">"https://drive.google.com/uc?id=</span><span class="si">{</span><span class="n">file_id</span><span class="si">}</span><span class="s2">"</span><span class="p">,</span> <span class="n">output_file</span><span class="p">)</span>
</code></pre></div>
<h2>Conclusion</h2>
<p>In this brief guide, we have explored the process of downloading data from Google Drive into Google Colab using the 'gdown' command. By following these simple steps, you can seamlessly access and utilize your data for various machine learning, data analysis, or other Python-based projects in Google Colab. Happy coding!</p>Add VSCode to PATH2023-07-21T00:00:00+02:002023-07-21T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-07-21:/add-vscode-to-path/<p>If you get code command not found error but vscode is installed:</p>
<div class="highlight"><pre><span></span><code>><span class="w"> </span>code
zsh:<span class="w"> </span><span class="nb">command</span><span class="w"> </span>not<span class="w"> </span>found:<span class="w"> </span>code
</code></pre></div>
<p>it means, that, <code>code</code> command is not in you system PATH. You need to add it.</p>
<p>To do that, follow these steps:</p>
<ol>
<li>
<p>Launch Visual …</p></li></ol><p>If you get code command not found error but vscode is installed:</p>
<div class="highlight"><pre><span></span><code>><span class="w"> </span>code
zsh:<span class="w"> </span><span class="nb">command</span><span class="w"> </span>not<span class="w"> </span>found:<span class="w"> </span>code
</code></pre></div>
<p>it means, that, <code>code</code> command is not in you system PATH. You need to add it.</p>
<p>To do that, follow these steps:</p>
<ol>
<li>
<p>Launch Visual Studio Code.</p>
</li>
<li>
<p>Open the Command Palette by pressing <code>Cmd+Shift+P</code> (or <code>Ctrl+Shift+P</code> on Windows/Linux).</p>
</li>
<li>
<p>Type "shell command" in the Command Palette search bar.</p>
</li>
<li>
<p>You should see an option that says "Shell Command: Install 'code' command in PATH." Select it to add the <code>code</code> command to your system PATH.</p>
</li>
</ol>
<p>After completing these steps, you should be able to open Visual Studio Code directly from the terminal using the <code>code</code> command.</p>What is downstream task2023-07-21T00:00:00+02:002023-07-21T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-07-21:/what-is-downstream-task/<blockquote>
<p>In the context of data science and business, the term "downstream task" refers to a task or process that occurs after the completion of an initial or preceding task in a data pipeline or workflow. In this data flow, information is processed …</p></blockquote><blockquote>
<p>In the context of data science and business, the term "downstream task" refers to a task or process that occurs after the completion of an initial or preceding task in a data pipeline or workflow. In this data flow, information is processed and refined as it moves from one stage to another.</p>
</blockquote>
<p>To understand the concept better, let's consider a simplified data science workflow:</p>
<ol>
<li>
<p><strong>Data Collection</strong>: The first step is to gather and collect raw data from various sources, such as databases, APIs, or files.</p>
</li>
<li>
<p><strong>Data Preprocessing</strong>: Once the data is collected, it often needs to be cleaned, transformed, and structured in a way that makes it suitable for analysis. This step is known as data preprocessing.</p>
</li>
<li>
<p><strong>Feature Engineering</strong>: After preprocessing, relevant features (variables) are extracted from the data, and new features might be created to enhance the predictive power of the models.</p>
</li>
<li>
<p><strong>Model Training</strong>: With the prepared data, machine learning models are trained to make predictions or classifications based on patterns found in the data.</p>
</li>
<li>
<p><strong>Model Evaluation</strong>: After the models are trained, they need to be evaluated on a separate dataset to assess their performance and identify any issues such as overfitting or underfitting.</p>
</li>
</ol>
<p>Now, let's introduce the notion of "downstream tasks":</p>
<ol>
<li>
<p><strong>Model Deployment</strong>: Once the trained model(s) have been evaluated and deemed satisfactory, they are deployed into a production environment where they can be used to make predictions on new, unseen data.</p>
</li>
<li>
<p><strong>Decision Making</strong>: In a business context, the model's predictions are often used as inputs for making data-driven decisions. These decisions could be related to marketing strategies, customer segmentation, risk assessment, product recommendations, etc.</p>
</li>
<li>
<p><strong>Performance Monitoring</strong>: After the model has been deployed, its performance needs to be continually monitored to ensure that it maintains accuracy and relevance over time.</p>
</li>
<li>
<p><strong>Model Updating and Retraining</strong>: As new data becomes available and the model's performance deteriorates or needs improvement, it might be necessary to update or retrain the model to keep it up-to-date and accurate.</p>
</li>
</ol>
<p>In this workflow, <strong>"downstream tasks" are those that happen after the initial data preprocessing, model training, and evaluation stages. These tasks utilize the output of the earlier stages to make informed decisions and provide value to the business.</strong></p>Alternatives for Building Python CLI Apps2023-07-17T00:00:00+02:002023-07-17T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-07-17:/alternatives_for_building_python_cli_apps/<p>Discover the best tools and frameworks for building Python CLI apps. Explore Click, argparse, Typer, and more. Master the art of command-line application development.</p><p>Python provides several libraries and frameworks for building command-line interface (CLI) applications, each with its own set of features and advantages. In this article, we will explore some of the popular alternatives to build Python CLI apps, including Click, argparse, and Typer, among others.</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#click">Click</a></li>
<li><a href="#argparse">argparse</a></li>
<li><a href="#typer">Typer</a></li>
<li><a href="#other-alternatives">Other Alternatives</a></li>
<li><a href="#fire">Fire</a></li>
<li><a href="#cement">cement</a></li>
<li><a href="#docopt">Docopt</a></li>
<li><a href="#plumbum">Plumbum</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="click"></a></p>
<h2>Click</h2>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/pallets/click.svg?logo=github"></p>
<p>Click is a powerful and widely used Python library for creating command-line interfaces. It focuses on simplicity and aims to make it easy to write and maintain CLI applications. Click provides a decorator-based approach for defining commands, options, and arguments, making it intuitive and straightforward to use. It supports complex command hierarchies, automatic help page generation, and customization options for output formatting. Click also offers advanced features such as context passing, callback handling, and parameter types. It has a large and active community, ensuring ongoing support and continuous development.</p>
<p>Click is an excellent choice for both simple and complex CLI applications. Its simplicity and intuitive API make it a great option for beginners, while its advanced features cater to more complex use cases. Whether you are building a small script or a full-fledged CLI tool, Click provides a solid foundation for developing robust and user-friendly applications.</p>
<p><strong>Stand-out Features:</strong></p>
<ul>
<li>Simple and intuitive API.</li>
<li>Decorator-based command definition.</li>
<li>Support for complex command hierarchies.</li>
<li>Automatic help page generation.</li>
<li>Advanced features like context passing and parameter types.</li>
</ul>
<p><strong>Use-case:</strong>
Click is suitable for a wide range of CLI applications, from small scripts to large-scale tools. It is a popular choice for building command-line interfaces in Python due to its simplicity, flexibility, and extensive feature set.</p>
<p>To learn more about Click, visit the <a href="https://click.palletsprojects.com/">official documentation</a> or explore the <a href="https://github.com/pallets/click">GitHub repository</a>.</p>
<p><a id="argparse"></a></p>
<h2>argparse</h2>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/python/cpython.svg?logo=github"></p>
<p>argparse is a standard library included in Python, making it readily available for CLI application development without any external dependencies. It provides a flexible and comprehensive framework for defining command-line arguments, options, and sub-commands. argparse supports automatic help generation, argument type checking, default values, and various customization options. It also handles error reporting and displays error messages with usage information. argparse's design promotes code reusability, making it easy to build CLI applications with modular components.</p>
<p>argparse is a versatile library suitable for a wide range of CLI applications. Its standard inclusion in Python ensures compatibility and ease of use, making it a popular choice for developers. Whether you are building a simple script or a complex application with multiple sub-commands, argparse provides a robust foundation for handling command-line arguments.</p>
<p><strong>Stand-out Features:</strong></p>
<ul>
<li>Standard library inclusion, no external dependencies.</li>
<li>Comprehensive framework for defining arguments and options.</li>
<li>Automatic help generation.</li>
<li>Error reporting and usage information display.</li>
<li>Code reusability and modular design.</li>
</ul>
<p><strong>Use-case:</strong>
argparse is well-suited for a variety of CLI applications, from basic scripts to more complex tools with sub-commands. Its standard library nature and comprehensive feature set make it a reliable choice for command-line argument handling in Python.</p>
<p>For detailed information about argparse, refer to the <a href="https://docs.python.org/3/library/argparse.html">official documentation</a> or explore the <a href="https://github.com/python/cpython">GitHub repository</a>.</p>
<p><a id="typer"></a></p>
<h2>Typer</h2>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/tiangolo/typer.svg?logo=github"></p>
<p>Typer is a modern, fast, and efficient CLI framework built on top of Click. It offers a simple and concise API for building command-line interfaces in Python, with an emphasis on code readability and type hints. Typer automatically infers the types of arguments and options from their default values or annotations, reducing the need for boilerplate code. It provides features such as automatic help generation, completion generation for shells, and support for asynchronous execution.</p>
<p>Typer's simplicity and seamless integration with Click make it an appealing choice for developers who prioritize code clarity and conciseness. It leverages Python's type hints to improve developer productivity and reduce the likelihood of runtime errors. With its performance optimizations, Typer can handle large CLI applications efficiently.</p>
<p><strong>Stand-out Features:</strong></p>
<ul>
<li>Simple and concise API with emphasis on code readability.</li>
<li>Automatic type inference from default values or annotations.</li>
<li>Automatic help and completion generation.</li>
<li>Asynchronous execution support.</li>
<li>Performance optimizations for handling large applications.</li>
</ul>
<p><strong>Use-case:</strong>
Typer is particularly well-suited for developers who value code readability and conciseness. It is a great choice for building CLI applications of any size, ranging from small scripts to complex tools, with a focus on leveraging Python's type hints.</p>
<p>To learn more about Typer, refer to the <a href="https://typer.tiangolo.com/">official documentation</a> or explore the <a href="https://github.com/tiangolo/typer">GitHub repository</a>.</p>
<p><a id="fire"></a></p>
<h2>Fire</h2>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/google/python-fire.svg?logo=github"></p>
<p>Fire is a library developed by Google that automatically generates a command-line interface from Python objects. It allows you to turn any Python class or module into a CLI application without the need for explicit command definitions. Fire uses introspection to infer the available methods and attributes of an object, which are then exposed as CLI commands and arguments. This automatic generation of the CLI interface makes Fire incredibly convenient for quickly building command-line tools from existing code.</p>
<p>Fire's simplicity and automatic CLI generation make it an excellent choice for rapidly prototyping CLI applications. It eliminates the need for manually defining command structures and allows you to focus on the core functionality of your Python objects. While it may not offer the same level of customization as some other libraries, Fire excels in its ability to generate a functional CLI interface with minimal effort.</p>
<p><strong>Stand-out Features:</strong></p>
<ul>
<li>Automatic CLI generation from Python objects.</li>
<li>No explicit command definitions required.</li>
<li>Rapid prototyping of CLI applications.</li>
<li>Eliminates the need for manual command structure definitions.</li>
</ul>
<p><strong>Use-case:</strong>
Fire is best suited for quickly creating simple CLI tools based on existing Python code. It is ideal for situations where you want to expose the functionality of your Python objects through a command-line interface without the need for explicit command definitions.</p>
<p>To learn more about Fire, refer to the <a href="https://google.github.io/python-fire/">official documentation</a> or explore the <a href="https://github.com/google/python-fire">GitHub repository</a>.</p>
<p><a id="cement"></a></p>
<h2>cement</h2>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/datafolklabs/cement.svg?logo=github"></p>
<p>cement is a powerful and extensible CLI framework for Python. It provides a complete set of features for building CLI applications, including command-line argument parsing, command line completion, output rendering, and plugin support. cement follows a modular design, allowing you to choose and configure only the components you need for your application. It offers support for both single-command and multi-command applications, making it versatile and adaptable to various use cases.</p>
<p>One of the standout features of cement is its plugin architecture, which enables easy integration of third-party functionality into your CLI application. It also provides a powerful and customizable output handler system, allowing you to define how the application's output is rendered and formatted. cement's extensive documentation and active community make it a reliable choice for developing robust CLI applications.</p>
<p><strong>Stand-out Features:</strong></p>
<ul>
<li>Comprehensive CLI framework with modular design.</li>
<li>Command-line argument parsing.</li>
<li>Command line completion.</li>
<li>Customizable output rendering.</li>
<li>Plugin architecture for easy integration of third-party functionality.</li>
</ul>
<p><strong>Use-case:</strong>
cement is suitable for building CLI applications of any complexity. Its modular design and extensive feature set make it an excellent choice for projects that require advanced customization, plugin support, and flexible output rendering.</p>
<p>For detailed information about cement, refer to the <a href="https://builtoncement.com/">official documentation</a> or explore the <a href="https://github.com/datafolklabs/cement">GitHub repository</a>.</p>
<p><a id="docopt"></a></p>
<h2>Docopt</h2>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/docopt/docopt.svg?logo=github"></p>
<p>Docopt is a command-line interface description language and Python library that generates a CLI parser from human-readable usage patterns. It allows you to define the command-line interface by writing usage patterns and associated descriptions. Docopt then automatically generates a parser based on these patterns, handling argument parsing and help generation.</p>
<p>The simplicity and readability of Docopt's usage patterns make it a unique and user-friendly approach to building CLI applications. By using natural language to describe the command-line interface, Docopt simplifies the process of defining and maintaining CLI specifications. It supports both positional arguments and options and provides support for complex command hierarchies.</p>
<p>Docopt is an excellent choice for projects where a human-readable and self-documenting CLI interface is a priority. It allows developers to focus on writing clear usage patterns while leaving the parsing and help generation to the library.</p>
<p><strong>Stand-out Features:</strong></p>
<ul>
<li>Command-line interface description language.</li>
<li>Automatic parser generation from human-readable usage patterns.</li>
<li>Simplifies the process of defining and maintaining CLI specifications.</li>
<li>Support for positional arguments and options.</li>
<li>Natural language approach for clear usage patterns.</li>
</ul>
<p><strong>Use-case:</strong>
Docopt is best suited for projects where a human-readable and self-documenting CLI interface is desired. It is a good choice for developers who prefer a more descriptive and expressive way of defining the command-line interface.</p>
<p>For more information about Docopt, refer to the <a href="http://docopt.org/">official documentation</a> or explore the <a href="https://github.com/docopt/docopt">GitHub repository</a>.</p>
<p><a id="plumbum"></a></p>
<h2>Plumbum</h2>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/tomerfiliba/plumbum.svg?logo=github"></p>
<p>Plumbum is a library that aims to simplify the process of writing shell-like scripts and command-line tools in Python. It provides an intuitive and concise API for executing shell commands, capturing their output, and handling command-line arguments. Plumbum allows you to seamlessly mix shell-like syntax and Python code, providing a powerful and flexible approach to command-line application development.</p>
<p>One of Plumbum's standout features is its ability to create reusable command templates. These templates encapsulate the common functionality of a command, allowing you to easily define and reuse complex command structures. Plumbum also offers support for input/output redirection, background execution, and shell pipeline operations.</p>
<p>Plumbum is an excellent choice for developers who want to combine the power of shell commands with the flexibility and expressiveness of Python. It simplifies the process of interacting with the command line and enables the creation of robust and maintainable CLI applications.</p>
<p><strong>Stand-out Features:</strong></p>
<ul>
<li>Intuitive and concise API for executing shell commands.</li>
<li>Seamless integration of shell-like syntax and Python code.</li>
<li>Reusable command templates for defining complex command structures.</li>
<li>Support for input/output redirection, background execution, and shell pipelines.</li>
</ul>
<p><strong>Use-case:</strong>
Plumbum is suitable for developers who want to leverage the power of shell commands while maintaining the flexibility and expressiveness of Python. It is a good choice for building command-line applications that require extensive interaction with the command line and complex command structures.</p>
<p>To learn more about Plumbum, refer to the <a href="https://plumbum.readthedocs.io/">official documentation</a> or explore the <a href="https://github.com/tomerfiliba/plumbum">GitHub repository</a>.</p>
<h2>Which Tool Should I Use in My Case?</h2>
<p>When choosing a tool for building Python CLI apps, it's important to consider the specific requirements of your project. Different tools excel in different scenarios. Here, we'll discuss three common use-cases with divergent requirements and suggest the best tools for each case along with justifications.</p>
<h3>1. Simple Script or Rapid Prototyping</h3>
<p>If you're building a simple script or need to rapidly prototype a CLI application, <strong>Click</strong> and <strong>Fire</strong> are excellent choices.</p>
<p><strong>Click</strong> offers a simple and intuitive API with decorator-based command definition, making it easy to create CLI apps quickly. It provides advanced features like context passing and parameter types, which can enhance the functionality of your script. Additionally, Click's extensive documentation and active community support make it a reliable choice.</p>
<p><strong>Fire</strong> is perfect for converting existing Python code into a CLI application effortlessly. With Fire, you can generate a command-line interface from any Python object without explicit command definitions. It prioritizes simplicity and allows you to focus on the core functionality of your code, making it ideal for rapid prototyping.</p>
<h3>2. Complex CLI Application with Advanced Customization</h3>
<p>For complex CLI applications that require advanced customization, <strong>argparse</strong> and <strong>cement</strong> are robust options.</p>
<p><strong>argparse</strong> is a Python standard library, providing a comprehensive framework for defining command-line arguments, options, and sub-commands. It supports automatic help generation, type checking, and error reporting. argparse's modular design promotes code reusability and is suitable for projects with multiple sub-commands and extensive customization requirements.</p>
<p><strong>cement</strong> is a powerful CLI framework that offers a complete set of features, including argument parsing, command line completion, output rendering, and plugin support. It follows a modular design, allowing you to choose the components you need. cement's plugin architecture enables easy integration of third-party functionality, and its customizable output rendering system provides flexibility.</p>
<h3>3. Human-Readable CLI Interface</h3>
<p>If you prioritize a human-readable and self-documenting CLI interface, consider <strong>Typer</strong> and <strong>Docopt</strong>.</p>
<p><strong>Typer</strong> is a modern CLI framework built on top of Click, emphasizing code readability and type hints. It automatically infers argument types, reducing boilerplate code. Typer's simplicity and integration with Python's type hints make it an appealing choice for developers who value code clarity.</p>
<p><strong>Docopt</strong> takes a unique approach, allowing you to define the command-line interface using human-readable usage patterns. It automatically generates a parser based on these patterns, handling argument parsing and help generation. Docopt's natural language approach simplifies the process of defining and maintaining CLI specifications, resulting in a clear and readable CLI interface.</p>Creating a PowerPoint Presentation with a Language Model2023-07-17T00:00:00+02:002023-07-17T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-07-17:/creating-a-powerpoint-presentation-with-a-language-model/<p>In this article, we'll explore how to generate a PowerPoint presentation using the OpenAI Azure API and provide additional advanced features to enhance the process.</p>
<h2>Prerequisites</h2>
<p>Before we begin, make sure you have the following prerequisites set up:</p>
<ul>
<li>Python 3.x installed …</li></ul><p>In this article, we'll explore how to generate a PowerPoint presentation using the OpenAI Azure API and provide additional advanced features to enhance the process.</p>
<h2>Prerequisites</h2>
<p>Before we begin, make sure you have the following prerequisites set up:</p>
<ul>
<li>Python 3.x installed on your machine</li>
<li>OpenAI API key</li>
<li>Required Python libraries: <code>python-pptx</code> and <code>openai</code></li>
</ul>
<p>You can install the libraries using the <code>pip</code> package manager:</p>
<div class="highlight"><pre><span></span><code>pip<span class="w"> </span>install<span class="w"> </span>python-pptx<span class="w"> </span>openai
</code></pre></div>
<h2>Step 1: Setting up the OpenAI API</h2>
<p>To get started, you'll need to sign up for the OpenAI API and obtain an API key. The API key allows you to interact with the GPT model. Follow the instructions in the OpenAI documentation to sign up and retrieve your API key.</p>
<h2>Step 2: Importing the Required Modules</h2>
<p>To work with PowerPoint and the OpenAI API, we need to import the necessary modules in our Python script. Specifically, we'll import the <code>Presentation</code> class from the <code>python-pptx</code> library and the <code>openai</code> module.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">pptx</span> <span class="kn">import</span> <span class="n">Presentation</span>
<span class="kn">import</span> <span class="nn">openai</span>
</code></pre></div>
<h2>Step 3: Authenticating with the OpenAI API</h2>
<p>Next, we need to authenticate with the OpenAI API by providing our API key. This step ensures that we have the necessary permissions to access and utilize the GPT model.</p>
<div class="highlight"><pre><span></span><code><span class="n">openai</span><span class="o">.</span><span class="n">api_key</span> <span class="o">=</span> <span class="s1">'YOUR_API_KEY'</span>
</code></pre></div>
<p>Replace <code>'YOUR_API_KEY'</code> with the API key you obtained in Step 1.</p>
<h2>Step 4: Generating the Presentation Outline with ChatGPT</h2>
<p>With the necessary setup complete, we can now use the ChatGPT model to generate an outline for our PowerPoint presentation. We'll provide a description of the presentation as input and receive a list of slides as output. The slides will form the basis of our presentation structure.</p>
<div class="highlight"><pre><span></span><code><span class="n">description</span> <span class="o">=</span> <span class="s2">"This presentation is about the benefits of exercise."</span>
<span class="n">response</span> <span class="o">=</span> <span class="n">openai</span><span class="o">.</span><span class="n">Completion</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
<span class="n">engine</span><span class="o">=</span><span class="s2">"text-davinci-003"</span><span class="p">,</span>
<span class="n">prompt</span><span class="o">=</span><span class="n">description</span><span class="p">,</span>
<span class="n">max_tokens</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span>
<span class="n">n</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="c1"># Number of slides in the outline</span>
<span class="n">stop</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span>
<span class="n">temperature</span><span class="o">=</span><span class="mf">0.7</span>
<span class="p">)</span>
<span class="n">slides</span> <span class="o">=</span> <span class="n">response</span><span class="o">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">text</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
</code></pre></div>
<p>In this example, the <code>description</code> variable contains the input description for the presentation. The <code>max_tokens</code> parameter limits the response length, and the <code>n</code> parameter determines the number of slides in the outline. Feel free to adjust these parameters based on your specific needs.</p>
<h2>Step 5: Generating Content for Each Slide</h2>
<p>To make our presentation informative, we'll use the ChatGPT model to generate content for each slide in the outline. For each slide, we'll iterate through the <code>slides</code> list and generate the content using the ChatGPT model.</p>
<div class="highlight"><pre><span></span><code><span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">slide</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">slides</span><span class="p">):</span>
<span class="n">response</span> <span class="o">=</span> <span class="n">openai</span><span class="o">.</span><span class="n">Completion</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
<span class="n">engine</span><span class="o">=</span><span class="s2">"text-davinci-003"</span><span class="p">,</span>
<span class="n">prompt</span><span class="o">=</span><span class="n">slide</span><span class="p">,</span>
<span class="n">max_tokens</span><span class="o">=</span><span class="mi">150</span><span class="p">,</span>
<span class="n">n</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="n">stop</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span>
<span class="n">temperature</span><span class="o">=</span><span class="mf">0.7</span>
<span class="p">)</span>
<span class="n">content</span> <span class="o">=</span> <span class="n">response</span><span class="o">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">text</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="c1"># Store the content for the slide</span>
<span class="n">slides</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="s1">'title'</span><span class="p">:</span> <span class="n">slide</span><span class="p">,</span> <span class="s1">'content'</span><span class="p">:</span> <span class="n">content</span><span class="p">}</span>
</code></pre></div>
<p>Here, we iterate through each slide in the <code>slides</code> list, generate the content using the ChatGPT model, and store the title and content in a dictionary. Adjust the <code>max_tokens</code> parameter based on the desired length of each slide's content.</p>
<h2>Step 6: Creating the PowerPoint Presentation</h2>
<p>With the slide titles and content generated, it's time to create the PowerPoint presentation using the <code>python-pptx</code> library. We'll iterate through the slides and add them to the presentation with the appropriate titles and content.</p>
<div class="highlight"><pre><span></span><code><span class="n">presentation</span> <span class="o">=</span> <span class="n">Presentation</span><span class="p">()</span>
<span class="k">for</span> <span class="n">slide</span> <span class="ow">in</span> <span class="n">slides</span><span class="p">:</span>
<span class="n">slide_title</span> <span class="o">=</span> <span class="n">slide</span><span class="p">[</span><span class="s1">'title'</span><span class="p">]</span>
<span class="n">slide_content</span> <span class="o">=</span> <span class="n">slide</span><span class="p">[</span><span class="s1">'content'</span><span class="p">]</span>
<span class="n">slide_layout</span> <span class="o">=</span> <span class="n">presentation</span><span class="o">.</span><span class="n">slide_layouts</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="c1"># Choose the layout for the slide</span>
<span class="n">slide</span> <span class="o">=</span> <span class="n">presentation</span><span class="o">.</span><span class="n">slides</span><span class="o">.</span><span class="n">add_slide</span><span class="p">(</span><span class="n">slide_layout</span><span class="p">)</span>
<span class="n">title</span> <span class="o">=</span> <span class="n">slide</span><span class="o">.</span><span class="n">shapes</span><span class="o">.</span><span class="n">title</span>
<span class="n">content</span> <span class="o">=</span> <span class="n">slide</span><span class="o">.</span><span class="n">placeholders</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="n">title</span><span class="o">.</span><span class="n">text</span> <span class="o">=</span> <span class="n">slide_title</span>
<span class="n">content</span><span class="o">.</span><span class="n">text</span> <span class="o">=</span> <span class="n">slide_content</span>
<span class="n">presentation</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s2">"generated_presentation.pptx"</span><span class="p">)</span>
</code></pre></div>
<p>In this example, we create a new slide for each item in the <code>slides</code> list. We set the title and content for each slide and save the presentation as a PowerPoint file named "generated_presentation.pptx". You can adjust the slide layout by choosing a different index from the <code>slide_layouts</code> list.</p>
<h2>Possible Next Features for the Presentation Generation Script</h2>
<p>While the script we've created is already capable of generating PowerPoint presentations, we can enhance it further with additional features. Here are a few possible next steps to consider:</p>
<ol>
<li>
<p><strong>Slide Customization</strong>: Allow users to specify different slide layouts, fonts, colors, and background images to customize the visual appearance of their presentation.</p>
</li>
<li>
<p><strong>Image Integration</strong>: Extend the script to generate slides with images. This can involve using AI models to automatically search and retrieve relevant images based on the content of each slide.</p>
</li>
<li>
<p><strong>Interactive Presentations</strong>: Utilize technologies like Jupyter Notebook or web-based frameworks to create interactive presentations that allow viewers to engage with the content dynamically.</p>
</li>
<li>
<p><strong>Natural Language Processing</strong>: Incorporate natural language processing techniques to analyze the generated content and provide suggestions for improvements, such as grammar corrections, more concise wording, or alternative phrasing.</p>
</li>
</ol>
<p>By implementing these features, the presentation generation script can become more versatile and provide a richer experience for users.</p>
<h2>Alternative approach - let LLM generate VisualBasic script</h2>
<p>In this article we use python to generate the slides. You can also ask model (ChatGPT) for a VisualBasic script that will generate presentation for you. You can learn this approach from the video: <a href="https://www.youtube.com/watch?v=JoedhPPi3O0">Create Beautiful PowerPoint Slides with ChatGPT + VBA: Quick Tip! - YouTube</a></p>
<h2>Conclusion</h2>
<p>In this article, we've explored how to create a PowerPoint presentation using a language model, specifically OpenAI's GPT model through the Azure API. We've covered the steps from setting up the OpenAI API to generating an outline and filling the slides with content. Additionally, we discussed possible next features to enhance the script, such as slide customization, image integration, interactive presentations, and natural language processing. By expanding upon these features, you can create powerful presentation automation tools tailored to your specific needs.</p>
<p>Automating presentation generation not only saves time and effort but also opens up new possibilities for creating engaging and informative presentations. With the help of AI and language models, we can revolutionize the way presentations are created, allowing presenters to focus more on refining their ideas and delivering impactful content.</p>Time Travel in Git - Creating a Branch from the Past and Crafting a New Future2023-07-14T00:00:00+02:002023-07-14T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-07-14:/time-travel-in-git-creating-a-branch-from-the-past-and-crafting-a-new future/<h2>Introduction</h2>
<p>In this guide, we will learn how to create a new branch in a Git repository based on a previous commit. We have commit history as below.
<img alt="before" src="images/git_time_travel/git-time-travel-1.png"></p>
<!--
<div class="highlight"><pre><span></span><code>gitGraph
commit id: "A"
commit id: "B"
commit id: "C"
commit id: "D"
commit id: "E"
</code></pre></div>
-->
<p>We are not happy with the changes C, D and E. We would like …</p><h2>Introduction</h2>
<p>In this guide, we will learn how to create a new branch in a Git repository based on a previous commit. We have commit history as below.
<img alt="before" src="images/git_time_travel/git-time-travel-1.png"></p>
<!--
<div class="highlight"><pre><span></span><code>gitGraph
commit id: "A"
commit id: "B"
commit id: "C"
commit id: "D"
commit id: "E"
</code></pre></div>
-->
<p>We are not happy with the changes C, D and E. We would like to start again from B, but we want to keep changes C, D and E in a new branch. Specifically, we will create a new branch starting from commit B in the main branch. We'll move the subsequent commits C, D, and E to the new branch and continue working on the main branch from the state of commit B - new commits F and G.
<img alt="after" src="images/git_time_travel/git-time-travel-2.png"></p>
<!--
<div class="highlight"><pre><span></span><code>gitGraph
commit id: "A"
commit id: "B"
branch feature-1
commit id: "C"
commit id: "D"
commit id: "E"
checkout main
commit id: "F"
commit id: "G"
</code></pre></div>
-->
<p>This guide assumes you have a basic understanding of Git commands and are familiar with the command line interface.</p>
<h2>Step-by-Step Guide</h2>
<h3>Determine the current branch and commit</h3>
<p>Open the terminal and navigate to the Git repository where you want to perform this operation. Use the following command to display the current branch and commit:</p>
<div class="highlight"><pre><span></span><code>git<span class="w"> </span>status
</code></pre></div>
<h3>Create a new branch from commit B</h3>
<p>To create a new branch at commit B, use the following command:</p>
<div class="highlight"><pre><span></span><code>git<span class="w"> </span>branch<span class="w"> </span>new-branch-name<span class="w"> </span>commit-B-hash
</code></pre></div>
<p>Replace <code>new-branch-name</code> with the desired name for your new branch and <code>commit-B-hash</code> with the hash or unique identifier of commit B. This command creates a new branch without switching to it.</p>
<h3>Move commits C, D, and E to the new branch</h3>
<p>Switch to the new branch using the following command:</p>
<div class="highlight"><pre><span></span><code>git<span class="w"> </span>checkout<span class="w"> </span>new-branch-name
</code></pre></div>
<p>This command switches your working directory to the new branch. Commits C, D, and E will be moved to this branch while leaving the main branch unaffected.</p>
<p>To move commits C, D, and E to the new branch, use the interactive rebase command:</p>
<div class="highlight"><pre><span></span><code>git<span class="w"> </span>rebase<span class="w"> </span>-i<span class="w"> </span>commit-B-hash
</code></pre></div>
<p>Replace <code>commit-B-hash</code> with the hash or unique identifier of commit B. An interactive rebase will open, displaying a list of commits.</p>
<h3>Rearrange the commits in the interactive rebase</h3>
<p>In the interactive rebase interface, locate the lines representing commits C, D, and E. Rearrange their order by moving them above commit B. Save and close the file to continue.</p>
<h3>Update the main branch</h3>
<p>Switch back to the main branch using the following command:</p>
<div class="highlight"><pre><span></span><code>git<span class="w"> </span>checkout<span class="w"> </span>main
</code></pre></div>
<p>Your working directory will now be on the main branch.</p>
<h3>Make changes to the main branch based on commit B</h3>
<p>You are now on the main branch, as it was at commit B. Make the necessary changes or improvements.</p>
<h3>Commit the changes on the main branch</h3>
<p>Stage your changes using the following command:</p>
<div class="highlight"><pre><span></span><code>git<span class="w"> </span>add<span class="w"> </span>.
</code></pre></div>
<p>Commit the changes with a descriptive message using the following command:</p>
<div class="highlight"><pre><span></span><code>git<span class="w"> </span>commit<span class="w"> </span>-m<span class="w"> </span><span class="s2">"Describe your changes or improvements"</span>
</code></pre></div>
<h3>Continue development on the main branch</h3>
<p>At this point, you can continue making new commits on the main branch, just as you would in any normal development workflow.</p>
<blockquote>
<p><strong>NOTE</strong>: perhaps it is not a best practice to run development on the main branch - you can learn more about it from various branching strategies. We use such a schema here for sake of simplicity.</p>
</blockquote>
<h2>Conclusion</h2>
<p>Congratulations! You have successfully created a new branch starting from commit B and moved the subsequent commits C, D, and E to the new branch. The develop branch has been reverted to its state at commit B, allowing you to continue development from that point. Remember to use Git commands with caution and make sure to create backups or push your changes to a remote repository for safety.</p>Mastering Temporary Files and Directories with Python's tempfile Module2023-07-13T00:00:00+02:002023-07-13T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-07-13:/mastering-temporary-files-and-directories-with-python-tempfile-module/<p>"Explore Python's tempfile module and learn how to create, manage, and customize temporary files and directories with ease. Master common use-cases and uncover lesser-known features of this powerful tool."</p><p>Python's <a href="https://docs.python.org/3/library/tempfile.html">tempfile</a> module is an incredibly powerful tool that allows you to create and manage temporary files and directories with ease. In this article, we'll dive deep into the most common use-cases and explore some lesser-known, but highly useful features of this versatile module.</p>
<h2>Why Use Temporary Files and Directories?</h2>
<p>Temporary files and directories are essential when you need to store intermediate results, cache data, or hold information during the execution of a program. They can help you minimize memory usage and improve performance by reducing the need to recompute expensive operations. Moreover, temporary files can be useful in scenarios like <a href="https://en.wikipedia.org/wiki/Unit_testing">unit testing</a>, where you need to create mock files and directories for testing purposes.</p>
<h2>Creating Temporary Files</h2>
<p>The <code>tempfile</code> module provides several functions to create temporary files, including <code>TemporaryFile</code>, <code>NamedTemporaryFile</code>, and <code>SpooledTemporaryFile</code>.</p>
<h3>TemporaryFile</h3>
<p>The <a href="https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryFile"><code>TemporaryFile</code></a> function creates an anonymous temporary file that is deleted when it is closed. This function returns a file-like object that can be used with Python's standard I/O operations:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">tempfile</span>
<span class="k">with</span> <span class="n">tempfile</span><span class="o">.</span><span class="n">TemporaryFile</span><span class="p">()</span> <span class="k">as</span> <span class="n">temp_file</span><span class="p">:</span>
<span class="n">temp_file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="sa">b</span><span class="s1">'This is a temporary file.'</span><span class="p">)</span>
<span class="n">temp_file</span><span class="o">.</span><span class="n">seek</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">temp_file</span><span class="o">.</span><span class="n">read</span><span class="p">())</span>
</code></pre></div>
<h3>NamedTemporaryFile</h3>
<p>The <a href="https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile"><code>NamedTemporaryFile</code></a> function is similar to <code>TemporaryFile</code>, but the file has a visible name in the file system. The file is deleted when it is closed:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">tempfile</span>
<span class="k">with</span> <span class="n">tempfile</span><span class="o">.</span><span class="n">NamedTemporaryFile</span><span class="p">()</span> <span class="k">as</span> <span class="n">temp_file</span><span class="p">:</span>
<span class="n">temp_file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="sa">b</span><span class="s1">'This is a named temporary file.'</span><span class="p">)</span>
<span class="n">temp_file</span><span class="o">.</span><span class="n">seek</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">temp_file</span><span class="o">.</span><span class="n">read</span><span class="p">())</span>
</code></pre></div>
<h3>SpooledTemporaryFile</h3>
<p>The <a href="https://docs.python.org/3/library/tempfile.html#tempfile.SpooledTemporaryFile"><code>SpooledTemporaryFile</code></a> function creates a temporary file that is stored in memory (using <code>io.BytesIO</code> or <code>io.StringIO</code>) until it reaches a specified size. Once the size is exceeded, the data is automatically written to disk:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">tempfile</span>
<span class="k">with</span> <span class="n">tempfile</span><span class="o">.</span><span class="n">SpooledTemporaryFile</span><span class="p">(</span><span class="n">max_size</span><span class="o">=</span><span class="mi">1024</span><span class="p">)</span> <span class="k">as</span> <span class="n">temp_file</span><span class="p">:</span>
<span class="n">temp_file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="sa">b</span><span class="s1">'This is a spooled temporary file.'</span><span class="p">)</span>
<span class="n">temp_file</span><span class="o">.</span><span class="n">seek</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">temp_file</span><span class="o">.</span><span class="n">read</span><span class="p">())</span>
</code></pre></div>
<h2>Creating Temporary Directories</h2>
<p>The <code>tempfile</code> module provides the <a href="https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryDirectory"><code>TemporaryDirectory</code></a> function to create temporary directories. These directories, along with their contents, are automatically deleted when the context manager exits:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">tempfile</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="k">with</span> <span class="n">tempfile</span><span class="o">.</span><span class="n">TemporaryDirectory</span><span class="p">()</span> <span class="k">as</span> <span class="n">temp_dir</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'Temporary directory: </span><span class="si">{</span><span class="n">temp_dir</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
<span class="n">temp_file_path</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">temp_dir</span><span class="p">,</span> <span class="s1">'temp_file.txt'</span><span class="p">)</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">temp_file_path</span><span class="p">,</span> <span class="s1">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">temp_file</span><span class="p">:</span>
<span class="n">temp_file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s1">'This file is inside the temporary directory.'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'Temporary directory and file have been deleted.'</span><span class="p">)</span>
</code></pre></div>
<h2>Customizing Temporary File and Directory Names</h2>
<p>You can customize the names of temporary files and directories using the <code>prefix</code>, <code>suffix</code>, and <code>dir</code> arguments. For example:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">tempfile</span>
<span class="k">with</span> <span class="n">tempfile</span><span class="o">.</span><span class="n">NamedTemporaryFile</span><span class="p">(</span><span class="n">prefix</span><span class="o">=</span><span class="s1">'my_temp_'</span><span class="p">,</span> <span class="n">suffix</span><span class="o">=</span><span class="s1">'.txt'</span><span class="p">,</span> <span class="nb">dir</span><span class="o">=</span><span class="s1">'/tmp'</span><span class="p">)</span> <span class="k">as</span> <span class="n">temp_file</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'Temporary file path: </span><span class="si">{</span><span class="n">temp_file</span><span class="o">.</span><span class="n">name</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
</code></pre></div>
<h2>Managing File and Directory Lifetimes</h2>
<p>By default, temporary files and directories are deleted when their corresponding file-like objects are closed. However, you can use the <code>delete</code> argument to control this behavior:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">tempfile</span>
<span class="k">with</span> <span class="n">tempfile</span><span class="o">.</span><span class="n">NamedTemporaryFile</span><span class="p">(</span><span class="n">delete</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">temp_file</span><span class="p">:</span>
<span class="n">temp_file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="sa">b</span><span class="s1">'This temporary file will not be deleted.'</span><span class="p">)</span>
<span class="n">temp_file_path</span> <span class="o">=</span> <span class="n">temp_file</span><span class="o">.</span><span class="n">name</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">temp_file_path</span><span class="p">,</span> <span class="s1">'rb'</span><span class="p">)</span> <span class="k">as</span> <span class="n">temp_file</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="n">temp_file</span><span class="o">.</span><span class="n">read</span><span class="p">())</span>
</code></pre></div>
<h2>Securely Generating Random Strings</h2>
<p>The <code>tempfile</code> module also provides the <a href="https://docs.python.org/3/library/tempfile.html#tempfile.mkstemp"><code>mkstemp</code></a> and <a href="https://docs.python.org/3/library/tempfile.html#tempfile.mkdtemp"><code>mkdtemp</code></a> functions, which generate random strings for file and directory names, respectively. These functions can be useful when you need to generate unique names for your application:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">tempfile</span>
<span class="n">temp_file</span><span class="p">,</span> <span class="n">temp_file_path</span> <span class="o">=</span> <span class="n">tempfile</span><span class="o">.</span><span class="n">mkstemp</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'Temporary file path: </span><span class="si">{</span><span class="n">temp_file_path</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
<span class="n">temp_dir_path</span> <span class="o">=</span> <span class="n">tempfile</span><span class="o">.</span><span class="n">mkdtemp</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s1">'Temporary directory path: </span><span class="si">{</span><span class="n">temp_dir_path</span><span class="si">}</span><span class="s1">'</span><span class="p">)</span>
</code></pre></div>
<h2>Conclusion</h2>
<p>In this article, we've explored the powerful features of Python's <code>tempfile</code> module, covering common use-cases and some lesser-known features. With these tools at your disposal, you can easily create and manage temporary files and directories in your Python applications.</p>Exploring Python Packages for Loading and Processing YAML Front Matter in Markdown Documents2023-07-11T00:00:00+02:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-07-11:/python-packages-yaml-front-matter-markdown/<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#pyyaml">PyYAML</a></li>
<li><a href="#frontmatter">Frontmatter</a></li>
<li><a href="#yaml-front-matter">YAML Front Matter</a></li>
<li><a href="#python-markdown">Python Markdown</a></li>
<li><a href="#mistune">mistune</a></li>
<li><a href="#commonmark">Commonmark</a></li>
<li><a href="#which-one-to-use-in-my-case">Which one to use in my case?</a></li>
<li><a href="#simple-front-matter-extraction">Simple Front Matter Extraction</a></li>
<li><a href="#advanced-front-matter-manipulation">Advanced Front Matter Manipulation</a></li>
<li><a href="#seamless-integration-with-markdown-parsing">Seamless Integration with Markdown Parsing</a></li>
<li><a href="#performance-and-speed">Performance and Speed</a></li>
<li><a href="#commonmark-compliance">CommonMark Compliance</a></li>
<li><a href="#minimalistic-approach">Minimalistic Approach</a></li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="introduction"></a></p>
<h2>Introduction</h2>
<p>Markdown has gained …</p><!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#pyyaml">PyYAML</a></li>
<li><a href="#frontmatter">Frontmatter</a></li>
<li><a href="#yaml-front-matter">YAML Front Matter</a></li>
<li><a href="#python-markdown">Python Markdown</a></li>
<li><a href="#mistune">mistune</a></li>
<li><a href="#commonmark">Commonmark</a></li>
<li><a href="#which-one-to-use-in-my-case">Which one to use in my case?</a></li>
<li><a href="#simple-front-matter-extraction">Simple Front Matter Extraction</a></li>
<li><a href="#advanced-front-matter-manipulation">Advanced Front Matter Manipulation</a></li>
<li><a href="#seamless-integration-with-markdown-parsing">Seamless Integration with Markdown Parsing</a></li>
<li><a href="#performance-and-speed">Performance and Speed</a></li>
<li><a href="#commonmark-compliance">CommonMark Compliance</a></li>
<li><a href="#minimalistic-approach">Minimalistic Approach</a></li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="introduction"></a></p>
<h2>Introduction</h2>
<p>Markdown has gained popularity as a lightweight markup language for creating structured documents. It is widely used in various domains, including blogging, documentation, and note-taking. Markdown documents often include front matter, which is a metadata section at the beginning of the document. This front matter typically contains YAML (YAML Ain't Markup Language) formatted data that provides additional information about the document. In this blog post, we will explore several Python packages that can help you load and process YAML front matter in Markdown documents, providing you with the necessary tools to extract and work with this valuable metadata.</p>
<p><a id="pyyaml"></a></p>
<h3>PyYAML</h3>
<p>PyYAML is a powerful YAML parser and emitter for Python. It allows you to easily read and write YAML files, making it a suitable choice for extracting YAML front matter from Markdown documents.</p>
<ul>
<li>PyPI: <a href="https://pypi.org/project/PyYAML/">PyYAML</a></li>
<li>GitHub: <a href="https://github.com/yaml/pyyaml">PyYAML on GitHub</a></li>
</ul>
<p>Example on how to load, modify and save front matter to markdown document:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">yaml</span>
<span class="c1"># Read front matter from a Markdown file</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'article.md'</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">)</span> <span class="k">as</span> <span class="n">file</span><span class="p">:</span>
<span class="n">content</span> <span class="o">=</span> <span class="n">file</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="n">_</span><span class="p">,</span> <span class="n">front_matter</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">content</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">'---'</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">yaml</span><span class="o">.</span><span class="n">safe_load</span><span class="p">(</span><span class="n">front_matter</span><span class="p">)</span>
<span class="c1"># Modify front matter</span>
<span class="n">data</span><span class="p">[</span><span class="s1">'Modified'</span><span class="p">]</span> <span class="o">=</span> <span class="s1">'2023-07-12'</span>
<span class="c1"># Write front matter back to the Markdown file</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'article.md'</span><span class="p">,</span> <span class="s1">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">file</span><span class="p">:</span>
<span class="n">file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s1">'---</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
<span class="n">file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">yaml</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">default_flow_style</span><span class="o">=</span><span class="kc">False</span><span class="p">))</span>
<span class="n">file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s1">'---</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
</code></pre></div>
<p><a id="frontmatter"></a></p>
<h3>python-frontmatter</h3>
<p><a href="http://jekyllrb.com/">Jekyll</a>-style YAML front matter offers a useful way to add arbitrary, structured metadata to text documents, regardless of type.
This is a small package to load and parse files (or just text) with YAML (or JSON, TOML or other) front matter.</p>
<ul>
<li>PyPI: <a href="https://pypi.org/project/python-frontmatter/">python-frontmatter</a></li>
<li>GitHub: <a href="https://github.com/eyeseast/python-frontmatter">python-frontmatter on GitHub</a></li>
</ul>
<p>Example on how to load, modify and save front matter to markdown document:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">frontmatter</span>
<span class="c1"># Read front matter from a Markdown file</span>
<span class="n">post</span> <span class="o">=</span> <span class="n">frontmatter</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s1">'article.md'</span><span class="p">)</span>
<span class="c1"># Modify front matter</span>
<span class="n">post</span><span class="p">[</span><span class="s1">'modified'</span><span class="p">]</span> <span class="o">=</span> <span class="s1">'2023-07-12'</span>
<span class="c1"># Write front matter back to the Markdown file</span>
<span class="n">frontmatter</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">post</span><span class="p">,</span> <span class="s1">'article.md'</span><span class="p">)</span>
</code></pre></div>
<p><a id="python-markdown"></a></p>
<h3>Python Markdown</h3>
<p>Python Markdown is a popular package for parsing and rendering Markdown documents. While its primary focus is on converting Markdown to HTML, it also provides support for custom extensions, including front matter parsing.</p>
<ul>
<li>PyPI: <a href="https://pypi.org/project/Markdown/">Python Markdown</a></li>
<li>GitHub: <a href="https://github.com/Python-Markdown/markdown">Python Markdown on GitHub</a></li>
</ul>
<p>Example on how to load, modify and save front matter to markdown document:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">markdown.extensions</span> <span class="kn">import</span> <span class="n">meta</span>
<span class="c1"># Read front matter from a Markdown file</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'article.md'</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">)</span> <span class="k">as</span> <span class="n">file</span><span class="p">:</span>
<span class="n">content</span> <span class="o">=</span> <span class="n">file</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="n">md</span> <span class="o">=</span> <span class="n">meta</span><span class="o">.</span><span class="n">MetaExtension</span><span class="p">()</span>
<span class="n">md</span><span class="o">.</span><span class="n">convert</span><span class="p">(</span><span class="n">content</span><span class="p">)</span>
<span class="c1"># Modify front matter</span>
<span class="n">md</span><span class="o">.</span><span class="n">Meta</span><span class="p">[</span><span class="s1">'Modified'</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'2023-07-12'</span><span class="p">]</span>
<span class="c1"># Write front matter back to the Markdown file</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'article.md'</span><span class="p">,</span> <span class="s1">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">file</span><span class="p">:</span>
<span class="n">file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">md</span><span class="o">.</span><span class="n">Meta</span><span class="o">.</span><span class="n">pformat</span><span class="p">())</span>
<span class="n">file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s1">'</span><span class="se">\n</span><span class="s1">---</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
<span class="n">file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">md</span><span class="o">.</span><span class="n">body</span><span class="p">)</span>
</code></pre></div>
<p><a id="mistune"></a></p>
<h3>mistune</h3>
<p>Description: mistune is a fast and extensible Markdown parser implemented in pure Python. It aims to be compatible with the Markdown specification while offering various customization options, including support for front matter parsing.</p>
<ul>
<li>PyPI: <a href="https://pypi.org/project/mistune/">mistune</a></li>
<li>GitHub: <a href="https://github.com/lepture/mistune">mistune on GitHub</a></li>
</ul>
<p>Example on how to load, modify and save front matter to markdown document:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">mistune</span>
<span class="c1"># Read front matter from a Markdown file</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'article.md'</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">)</span> <span class="k">as</span> <span class="n">file</span><span class="p">:</span>
<span class="n">content</span> <span class="o">=</span> <span class="n">file</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="n">md</span> <span class="o">=</span> <span class="n">mistune</span><span class="o">.</span><span class="n">Markdown</span><span class="p">(</span><span class="n">renderer</span><span class="o">=</span><span class="n">mistune</span><span class="o">.</span><span class="n">AstRenderer</span><span class="p">())</span>
<span class="c1"># Modify front matter</span>
<span class="n">front_matter</span> <span class="o">=</span> <span class="n">md</span><span class="o">.</span><span class="n">renderer</span><span class="o">.</span><span class="n">front_matter</span>
<span class="k">for</span> <span class="n">node</span> <span class="ow">in</span> <span class="n">md</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="n">content</span><span class="p">):</span>
<span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">node</span><span class="p">,</span> <span class="n">front_matter</span><span class="p">):</span>
<span class="n">node</span><span class="p">[</span><span class="s2">"Modified"</span><span class="p">]</span> <span class="o">=</span> <span class="s2">"2023-07-12"</span>
<span class="c1"># Write front matter back to the Markdown file</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'article.md'</span><span class="p">,</span> <span class="s1">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">file</span><span class="p">:</span>
<span class="n">file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">md</span><span class="o">.</span><span class="n">renderer</span><span class="o">.</span><span class="n">render</span><span class="p">(</span><span class="n">md</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="n">content</span><span class="p">)))</span>
</code></pre></div>
<p><a id="commonmark"></a></p>
<h3>Commonmark</h3>
<p>Commonmark is a comprehensive Markdown parsing and rendering library for Python. It adheres to the CommonMark specification and offers a wide range of features, including support for parsing YAML front matter.</p>
<ul>
<li>PyPI: <a href="https://pypi.org/project/commonmark/">Commonmark</a></li>
<li>GitHub: <a href="https://github.com/commonmark/commonmark-python">Commonmark on GitHub</a></li>
</ul>
<p>Example on how to load, modify and save front matter to markdown document:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">commonmark</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="c1"># Read front matter from a Markdown file</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'article.md'</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">)</span> <span class="k">as</span> <span class="n">file</span><span class="p">:</span>
<span class="n">content</span> <span class="o">=</span> <span class="n">file</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="c1"># Extract front matter</span>
<span class="n">front_matter</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="sa">r</span><span class="s1">'^---\n(.*?)\n---\n'</span><span class="p">,</span> <span class="n">content</span><span class="p">,</span> <span class="n">re</span><span class="o">.</span><span class="n">DOTALL</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">yaml</span><span class="o">.</span><span class="n">safe_load</span><span class="p">(</span><span class="n">front_matter</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="mi">1</span><span class="p">))</span>
<span class="c1"># Modify front matter</span>
<span class="n">data</span><span class="p">[</span><span class="s1">'Modified'</span><span class="p">]</span> <span class="o">=</span> <span class="s1">'2023-07-12'</span>
<span class="c1"># Write front matter back to the Markdown file</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">'article.md'</span><span class="p">,</span> <span class="s1">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">file</span><span class="p">:</span>
<span class="n">file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s1">'---</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
<span class="n">file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">yaml</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">default_flow_style</span><span class="o">=</span><span class="kc">False</span><span class="p">))</span>
<span class="n">file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s1">'---</span><span class="se">\n</span><span class="s1">'</span><span class="p">)</span>
<span class="n">file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">content</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">front_matter</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="s1">''</span><span class="p">))</span>
</code></pre></div>
<p><a id="which-one-to-use-in-my-case"></a></p>
<h2>Which one to use in my case?</h2>
<p>Here are distinct use cases related to loading and processing YAML front matter in Markdown documents, along with recommended libraries for each case and the justifications for the recommendations:</p>
<p><a id="simple-front-matter-extraction"></a></p>
<h3>Simple Front Matter Extraction</h3>
<ul>
<li>Recommended Library: <strong>Frontmatter</strong></li>
</ul>
<blockquote>
<p>Frontmatter is a dedicated Python package designed specifically for working with front matter in Markdown documents. It provides a simple and intuitive API for extracting front matter data, making it a suitable choice for straightforward front matter extraction needs.</p>
</blockquote>
<p><a id="advanced-front-matter-manipulation"></a></p>
<h3>Advanced Front Matter Manipulation</h3>
<ul>
<li>Recommended Library: <strong>PyYAML</strong></li>
</ul>
<blockquote>
<p>PyYAML is a powerful YAML parser and emitter for Python. If you require advanced manipulation and processing of YAML front matter, PyYAML offers extensive functionality and flexibility. It allows you to read and write YAML files, making it a robust choice for complex front matter handling.</p>
</blockquote>
<p><a id="seamless-integration-with-markdown-parsing"></a></p>
<h3>Seamless Integration with Markdown Parsing</h3>
<ul>
<li>Recommended Library: <strong>Python Markdown</strong></li>
</ul>
<blockquote>
<p>If your focus is on seamless integration with Markdown parsing, Python Markdown is a widely-used and feature-rich package. It supports custom extensions, including front matter parsing, allowing you to extract front matter while parsing the Markdown content. This integration can simplify your workflow when working with Markdown documents.</p>
</blockquote>
<p><a id="performance-and-speed"></a></p>
<h3>Performance and Speed</h3>
<ul>
<li>Recommended Library: <strong>mistune</strong></li>
</ul>
<blockquote>
<p>mistune is a fast and extensible Markdown parser implemented in pure Python. If performance and speed are crucial factors in your use case, mistune's efficient parsing capabilities make it an ideal choice. It provides customization options, including support for front matter parsing, while maintaining high performance.</p>
</blockquote>
<p><a id="commonmark-compliance"></a></p>
<h3>CommonMark Compliance</h3>
<ul>
<li>Recommended Library: <strong>Commonmark</strong></li>
</ul>
<p>If adhering to the CommonMark specification is essential, Commonmark is a comprehensive Markdown parsing and rendering library that aligns with the specification. It supports front matter parsing while ensuring compliance with the CommonMark standard, providing a reliable solution for standardized Markdown processing.</p>
<p><a id="minimalistic-approach"></a></p>
<h3>Minimalistic Approach</h3>
<ul>
<li>Recommended Library: <strong>YAML Front Matter</strong></li>
</ul>
<p>YAML Front Matter is a minimalistic package that focuses on simplicity and ease of use. If you prefer a lightweight solution for extracting YAML front matter from Markdown files without additional complexity, YAML Front Matter provides a straightforward and efficient approach.</p>
<p><a id="conclusion"></a></p>
<h2>Conclusion</h2>
<p>In this blog post, we explored several Python packages that can load and process YAML front matter in Markdown documents. These packages provide convenient and efficient methods for extracting metadata from the front matter section, enabling you to access and manipulate this valuable information.</p>Boosting Productivity and Automation With AppleScript on macOS2023-07-10T00:00:00+02:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-07-10:/Boosting Productivity and Automation with AppleScript on macOS/<h2>Introduction</h2>
<p>In today's fast-paced digital world, maximizing productivity and finding ways to automate tasks are essential skills. macOS provides a powerful tool called AppleScript, which allows users to write scripts and automate various processes. In this blog post, we will explore the …</p><h2>Introduction</h2>
<p>In today's fast-paced digital world, maximizing productivity and finding ways to automate tasks are essential skills. macOS provides a powerful tool called AppleScript, which allows users to write scripts and automate various processes. In this blog post, we will explore the capabilities of AppleScript, discuss cool tricks, and highlight its alternatives.</p>
<!-- MarkdownTOC levels="2,3,4" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#getting-started-with-applescript">Getting Started with AppleScript</a></li>
<li><a href="#increasing-productivity-with-applescript">Increasing Productivity with AppleScript</a><ul>
<li><a href="#customized-workflow">Customized Workflow</a></li>
<li><a href="#application-control">Application Control</a></li>
<li><a href="#system-automation">System Automation</a></li>
</ul>
</li>
<li><a href="#cool-tricks-with-applescript">Cool Tricks with AppleScript</a><ul>
<li><a href="#displaying-notifications">Displaying Notifications</a></li>
<li><a href="#text-manipulation">Text Manipulation</a></li>
<li><a href="#gui-automation">GUI Automation</a></li>
</ul>
</li>
<li><a href="#alternatives-to-applescript">Alternatives to AppleScript</a><ul>
<li><a href="#automator">Automator</a></li>
<li><a href="#hammerspoon">Hammerspoon</a></li>
<li><a href="#keyboard-maestro">Keyboard Maestro</a></li>
</ul>
</li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="getting-started-with-applescript"></a></p>
<h3>Getting Started with AppleScript</h3>
<p>AppleScript is a scripting language that enables users to control applications and perform tasks on macOS. It utilizes the <code>osascript</code> command-line utility to execute AppleScript code. To begin using AppleScript, open the Terminal on your Mac and enter the desired commands preceded by <code>osascript -e</code>.</p>
<p>The osascript website provide examples:</p>
<div class="highlight"><pre><span></span><code>Open<span class="w"> </span>or<span class="w"> </span>switch<span class="w"> </span>to<span class="w"> </span>Safari:<span class="w"> </span>
$<span class="w"> </span>osascript<span class="w"> </span>-e<span class="w"> </span><span class="s1">'tell app "Safari" to activate'</span>
Close<span class="w"> </span>safari:<span class="w"> </span>
$<span class="w"> </span>osascript<span class="w"> </span>-e<span class="w"> </span><span class="s1">'quit app "safari.app"'</span>
Empty<span class="w"> </span>the<span class="w"> </span>trash:<span class="w"> </span>
$<span class="w"> </span>osascript<span class="w"> </span>-e<span class="w"> </span><span class="s1">'tell application "Finder" to empty trash'</span>
Set<span class="w"> </span>the<span class="w"> </span>output<span class="w"> </span>volume<span class="w"> </span>to<span class="w"> </span><span class="m">50</span>%<span class="w"> </span>
$<span class="w"> </span>sudo<span class="w"> </span>osascript<span class="w"> </span>-e<span class="w"> </span><span class="s1">'set volume output volume 50'</span>
Input<span class="w"> </span>volume<span class="w"> </span>and<span class="w"> </span>Alert<span class="w"> </span>volume<span class="w"> </span>can<span class="w"> </span>also<span class="w"> </span>be<span class="w"> </span><span class="nb">set</span><span class="w"> </span>from<span class="w"> </span><span class="m">0</span><span class="w"> </span>to<span class="w"> </span><span class="m">100</span>%:<span class="w"> </span>
$<span class="w"> </span>sudo<span class="w"> </span>osascript<span class="w"> </span>-e<span class="w"> </span><span class="s1">'set volume input volume 40'</span><span class="w"> </span>
$<span class="w"> </span>sudo<span class="w"> </span>osascript<span class="w"> </span>-e<span class="w"> </span><span class="s1">'set volume alert volume 75'</span><span class="w"> </span>
Mute<span class="w"> </span>the<span class="w"> </span>output<span class="w"> </span>volume<span class="w"> </span><span class="o">(</span>True/False<span class="o">)</span>:<span class="w"> </span>
$<span class="w"> </span>osascript<span class="w"> </span>-e<span class="w"> </span><span class="s1">'set volume output muted TRUE'</span>
Toggle<span class="w"> </span>volume<span class="w"> </span>muting:<span class="w"> </span>
$<span class="w"> </span>osascript<span class="w"> </span>-e<span class="w"> </span><span class="s1">'set volume output muted not (output muted of (get volume settings))'</span>
Toggle<span class="w"> </span>system<span class="w"> </span>theme:<span class="w"> </span>
$<span class="w"> </span>osascript<span class="w"> </span>-e<span class="w"> </span><span class="s1">'tell application "System Events" to tell appearance preferences to set dark mode to not dark mode'</span>
Shut<span class="w"> </span>down<span class="w"> </span>without<span class="w"> </span>asking<span class="w"> </span><span class="k">for</span><span class="w"> </span>confirmation:<span class="w"> </span>
$<span class="w"> </span>osascript<span class="w"> </span>-e<span class="w"> </span><span class="s1">'tell app "System Events" to shut down'</span>
Restart<span class="w"> </span>without<span class="w"> </span>asking<span class="w"> </span><span class="k">for</span><span class="w"> </span>confirmation:<span class="w"> </span>
$<span class="w"> </span>osascript<span class="w"> </span>-e<span class="w"> </span><span class="s1">'tell app "System Events" to restart'</span>
</code></pre></div>
<p><a id="increasing-productivity-with-applescript"></a></p>
<h3>Increasing Productivity with AppleScript</h3>
<p><a id="customized-workflow"></a></p>
<h4>Customized Workflow</h4>
<p>AppleScript enables you to create personalized workflows by automating repetitive tasks. For example, you can write a script that renames and moves files based on specific criteria, saving you time and effort.</p>
<blockquote>
<p>Example Script 1: <strong>Automating File Organization</strong></p>
</blockquote>
<div class="highlight"><pre><span></span><code><span class="nv">tell</span><span class="w"> </span><span class="nv">application</span><span class="w"> </span><span class="s2">"Finder"</span>
<span class="w"> </span><span class="nv">set</span><span class="w"> </span><span class="nv">sourceFolder</span><span class="w"> </span><span class="nv">to</span><span class="w"> </span><span class="nv">choose</span><span class="w"> </span><span class="nv">folder</span><span class="w"> </span><span class="nv">with</span><span class="w"> </span><span class="nv">prompt</span><span class="w"> </span><span class="s2">"Select the source folder"</span>
<span class="w"> </span><span class="nv">set</span><span class="w"> </span><span class="nv">destinationFolder</span><span class="w"> </span><span class="nv">to</span><span class="w"> </span><span class="nv">choose</span><span class="w"> </span><span class="nv">folder</span><span class="w"> </span><span class="nv">with</span><span class="w"> </span><span class="nv">prompt</span><span class="w"> </span><span class="s2">"Select the destination folder"</span>
<span class="w"> </span><span class="nv">set</span><span class="w"> </span><span class="nv">fileList</span><span class="w"> </span><span class="nv">to</span><span class="w"> </span><span class="nv">every</span><span class="w"> </span><span class="nv">file</span><span class="w"> </span><span class="nv">of</span><span class="w"> </span><span class="nv">sourceFolder</span>
<span class="w"> </span><span class="nv">repeat</span><span class="w"> </span><span class="nv">with</span><span class="w"> </span><span class="nv">aFile</span><span class="w"> </span><span class="nv">in</span><span class="w"> </span><span class="nv">fileList</span>
<span class="w"> </span><span class="nv">move</span><span class="w"> </span><span class="nv">aFile</span><span class="w"> </span><span class="nv">to</span><span class="w"> </span><span class="nv">destinationFolder</span>
<span class="w"> </span><span class="k">end</span><span class="w"> </span><span class="nv">repeat</span>
<span class="k">end</span><span class="w"> </span><span class="nv">tell</span>
</code></pre></div>
<blockquote>
<p>This script allows you to select a source folder and a destination folder. It moves all files from the source folder to the destination folder, simplifying your file organization process.</p>
</blockquote>
<p><a id="application-control"></a></p>
<h4>Application Control</h4>
<p>With AppleScript, you can interact with various macOS applications. You could automate tasks like sending emails, creating documents, or extracting data from spreadsheets, helping streamline your workflow.</p>
<blockquote>
<p>Example Script 2: Creating New Email in Apple Mail</p>
</blockquote>
<div class="highlight"><pre><span></span><code><span class="nx">tell</span><span class="w"> </span><span class="nx">application</span><span class="w"> </span><span class="s">"Mail"</span>
<span class="w"> </span><span class="nx">set</span><span class="w"> </span><span class="nx">newMessage</span><span class="w"> </span><span class="nx">to</span><span class="w"> </span><span class="nx">make</span><span class="w"> </span><span class="nx">new</span><span class="w"> </span><span class="nx">outgoing</span><span class="w"> </span><span class="nx">message</span><span class="w"> </span><span class="nx">with</span><span class="w"> </span><span class="nx">properties</span><span class="w"> </span><span class="p">{</span><span class="nx">subject</span><span class="p">:</span><span class="s">"Hello"</span><span class="p">,</span><span class="w"> </span><span class="nx">content</span><span class="p">:</span><span class="s">"Just wanted to say hi!"</span><span class="p">}</span>
<span class="w"> </span><span class="nx">tell</span><span class="w"> </span><span class="nx">newMessage</span>
<span class="w"> </span><span class="nx">make</span><span class="w"> </span><span class="nx">new</span><span class="w"> </span><span class="nx">to</span><span class="w"> </span><span class="nx">recipient</span><span class="w"> </span><span class="nx">at</span><span class="w"> </span><span class="nx">end</span><span class="w"> </span><span class="nx">of</span><span class="w"> </span><span class="nx">to</span><span class="w"> </span><span class="nx">recipients</span><span class="w"> </span><span class="nx">with</span><span class="w"> </span><span class="nx">properties</span><span class="w"> </span><span class="p">{</span><span class="nx">address</span><span class="p">:</span><span class="s">"example@email.com"</span><span class="p">}</span>
<span class="w"> </span><span class="nx">open</span>
<span class="w"> </span><span class="nx">end</span><span class="w"> </span><span class="nx">tell</span>
<span class="nx">end</span><span class="w"> </span><span class="nx">tell</span>
</code></pre></div>
<blockquote>
<p>This script automates the process of creating a new email in Apple Mail. It sets the subject and content of the email and adds a recipient, ready for you to send your message swiftly.</p>
</blockquote>
<p><a id="system-automation"></a></p>
<h4>System Automation</h4>
<p>AppleScript allows you to control system settings and perform actions like changing the display resolution, adjusting volume, or toggling Wi-Fi—all with a single script.</p>
<blockquote>
<p>Example Script 3: Adjusting Display Brightness</p>
</blockquote>
<div class="highlight"><pre><span></span><code><span class="nv">tell</span><span class="w"> </span><span class="nv">application</span><span class="w"> </span><span class="s2">"System Preferences"</span>
<span class="w"> </span><span class="nv">reveal</span><span class="w"> </span><span class="nv">anchor</span><span class="w"> </span><span class="s2">"displaysDisplayTab"</span><span class="w"> </span><span class="nv">of</span><span class="w"> </span><span class="nv">pane</span><span class="w"> </span><span class="nv">id</span><span class="w"> </span><span class="s2">"com.apple.preference.displays"</span>
<span class="w"> </span><span class="nv">activate</span>
<span class="k">end</span><span class="w"> </span><span class="nv">tell</span>
<span class="nv">tell</span><span class="w"> </span><span class="nv">application</span><span class="w"> </span><span class="s2">"System Events"</span>
<span class="w"> </span><span class="nv">tell</span><span class="w"> </span><span class="nv">process</span><span class="w"> </span><span class="s2">"System Preferences"</span>
<span class="w"> </span><span class="nv">tell</span><span class="w"> </span><span class="nv">slider</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="nv">of</span><span class="w"> </span><span class="nv">window</span><span class="w"> </span><span class="mi">1</span>
<span class="w"> </span><span class="nv">set</span><span class="w"> </span><span class="nv">value</span><span class="w"> </span><span class="nv">to</span><span class="w"> </span><span class="mi">75</span><span class="w"> </span><span class="o">--</span><span class="w"> </span><span class="nv">Change</span><span class="w"> </span><span class="nv">brightness</span><span class="w"> </span><span class="nv">level</span><span class="w"> </span><span class="ss">(</span><span class="mi">0</span><span class="o">-</span><span class="mi">100</span><span class="ss">)</span>
<span class="w"> </span><span class="k">end</span><span class="w"> </span><span class="nv">tell</span>
<span class="w"> </span><span class="k">end</span><span class="w"> </span><span class="nv">tell</span>
<span class="k">end</span><span class="w"> </span><span class="nv">tell</span>
<span class="nv">tell</span><span class="w"> </span><span class="nv">application</span><span class="w"> </span><span class="s2">"System Preferences"</span><span class="w"> </span><span class="nv">to</span><span class="w"> </span><span class="nv">quit</span>
</code></pre></div>
<blockquote>
<p>This script opens the Display preferences in System Preferences, adjusts the brightness slider to the desired level, and then closes System Preferences. This allows you to quickly customize your display brightness without navigating through menus.</p>
</blockquote>
<p><a id="cool-tricks-with-applescript"></a></p>
<h3>Cool Tricks with AppleScript</h3>
<p><a id="displaying-notifications"></a></p>
<h4>Displaying Notifications</h4>
<p>As discussed earlier, you can use AppleScript to display notifications on the screen. This feature is particularly useful for receiving alerts or reminders during time-sensitive tasks.</p>
<blockquote>
<p>Example Script 4: Notifying Important Task Deadlines</p>
</blockquote>
<div class="highlight"><pre><span></span><code>display notification "Don't forget to submit the report by 5 PM!" with title "Task Reminder"
</code></pre></div>
<blockquote>
<p>This script displays a notification with a reminder for an important task deadline. You can set up similar notifications for time-sensitive activities to keep you on track.</p>
</blockquote>
<p><a id="text-manipulation"></a></p>
<h4>Text Manipulation</h4>
<p>AppleScript offers powerful text manipulation capabilities. You can automate tasks such as extracting specific information from a text file, finding and replacing text across multiple documents, or formatting text according to predefined rules.</p>
<blockquote>
<p>Example Script 5: Find and Replace Text in Multiple Files</p>
</blockquote>
<div class="highlight"><pre><span></span><code><span class="nv">set</span><span class="w"> </span><span class="nv">searchText</span><span class="w"> </span><span class="nv">to</span><span class="w"> </span><span class="s2">"oldText"</span>
<span class="nv">set</span><span class="w"> </span><span class="nv">replaceText</span><span class="w"> </span><span class="nv">to</span><span class="w"> </span><span class="s2">"newText"</span>
<span class="nv">tell</span><span class="w"> </span><span class="nv">application</span><span class="w"> </span><span class="s2">"Finder"</span>
<span class="w"> </span><span class="nv">set</span><span class="w"> </span><span class="nv">folderPath</span><span class="w"> </span><span class="nv">to</span><span class="w"> </span><span class="nv">choose</span><span class="w"> </span><span class="nv">folder</span><span class="w"> </span><span class="nv">with</span><span class="w"> </span><span class="nv">prompt</span><span class="w"> </span><span class="s2">"Select the folder"</span>
<span class="w"> </span><span class="nv">set</span><span class="w"> </span><span class="nv">fileList</span><span class="w"> </span><span class="nv">to</span><span class="w"> </span><span class="nv">every</span><span class="w"> </span><span class="nv">file</span><span class="w"> </span><span class="nv">of</span><span class="w"> </span><span class="nv">folderPath</span>
<span class="w"> </span><span class="nv">repeat</span><span class="w"> </span><span class="nv">with</span><span class="w"> </span><span class="nv">aFile</span><span class="w"> </span><span class="nv">in</span><span class="w"> </span><span class="nv">fileList</span>
<span class="w"> </span><span class="nv">set</span><span class="w"> </span><span class="nv">fileContents</span><span class="w"> </span><span class="nv">to</span><span class="w"> </span><span class="nv">read</span><span class="w"> </span><span class="nv">aFile</span><span class="w"> </span><span class="nv">as</span><span class="w"> </span>«<span class="nv">class</span><span class="w"> </span><span class="nv">utf8</span>»
<span class="w"> </span><span class="nv">set</span><span class="w"> </span><span class="nv">modifiedContents</span><span class="w"> </span><span class="nv">to</span><span class="w"> </span><span class="nv">replaceTextInString</span><span class="ss">(</span><span class="nv">fileContents</span>,<span class="w"> </span><span class="nv">searchText</span>,<span class="w"> </span><span class="nv">replaceText</span><span class="ss">)</span>
<span class="w"> </span><span class="nv">set</span><span class="w"> </span><span class="nv">writeResult</span><span class="w"> </span><span class="nv">to</span><span class="w"> </span><span class="nv">write</span><span class="w"> </span><span class="nv">modifiedContents</span><span class="w"> </span><span class="nv">to</span><span class="w"> </span><span class="nv">aFile</span><span class="w"> </span><span class="nv">as</span><span class="w"> </span>«<span class="nv">class</span><span class="w"> </span><span class="nv">utf8</span>»
<span class="w"> </span><span class="k">end</span><span class="w"> </span><span class="nv">repeat</span>
<span class="k">end</span><span class="w"> </span><span class="nv">tell</span>
<span class="nv">on</span><span class="w"> </span><span class="nv">replaceTextInString</span><span class="ss">(</span><span class="nv">textString</span>,<span class="w"> </span><span class="nv">oldText</span>,<span class="w"> </span><span class="nv">newText</span><span class="ss">)</span>
<span class="w"> </span><span class="nv">set</span><span class="w"> </span><span class="nv">AppleScript</span><span class="err">'s text item delimiters to the oldText</span>
<span class="err"> set textItems to every text item of textString</span>
<span class="w"> </span><span class="nv">set</span><span class="w"> </span><span class="nv">AppleScript</span><span class="err">'s text item delimiters to the newText</span>
<span class="err"> return textItems as text</span>
<span class="err">end replaceTextInString</span>
</code></pre></div>
<blockquote>
<p>This script prompts you to select a folder and replaces all occurrences of "oldText" with "newText" in the contents of every file within that folder. This can be useful for batch text replacements across multiple documents.</p>
</blockquote>
<p><a id="gui-automation"></a></p>
<h4>GUI Automation</h4>
<p>AppleScript can simulate user interactions with graphical user interfaces (GUI). You can automate tasks that involve clicking buttons, selecting options from menus, or filling out forms in applications, saving you from repetitive manual operations.</p>
<blockquote>
<p>Example Script 6: Automating Safari Website Login</p>
</blockquote>
<div class="highlight"><pre><span></span><code><span class="n">tell</span><span class="w"> </span><span class="n">application</span><span class="w"> </span><span class="s2">"Safari"</span>
<span class="w"> </span><span class="n">activate</span>
<span class="w"> </span><span class="n">open</span><span class="w"> </span><span class="n">location</span><span class="w"> </span><span class="s2">"https://example.com/login"</span>
<span class="w"> </span><span class="n">delay</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="o">--</span><span class="w"> </span><span class="n">Add</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="n">delay</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="n">needed</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">page</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="nb">load</span>
<span class="n">end</span><span class="w"> </span><span class="n">tell</span>
<span class="n">tell</span><span class="w"> </span><span class="n">application</span><span class="w"> </span><span class="s2">"System Events"</span>
<span class="w"> </span><span class="n">tell</span><span class="w"> </span><span class="n">process</span><span class="w"> </span><span class="s2">"Safari"</span>
<span class="w"> </span><span class="n">set</span><span class="w"> </span><span class="n">frontmost</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="bp">true</span>
<span class="w"> </span><span class="n">keystroke</span><span class="w"> </span><span class="s2">"username"</span>
<span class="w"> </span><span class="n">keystroke</span><span class="w"> </span><span class="n">tab</span>
<span class="w"> </span><span class="n">keystroke</span><span class="w"> </span><span class="s2">"password"</span>
<span class="w"> </span><span class="n">keystroke</span><span class="w"> </span><span class="k">return</span>
<span class="w"> </span><span class="n">end</span><span class="w"> </span><span class="n">tell</span>
<span class="n">end</span><span class="w"> </span><span class="n">tell</span>
</code></pre></div>
<blockquote>
<p>This script automates the process of opening a specific website in Safari, entering a username, password, and submitting the login form. You can adapt this script to automate various web-based actions.
<a id="alternatives-to-applescript"></a></p>
</blockquote>
<h3>Alternatives to AppleScript</h3>
<p>While AppleScript is a robust tool, other alternatives can also help achieve automation and productivity on macOS:</p>
<p><a id="automator"></a></p>
<h4>Automator</h4>
<p><img alt="automator logo" src="https://help.apple.com/assets/61E87B255FBFB2628709732E/61E87B275FBFB26287097336/en_GB/573f95d708cbb258343f5c78cc439bcb.png">
<a href="https://support.apple.com/en-gb/guide/automator/welcome/mac">Automator</a> is a visual automation tool built into macOS. It provides a drag-and-drop interface to create workflows without writing code. Automator supports a wide range of actions and can be an excellent choice for users who prefer a more intuitive approach.
<a id="hammerspoon"></a></p>
<h4>Hammerspoon</h4>
<p><img alt="Hammerspoon logo" src="https://www.hammerspoon.org/images/hammerspoon.png">
<a href="https://www.hammerspoon.org/">Hammerspoon</a> is a powerful automation tool that uses the Lua scripting language. It offers extensive customization and control over macOS, allowing users to create complex workflows and automation routines.</p>
<p><a id="keyboard-maestro"></a></p>
<h4>Keyboard Maestro</h4>
<p><img alt="Keyboard Maestro logo" src="https://www.keyboardmaestro.com/img/keyboardmaestro-64.png">
<a href="https://www.keyboardmaestro.com/main/">Keyboard Maestro</a> is a comprehensive automation tool that focuses on keyboard and mouse automation. It provides a user-friendly interface to create macros, trigger actions based on specific events, and automate repetitive tasks efficiently.</p>
<p><a id="conclusion"></a></p>
<h2>Conclusion</h2>
<p>AppleScript is a versatile tool for increasing productivity and automating tasks on macOS. Its ability to control applications, system settings, and perform various actions make it a valuable asset. Additionally, cool tricks like displaying notifications and GUI automation enhance the overall experience. However, alternatives like Automator, Hammerspoon, and Keyboard Maestro offer different approaches to automation, catering to diverse user preferences. Explore these tools and find the one that best fits your workflow to unlock new levels of productivity and efficiency on your Mac.</p>
<h2>References</h2>
<ul>
<li><a href="https://developer.apple.com/library/archive/documentation/AppleScript/Conceptual/AppleScriptLangGuide/introduction/ASLR_intro.html">Introduction to AppleScript Language Guide</a></li>
<li><a href="https://ss64.com/osx/osascript.html">osascript Man Page - macOS - SS64.com</a></li>
<li><a href="https://ss64.com/osx/osacompile.html">osacompile</a> - Compile AppleScripts and other OSA language scripts.</li>
</ul>Display a Notification on the Screen in macOS2023-07-10T00:00:00+02:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-07-10:/display-a-notification-on-the-screen-in-macos/<p>To display a notification on the screen near the menu bar in macOS using the terminal, you can make use of the <code>osascript</code> command to execute AppleScript code. Here's an example command you can run:</p>
<div class="highlight"><pre><span></span><code>osascript<span class="w"> </span>-e<span class="w"> </span><span class="s1">'display notification "Hello, World!" with …</span></code></pre></div><p>To display a notification on the screen near the menu bar in macOS using the terminal, you can make use of the <code>osascript</code> command to execute AppleScript code. Here's an example command you can run:</p>
<div class="highlight"><pre><span></span><code>osascript<span class="w"> </span>-e<span class="w"> </span><span class="s1">'display notification "Hello, World!" with title "Notification"'</span>
</code></pre></div>
<p>This command will display a notification with the message "Hello, World!" and the title "Notification" near the menu bar on macOS.</p>
<p>You can customize the message and title by modifying the strings inside the double quotes in the <code>osascript</code> command.</p>
<p>Note that starting from macOS Big Sur (11.0), AppleScript-based notifications require user authorization. The first time you run this command, you will be prompted to grant permission to Terminal (or whichever application you are using) to send notifications.</p>
<h2>reading</h2>
<ul>
<li><a href="https://victorscholz.medium.com/what-is-osascript-e48f11b8dec6">What is Osascript?. Learning about Osascript started with… | by Victor Scholz | Medium</a></li>
<li><a href="https://ss64.com/osx/osascript.html">osascript Man Page - macOS - SS64.com</a></li>
</ul>Software Versioning Schemes2023-07-08T00:00:00+02:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-07-08:/software-versioning-schemes/<h2>Introduction</h2>
<p>Software versioning schemes are essential in the world of programming, as they help developers, users, and collaborators keep track of various versions of a software product. A proper versioning scheme enables easy identification of the current release, the changes made in …</p><h2>Introduction</h2>
<p>Software versioning schemes are essential in the world of programming, as they help developers, users, and collaborators keep track of various versions of a software product. A proper versioning scheme enables easy identification of the current release, the changes made in each version, and the compatibility of a version with previous ones. In this blog post, we will discuss some of the most popular versioning schemes used in the software industry, along with a few lesser-known but useful alternatives. </p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#semantic-versioning">Semantic Versioning</a></li>
<li><a href="#calendar-versioning-calver">Calendar Versioning (CalVer)</a></li>
<li><a href="#zerover-0-based-versioning">ZeroVer: 0-based Versioning</a></li>
<li><a href="#lesser-known-versioning-schemes">Lesser-known Versioning Schemes</a><ul>
<li><a href="#romantic-versioning">Romantic Versioning</a></li>
<li><a href="#hash-based-versioning">Hash-based Versioning</a></li>
<li><a href="#custom-versioning-schemes">Custom Versioning Schemes</a></li>
</ul>
</li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="semantic-versioning"></a></p>
<h2>Semantic Versioning (SemVer)</h2>
<p>Semantic versioning, also known as <a href="https://semver.org/">SemVer</a>, is a widely adopted versioning scheme that emphasizes the importance of clear and meaningful version numbers. In SemVer, a version number consists of three parts: MAJOR.MINOR.PATCH. Each part represents the following: </p>
<ul>
<li>MAJOR version: incremented when you make incompatible API changes, </li>
<li>MINOR version: incremented when you add functionality in a backwards-compatible manner, and </li>
<li>PATCH version: incremented when you make backwards-compatible bug fixes. </li>
</ul>
<p>In addition to these three parts, SemVer allows for additional labels for pre-release and build metadata as extensions to the MAJOR.MINOR.PATCH format. This makes it easier for developers to communicate the scope of changes in each release and helps users understand if an update will break their existing setup or not. </p>
<p><a id="calendar-versioning-calver"></a></p>
<h2>Calendar Versioning (CalVer)</h2>
<p>Another popular versioning scheme is Calendar Versioning or <a href="https://calver.org/">CalVer</a>. CalVer uses a combination of the release date and a project-specific version number to create a unique identifier for each release. The format typically looks like this: YYYY.MM.DD.MICRO. </p>
<p>The advantages of CalVer include its simplicity and the ability to quickly determine the age of a release. However, unlike SemVer, CalVer does not provide explicit information about API changes or compatibility between versions. </p>
<p><a id="zerover-0-based-versioning"></a></p>
<h2>ZeroVer: 0-based Versioning (0ver)</h2>
<p>ZeroVer, also known as <a href="https://0ver.org/">0ver</a> is a unique and simple versioning scheme that asserts that your software's major version should never exceed the first and most important number in computing: zero. This is in contrast to other versioning schemes like Semantic Versioning and Calendar Versioning. </p>
<p>The rationale behind ZeroVer is that software is never truly "finished" and will always be subject to improvements, bug fixes, and new features. By keeping the major version at zero, developers acknowledge the ever-evolving nature of their software and avoid the pressures associated with "final" releases. </p>
<blockquote>
<p>Note: in the <code>0ver</code> there is a zero in front of the name, do not confuse with capital letter O</p>
</blockquote>
<p><a id="lesser-known-versioning-schemes"></a></p>
<h2>Lesser-known Versioning Schemes</h2>
<p>In addition to the popular versioning schemes mentioned above, there are other lesser-known but equally useful alternatives. Some of these include: </p>
<p><a id="romantic-versioning"></a></p>
<h3>Romantic Versioning</h3>
<p><a href="https://github.com/romversioning/romver">Romantic Versioning</a> is a light-hearted, informal versioning scheme that uses popular culture references or personal milestones as version numbers. While not suitable for all projects, Romantic Versioning can be a fun way to engage users and make software updates more memorable. </p>
<p>The Romantic Versioning specification was authored by <a href="http://blog.legacyteam.info/2015/12/romver-romantic-versioning/">Daniel V from the Legacy Blog crew</a> in 2015. This open and public repository has the task of maintenance and visibility, cooperation with others.</p>
<p>See also: <a href="http://sentimentalversioning.org/">sentimentalversioning.org</a></p>
<p><a id="hash-based-versioning"></a></p>
<h3>Hash-based Versioning</h3>
<p><a href="https://miniscruff.github.io/hashver/">Hash-based Versioning</a>is a versioning scheme that uses the unique hash of a particular commit in a version control system (such as Git) as the version number. This approach ensures that each release is directly tied to a specific point in the development history, making it easy to track and revert changes if needed. </p>
<p><a id="custom-versioning-schemes"></a></p>
<h3>Custom Versioning Schemes</h3>
<p>Some projects may benefit from a custom versioning scheme tailored to their specific needs. This could involve combining elements from various existing schemes or developing an entirely new approach. When creating a custom versioning scheme, it's essential to ensure that it is clear, consistent, and easy to understand for all stakeholders. </p>
<p><a id="conclusion"></a></p>
<h2>Conclusion</h2>
<p>Choosing the right versioning scheme for your software project is crucial for effective communication and collaboration among developers, users, and other stakeholders. While Semantic Versioning and Calendar Versioning are popular choices, alternative schemes like ZeroVer, Romantic Versioning, Hash-based Versioning, or even custom schemes can also be appropriate depending on your project's unique requirements. Ultimately, the ideal versioning scheme should be easy to understand, provide meaningful information about each release, and facilitate the management of software updates.</p>How to install Faiss on Google Colab2023-07-04T00:00:00+02:002023-07-04T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-07-04:/how-to-install-faiss-on-google-colab/<h2>"X:": "2023-06-08-Similarity_search_using_IVFPQ"</h2>
<p>To install <a href="https://github.com/facebookresearch/faiss">faiss</a> on Colab use:</p>
<div class="highlight"><pre><span></span><code><span class="sx">!pip install faiss-cpu --no-cache</span>
</code></pre></div>Easy Text Vectorization With VectorHub and Sentence Transformers2023-07-04T00:00:00+02:002023-07-04T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-07-04:/text-vectorization-with-vectorhub-and-sentence-transformers/<p>Learn how to use Sentence Transformers for text vectorization with different models using consistent API.</p><p>Text is heavily inspired by part of the e-book: <a href="https://learn.getvectorai.com/vector-ai-documentation/semantic-nlp-search-with-faiss-and-vectorhub">Semantic NLP search with FAISS and VectorHub - Guide To Vectors (getvectorai.com)</a> - which was using VectorHub as an interface to the models.</p>
<blockquote>
<p><strong>NOTE</strong>: VectorHub is deprecated and no longer maintained. The authors of VectorHub recommend using <a href="https://www.sbert.net/">Sentence Transformers</a>, TFHub, and Huggingface directly for text vectorization.</p>
</blockquote>
<p>This article demonstrates a similar process as the original article but uses a sentence transformers package.</p>
<h3>Encoding Data Using Sentence Transformers</h3>
<p>To encode models easily, we will utilize the <a href="https://www.sbert.net/">Sentence Transformers</a> library. SentenceTransformers is a Python framework for state-of-the-art sentence, text, and image embeddings. It provides a variety of pre-trained models that can convert sentences into meaningful numerical representations.</p>
<p>First, we need to install the <code>sentence-transformers</code> package, which includes the necessary dependencies for using Sentence Transformers. This library offers a wide range of pre-trained models, such as <a href="https://en.wikipedia.org/wiki/BERT_(Language_model)">BERT</a>, <a href="https://huggingface.co/docs/transformers/model_doc/roberta">RoBERTa</a>, and <a href="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2">MiniLM</a>, that can be used for text encoding. More information about Sentence Transformers can be found <a href="https://www.sbert.net/">here</a>.</p>
<div class="highlight"><pre><span></span><code>pip<span class="w"> </span>install<span class="w"> </span>sentence-transformers
</code></pre></div>
<p>Next, we will instantiate our model and start the encoding process. In this example, we will use the "all-MiniLM-L6-v2" model, which is a variant of the MiniLM model.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">sentence_transformers</span> <span class="kn">import</span> <span class="n">SentenceTransformer</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">SentenceTransformer</span><span class="p">(</span><span class="s1">'all-MiniLM-L6-v2'</span><span class="p">)</span>
<span class="c1"># Sentences to be encoded</span>
<span class="n">sentences</span> <span class="o">=</span> <span class="p">[</span>
<span class="s1">'This framework generates embeddings for each input sentence'</span><span class="p">,</span>
<span class="s1">'Sentences are passed as a list of strings.'</span><span class="p">,</span>
<span class="s1">'The quick brown fox jumps over the lazy dog.'</span>
<span class="p">]</span>
<span class="c1"># Encode sentences using the Sentence Transformers model</span>
<span class="n">embeddings</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">sentences</span><span class="p">)</span>
<span class="c1"># Print the embeddings</span>
<span class="k">for</span> <span class="n">sentence</span><span class="p">,</span> <span class="n">embedding</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">sentences</span><span class="p">,</span> <span class="n">embeddings</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Sentence:"</span><span class="p">,</span> <span class="n">sentence</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Embedding:"</span><span class="p">,</span> <span class="n">embedding</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">""</span><span class="p">)</span>
</code></pre></div>
<p>In the code snippet above, we begin by installing the <code>sentence-transformers</code> package, which provides the necessary tools for working with Sentence Transformers. This library offers various pre-trained models that can convert sentences into meaningful vector representations.</p>
<p>After the installation, we import the <code>SentenceTransformer</code> class from the <code>sentence_transformers</code> module. We instantiate the model using the <code>all-MiniLM-L6-v2</code> variant, which will be used for encoding our sentences.</p>
<p>We define a list of sentences that we want to encode using the Sentence Transformers model. In this case, we have three exemplary sentences: "This framework generates embeddings for each input sentence," "Sentences are passed as a list of strings," and "The quick brown fox jumps over the lazy dog."</p>
<p>To perform the encoding, we use the <code>encode</code> method of the <code>model</code> object, passing in the <code>sentences</code> list. This method returns the corresponding embeddings for each sentence, which we store in the <code>embeddings</code> variable.</p>
<p>Finally, we iterate over the <code>sentences</code> and <code>embeddings</code> lists together using <code>zip</code>. For each sentence and its corresponding embedding, we print them out to visualize the results.</p>
<p>Please note that the code snippet above uses the "all-MiniLM-L6-v2" model as an example. You can explore and choose from a wide range of models provided by Sentence Transformers according to your specific requirements.</p>
<h2>References</h2>
<ul>
<li><a href="https://github.com/RelevanceAI/vectorhub">GitHub - RelevanceAI/vectorhub: Vector Hub - Library for easy discovery, and consumption of State-of-the-art models to turn data into vectors. (text2vec, image2vec, video2vec, graph2vec, bert, inception, etc)</a></li>
<li><a href="https://learn.getvectorai.com/">Introduction - Guide To Vectors</a></li>
</ul>
<p><em>Any comments or suggestions? <a href="mailto:ksafjan@gmail.com?subject=Blog+post">Let me know</a>.</em></p>Introducing a Python Module for Splitting Text Into Parts Based on Token Limit2023-07-03T00:00:00+02:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-07-03:/token-split-text/<h2>Introduction</h2>
<p>In the realm of natural language processing and text analysis, it is often necessary to split a large piece of text into smaller parts while ensuring that the split does not break words or disrupt the meaning of the text. This …</p><h2>Introduction</h2>
<p>In the realm of natural language processing and text analysis, it is often necessary to split a large piece of text into smaller parts while ensuring that the split does not break words or disrupt the meaning of the text. This task can be challenging, especially when dealing with the tokenization. However, with the help of the Tiktoken library and a custom Python module, splitting text based on a specified number of tokens can be an easy task.</p>
<h2>Understanding the Tiktoken Library</h2>
<p>Tiktoken is a powerful Python library for tokenization, which is the process of splitting text into individual tokens such as words or subwords. The library provides various tokenization encodings and functions that enable developers to process text data in a tokenized format. It offers support for different languages and tokenization models, making it a versatile tool for a wide range of text processing tasks. Tiktoken is a fast <a href="https://en.wikipedia.org/wiki/Byte_pair_encoding">BPE</a> tokeniser for use with OpenAI's models.</p>
<h2>Introducing the Python Module: split_string_with_limit</h2>
<p>The provided Python module: <a href="https://gist.github.com/izikeros/17d9c8ab644bd2762acf6b19dd0cea39">split_string_with_limit.py</a> (GitHub Gist), leverages the capabilities of the Tiktoken library to split a string into parts with a specified limit on the number of tokens per part. The module takes three parameters: <code>text</code>, <code>limit</code>, and <code>encoding</code>.</p>
<ul>
<li><code>text</code>: The input string that needs to be split.</li>
<li><code>limit</code>: The maximum number of tokens allowed per part.</li>
<li><code>encoding</code>: The tokenization encoding to be used, which determines how the text is tokenized.</li>
</ul>
<p>The module proceeds as follows:</p>
<ol>
<li>It tokenizes the input text using the specified encoding from Tiktoken.</li>
<li>It creates an empty list, <code>parts</code>, to store the tokenized parts.</li>
<li>It initializes a <code>current_part</code> list and a <code>current_count</code> variable to keep track of the tokens in the current part.</li>
<li>It iterates over each token in the tokenized text.</li>
<li>For each token, it appends it to the <code>current_part</code> list and increments the <code>current_count</code> by 1.</li>
<li>If the <code>current_count</code> exceeds the specified limit, it adds the <code>current_part</code> to the <code>parts</code> list, resets the <code>current_part</code> and <code>current_count</code> to empty values, and continues with the next tokens.</li>
<li>Once all the tokens have been processed, the module checks if there is any remaining <code>current_part</code> and adds it to the <code>parts</code> list.</li>
<li>Finally, it converts each tokenized part back into text format by decoding the individual tokens and joins them together. The resulting text parts are stored in the <code>text_parts</code> list.</li>
<li>The module returns the <code>text_parts</code> list as the output.</li>
</ol>
<h2>Example Usage</h2>
<p>To demonstrate the usage of the <code>split_string_with_limit</code> module, let's consider an example:</p>
<div class="highlight"><pre><span></span><code>python<span class="w"> </span>split_string_with_limit.py<span class="w"> </span>input_file.txt<span class="w"> </span><span class="m">100</span><span class="w"> </span>cl100k_base
</code></pre></div>
<p>In this example, we provide three arguments:</p>
<ol>
<li><code>input_file.txt</code>: The path to the input text file that contains the text to be split.</li>
<li><code>100</code>: The maximum number of tokens allowed per part. You can adjust this value based on your requirements.</li>
<li><code>cl100k_base</code>: The encoding name. This determines how the text will be tokenized. Tiktoken provides various encoding options, and you can experiment with different encodings to achieve the desired results.</li>
</ol>
<p>The module reads the text from the input file, tokenizes it using the specified encoding, and splits it into parts based on the token limit. The resulting text parts are then printed in a JSON format, providing a structured representation of the split text.</p>
<h2>Approximate approach</h2>
<p>While the <code>split_string_with_limit</code> module offers a convenient solution for splitting text based on a token limit, it's worth mentioning alternative algorithms or approaches that can achieve similar results. One of these can be a <strong>Fixed-Length Split</strong>: instead of splitting based on the number of tokens, we could split the text into fixed-length segments based on counting words or characters. One can use <a href="https://platform.openai.com/tokenizer">rule of thumb</a>:</p>
<blockquote>
<p>A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).</p>
</blockquote>
<p>to have approximate of the split into parts of equal length without doing actual tokenization.</p>
<h2>Conclusion</h2>
<p>In this blog post, we introduced the <code>split_string_with_limit</code> Python module, which leverages the power of the Tiktoken library to split a string into parts based on a specified token limit. We discussed the functionality of the module, its parameters, and how it can be used in practice. Furthermore, we explored alternative algorithms and approaches for splitting text based on the number of tokens. By combining the flexibility of Tiktoken and the convenience of the <code>split_string_with_limit</code> module, developers can efficiently process and analyze text data without compromising on accuracy or readability.</p>Demystifying Perplexity - Assessing Dimensionality Reduction With PCA2023-06-30T00:00:00+02:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-06-30:/demystifying-perplexity-assessing-dimensionality-reduction-with-pca/<p>X::<a href="https://www.safjan.com/measure-quality-of-embeddings-intrinsic-vs-extrinsic/">Intrinsic vs. Extrinsic Evaluation - What's the Best Way to Measure Embedding Quality?</a></p>
<p>Perplexity is a measure commonly used in machine learning, particularly in the field of dimensionality reduction techniques such as Principal Component Analysis (PCA). It provides a way to evaluate …</p><p>X::<a href="https://www.safjan.com/measure-quality-of-embeddings-intrinsic-vs-extrinsic/">Intrinsic vs. Extrinsic Evaluation - What's the Best Way to Measure Embedding Quality?</a></p>
<p>Perplexity is a measure commonly used in machine learning, particularly in the field of dimensionality reduction techniques such as Principal Component Analysis (PCA). It provides a way to evaluate and compare the quality of dimensionality reduction algorithms by quantifying how well they preserve the structure of the original data.</p>
<p>In this blog post, we will delve into the concept of perplexity, its application in PCA, and its importance in assessing the performance of dimensionality reduction methods. We will also provide code examples in Python to demonstrate its practical implementation.</p>
<h2>Understanding Perplexity</h2>
<p>Perplexity is a measure originally developed for evaluating probabilistic models, particularly in the field of natural language processing. It represents the level of uncertainty or confusion in predicting the next item in a sequence. In the context of dimensionality reduction, perplexity provides an estimation of the number of nearest neighbors that should be considered when reconstructing a data point in a lower-dimensional space.</p>
<p>Given a high-dimensional dataset, PCA aims to find a lower-dimensional representation that captures the most significant features or patterns of the original data. The idea behind perplexity is to preserve the local structure of the data, ensuring that neighboring points in the high-dimensional space remain close to each other in the reduced space.</p>
<h2>Perplexity in PCA</h2>
<p>To understand how perplexity is used in PCA, let's consider a high-dimensional dataset with 𝑁 data points. PCA involves projecting this dataset onto a lower-dimensional space while retaining the maximum amount of variance. The reduced dataset can be represented by 𝑀 principal components, where 𝑀 < 𝑁.</p>
<p>In PCA, the perplexity of a data point 𝑥𝑖 is calculated based on the conditional probability distribution of its neighbors given a particular variance or similarity scale. This distribution can be modeled using a Gaussian kernel centered at 𝑥𝑖:</p>
<div class="math">$$
P(\mathbf{y}_j|\mathbf{x}_i) = \frac{{\exp\left(-\frac{{\|\mathbf{y}_j - \mathbf{x}_i\|^2}}{{2\sigma_i^2}}\right)}}{{\sum_{k\neq j}\exp\left(-\frac{{\|\mathbf{y}_k - \mathbf{x}_i\|^2}}{{2\sigma_i^2}}\right)}}
$$</div>
<p>Here, 𝑃(𝑦𝑗|𝑥𝑖) represents the conditional probability of observing data point 𝑦𝑗 as a neighbor of 𝑥𝑖 in the lower-dimensional space. The variance or similarity scale 𝜎𝑖 determines the spread of the Gaussian kernel for each data point 𝑥𝑖.</p>
<p>The perplexity of 𝑥𝑖, denoted as 𝑃𝑖, is then defined as the Shannon entropy of the conditional distribution:</p>
<div class="math">$$
P_i = 2^{-\sum_j P(\mathbf{y}_j|\mathbf{x}_i)\log_2 P(\mathbf{y}_j|\mathbf{x}_i)}
$$</div>
<p>In practice, finding the optimal variance scale 𝜎𝑖 that results in the desired perplexity can be challenging. One common approach is to perform a binary search to find the value of 𝜎𝑖 that achieves a target perplexity value. The binary search is performed by iteratively adjusting the value of 𝜎𝑖 until the entropy of the conditional distribution matches the target perplexity.</p>
<h2>Evaluating Dimensionality Reduction with Perplexity</h2>
<p>Perplexity is a crucial metric for evaluating the performance of dimensionality reduction techniques, including PCA. By preserving the local structure of the data, a good dimensionality reduction method should ensure that neighboring points remain close to each other in the lower-dimensional space.</p>
<p>To evaluate the effectiveness of a dimensionality reduction algorithm, we can compare the perplexity of the original high-dimensional data with the perplexity of the reduced data. If the perplexity remains similar after dimensionality reduction, it suggests that the algorithm successfully preserves the local structure of the data.</p>
<p>In practice, perplexity is often used in conjunction with other evaluation metrics, such as visualization techniques like t-SNE (t-Distributed Stochastic Neighbor Embedding). t-SNE is a nonlinear dimensionality reduction method that can be used to visualize high-dimensional data in two or three dimensions while preserving the local structure. By comparing the perplexity of t-SNE embeddings with the perplexity of the original data, we can gain insights into the quality of the dimensionality reduction.</p>
<h2>Implementation in Python</h2>
<p>Let's now demonstrate the calculation of perplexity and its application in evaluating dimensionality reduction using PCA in Python. We will use the scikit-learn library, which provides a simple and efficient implementation of PCA and other machine learning algorithms.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">sklearn.decomposition</span> <span class="kn">import</span> <span class="n">PCA</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">pairwise_distances</span>
<span class="kn">from</span> <span class="nn">scipy.spatial.distance</span> <span class="kn">import</span> <span class="n">squareform</span>
<span class="k">def</span> <span class="nf">perplexity</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">perplexity_value</span><span class="p">):</span>
<span class="n">distances</span> <span class="o">=</span> <span class="n">pairwise_distances</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">metric</span><span class="o">=</span><span class="s1">'euclidean'</span><span class="p">,</span> <span class="n">squared</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">P</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">N</span><span class="p">,</span> <span class="n">N</span><span class="p">))</span>
<span class="n">sigmas</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">N</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">N</span><span class="p">):</span>
<span class="n">beta_min</span> <span class="o">=</span> <span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">inf</span>
<span class="n">beta_max</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">inf</span>
<span class="n">betas</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">N</span><span class="p">)</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">50</span><span class="p">):</span>
<span class="n">affinities</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">distances</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">betas</span><span class="p">)</span>
<span class="n">sum_affinities</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">affinities</span><span class="p">)</span>
<span class="n">entropy</span> <span class="o">=</span> <span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">((</span><span class="n">affinities</span> <span class="o">/</span> <span class="n">sum_affinities</span><span class="p">)</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">log2</span><span class="p">(</span><span class="n">affinities</span> <span class="o">/</span> <span class="n">sum_affinities</span><span class="p">))</span>
<span class="n">perplexity_diff</span> <span class="o">=</span> <span class="n">entropy</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="n">log2</span><span class="p">(</span><span class="n">perplexity_value</span><span class="p">)</span>
<span class="k">if</span> <span class="n">np</span><span class="o">.</span><span class="n">abs</span><span class="p">(</span><span class="n">perplexity_diff</span><span class="p">)</span> <span class="o"><</span> <span class="mf">1e-5</span><span class="p">:</span>
<span class="k">break</span>
<span class="k">if</span> <span class="n">perplexity_diff</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span>
<span class="n">beta_min</span> <span class="o">=</span> <span class="n">betas</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span>
<span class="k">if</span> <span class="n">beta_max</span> <span class="o">==</span> <span class="n">np</span><span class="o">.</span><span class="n">inf</span> <span class="ow">or</span> <span class="n">beta_max</span> <span class="o">==</span> <span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">inf</span><span class="p">:</span>
<span class="n">betas</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*=</span> <span class="mi">2</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">betas</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">betas</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="n">beta_max</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">beta_max</span> <span class="o">=</span> <span class="n">betas</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span>
<span class="k">if</span> <span class="n">beta_min</span> <span class="o">==</span> <span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">inf</span> <span class="ow">or</span> <span class="n">beta_min</span> <span class="o">==</span> <span class="n">np</span><span class="o">.</span><span class="n">inf</span><span class="p">:</span>
<span class="n">betas</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">/=</span> <span class="mi">2</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">betas</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">betas</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="n">beta_min</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span>
<span class="n">P</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">affinities</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">affinities</span><span class="p">)</span>
<span class="k">return</span> <span class="n">P</span>
<span class="c1"># Generate random high-dimensional data</span>
<span class="n">N</span> <span class="o">=</span> <span class="mi">1000</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span>
<span class="c1"># Apply PCA</span>
<span class="n">pca</span> <span class="o">=</span> <span class="n">PCA</span><span class="p">(</span><span class="n">n_components</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="n">X_reduced</span> <span class="o">=</span> <span class="n">pca</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="c1"># Calculate perplexity of original data</span>
<span class="n">original_perplexity</span> <span class="o">=</span> <span class="n">perplexity</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">perplexity_value</span><span class="o">=</span><span class="mi">30</span><span class="p">)</span>
<span class="c1"># Calculate perplexity of reduced data</span>
<span class="n">reduced_perplexity</span> <span class="o">=</span> <span class="n">perplexity</span><span class="p">(</span><span class="n">X_reduced</span><span class="p">,</span> <span class="n">perplexity_value</span><span class="o">=</span><span class="mi">30</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Perplexity of original data:"</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">original_perplexity</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Perplexity of reduced data:"</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">reduced_perplexity</span><span class="p">))</span>
</code></pre></div>
<p>In the above example, we generate a random high-dimensional dataset using NumPy and apply PCA to reduce its dimensionality to 2. We then calculate the perplexity of the original data and the reduced data using the <code>perplexity</code> function. Finally, we print the mean perplexity values for comparison.</p>
<p>By examining the perplexity values, we can gain insights into how well PCA preserves the local structure of the data. If the perplexity of the reduced data is close to the perplexity of the original data, it suggests that PCA successfully retains the essential information during dimensionality reduction.</p>
<h2>Conclusion</h2>
<p>In this blog post, we explored the concept of perplexity in the context of dimensionality reduction, specifically in PCA. Perplexity provides a measure of the level of uncertainty or confusion in predicting the neighbors of a data point in a lower-dimensional space. It is used to assess the quality of dimensionality reduction algorithms by evaluating how well they preserve the local structure of the data.</p>
<p>We also provided a Python implementation to calculate perplexity and demonstrated its application in evaluating dimensionality reduction using PCA. By comparing the perplexity of the original data with the perplexity of the reduced data, we can assess the effectiveness of PCA in preserving the essential information.</p>
<p>Perplexity is a valuable tool in the evaluation and comparison of dimensionality reduction methods. It offers insights into the preservation of the local structure and can guide the selection of appropriate techniques for different datasets and applications.</p>
<p>See also:
<a href="https://distill.pub/2016/misread-tsne/">How to Use t-SNE Effectively</a></p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Understanding Bhattacharyya Distance and Coefficient for Probability Distributions2023-06-30T00:00:00+02:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-06-30:/understanding-bhattacharyya-distance-and-coefficient-for-probability-distributions/<h1>Introduction</h1>
<p>In the fields of statistics, machine learning, and data science, measuring the similarity between probability distributions is crucial for various tasks. One commonly used measure for this purpose is the <a href="https://en.wikipedia.org/wiki/Bhattacharyya_distance">Bhattacharyya distance</a>, which quantifies the dissimilarity between two distributions. The Bhattacharyya …</p><h1>Introduction</h1>
<p>In the fields of statistics, machine learning, and data science, measuring the similarity between probability distributions is crucial for various tasks. One commonly used measure for this purpose is the <a href="https://en.wikipedia.org/wiki/Bhattacharyya_distance">Bhattacharyya distance</a>, which quantifies the dissimilarity between two distributions. The Bhattacharyya coefficient, on the other hand, provides a measure of the overlap between two statistical samples or populations. In this blog post, we will delve into the concepts of Bhattacharyya distance and coefficient, discuss their applications, and provide Python code examples for better understanding.</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#bhattacharyya-distance">Bhattacharyya Distance</a></li>
<li><a href="#bhattacharyya-coefficient">Bhattacharyya Coefficient</a></li>
<li><a href="#applications-of-bhattacharyya-distance-and-coefficient">Applications of Bhattacharyya Distance and Coefficient</a></li>
<li><a href="#python-implementation">Python Implementation</a></li>
<li><a href="#other-metrics">Other metrics</a></li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="bhattacharyya-distance"></a></p>
<h2>Bhattacharyya Distance</h2>
<p>The Bhattacharyya distance is a statistical measure that quantifies the similarity between two probability distributions. It is named after Anil Kumar Bhattacharyya, an Indian-American statistician. The distance is defined for continuous probability distributions and is based on the Bhattacharyya coefficient (which we will discuss later).</p>
<p>Mathematically, the Bhattacharyya distance between two continuous probability density functions (PDFs) or discrete probability mass functions (PMFs) is defined as follows:</p>
<div class="math">$$
D_B(P,Q) = -\ln \left( BC(P,Q) \right) = -\ln \left( \sum_{i} \sqrt{P(i)Q(i)} \right)
$$</div>
<p>where ( P ) and ( Q ) are the probability distributions being compared, ( P(i) ) and ( Q(i) ) represent the probabilities of occurrence for the ( i )th event or sample, and ( BC(P,Q) ) denotes the Bhattacharyya coefficient.</p>
<p>The Bhattacharyya distance ranges from 0 to infinity, where 0 indicates perfect similarity and higher values indicate increasing dissimilarity. It is important to note that the Bhattacharyya distance is a symmetric measure, meaning ( D_B(P,Q) = D_B(Q,P) ).</p>
<p><a id="bhattacharyya-coefficient"></a></p>
<h2>Bhattacharyya Coefficient</h2>
<p>The Bhattacharyya coefficient is a measure of overlap between two statistical samples or populations. It quantifies the similarity between two probability distributions and is often used as a precursor to computing the Bhattacharyya distance.</p>
<p>Mathematically, the Bhattacharyya coefficient between two discrete probability distributions can be calculated as follows:</p>
<div class="math">$$
BC(P,Q) = \sum_{i} \sqrt{P(i)Q(i)}
$$</div>
<p>For continuous probability distributions, the Bhattacharyya coefficient can be expressed as an integral:</p>
<div class="math">$$
BC(P,Q) = \int \sqrt{p(x) q(x)} \, dx
$$</div>
<p>where ( p(x) ) and ( q(x) ) represent the probability density functions (PDFs) of distributions ( P ) and ( Q ), respectively.</p>
<p>The Bhattacharyya coefficient ranges from 0 to 1, where 1 indicates complete overlap and 0 indicates no overlap. The coefficient measures the similarity of two distributions by taking into account the square root of the product of their probabilities at corresponding events or samples.</p>
<p><a id="applications-of-bhattacharyya-distance-and-coefficient"></a></p>
<h2>Applications of Bhattacharyya Distance and Coefficient</h2>
<ol>
<li>
<p>Pattern recognition: Bhattacharyya distance is often used to compare feature vectors or histograms in pattern recognition tasks. It helps in identifying similarities or dissimilarities between different classes or clusters of data.</p>
</li>
<li>
<p>Image processing: Bhattacharyya distance can be used to compare image histograms, aiding in tasks such as image segmentation, object recognition, and image retrieval.</p>
</li>
<li>
<p>Document classification: Bhattacharyya distance can be employed to measure the similarity between document feature vectors, enabling document clustering and categorization.</p>
</li>
</ol>
<p><a id="python-implementation"></a></p>
<h2>Python Implementation</h2>
<p>To demonstrate the computation of Bhattacharyya distance and coefficient, we will use the SciPy library in Python.</p>
<p>Let's assume we have two discrete probability distributions, ( P ) and ( Q ), represented as arrays.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">scipy.spatial</span> <span class="kn">import</span> <span class="n">distance</span>
<span class="c1"># Probability distributions</span>
<span class="n">P</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">0.2</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.4</span><span class="p">])</span>
<span class="n">Q</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">0.25</span><span class="p">,</span> <span class="mf">0.15</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">,</span> <span class="mf">0.4</span><span class="p">])</span>
<span class="c1"># Bhattacharyya distance</span>
<span class="n">db</span> <span class="o">=</span> <span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">distance</span><span class="o">.</span><span class="n">bhattacharyya</span><span class="p">(</span><span class="n">P</span><span class="p">,</span> <span class="n">Q</span><span class="p">))</span>
<span class="c1"># Bhattacharyya coefficient</span>
<span class="n">bc</span> <span class="o">=</span> <span class="n">distance</span><span class="o">.</span><span class="n">bhattacharyya</span><span class="p">(</span><span class="n">P</span><span class="p">,</span> <span class="n">Q</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Bhattacharyya Distance: "</span><span class="p">,</span> <span class="n">db</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Bhattacharyya Coefficient: "</span><span class="p">,</span> <span class="n">bc</span><span class="p">)</span>
</code></pre></div>
<p>Output:</p>
<div class="highlight"><pre><span></span><code>Bhattacharyya Distance: 0.0632593256263896
Bhattacharyya Coefficient: 0.9367406743736104
</code></pre></div>
<p>In the code snippet above, we utilize the <code>bhattacharyya</code> function from the <code>scipy.spatial.distance</code> module to compute the Bhattacharyya distance and coefficient. The resulting values are printed, providing the measure of dissimilarity and overlap, respectively.</p>
<p><a id="other-metrics"></a></p>
<h2>Other metrics</h2>
<p>The Bhattacharyya distance metric has both similarities and differences compared to other related distance metrics used in statistics, machine learning, and data science. Let's explore the similarities and differences with some commonly used distance metrics.</p>
<table>
<thead>
<tr>
<th>Distance Metric</th>
<th>Similarities</th>
<th>Differences</th>
</tr>
</thead>
<tbody>
<tr>
<td>Euclidean</td>
<td>- Applicable to both continuous and discrete data.</td>
<td>- Measures geometric distance between points in a multi-dimensional space.<br>- Does not consider probability information of the data.</td>
</tr>
<tr>
<td>Manhattan</td>
<td>- Similar to Euclidean, applicable to both continuous and discrete data.</td>
<td>- Measures distance between points as the sum of absolute differences in their coordinates.<br>- Does not consider probability distributions.<br>- Not suitable for measuring similarity between distributions.</td>
</tr>
<tr>
<td>Kullback-Leibler</td>
<td>- Measures dissimilarity between probability distributions.</td>
<td>- Quantifies information lost when one distribution is used to approximate the other.<br>- Does not directly measure overlap or similarity between distributions.<br>- Asymmetric measure.</td>
</tr>
<tr>
<td>Jensen-Shannon</td>
<td>- Variation of KL divergence, measures dissimilarity between probability distributions.</td>
<td>- Calculates average of KL divergences between distributions and their average.<br>- Does not directly measure overlap or similarity between distributions.<br>- Symmetric measure.</td>
</tr>
<tr>
<td>Cosine Similarity</td>
<td>- Measures similarity between vector representations of data.</td>
<td>- Measures cosine of the angle between two vectors, indicating similarity in direction or orientation.<br>- Primarily used for comparing vector representations, such as text documents or high-dimensional feature vectors.<br>- Does not capture probabilistic nature of distributions or specifically designed for comparing probability distributions.</td>
</tr>
</tbody>
</table>
<p>In summary, the Bhattacharyya distance stands out as a measure explicitly designed for comparing probability distributions. It considers the probability information of the data and provides a measure of dissimilarity based on the overlap between distributions. Other distance metrics, such as Euclidean distance, Manhattan distance, KL divergence, Jensen-Shannon divergence, and cosine similarity, have different focuses and may not capture the probabilistic nature or similarity between distributions as effectively as the Bhattacharyya distance.</p>
<p><a id="conclusion"></a></p>
<h2>Conclusion</h2>
<p>The Bhattacharyya distance and coefficient are valuable tools for quantifying the similarity and dissimilarity between probability distributions. By leveraging these measures, we can compare distributions in various fields, including statistics, machine learning, and data science. Understanding and utilizing these concepts can aid in solving diverse tasks, such as pattern recognition, image processing, and document classification. Python implementations, as showcased in this blog post, allow for straightforward calculations and application of Bhattacharyya distance and coefficient to real-world scenarios.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Script to Python Package Using Poetry (And PyCharm)2023-06-28T00:00:00+02:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-06-28:/script-to-python-package-using-poetry-and-pycharm/<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#the-task">The task</a></li>
<li><a href="#steps-for-package-creation">Steps for Package Creation</a></li>
<li><a href="#create-project-directory">Create Project Directory</a></li>
<li><a href="#open-the-project-in-pycharm">Open the Project in PyCharm</a></li>
<li><a href="#configure-poetry-virtual-environment">Configure Poetry Virtual Environment</a></li>
<li><a href="#install-dependencies">Install Dependencies</a></li>
<li><a href="#configure-pycharm-interpreter">Configure PyCharm Interpreter</a></li>
<li><a href="#initialize-git-repository">Initialize Git Repository</a></li>
<li><a href="#create-package-structure">Create Package Structure</a></li>
<li><a href="#move-script-and-files">Move Script and Files</a></li>
<li><a href="#create-__init__py">Create <code>__init__.py</code></a></li>
<li><a href="#update-pyprojecttoml">Update <code>pyproject.toml</code></a></li>
<li><a href="#add-readmemd-file">Add README.md …</a></li></ul><!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#the-task">The task</a></li>
<li><a href="#steps-for-package-creation">Steps for Package Creation</a></li>
<li><a href="#create-project-directory">Create Project Directory</a></li>
<li><a href="#open-the-project-in-pycharm">Open the Project in PyCharm</a></li>
<li><a href="#configure-poetry-virtual-environment">Configure Poetry Virtual Environment</a></li>
<li><a href="#install-dependencies">Install Dependencies</a></li>
<li><a href="#configure-pycharm-interpreter">Configure PyCharm Interpreter</a></li>
<li><a href="#initialize-git-repository">Initialize Git Repository</a></li>
<li><a href="#create-package-structure">Create Package Structure</a></li>
<li><a href="#move-script-and-files">Move Script and Files</a></li>
<li><a href="#create-__init__py">Create <code>__init__.py</code></a></li>
<li><a href="#update-pyprojecttoml">Update <code>pyproject.toml</code></a></li>
<li><a href="#add-readmemd-file">Add README.md file</a></li>
<li><a href="#test-the-script">Test the Script</a></li>
<li><a href="#package-the-project">Package the Project</a></li>
<li><a href="#publish-the-package">Publish the Package</a></li>
<li><a href="#versioning-and-updates">Versioning and Updates</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="the-task"></a></p>
<h2>The task</h2>
<p>Let's assume that you have simple script that count tokens in provided text file. Below is the script that accepts a positional input argument, which is the file name, and can be run from the command-line interface (CLI). See also the note on <a href="https://safjan.com/how-to-count-tokens/">How to count tokens?</a></p>
<div class="highlight"><pre><span></span><code><span class="ch">#!/usr/bin/env python3</span>
<span class="kn">import</span> <span class="nn">argparse</span>
<span class="kn">import</span> <span class="nn">tiktoken</span>
<span class="k">def</span> <span class="nf">num_tokens_from_string</span><span class="p">(</span><span class="n">string</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">encoding_name</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s2">"cl100k_base"</span><span class="p">)</span> <span class="o">-></span> <span class="nb">int</span><span class="p">:</span>
<span class="w"> </span><span class="sd">"""Returns the number of tokens in a text string."""</span>
<span class="n">encoding</span> <span class="o">=</span> <span class="n">tiktoken</span><span class="o">.</span><span class="n">get_encoding</span><span class="p">(</span><span class="n">encoding_name</span><span class="p">)</span>
<span class="n">num_tokens</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">encoding</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">string</span><span class="p">))</span>
<span class="k">return</span> <span class="n">num_tokens</span>
<span class="n">num_tokens_from_string</span><span class="p">(</span>
<span class="s2">"tiktoken is great!"</span><span class="p">,</span>
<span class="p">)</span>
<span class="k">def</span> <span class="nf">count_tokens</span><span class="p">(</span><span class="n">file_path</span><span class="p">):</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">file_path</span><span class="p">,</span> <span class="s2">"r"</span><span class="p">)</span> <span class="k">as</span> <span class="n">file</span><span class="p">:</span>
<span class="n">text</span> <span class="o">=</span> <span class="n">file</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="k">return</span> <span class="n">num_tokens_from_string</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">"__main__"</span><span class="p">:</span>
<span class="n">parser</span> <span class="o">=</span> <span class="n">argparse</span><span class="o">.</span><span class="n">ArgumentParser</span><span class="p">(</span>
<span class="n">description</span><span class="o">=</span><span class="s2">"Count the number of tokens in a text file."</span>
<span class="p">)</span>
<span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s2">"file"</span><span class="p">,</span> <span class="n">help</span><span class="o">=</span><span class="s2">"Path to the input text file"</span><span class="p">)</span>
<span class="n">args</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse_args</span><span class="p">()</span>
<span class="n">file_path</span> <span class="o">=</span> <span class="n">args</span><span class="o">.</span><span class="n">file</span>
<span class="n">num_tokens</span> <span class="o">=</span> <span class="n">count_tokens</span><span class="p">(</span><span class="n">file_path</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Number of tokens: </span><span class="si">{</span><span class="n">num_tokens</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
</code></pre></div>
<p>In this script, the <code>argparse</code> module is used to handle command-line arguments. The script defines a single positional argument, <code>file</code>, which represents the file name of the input text file.</p>
<p>When the script is executed from the command line, it will parse the command-line arguments and retrieve the file path provided by the user. The <code>count_tokens</code> function will then be called with the file path, and the number of tokens will be printed.</p>
<p>To run the script from the CLI, use the following command:</p>
<div class="highlight"><pre><span></span><code>python<span class="w"> </span>script_name.py<span class="w"> </span>file_path
</code></pre></div>
<p>Replace <code>script_name.py</code> with the actual name of your script file, and <code>file_path</code> with the path to the input text file you want to analyze. The script will then tokenize the text file and print the number of tokens.</p>
<blockquote>
<p><strong>NOTE:</strong> you need <code>tiktoken</code> package installed before running the script. You can install it using pip:</p>
</blockquote>
<div class="highlight"><pre><span></span><code>pip<span class="w"> </span>install<span class="w"> </span>tiktoken
</code></pre></div>
<p><a id="steps-for-package-creation"></a></p>
<h2>Steps for Package Creation</h2>
<p>To create and publish a Python package based on the provided script, you can follow the steps below:</p>
<p><a id="create-project-directory"></a></p>
<h3>Create Project Directory</h3>
<p>Start by creating a new directory for your project. You can choose an appropriate name for the directory.</p>
<ol>
<li><strong>Initialize the Project with Poetry</strong>: Open your command-line interface and navigate to the project directory you created. Run the following command to initialize the project using Poetry:</li>
</ol>
<div class="highlight"><pre><span></span><code>poetry<span class="w"> </span>init
</code></pre></div>
<p>This command will prompt you to fill in information about your package, such as the package name, version, description, author details, and more. Fill in the required information as prompted.</p>
<p><a id="open-the-project-in-pycharm"></a></p>
<h3>Open the Project in PyCharm</h3>
<p>Open PyCharm and select "Open" from the welcome screen or go to "File" > "Open" and choose the project directory you created.</p>
<p><a id="configure-poetry-virtual-environment"></a></p>
<h3>Configure Poetry Virtual Environment</h3>
<p>When opening the project in PyCharm for the first time, it will detect the presence of Poetry. You will be prompted to either allow PyCharm to create a Poetry virtual environment or create it manually. Select the option to create the virtual environment.</p>
<p>If you already have a Poetry virtual environment set up manually, you can skip this step.</p>
<p><a id="install-dependencies"></a></p>
<h3>Install Dependencies</h3>
<p>In your command-line interface, navigate to the project directory if you're not already there. Run the following command to install the necessary dependencies using Poetry:</p>
<div class="highlight"><pre><span></span><code>poetry<span class="w"> </span>install
</code></pre></div>
<p>This command will create a virtual environment and install the required packages specified in your project's <code>pyproject.toml</code> file.</p>
<p><a id="configure-pycharm-interpreter"></a></p>
<h3>Configure PyCharm Interpreter</h3>
<p>In PyCharm, go to "File" > "Settings" > "Project: <project_name>" > "Python Interpreter". Click on the gear icon and choose "Add...".</p>
<p>Select "Poetry Environment" and choose the existing local Poetry interpreter associated with your project's virtual environment. Click "OK" to apply the changes.</p>
<p><a id="initialize-git-repository"></a></p>
<h3>Initialize Git Repository</h3>
<p>In your command-line interface, navigate to the project directory if you're not already there. Run the following command to initialize a Git repository:</p>
<div class="highlight"><pre><span></span><code>git<span class="w"> </span>init
</code></pre></div>
<p>This will set up a new Git repository for version control.</p>
<p>At this point, you have set up the project structure, initialized Poetry, configured the virtual environment in PyCharm, installed dependencies, and initialized a Git repository. Now, you can proceed with packaging and publishing your Python script.</p>
<blockquote>
<p>NOTE: you might want to add <code>.gitignore</code> file at this stage
Minimal <code>.gitignore</code> can be:</p>
</blockquote>
<div class="highlight"><pre><span></span><code><span class="err">#</span><span class="w"> </span><span class="n">Compiled</span><span class="w"> </span><span class="n">Python</span><span class="w"> </span><span class="n">files</span>
<span class="n">__pycache__</span><span class="o">/</span>
<span class="o">*</span><span class="p">.</span><span class="n">py</span><span class="o">[</span><span class="n">cod</span><span class="o">]</span>
<span class="err">#</span><span class="w"> </span><span class="n">Distribution</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">packaging</span>
<span class="n">dist</span><span class="o">/</span>
<span class="n">build</span><span class="o">/</span>
<span class="o">*</span><span class="p">.</span><span class="n">egg</span><span class="o">-</span><span class="n">info</span><span class="o">/</span>
<span class="o">*</span><span class="p">.</span><span class="n">egg</span>
<span class="err">#</span><span class="w"> </span><span class="n">Virtual</span><span class="w"> </span><span class="n">environments</span>
<span class="n">venv</span><span class="o">/</span>
<span class="n">env</span><span class="o">/</span>
<span class="err">#</span><span class="w"> </span><span class="n">IDEs</span><span class="w"> </span><span class="ow">and</span><span class="w"> </span><span class="n">editors</span>
<span class="p">.</span><span class="n">idea</span><span class="o">/</span>
</code></pre></div>
<p><a id="create-package-structure"></a></p>
<h3>Create Package Structure</h3>
<p>Inside your project directory, create a package structure that follows Python's best practices. For example, you can create a directory named <code>my_package</code> that will contain your script and other necessary files.</p>
<p><a id="move-script-and-files"></a></p>
<h3>Move Script and Files</h3>
<p>Move your script file and any other relevant files into the package directory (<code>my_package</code> in this example).</p>
<p><a id="create-__init__py"></a></p>
<h3>Create <code>__init__.py</code></h3>
<p>Inside the package directory (<code>my_package</code>), create an empty file named <code>__init__.py</code>. This file is required to make the directory a Python package.</p>
<p><a id="update-pyprojecttoml"></a></p>
<h3>Update <code>pyproject.toml</code></h3>
<p>Open your project's <code>pyproject.toml</code> file. Under the <code>[tool.poetry]</code> section, add the script file and any additional files that need to be included in the package. For example:</p>
<div class="highlight"><pre><span></span><code><span class="k">[tool.poetry]</span>
<span class="p">...</span>
<span class="k">[tool.poetry.scripts]</span>
<span class="n">my_script</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'my_package.my_script:main'</span>
</code></pre></div>
<p>Replace <code>my_script</code> with the desired command name for your script, and <code>my_package.my_script:main</code> with the correct import path to your script and its main function.</p>
<p><a id="add-readmemd-file"></a></p>
<h3>Add README.md file</h3>
<p>In the root of the project directory create <code>README.md</code> and fill it with useful information. See also:writing_good_readme</p>
<blockquote>
<p>NOTE: You can add some badges relate to your pypi package, e.g.:</p>
</blockquote>
<div class="highlight"><pre><span></span><code><span class="err">!</span><span class="o">[</span><span class="n">img</span><span class="o">]</span><span class="p">(</span><span class="nl">https</span><span class="p">:</span><span class="o">//</span><span class="n">img</span><span class="p">.</span><span class="n">shields</span><span class="p">.</span><span class="n">io</span><span class="o">/</span><span class="n">pypi</span><span class="o">/</span><span class="n">v</span><span class="o">/</span><span class="n">package_name</span><span class="p">.</span><span class="n">svg</span><span class="p">)</span>
<span class="err">![]</span><span class="p">(</span><span class="nl">https</span><span class="p">:</span><span class="o">//</span><span class="n">img</span><span class="p">.</span><span class="n">shields</span><span class="p">.</span><span class="n">io</span><span class="o">/</span><span class="n">pypi</span><span class="o">/</span><span class="n">pyversions</span><span class="o">/</span><span class="n">package_name</span><span class="p">.</span><span class="n">svg</span><span class="p">)</span>
<span class="err">![]</span><span class="p">(</span><span class="nl">https</span><span class="p">:</span><span class="o">//</span><span class="n">img</span><span class="p">.</span><span class="n">shields</span><span class="p">.</span><span class="n">io</span><span class="o">/</span><span class="n">pypi</span><span class="o">/</span><span class="n">dm</span><span class="o">/</span><span class="n">package_name</span><span class="p">.</span><span class="n">svg</span><span class="p">)</span>
</code></pre></div>
<h3>Add LICENSE file</h3>
<p>You can create a LICENSE file manually. Here's how you can do it:</p>
<ol>
<li>Create a new file in your project root directory named <code>LICENSE</code>.</li>
<li>Go to the <a href="https://opensource.org/licenses/MIT">MIT License template</a>, copy the text.</li>
<li>Paste the copied text into your <code>LICENSE</code> file.</li>
<li>Replace <code>[year]</code> with the current year and <code>[fullname]</code> with your name or your organization's name.</li>
<li>Save the file.</li>
</ol>
<p><a id="test-the-script"></a></p>
<h3>Test the Script</h3>
<p>Before publishing your package, it's essential to test your script to ensure it works as expected. You can execute the script locally to verify its functionality.</p>
<p>If you want to use pytest for testing add it as development dependency and install:</p>
<div class="highlight"><pre><span></span><code>poetry<span class="w"> </span>add<span class="w"> </span>--group<span class="w"> </span>dev<span class="w"> </span>poetry
</code></pre></div>
<p><a id="package-the-project"></a></p>
<h3>Package the Project</h3>
<p>In your command-line interface, navigate to the project directory. Run the following command to create a distributable package:</p>
<div class="highlight"><pre><span></span><code>poetry<span class="w"> </span>build
</code></pre></div>
<p>This command will generate a distributable package (e.g., a <code>.tar.gz</code> file) in the <code>dist</code> directory within your project.</p>
<p><a id="publish-the-package"></a></p>
<h3>Publish the Package</h3>
<p>To publish your package, you can use a package index such as PyPI (Python Package Index). First, you need to create an account on PyPI if you haven't already. Once you have an account, run the following command to publish your package:</p>
<div class="highlight"><pre><span></span><code>poetry<span class="w"> </span>publish
</code></pre></div>
<p>This command will guide you through the process of publishing your package to PyPI. You'll be prompted to enter your PyPI credentials and confirm the publication.</p>
<blockquote>
<p><strong>Note:</strong> Make sure your package has a unique name to avoid conflicts with existing packages on PyPI.</p>
</blockquote>
<p><a id="versioning-and-updates"></a></p>
<h3>Versioning and Updates</h3>
<p>When you make updates to your package, ensure to increment the version number in the <code>pyproject.toml</code> file under the <code>[tool.poetry.version]</code> section. This helps to track and manage different versions of your package.</p>
<p>That's it! You have now packaged and published your Python script using Poetry. Users can install your package using pip and use your script as a command-line tool.</p>
<p>Please note that publishing a package is a significant step, and it's essential to review and test your code thoroughly before sharing it with others.</p>
<h2>Correcting metadata</h2>
<div class="highlight"><pre><span></span><code><span class="n">authors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">[</span><span class="n">"Krystian Safjan <ksafjan@gmail.com>"</span><span class="o">]</span>
<span class="n">keywords</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">[</span><span class="n">"keyword1", "keyword2"</span><span class="o">]</span>
<span class="n">homepage</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="ss">"https://github.com/user/repo"</span>
<span class="n">repository</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="ss">"https://github.com/user/repo"</span>
<span class="n">documentation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="ss">"https://github.com/user/repo"</span>
</code></pre></div>Bash - Rename Multiple Image Files to Match Pattern With Sequence Number2023-06-27T00:00:00+02:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-06-27:/bash-rename-mutliple-image-files-to-match-pattern-with-sequence-number/<p>The use case for the provided script is to rename multiple image files in a directory while maintaining their original file extensions. This script can be handy in situations where you have a collection of image files with different formats or extensions …</p><p>The use case for the provided script is to rename multiple image files in a directory while maintaining their original file extensions. This script can be handy in situations where you have a collection of image files with different formats or extensions, and you want to standardize their names for better organization or consistency.</p>
<p>By executing the script, all image files with extensions such as <code>.jpg</code>, <code>.jpeg</code>, <code>.png</code>, <code>.gif</code>, <code>.tiff</code>, <code>.heic</code>, and <code>.heif</code> in the current directory will be renamed. The new names will follow the pattern "img_xxx.ext", where "xxx" represents a sequence number starting from 000, and "ext" represents the original file extension.</p>
<p>For example, if you have the following image files in the directory:</p>
<div class="highlight"><pre><span></span><code>photo1.jpg
picture.png
image2.jpeg
snapshot.tiff
capture.heic
</code></pre></div>
<p>Running the script will rename them as:</p>
<div class="highlight"><pre><span></span><code>img_000.jpg
img_001.png
img_002.jpeg
img_003.tiff
img_004.heic
</code></pre></div>
<p>This allows for consistent naming and easier identification of the image files in the directory.</p>
<p>Here's the Bash script that supports multiple image formats and preserves the original file extension while renaming the files:</p>
<div class="highlight"><pre><span></span><code><span class="ch">#!/bin/bash</span>
<span class="nv">counter</span><span class="o">=</span><span class="m">0</span>
<span class="k">for</span><span class="w"> </span>file<span class="w"> </span><span class="k">in</span><span class="w"> </span>*.<span class="o">{</span>jpg,jpeg,png,gif,tiff,heic,heif<span class="o">}</span><span class="p">;</span><span class="w"> </span><span class="k">do</span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span>-f<span class="w"> </span><span class="s2">"</span><span class="nv">$file</span><span class="s2">"</span><span class="p">;</span><span class="w"> </span><span class="k">then</span>
<span class="w"> </span><span class="nv">extension</span><span class="o">=</span><span class="s2">"</span><span class="si">${</span><span class="nv">file</span><span class="p">##*.</span><span class="si">}</span><span class="s2">"</span>
<span class="w"> </span><span class="nv">newname</span><span class="o">=</span><span class="k">$(</span><span class="nb">printf</span><span class="w"> </span><span class="s2">"img_%03d.%s"</span><span class="w"> </span><span class="s2">"</span><span class="nv">$counter</span><span class="s2">"</span><span class="w"> </span><span class="s2">"</span><span class="nv">$extension</span><span class="s2">"</span><span class="k">)</span>
<span class="w"> </span>mv<span class="w"> </span><span class="s2">"</span><span class="nv">$file</span><span class="s2">"</span><span class="w"> </span><span class="s2">"</span><span class="nv">$newname</span><span class="s2">"</span>
<span class="w"> </span><span class="o">((</span>counter++<span class="o">))</span>
<span class="w"> </span><span class="k">fi</span>
<span class="k">done</span>
</code></pre></div>
<p>In this script:</p>
<ol>
<li>The <code>for</code> loop uses brace expansion <code>{}</code> to iterate over multiple file extensions: <code>jpg</code>, <code>jpeg</code>, <code>png</code>, <code>gif</code>, <code>tiff</code>, <code>heic</code>, and <code>heif</code>.</li>
<li>Inside the loop, the script checks if the current file is a regular file using the <code>-f</code> test.</li>
<li>If it's a regular file, it extracts the original file extension using the <code>${file##*.}</code> syntax.</li>
<li>The <code>newname</code> variable is generated using <code>printf</code> with the current value of the <code>counter</code> variable and the extracted extension.</li>
<li>Finally, the file is renamed using the <code>mv</code> command, preserving the original extension.</li>
</ol>
<p>To use this script, follow these steps:</p>
<ol>
<li>Open a text editor and paste the script into a new file.</li>
<li>Save the file with a <code>.sh</code> extension, for example, <code>rename_images.sh</code>.</li>
<li>Open a terminal and navigate to the directory where the image files are located.</li>
<li>Make the script executable by running the following command: <code>chmod +x rename_images.sh</code>.</li>
<li>Run the script using the command <code>./rename_images.sh</code>.</li>
</ol>
<p>After running the script, all the image files in the directory should be renamed according to the pattern you specified.</p>
<h2>Oneliner</h2>
<p>Here's a one-liner Bash command that renames all image files in the current directory to match the pattern "img_xxx.jpg" where "xxx" is a sequence number starting from 000:</p>
<div class="highlight"><pre><span></span><code><span class="nv">counter</span><span class="o">=</span><span class="m">0</span><span class="p">;</span><span class="w"> </span><span class="k">for</span><span class="w"> </span>file<span class="w"> </span><span class="k">in</span><span class="w"> </span>*.jpg<span class="p">;</span><span class="w"> </span><span class="k">do</span><span class="w"> </span><span class="k">if</span><span class="w"> </span>-f<span class="w"> </span><span class="s2">"</span><span class="nv">$file</span><span class="s2">"</span><span class="p">;</span><span class="w"> </span><span class="k">then</span><span class="w"> </span><span class="nv">newname</span><span class="o">=</span><span class="k">$(</span><span class="nb">printf</span><span class="w"> </span><span class="s2">"img_%03d.jpg"</span><span class="w"> </span><span class="s2">"</span><span class="nv">$counter</span><span class="s2">"</span><span class="k">)</span><span class="p">;</span><span class="w"> </span>mv<span class="w"> </span><span class="s2">"</span><span class="nv">$file</span><span class="s2">"</span><span class="w"> </span><span class="s2">"</span><span class="nv">$newname</span><span class="s2">"</span><span class="p">;</span><span class="w"> </span><span class="o">((</span>counter++<span class="o">))</span><span class="p">;</span><span class="w"> </span><span class="k">fi</span><span class="p">;</span><span class="w"> </span><span class="k">done</span>
</code></pre></div>
<p>This command combines the same logic as the previous script into a single line. The <code>counter</code> variable is set to 0, and then the <code>for</code> loop iterates over the <code>.jpg</code> files in the directory. The rest of the logic remains the same.</p>
<p>To use this one-liner, open a terminal, navigate to the directory containing the image files, and run the command. The image files will be renamed accordingly.</p>
<p>To create a Bash alias for the one-liner version of the last script, you can add the following line to your <code>~/.bashrc</code> or <code>~/.bash_aliases</code> (<code>.zshrc</code> or <code>~/.zsh_aliases</code> if using zsh) file:</p>
<div class="highlight"><pre><span></span><code><span class="nb">alias</span><span class="w"> </span><span class="nv">rename_images</span><span class="o">=</span><span class="s1">'counter=0; for file in *.{jpg,jpeg,png,gif,tiff,heic,heif}; do if -f "$file"; then extension="${file##*.}"; newname=$(printf "img_%03d.%s" "$counter" "$extension"); mv "$file" "$newname"; ((counter++)); fi; done'</span>
</code></pre></div>
<p>Save the file and then run <code>source ~/.bashrc</code> or <code>source ~/.bash_aliases</code> to apply the changes.</p>
<p>Afterward, you can use the <code>rename_images</code> command in your terminal to execute the one-liner script and rename the image files in the current directory accordingly.</p>
<h2>Python version</h2>
<p>Here's a Python script that achieves the same functionality as the Bash script, renaming image files while preserving their original extensions:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">os</span>
<span class="n">counter</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">extensions</span> <span class="o">=</span> <span class="p">[</span><span class="s2">".jpg"</span><span class="p">,</span> <span class="s2">".jpeg"</span><span class="p">,</span> <span class="s2">".png"</span><span class="p">,</span> <span class="s2">".gif"</span><span class="p">,</span> <span class="s2">".tiff"</span><span class="p">,</span> <span class="s2">".heic"</span><span class="p">,</span> <span class="s2">".heif"</span><span class="p">]</span>
<span class="k">for</span> <span class="n">filename</span> <span class="ow">in</span> <span class="n">os</span><span class="o">.</span><span class="n">listdir</span><span class="p">(</span><span class="s2">"."</span><span class="p">):</span>
<span class="k">if</span> <span class="n">filename</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="nb">tuple</span><span class="p">(</span><span class="n">extensions</span><span class="p">))</span> <span class="ow">and</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">isfile</span><span class="p">(</span><span class="n">filename</span><span class="p">):</span>
<span class="n">file_parts</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">splitext</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span>
<span class="n">newname</span> <span class="o">=</span> <span class="sa">f</span><span class="s2">"img_</span><span class="si">{</span><span class="n">counter</span><span class="si">:</span><span class="s2">03d</span><span class="si">}{</span><span class="n">file_parts</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="si">}</span><span class="s2">"</span>
<span class="n">os</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="n">newname</span><span class="p">)</span>
<span class="n">counter</span> <span class="o">+=</span> <span class="mi">1</span>
</code></pre></div>
<p>In this Python script:</p>
<ol>
<li>The <code>counter</code> variable keeps track of the sequence number for renaming the files.</li>
<li>The <code>extensions</code> list contains the supported image extensions.</li>
<li>The script iterates over each file in the current directory using <code>os.listdir(".")</code>.</li>
<li>For each file, it checks if the filename has a matching extension and if it is a regular file.</li>
<li>If both conditions are satisfied, it extracts the file's extension and uses <code>os.rename()</code> to perform the renaming operation.</li>
<li>The new name is constructed using the desired pattern "img_xxx.ext", where "xxx" represents the sequence number and "ext" represents the original file extension.</li>
<li>Finally, the <code>counter</code> is incremented for the next file.</li>
</ol>
<p>You can save this Python script to a file with a <code>.py</code> extension, for example, <code>rename_images.py</code>, and then run it using a Python interpreter. The image files in the directory will be renamed accordingly, following the specified pattern while preserving their original extensions.</p>Harnessing the Power of Dependency Injection for Improved Testability in Python2023-06-21T00:00:00+02:002023-06-21T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-06-21:/python-dependency-injection-for-the-testability/<p>Learn how to use dependency injection to decouple dependencies from our functions, methods, or classes, making it easier to test and maintain our code.</p><h2>Introduction</h2>
<p>In software development, testability is a crucial aspect that helps ensure the reliability and maintainability of our code. One effective technique for enhancing testability is dependency injection (DI). Dependency injection allows us to decouple dependencies from our functions, methods, or classes, making it easier to test and maintain our code. In this blog post, we will explore various techniques, use cases, and lesser-known tricks for using dependency injection in Python.</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#what-is-dependency-injection">What is Dependency Injection?</a></li>
<li><a href="#benefits-of-dependency-injection">Benefits of Dependency Injection:</a></li>
<li><a href="#techniques-for-dependency-injection">Techniques for Dependency Injection</a></li>
<li><a href="#constructor-injection">Constructor Injection</a></li>
<li><a href="#setter-injection">Setter Injection</a></li>
<li><a href="#interface-injection">Interface Injection</a></li>
<li><a href="#use-cases-for-dependency-injection">Use Cases for Dependency Injection:</a></li>
<li><a href="#testing-legacy-code">Testing Legacy Code</a></li>
<li><a href="#mocking-dependencies">Mocking Dependencies</a></li>
<li><a href="#improving-code-reusability">Improving Code Reusability</a></li>
<li><a href="#parameter-injection">Parameter Injection:</a></li>
<li><a href="#context-managers-and-dependency-injection">Context Managers and Dependency Injection</a></li>
<li><a href="#dependency-injection-containers">Dependency Injection Containers</a></li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="what-is-dependency-injection"></a></p>
<h2>What is Dependency Injection?</h2>
<p>Dependency injection is a design pattern that allows us to inject dependencies into a class or function from external sources rather than creating them internally. By doing so, we reduce the coupling between components and make them more flexible, reusable, and testable.</p>
<p><a id="benefits-of-dependency-injection"></a></p>
<h2>Benefits of Dependency Injection</h2>
<ul>
<li><strong>Improved testability</strong>: By injecting dependencies, we can easily replace them with mocks or stubs during testing, making our tests more isolated and focused.</li>
<li><strong>Decoupled code</strong>: Dependency injection reduces the tight coupling between components, promoting better separation of concerns and modular design.</li>
<li><strong>Code reusability</strong>: With dependency injection, components become more reusable as they rely on abstractions rather than concrete implementations.</li>
<li><strong>Easier maintenance</strong>: By externalizing dependencies, we can modify or extend their behavior without affecting the code that uses them.</li>
</ul>
<p><a id="techniques-for-dependency-injection"></a></p>
<h2>Techniques for Dependency Injection</h2>
<p><a id="constructor-injection"></a></p>
<h3>Constructor Injection</h3>
<p>Constructor injection involves passing dependencies through a class's constructor. It ensures that the required dependencies are available before an object is created.</p>
<p>Example:</p>
<div class="highlight"><pre><span></span><code><span class="k">class</span> <span class="nc">UserService</span><span class="p">:</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">user_repository</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">user_repository</span> <span class="o">=</span> <span class="n">user_repository</span>
<span class="k">def</span> <span class="nf">get_user</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">user_id</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">user_repository</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">user_id</span><span class="p">)</span>
</code></pre></div>
<p><a id="setter-injection"></a></p>
<h3>Setter Injection</h3>
<p>Setter injection involves setting the dependencies using setter methods. This technique allows for more flexibility, as dependencies can be changed or updated after object initialization.</p>
<p>Example:</p>
<div class="highlight"><pre><span></span><code><span class="k">class</span> <span class="nc">NotificationService</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">set_email_service</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">email_service</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">email_service</span> <span class="o">=</span> <span class="n">email_service</span>
<span class="k">def</span> <span class="nf">send_notification</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">user</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">email_service</span><span class="o">.</span><span class="n">send</span><span class="p">(</span><span class="n">user</span><span class="o">.</span><span class="n">email</span><span class="p">,</span> <span class="s2">"New notification!"</span><span class="p">)</span>
</code></pre></div>
<p><a id="interface-injection"></a></p>
<h3>Interface Injection</h3>
<p>Interface injection is a technique where a dependency is injected by providing an interface or an abstract base class. This allows for the injection of different implementations of the same interface, providing flexibility and extensibility.</p>
<p>Example:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">abc</span> <span class="kn">import</span> <span class="n">ABC</span><span class="p">,</span> <span class="n">abstractmethod</span>
<span class="k">class</span> <span class="nc">Database</span><span class="p">(</span><span class="n">ABC</span><span class="p">):</span>
<span class="nd">@abstractmethod</span>
<span class="k">def</span> <span class="nf">query</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">query</span><span class="p">):</span>
<span class="k">pass</span>
<span class="k">class</span> <span class="nc">MySQLDatabase</span><span class="p">(</span><span class="n">Database</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">query</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">query</span><span class="p">):</span>
<span class="c1"># Perform MySQL query</span>
<span class="k">pass</span>
<span class="k">class</span> <span class="nc">PostgresDatabase</span><span class="p">(</span><span class="n">Database</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">query</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">query</span><span class="p">):</span>
<span class="c1"># Perform Postgres query</span>
<span class="k">pass</span>
</code></pre></div>
<p><a id="use-cases-for-dependency-injection"></a></p>
<h2>Use Cases for Dependency Injection</h2>
<p><a id="testing-legacy-code"></a></p>
<h3>Testing Legacy Code</h3>
<p>When working with legacy code that has tightly coupled dependencies, dependency injection can be used to introduce testability by replacing or mocking those dependencies during testing.</p>
<p>Example:</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">legacy_function</span><span class="p">():</span>
<span class="c1"># ...</span>
<span class="n">db_connection</span> <span class="o">=</span> <span class="n">MySQLDatabase</span><span class="p">()</span> <span class="c1"># Tightly coupled dependency</span>
<span class="c1"># ...</span>
<span class="c1"># Using dependency injection to test legacy_function</span>
<span class="k">def</span> <span class="nf">test_legacy_function</span><span class="p">():</span>
<span class="n">mock_db</span> <span class="o">=</span> <span class="n">MockMySQLDatabase</span><span class="p">()</span>
<span class="n">legacy_function</span><span class="o">.</span><span class="n">inject_dependencies</span><span class="p">(</span><span class="n">db_connection</span><span class="o">=</span><span class="n">mock_db</span><span class="p">)</span>
<span class="c1"># Test the function</span>
</code></pre></div>
<p><a id="mocking-dependencies"></a></p>
<h3>Mocking Dependencies</h3>
<p>In unit testing, dependency injection allows us to replace real dependencies with mock objects, enabling us to focus on testing the behavior of the unit under test in isolation.</p>
<p>Example:</p>
<div class="highlight"><pre><span></span><code><span class="k">class</span> <span class="nc">UserService</span><span class="p">:</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">user_repository</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">user_repository</span> <span class="o">=</span> <span class="n">user_repository</span>
<span class="k">def</span> <span class="nf">get_user</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">user_id</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">user_repository</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">user_id</span><span class="p">)</span>
<span class="c1"># Testing UserService with a mock user repository</span>
<span class="k">def</span> <span class="nf">test_get_user</span><span class="p">():</span>
<span class="n">mock_repository</span> <span class="o">=</span> <span class="n">MockUserRepository</span><span class="p">()</span>
<span class="n">service</span> <span class="o">=</span> <span class="n">UserService</span><span class="p">(</span><span class="n">user_repository</span><span class="o">=</span><span class="n">mock_repository</span><span class="p">)</span>
<span class="c1"># Test the method using the mock repository</span>
</code></pre></div>
<p><a id="improving-code-reusability"></a></p>
<h3>Improving Code Reusability</h3>
<p>Dependency injection promotes code reusability by relying on abstractions rather than concrete implementations. This allows different implementations to be injected based on specific requirements.</p>
<p>Example:</p>
<div class="highlight"><pre><span></span><code><span class="k">class</span> <span class="nc">PaymentGateway</span><span class="p">(</span><span class="n">ABC</span><span class="p">):</span>
<span class="nd">@abstractmethod</span>
<span class="k">def</span> <span class="nf">process_payment</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">amount</span><span class="p">):</span>
<span class="k">pass</span>
<span class="k">class</span> <span class="nc">PayPalGateway</span><span class="p">(</span><span class="n">PaymentGateway</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">process_payment</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">amount</span><span class="p">):</span>
<span class="c1"># Process payment via PayPal</span>
<span class="k">pass</span>
<span class="k">class</span> <span class="nc">StripeGateway</span><span class="p">(</span><span class="n">PaymentGateway</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">process_payment</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">amount</span><span class="p">):</span>
<span class="c1"># Process payment via Stripe</span>
<span class="k">pass</span>
</code></pre></div>
<ol>
<li>Lesser-Known Techniques and Tricks:</li>
</ol>
<p><a id="parameter-injection"></a></p>
<h3>Parameter Injection</h3>
<p>In addition to constructor, setter, and interface injection, parameter injection is a technique where dependencies are passed directly as parameters to functions or methods. This can be useful in situations where direct injection is preferred over using class instances.</p>
<p>Example:</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">process_data</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">logger</span><span class="p">):</span>
<span class="n">logger</span><span class="o">.</span><span class="n">info</span><span class="p">(</span><span class="s2">"Processing data..."</span><span class="p">)</span>
<span class="c1"># Process the data</span>
<span class="c1"># Calling the function with injected dependencies</span>
<span class="n">logger</span> <span class="o">=</span> <span class="n">Logger</span><span class="p">()</span>
<span class="n">process_data</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">logger</span><span class="p">)</span>
</code></pre></div>
<p><a id="context-managers-and-dependency-injection"></a></p>
<h3>Context Managers and Dependency Injection</h3>
<p>Context managers can be combined with dependency injection to provide resources or dependencies within a specific scope, ensuring their proper initialization and cleanup.</p>
<p>Example:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">contextlib</span> <span class="kn">import</span> <span class="n">contextmanager</span>
<span class="nd">@contextmanager</span>
<span class="k">def</span> <span class="nf">db_connection</span><span class="p">():</span>
<span class="n">connection</span> <span class="o">=</span> <span class="n">MySQLDatabase</span><span class="p">()</span> <span class="c1"># Dependency initialization</span>
<span class="k">yield</span> <span class="n">connection</span>
<span class="n">connection</span><span class="o">.</span><span class="n">close</span><span class="p">()</span> <span class="c1"># Cleanup</span>
<span class="c1"># Using the context manager with dependency injection</span>
<span class="k">with</span> <span class="n">db_connection</span><span class="p">()</span> <span class="k">as</span> <span class="n">db</span><span class="p">:</span>
<span class="c1"># Use the database connection within the context</span>
</code></pre></div>
<p><a id="dependency-injection-containers"></a></p>
<h3>Dependency Injection Containers</h3>
<p>Dependency injection containers or frameworks provide a centralized way to manage dependencies, their configurations, and their lifetime. Popular Python DI frameworks include <code>injector</code>, <code>DInjector</code>, and <code>inject</code>.</p>
<p>Example:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">injector</span> <span class="kn">import</span> <span class="n">inject</span><span class="p">,</span> <span class="n">Injector</span>
<span class="k">class</span> <span class="nc">UserService</span><span class="p">:</span>
<span class="nd">@inject</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">user_repository</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">user_repository</span> <span class="o">=</span> <span class="n">user_repository</span>
<span class="c1"># Creating and using an injector</span>
<span class="n">injector</span> <span class="o">=</span> <span class="n">Injector</span><span class="p">()</span>
<span class="n">user_service</span> <span class="o">=</span> <span class="n">injector</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">UserService</span><span class="p">)</span>
</code></pre></div>
<p><a id="conclusion"></a></p>
<h2>Conclusion</h2>
<p>Dependency injection is a powerful technique for improving testability, code modularity, and reusability in Python. By applying various injection techniques and exploring different use cases, you can design more robust and maintainable code. Additionally, the lesser-known tricks and techniques covered in this blog post can further enhance your understanding and application of dependency injection in various scenarios.</p>
<p><em>Any comments or suggestions? <a href="mailto:ksafjan@gmail.com?subject=Blog+post">Let me know</a>.</em></p>Efficient Workflow for Reviewing Changes in Git before Pulling from Remote Branch2023-06-20T00:00:00+02:002023-06-20T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-06-20:/git-workflow-reviewing-changes-before-pulling-remote-branch/<h2>Introduction</h2>
<p>When working with Git, it is essential to have a streamlined workflow that ensures you <strong>review the changes made by others</strong> before pulling them into your local branch. This practice helps <strong>prevent conflicts</strong> and ensures that your local repository remains in …</p><h2>Introduction</h2>
<p>When working with Git, it is essential to have a streamlined workflow that ensures you <strong>review the changes made by others</strong> before pulling them into your local branch. This practice helps <strong>prevent conflicts</strong> and ensures that your local repository remains in sync with the remote branch. In this blog post, we will outline a few simple steps to check the changes introduced by others in the remote branch before performing a <code>git pull</code>.</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#step-1-fetch-remote-changes">Step 1: Fetch Remote Changes</a></li>
<li><a href="#step-3-review-changes">Step 3: Review Changes</a></li>
<li><a href="#step-4-resolve-conflicts-if-any">Step 4: Resolve Conflicts (if any)</a></li>
<li><a href="#step-5-pull-changes">Step 5: Pull Changes</a></li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="step-1-fetch-remote-changes"></a></p>
<h3>Step 1: Fetch Remote Changes</h3>
<p>Before reviewing any changes, it is crucial to fetch the latest updates from the remote repository. This step ensures that your local repository has the most up-to-date information. To fetch changes, run the following command:</p>
<div class="highlight"><pre><span></span><code>git<span class="w"> </span>fetch
</code></pre></div>
<p>This command retrieves all the latest changes from the remote repository without automatically merging them into your local branch.</p>
<h3>Step 2: Inspect Remote Branch</h3>
<p>After fetching the remote changes, you can inspect the remote branch to see the modifications made by others. This step helps you understand the nature and scope of the changes before merging them into your branch. To view the remote branch, use the following command:</p>
<div class="highlight"><pre><span></span><code>git log origin/branch-name
</code></pre></div>
<p>Replace <code>branch-name</code> with the name of the remote branch you want to review. This command displays a list of commits made to the remote branch, showing the commit hash, author, timestamp, and commit message.</p>
<p><a id="step-3-review-changes"></a></p>
<h3>Step 3: Review Changes</h3>
<p>Now that you have a clear view of the commits in the remote branch, it's time to review the changes introduced. There are several ways to inspect the individual commits, depending on your preferred Git tooling. Here are a few common options:</p>
<h4>Option 1: Using Git Diff</h4>
<p>To review the changes introduced by a specific commit, you can use the <code>git diff</code> command. Run the following command, replacing <code>commit-hash</code> with the actual commit hash you want to inspect:</p>
<div class="highlight"><pre><span></span><code>git<span class="w"> </span>diff<span class="w"> </span>commit-hash
</code></pre></div>
<p>This command displays a detailed diff of the changes made in that specific commit, allowing you to analyze the modifications line by line.</p>
<h4>Option 2: Utilizing Visual Git Tools</h4>
<p>If you prefer a more visual representation of changes, you can leverage Git GUI tools like GitKraken, Sourcetree, Git Cola or tig. These tools provide an intuitive interface that allows you to navigate through commits, inspect changes, and even visualize branching patterns.</p>
<h5>tig</h5>
<p><code>tig test..master</code> - Show difference between two branches <code>test</code> and <code>master</code></p>
<p><a id="step-4-resolve-conflicts-if-any"></a></p>
<h2>Step 4: Resolve Conflicts (if any)</h2>
<p>During your review, you may encounter conflicts between the changes made by others and your local modifications. Conflicts arise when Git cannot automatically merge two sets of changes. If conflicts occur, it is crucial to resolve them before pulling the changes into your branch.</p>
<p>To resolve conflicts, you can use Git's built-in merge tools or a visual Git tool like those mentioned earlier. These tools provide a side-by-side view of conflicting changes, enabling you to choose which modifications to keep and how to combine them effectively.</p>
<p><a id="step-5-pull-changes"></a></p>
<h3>Step 5: Pull Changes</h3>
<p>After reviewing the changes, ensuring there are no conflicts or addressing any conflicts that arise, you can proceed with pulling the changes from the remote branch into your local branch. To pull the changes, use the following command:</p>
<div class="highlight"><pre><span></span><code>git<span class="w"> </span>pull<span class="w"> </span>origin<span class="w"> </span>branch-name
</code></pre></div>
<p>Replace <code>branch-name</code> with the name of the remote branch from which you want to pull the changes. This command automatically merges the changes into your branch, keeping your local repository up to date.</p>
<p><a id="conclusion"></a></p>
<h2>Conclusion</h2>
<p>In this blog post, we discussed a streamlined workflow for reviewing changes in Git before pulling them from a remote branch. By following these steps, you can ensure that you have a clear understanding of the modifications introduced by others, address conflicts if necessary, and maintain a synchronized local repository. Adopting this workflow will help you avoid potential</p>Extracting Keywords From the User Query2023-06-09T00:00:00+02:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-06-09:/extracting-keywords-from-the-user-query/<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#rule-based-approach">Rule-Based Approach</a></li>
<li><a href="#linguistic-analysis">Linguistic Analysis</a></li>
<li><a href="#machine-learning-ml-and-statistical-methods">Machine Learning (ML) and Statistical Methods</a></li>
<li><a href="#hybrid-approaches">Hybrid Approaches:</a></li>
<li><a href="#what-about-using-large-language-models">What about using (large) language models?</a></li>
<li><a href="#pros">Pros:</a></li>
<li><a href="#cons">Cons:</a></li>
<li><a href="#more-on-machine-learning-and-statistical-methods-for-keywords-extraction">More on Machine Learning and statistical Methods for Keywords Extraction</a></li>
<li><a href="#exemplary-implementation">Exemplary implementation</a></li>
</ul>
<!-- /MarkdownTOC -->
<p>When it comes to extracting keywords or key terms from …</p><!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#rule-based-approach">Rule-Based Approach</a></li>
<li><a href="#linguistic-analysis">Linguistic Analysis</a></li>
<li><a href="#machine-learning-ml-and-statistical-methods">Machine Learning (ML) and Statistical Methods</a></li>
<li><a href="#hybrid-approaches">Hybrid Approaches:</a></li>
<li><a href="#what-about-using-large-language-models">What about using (large) language models?</a></li>
<li><a href="#pros">Pros:</a></li>
<li><a href="#cons">Cons:</a></li>
<li><a href="#more-on-machine-learning-and-statistical-methods-for-keywords-extraction">More on Machine Learning and statistical Methods for Keywords Extraction</a></li>
<li><a href="#exemplary-implementation">Exemplary implementation</a></li>
</ul>
<!-- /MarkdownTOC -->
<p>When it comes to extracting keywords or key terms from a user query, there are several approaches that can be used. Each approach has its own set of pros and cons, which I will discuss below:</p>
<p><a id="rule-based-approach"></a></p>
<h2>Rule-Based Approach</h2>
<ul>
<li><strong>Pros</strong>: This approach involves defining a set of rules or patterns to identify keywords based on specific criteria. It can be effective for simple queries and known patterns, allowing for precise keyword extraction.</li>
<li><strong>Cons</strong>: Rule-based approaches can be limited in their flexibility and scalability. They require manual effort to create and maintain the rules, making them less suitable for handling complex or evolving queries. Additionally, they may not perform well when faced with ambiguous or unstructured input.</li>
</ul>
<p><a id="linguistic-analysis"></a></p>
<h2>Linguistic Analysis</h2>
<ul>
<li><strong>Pros</strong>: Linguistic analysis techniques utilize natural language processing (NLP) algorithms to analyze the grammatical structure and semantics of a query. By considering parts of speech, syntactic relationships, and semantic associations, they can extract relevant keywords effectively.</li>
<li><strong>Cons</strong>: This approach can be computationally expensive and may require substantial linguistic resources such as parsers, lexicons, and ontologies. Handling languages with complex grammar or processing highly contextual queries can be challenging. It might also struggle with ambiguous phrases or idiomatic expressions.</li>
</ul>
<p><a id="machine-learning-ml-and-statistical-methods"></a></p>
<h2>Machine Learning (ML) and Statistical Methods</h2>
<ul>
<li><strong>Pros</strong>: ML techniques, such as supervised or unsupervised learning, can automatically learn patterns and extract keywords based on training data. They can adapt to different query types and improve over time with more data. Statistical methods, such as term frequency-inverse document frequency (TF-IDF), can also identify important keywords based on their prevalence and relevance within a dataset.</li>
<li><strong>Cons</strong>: Building ML models requires labeled training data, which can be time-consuming and expensive to create. Models may struggle with rare or domain-specific queries if not adequately trained. They can also be susceptible to biases present in the training data, and their performance may degrade when faced with queries significantly different from the training distribution.</li>
</ul>
<p><a id="hybrid-approaches"></a></p>
<h2>Hybrid Approaches</h2>
<ul>
<li><strong>Pros</strong>: Hybrid approaches combine multiple techniques, leveraging the strengths of each to improve keyword extraction. For example, combining rule-based methods with ML models can enhance accuracy and handle a wider range of queries.</li>
<li><strong>Cons</strong>: Designing and implementing hybrid approaches can be complex and require expertise in multiple areas. Combining different techniques may introduce additional computational overhead, impacting performance and response time.</li>
</ul>
<p>It's important to note that the effectiveness of these approaches can vary depending on factors such as the nature of the queries, available resources, and the desired level of accuracy. A well-designed solution often involves a combination of techniques to achieve the best results.</p>
<p><a id="what-about-using-large-language-models"></a></p>
<h2>What about using (large) language models?</h2>
<p>Using language models, such as GPT-3.5, can be a powerful approach for extracting keywords or key terms from a user query. Language models are trained on vast amounts of text data and have the ability to understand and generate human-like language.</p>
<p>Here are the pros and cons of using language models for keyword extraction:</p>
<p><a id="pros"></a></p>
<h3>Pros</h3>
<ol>
<li><strong>Contextual Understanding</strong>: Language models can capture the contextual meaning of words and phrases in a query. They can consider the surrounding words and sentences to extract keywords that are most relevant to the overall query.</li>
<li><strong>Handling Ambiguity</strong>: Language models can handle ambiguous queries by considering the broader context. They can interpret the query based on available information and generate keywords that make the most sense in the given context.</li>
<li><strong>Generalization</strong>: Language models have the ability to generalize from the training data and can extract keywords effectively even for queries that are slightly different from what they have seen before.</li>
<li><strong>Continuous Learning</strong>: Language models can be fine-tuned on specific domains or datasets to improve their keyword extraction capabilities. This allows them to adapt to specific contexts and improve their accuracy over time.</li>
</ol>
<p><a id="cons"></a></p>
<h3>Cons</h3>
<ol>
<li><strong>Lack of Control</strong>: Language models generate keywords based on their learned patterns and training data, which may not always align with specific user requirements or domain-specific terminology. They may produce keywords that are technically correct but not exactly what the user intended.</li>
<li><strong>Over-reliance on Training Data</strong>: Language models heavily depend on the data they were trained on. If the training data contains biases or limitations, the model may exhibit the same biases or struggle with specific types of queries that were underrepresented in the training data.</li>
<li><strong>Computational Overhead</strong>: Language models can be computationally expensive to run, especially for real-time applications. The time required for keyword extraction using a language model might not be suitable for scenarios that demand low latency or high throughput.</li>
<li><strong>Lack of Explanation</strong>: Language models can provide keyword outputs, but they may not offer clear explanations for why certain words were selected as keywords. This lack of interpretability can make it challenging to understand the reasoning behind the chosen keywords.</li>
</ol>
<p>While language models can be effective for keyword extraction, it's important to consider these pros and cons and carefully evaluate the trade-offs before integrating them into a production system. It may be necessary to fine-tune the language model or combine it with other techniques to address specific limitations or requirements.</p>
<p><a id="more-on-machine-learning-and-statistical-methods-for-keywords-extraction"></a></p>
<h2>More on Machine Learning and statistical Methods for Keywords Extraction</h2>
<p>There are several machine learning and statistical methods commonly used for keyword extraction from text. Here are some popular techniques:</p>
<ol>
<li>
<p><strong>Term Frequency-Inverse Document Frequency (TF-IDF)</strong>: TF-IDF is a statistical method that measures the importance of a term within a document and across a collection of documents. It calculates a weight for each term based on its frequency in the document and inversely proportional to its frequency in the entire document collection. Keywords with higher TF-IDF scores are considered more significant.</p>
</li>
<li>
<p><strong>TextRank</strong>: TextRank is an algorithm inspired by Google's PageRank algorithm for ranking web pages. It applies a graph-based ranking approach to identify important keywords in a text. In this method, the text is represented as a graph, where each word is a node, and edges represent the co-occurrence or semantic similarity between words. TextRank assigns scores to words based on their centrality in the graph, with higher scores indicating more important keywords.</p>
</li>
<li>
<p><strong>Latent Dirichlet Allocation (LDA)</strong>: LDA is a generative probabilistic model that represents a collection of documents as a mixture of topics. It assumes that each document contains a distribution of topics, and each topic is characterized by a distribution of words. LDA can be used for keyword extraction by identifying the most probable words associated with each topic. Keywords are then selected based on their relevance to the document's topics.</p>
</li>
<li>
<p><strong>Support Vector Machines (SVM)</strong>: SVM is a supervised learning algorithm that can be used for keyword extraction by treating it as a binary classification problem. Training data is labeled with keywords and non-keywords, and SVM learns a decision boundary to separate the two classes. New text can be classified using the trained SVM model, and the words contributing most to the classification decision are considered keywords.</p>
</li>
<li>
<p><strong>Neural Networks</strong>: Various neural network architectures can be employed for keyword extraction, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and transformers. These models can learn representations of words and capture complex relationships between them. They can be trained using labeled data or trained in an unsupervised manner by formulating the problem as an autoencoder or sequence-to-sequence learning.</p>
</li>
<li>
<p><strong>Rule-based methods</strong>: Rule-based approaches define a set of linguistic rules or patterns to identify keywords based on specific criteria such as part-of-speech tags, syntactic structures, or domain-specific rules. These methods can be effective when the domain or language has well-defined patterns for keywords.</p>
</li>
</ol>
<p><a id="exemplary-implementation"></a></p>
<h2>Exemplary implementation</h2>
<p>One state-of-the-art solution for keyword extraction from short texts is the TextRank algorithm, which is an unsupervised approach based on the PageRank algorithm. It has been proven to be highly effective in identifying important keywords in a text.</p>
<p>Here's a Python implementation using the <code>nltk</code> library, which provides an implementation of the TextRank algorithm:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">nltk</span>
<span class="kn">from</span> <span class="nn">nltk.tokenize</span> <span class="kn">import</span> <span class="n">word_tokenize</span><span class="p">,</span> <span class="n">sent_tokenize</span>
<span class="kn">from</span> <span class="nn">nltk.corpus</span> <span class="kn">import</span> <span class="n">stopwords</span>
<span class="kn">from</span> <span class="nn">nltk.stem</span> <span class="kn">import</span> <span class="n">WordNetLemmatizer</span>
<span class="kn">from</span> <span class="nn">nltk.tag</span> <span class="kn">import</span> <span class="n">pos_tag</span>
<span class="kn">from</span> <span class="nn">nltk.corpus</span> <span class="kn">import</span> <span class="n">wordnet</span> <span class="k">as</span> <span class="n">wn</span>
<span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">defaultdict</span>
<span class="k">def</span> <span class="nf">preprocess_text</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
<span class="c1"># Tokenize the text into sentences</span>
<span class="n">sentences</span> <span class="o">=</span> <span class="n">sent_tokenize</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
<span class="c1"># Tokenize each sentence into words and perform part-of-speech tagging</span>
<span class="n">tagged_words</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">sentence</span> <span class="ow">in</span> <span class="n">sentences</span><span class="p">:</span>
<span class="n">words</span> <span class="o">=</span> <span class="n">word_tokenize</span><span class="p">(</span><span class="n">sentence</span><span class="p">)</span>
<span class="n">tagged_words</span><span class="o">.</span><span class="n">extend</span><span class="p">(</span><span class="n">pos_tag</span><span class="p">(</span><span class="n">words</span><span class="p">))</span>
<span class="c1"># Lemmatize the words and remove stopwords</span>
<span class="n">lemmatizer</span> <span class="o">=</span> <span class="n">WordNetLemmatizer</span><span class="p">()</span>
<span class="n">stop_words</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="n">stopwords</span><span class="o">.</span><span class="n">words</span><span class="p">(</span><span class="s1">'english'</span><span class="p">))</span>
<span class="n">preprocessed_words</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">word</span><span class="p">,</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">tagged_words</span><span class="p">:</span>
<span class="c1"># Consider only nouns, verbs, adjectives, and adverbs</span>
<span class="k">if</span> <span class="n">tag</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">'NN'</span><span class="p">)</span> <span class="ow">or</span> <span class="n">tag</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">'VB'</span><span class="p">)</span> <span class="ow">or</span> <span class="n">tag</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">'JJ'</span><span class="p">)</span> <span class="ow">or</span> <span class="n">tag</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">'RB'</span><span class="p">):</span>
<span class="c1"># Lemmatize the word</span>
<span class="n">lemma</span> <span class="o">=</span> <span class="n">lemmatizer</span><span class="o">.</span><span class="n">lemmatize</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="n">get_wordnet_pos</span><span class="p">(</span><span class="n">tag</span><span class="p">))</span>
<span class="c1"># Convert to lowercase and remove stopwords</span>
<span class="k">if</span> <span class="n">lemma</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">stop_words</span><span class="p">:</span>
<span class="n">preprocessed_words</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">lemma</span><span class="o">.</span><span class="n">lower</span><span class="p">())</span>
<span class="k">return</span> <span class="n">preprocessed_words</span>
<span class="k">def</span> <span class="nf">get_wordnet_pos</span><span class="p">(</span><span class="n">tag</span><span class="p">):</span>
<span class="k">if</span> <span class="n">tag</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">'N'</span><span class="p">):</span>
<span class="k">return</span> <span class="n">wn</span><span class="o">.</span><span class="n">NOUN</span>
<span class="k">elif</span> <span class="n">tag</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">'V'</span><span class="p">):</span>
<span class="k">return</span> <span class="n">wn</span><span class="o">.</span><span class="n">VERB</span>
<span class="k">elif</span> <span class="n">tag</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">'J'</span><span class="p">):</span>
<span class="k">return</span> <span class="n">wn</span><span class="o">.</span><span class="n">ADJ</span>
<span class="k">elif</span> <span class="n">tag</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">'R'</span><span class="p">):</span>
<span class="k">return</span> <span class="n">wn</span><span class="o">.</span><span class="n">ADV</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="kc">None</span>
<span class="k">def</span> <span class="nf">calculate_similarity</span><span class="p">(</span><span class="n">word1</span><span class="p">,</span> <span class="n">word2</span><span class="p">):</span>
<span class="n">synsets1</span> <span class="o">=</span> <span class="n">wn</span><span class="o">.</span><span class="n">synsets</span><span class="p">(</span><span class="n">word1</span><span class="p">)</span>
<span class="n">synsets2</span> <span class="o">=</span> <span class="n">wn</span><span class="o">.</span><span class="n">synsets</span><span class="p">(</span><span class="n">word2</span><span class="p">)</span>
<span class="k">if</span> <span class="n">synsets1</span> <span class="ow">and</span> <span class="n">synsets2</span><span class="p">:</span>
<span class="n">max_sim</span> <span class="o">=</span> <span class="nb">max</span><span class="p">((</span><span class="n">wn</span><span class="o">.</span><span class="n">path_similarity</span><span class="p">(</span><span class="n">s1</span><span class="p">,</span> <span class="n">s2</span><span class="p">)</span> <span class="ow">or</span> <span class="mi">0</span><span class="p">)</span> <span class="k">for</span> <span class="n">s1</span> <span class="ow">in</span> <span class="n">synsets1</span> <span class="k">for</span> <span class="n">s2</span> <span class="ow">in</span> <span class="n">synsets2</span><span class="p">)</span>
<span class="k">return</span> <span class="n">max_sim</span>
<span class="k">return</span> <span class="mi">0</span>
<span class="k">def</span> <span class="nf">textrank_keywords</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="n">top_n</span><span class="o">=</span><span class="mi">5</span><span class="p">):</span>
<span class="c1"># Preprocess the text</span>
<span class="n">words</span> <span class="o">=</span> <span class="n">preprocess_text</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
<span class="c1"># Build the word co-occurrence graph</span>
<span class="n">graph</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">word1</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">words</span><span class="p">):</span>
<span class="k">for</span> <span class="n">j</span><span class="p">,</span> <span class="n">word2</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">words</span><span class="p">):</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">!=</span> <span class="n">j</span><span class="p">:</span>
<span class="n">similarity</span> <span class="o">=</span> <span class="n">calculate_similarity</span><span class="p">(</span><span class="n">word1</span><span class="p">,</span> <span class="n">word2</span><span class="p">)</span>
<span class="k">if</span> <span class="n">similarity</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span>
<span class="n">graph</span><span class="p">[</span><span class="n">word1</span><span class="p">]</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">word2</span><span class="p">,</span> <span class="n">similarity</span><span class="p">))</span>
<span class="c1"># Apply the TextRank algorithm</span>
<span class="n">scores</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">float</span><span class="p">)</span>
<span class="n">damping_factor</span> <span class="o">=</span> <span class="mf">0.85</span>
<span class="n">max_iterations</span> <span class="o">=</span> <span class="mi">100</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">max_iterations</span><span class="p">):</span>
<span class="n">prev_scores</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">scores</span><span class="p">)</span>
<span class="k">for</span> <span class="n">word1</span> <span class="ow">in</span> <span class="n">words</span><span class="p">:</span>
<span class="n">score</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">damping_factor</span><span class="p">)</span> <span class="o">+</span> <span class="n">damping_factor</span> <span class="o">*</span> <span class="nb">sum</span><span class="p">(</span><span class="n">prev_scores</span><span class="p">[</span><span class="n">word2</span><span class="p">]</span> <span class="o">*</span> <span class="n">weight</span> <span class="k">for</span> <span class="n">word2</span><span class="p">,</span> <span class="n">weight</span> <span class="ow">in</span> <span class="n">graph</span><span class="p">[</span><span class="n">word1</span><span class="p">])</span>
<span class="n">scores</span><span class="p">[</span><span class="n">word1</span><span class="p">]</span> <span class="o">=</span> <span class="n">score</span>
<span class="c1"># Get the top keywords</span>
<span class="n">top_keywords</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">scores</span><span class="o">.</span><span class="n">items</span><span class="p">(),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="kc">True</span><span class="p">)[:</span><span class="n">top_n</span><span class="p">]</span>
<span class="k">return</span> <span class="n">top_keywords</span>
<span class="c1"># Example usage</span>
<span class="n">text</span> <span class="o">=</span> <span class="s2">"What are the benefits of exercise for mental health?"</span>
<span class="n">keywords</span> <span class="o">=</span> <span class="n">textrank_keywords</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">keywords</span><span class="p">)</span>
</code></pre></div>
<p>NOTE: before you can start using it you will need to download certain data resources from NLTK (Natural Language Toolkit) in order to use it for keyword extraction. Specifically, you will need to download the stopwords corpus and WordNet data.</p>
<p>To download the necessary data, you can use the following code snippet:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">nltk</span>
<span class="n">nltk</span><span class="o">.</span><span class="n">download</span><span class="p">(</span><span class="s1">'stopwords'</span><span class="p">)</span>
<span class="n">nltk</span><span class="o">.</span><span class="n">download</span><span class="p">(</span><span class="s1">'averaged_perceptron_tagger'</span><span class="p">)</span>
<span class="n">nltk</span><span class="o">.</span><span class="n">download</span><span class="p">(</span><span class="s1">'punkt'</span><span class="p">)</span>
<span class="n">nltk</span><span class="o">.</span><span class="n">download</span><span class="p">(</span><span class="s1">'wordnet'</span><span class="p">)</span>
</code></pre></div>The Role and Responsibilities of a Forward Deployed Engineer - Bridging the Gap Between Software Products and Customer Needs2023-06-09T00:00:00+02:002023-06-09T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-06-09:/the-role-and-responsibilities-of-a-forward-deployed-engineer/<p>Bridging the gap between software products and customer needs, Forward Deployed Engineers are the game-changers of enterprise software. Discover their unique role in driving success and why it's in high demand. Don't miss out!</p><h2>prompt: null</h2>
<h2>TL;DR</h2>
<p>A Forward Deployed Engineer (FDE) is a versatile software engineer who works closely with customers to bridge the gap between enterprise software products and their specific implementation needs. FDEs collaborate with engineering teams, provide technical support, partner with product teams, assist in revenue growth activities, and lead customer success efforts. With a mix of technical skills, an entrepreneurial mindset, and product intuition, FDEs play a crucial role in ensuring successful product deployment and customer satisfaction.</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#understanding-the-role-of-fdes">Understanding the Role of FDEs</a><ul>
<li><a href="#collaboration-with-engineering">Collaboration with Engineering</a></li>
<li><a href="#partnership-with-product-teams">Partnership with Product Teams</a></li>
<li><a href="#support-for-revenue-growth">Support for Revenue Growth</a></li>
<li><a href="#leadership-in-customer-success">Leadership in Customer Success</a></li>
</ul>
</li>
<li><a href="#why-forward-deployed-engineers-are-in-high-demand">Why Forward Deployed Engineers are in High Demand?</a><ul>
<li><a href="#technical-expertise-and-customer-focus">Technical Expertise and Customer Focus</a></li>
<li><a href="#agile-problem-solvers">Agile Problem Solvers</a></li>
<li><a href="#product-intuition">Product Intuition</a></li>
</ul>
</li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="introduction"></a></p>
<h2>Introduction</h2>
<p>In the fast-paced world of enterprise software, there is an increasing demand for versatile engineers who can seamlessly integrate complex products into customers' specific implementation needs. This demand has given rise to the role of Forward Deployed Engineer (FDE). FDEs play a crucial role in ensuring successful technical integration and ongoing product deployment, acting as a bridge between the product suite and the unique requirements of each customer. This blog post will delve into the responsibilities of FDEs and shed light on why this role is in high demand.</p>
<p><a id="understanding-the-role-of-fdes"></a></p>
<h2>Understanding the Role of FDEs</h2>
<p>Forward Deployed Engineers are software engineers with broad skill sets that enable them to work closely with customers and iterate on enterprise software products. They possess technical expertise while being customer-facing, making them a valuable asset in various areas of an enterprise software organization.</p>
<p><a id="collaboration-with-engineering"></a></p>
<h3>Collaboration with Engineering</h3>
<p>Forward Deployed Engineers (FDEs) play a crucial role in fostering collaboration between engineering teams and external stakeholders. By actively contributing to internal codebases and working closely with core engineering teams, FDEs ensure that customer feedback and implementation needs are effectively communicated and addressed.</p>
<p>FDEs act as the bridge between the technical complexities of the product and the understanding of external stakeholders. They have a deep understanding of the product's architecture, functionalities, and underlying technologies. This expertise allows them to effectively communicate technical topics to non-technical stakeholders, such as customers or business executives.</p>
<p>When customers encounter challenges or require customizations to the product suite, FDEs work closely with the engineering team to find viable solutions. They provide valuable insights on the implementation needs and collaborate with engineers to identify the best approach. FDEs act as advocates for customers, ensuring that their requirements are properly understood and addressed within the product's capabilities.</p>
<p>Through this collaboration, FDEs contribute to the improvement of internal codebases. They provide feedback to engineering teams regarding areas that require enhancements or optimizations based on real-world customer experiences. This feedback loop helps create a continuous improvement process for the product, making it more robust and aligned with customer needs.</p>
<p>Furthermore, FDEs actively participate in cross-functional meetings, bringing the perspective of external stakeholders to the engineering team. This collaboration helps align engineering efforts with customer requirements and provides valuable context for decision-making.</p>
<blockquote>
<p>Collaboration with engineering is a critical aspect of the FDE role. By effectively communicating technical topics to external stakeholders and working closely with the engineering team, FDEs ensure that customer feedback is accurately relayed, implementation needs are addressed, and the product continues to evolve to meet customer expectations.</p>
</blockquote>
<p><a id="partnership-with-product-teams"></a></p>
<h3>Partnership with Product Teams</h3>
<p>Forward Deployed Engineers (FDEs) play a pivotal role in establishing a strong partnership between external stakeholders and the product teams. By leveraging their customer-facing experience and technical expertise, FDEs bring valuable insights to the table, shaping the product roadmap and driving its evolution.</p>
<p>FDEs act as the voice of the customer within the organization. They gather feedback, requirements, and feature requests directly from customers and effectively communicate these insights to the product teams. By understanding the customers' pain points, desired features, and use cases, FDEs provide invaluable information that helps shape the product's direction.</p>
<p>Throughout the engineering lifecycle, FDEs collaborate closely with the product teams to iterate on existing features and deliver new use cases. They work in tandem with product managers, developers, and designers to ensure that the product roadmap aligns with the specific needs of customers. FDEs provide real-world context and technical expertise, enabling product teams to make informed decisions regarding prioritization, feature enhancements, and trade-offs.</p>
<p>FDEs also act as a bridge between product teams and customers during the implementation phase. They facilitate ongoing communication, ensuring that the product is implemented effectively and meets customers' expectations. FDEs provide guidance on technical integration, address any gaps between the product suite and customer requirements, and offer insights on best practices for successful deployment.</p>
<p>Additionally, FDEs actively participate in testing and validation processes, providing feedback on new features and enhancements from the customer's perspective. They collaborate with product teams to conduct user acceptance testing, gather feedback, and ensure that the product meets the desired outcomes.</p>
<p>By establishing a strong partnership with product teams, FDEs contribute to the overall success of the product. Their unique position allows them to bridge the gap between customer needs and product development, ensuring that the product remains relevant, competitive, and aligned with the evolving market landscape.</p>
<blockquote>
<p>The partnership between FDEs and product teams is essential for driving innovation, customer satisfaction, and product evolution. FDEs bring customer insights, technical expertise, and a deep understanding of implementation needs to collaborate closely with product teams, influencing the product roadmap, and delivering value-driven solutions to customers.</p>
</blockquote>
<p><a id="support-for-revenue-growth"></a></p>
<h3>Support for Revenue Growth</h3>
<p>Forward Deployed Engineers (FDEs) contribute significantly to revenue growth by providing technical expertise and support in various revenue-related activities. Their role extends beyond engineering and involves actively participating in sales meetings, leading technical discussions, and completing Requests for Proposal (RFPs).</p>
<p>As technical advisors, FDEs join sales meetings with non-technical external stakeholders, such as executives or business leaders. In this capacity, they provide valuable insights into the product's capabilities, technical requirements, and implementation process. By bridging the gap between the product suite and the customers' specific needs, FDEs help potential clients understand the value proposition and make informed purchasing decisions.</p>
<p>Moreover, FDEs take the lead in technical sales calls and meetings with external technical stakeholders. They are responsible for communicating the technical aspects of the product, answering complex inquiries, and addressing any technical concerns potential customers may have. FDEs play a crucial role in building trust and confidence in the product's ability to meet the customers' requirements.</p>
<p>FDEs also contribute to revenue growth by completing RFPs. These documents are often requested by potential customers to evaluate software solutions for their specific needs. FDEs leverage their technical knowledge and customer insights to provide comprehensive and accurate responses to these RFPs. By effectively showcasing the product's capabilities and aligning them with customer requirements, FDEs play a key role in unlocking new revenue opportunities.</p>
<p>Additionally, FDEs collaborate with the sales and marketing teams to develop technical collateral, such as case studies, technical whitepapers, and solution guides. These resources help articulate the product's value proposition, highlight successful customer implementations, and provide technical details to support the sales process. FDEs actively contribute to these materials, ensuring they are accurate, relevant, and impactful.</p>
<p>By supporting revenue growth initiatives, FDEs contribute to the overall success of the organization. Their technical expertise, customer-centric mindset, and ability to effectively communicate the value of the product position them as trusted advisors and advocates for both the customers and the sales teams. FDEs help drive new business opportunities, enhance customer satisfaction, and ultimately contribute to the financial growth of the company.</p>
<blockquote>
<p>FDEs play a crucial role in supporting revenue growth by providing technical support, leading sales discussions, completing RFPs, and developing collateral. Their ability to bridge the gap between technical complexities and customer needs helps build trust, accelerate sales cycles, and unlock new revenue streams. FDEs are instrumental in driving the financial success of the organization.</p>
</blockquote>
<p><a id="leadership-in-customer-success"></a></p>
<h3>Leadership in Customer Success</h3>
<p>Forward Deployed Engineers (FDEs) take on a leadership role in ensuring customer success throughout the implementation and deployment of the product. They act as technical leads and provide critical support to customers, facilitating onboarding, and driving the adoption of new features into customers' production environments.</p>
<p>FDEs serve as the primary point of contact for customers during the implementation phase. They work closely with customer success teams to understand the customers' specific requirements and develop tailored implementation plans. FDEs leverage their technical expertise to guide customers through the integration process, ensuring a smooth and successful onboarding experience.</p>
<p>As technical leads, FDEs provide ongoing support to customers, addressing any technical issues or challenges they may encounter. They troubleshoot and resolve complex technical problems, acting as a bridge between the customers and the engineering team. FDEs leverage their deep understanding of the product to provide timely and effective solutions, ensuring that customers can fully leverage the capabilities of the software.</p>
<p>In addition to technical support, FDEs play a critical role in driving the adoption of new features and enhancements. They collaborate with customers to understand their specific use cases and provide guidance on how to best utilize the product's functionality to achieve their desired outcomes. FDEs conduct training sessions, create documentation, and offer best practices to ensure that customers can maximize the value they derive from the product.</p>
<p>FDEs also act as advocates for customers within the organization. They actively collect feedback, feature requests, and insights from customers and communicate them to the product teams. By representing the customers' voice, FDEs contribute to the continuous improvement of the product, ensuring that it evolves to meet their changing needs.</p>
<p>Building strong relationships with customers is a key aspect of the FDE role. FDEs engage in regular communication, conduct business reviews, and seek opportunities to deepen customer engagement. By understanding the customers' goals, challenges, and aspirations, FDEs can provide personalized recommendations and strategic guidance, ultimately fostering long-term customer satisfaction and loyalty.</p>
<blockquote>
<p>FDEs assume a leadership role in customer success by providing technical guidance, support, and advocacy throughout the implementation and deployment process. Their deep technical expertise, customer-centric approach, and ability to build strong relationships position them as trusted partners for customers. FDEs play a crucial role in driving customer success, ensuring that customers achieve their desired outcomes and maximizing the value they derive from the product.</p>
</blockquote>
<p><a id="why-forward-deployed-engineers-are-in-high-demand"></a></p>
<h2>Why Forward Deployed Engineers are in High Demand?</h2>
<p>The increasing complexity of enterprise software products and the variability in customer requirements have created a significant demand for FDEs. Here are some reasons why this role is sought after:</p>
<p><a id="technical-expertise-and-customer-focus"></a></p>
<h3>Technical Expertise and Customer Focus</h3>
<p>FDEs possess a unique mix of technical skills and customer-centricity. They understand the intricacies of the product and can effectively communicate its value to both technical and non-technical stakeholders. Their ability to bridge the gap between engineering and customer needs is invaluable in ensuring successful deployments.</p>
<p><a id="agile-problem-solvers"></a></p>
<h3>Agile Problem Solvers</h3>
<p>FDEs exhibit an entrepreneurial mindset, allowing them to adapt quickly to evolving customer requirements. They are adept at identifying challenges, proposing solutions, and iterating on product features. This agility is essential in a rapidly changing technological landscape, where customers' needs evolve at a fast pace.</p>
<p><a id="product-intuition"></a></p>
<h3>Product Intuition</h3>
<p>By working closely with customers, FDEs develop a deep understanding of their pain points and aspirations. This product intuition enables them to provide valuable insights to product teams, helping shape the product roadmap and prioritize features that align with customer needs. FDEs contribute to the development of customer-centric software solutions.</p>
<p><a id="conclusion"></a></p>
<h2>Conclusion</h2>
<p>Forward Deployed Engineers play a vital role in enterprise software organizations, acting as the bridge between products and customer implementations. Their broad skill set, technical expertise, entrepreneurial mindset, and product intuition make them invaluable assets in driving customer success, revenue growth, and product evolution. As enterprise software continues to evolve, the demand for FDEs will likely increase, providing software engineers with a customer-facing path that allows them to thrive in both technical and business domains</p>
<p><em>Any comments or suggestions? <a href="mailto:ksafjan@gmail.com?subject=Blog+post">Let me know</a>.</em></p>How to Count Tokens - Tokenization With Tiktoken.2023-06-08T00:00:00+02:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-06-08:/how-to-count-tokens/<p>Counting tokens is a useful task in natural language processing (NLP) that allows us to measure the length and complexity of a text. The two important use cases for counting the tokens are:</p>
<ul>
<li><strong>controlling the length of the prompt</strong> - models has limit …</li></ul><p>Counting tokens is a useful task in natural language processing (NLP) that allows us to measure the length and complexity of a text. The two important use cases for counting the tokens are:</p>
<ul>
<li><strong>controlling the length of the prompt</strong> - models has limit on the number of input tokens - it is good to have control if you don't exceed the limits for the model</li>
<li><strong>cost awareness</strong> - when you know how many tokens you pass as input, you know the cost related to the prompt.</li>
</ul>
<p>In this blog post, we will explore how to count the number of tokens in a given text using OpenAI's tokenizer, called <code>tiktoken</code>. Whether you're a seasoned Python developer or just getting started with NLP, this guide will provide you with a step-by-step process to accurately determine the token count of your text.</p>
<h3>Introduction to <code>tiktoken</code></h3>
<p>To begin with, we need to install the <code>tiktoken</code> library, which is a powerful tokenizer developed by OpenAI. It offers efficient tokenization capabilities and supports a wide range of languages. You can find the library on GitHub at <a href="https://github.com/openai/tiktoken">this link</a>.</p>
<h3>Code Example</h3>
<p>Let's dive into a code example that demonstrates how to count tokens using <code>tiktoken</code>:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">tiktoken</span>
<span class="k">def</span> <span class="nf">num_tokens_from_string</span><span class="p">(</span><span class="n">string</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">encoding_name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-></span> <span class="nb">int</span><span class="p">:</span>
<span class="w"> </span><span class="sd">"""Returns the number of tokens in a text string."""</span>
<span class="n">encoding</span> <span class="o">=</span> <span class="n">tiktoken</span><span class="o">.</span><span class="n">get_encoding</span><span class="p">(</span><span class="n">encoding_name</span><span class="p">)</span>
<span class="n">num_tokens</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">encoding</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">string</span><span class="p">))</span>
<span class="k">return</span> <span class="n">num_tokens</span>
<span class="n">num_tokens_from_string</span><span class="p">(</span><span class="s2">"tiktoken is great!"</span><span class="p">,</span> <span class="s2">"cl100k_base"</span><span class="p">)</span>
</code></pre></div>
<p>In the example above, we import the <code>tiktoken</code> library and define a function called <code>num_tokens_from_string</code>. This function takes a text string and an encoding name as input parameters. It returns the number of tokens in the given text string.</p>
<p>To count the tokens, we first obtain the encoding using <code>tiktoken.get_encoding(encoding_name)</code>. The <code>encoding_name</code> specifies the type of encoding we want to use. In this case, we use the <code>cl100k_base</code> encoding, which is suitable for second-generation embedding models like <code>text-embedding-ada-002</code>.</p>
<p>Next, we encode the input string using <code>encoding.encode(string)</code> and calculate the number of tokens by taking the length of the encoded sequence. The final result is the total number of tokens in the text string.</p>
<p><code>tiktoken</code> supports three encodings used by OpenAI models:</p>
<table>
<thead>
<tr>
<th>Encoding name</th>
<th>OpenAI models</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>cl100k_base</code></td>
<td><code>gpt-4</code>, <code>gpt-3.5-turbo</code>, <code>text-embedding-ada-002</code></td>
</tr>
<tr>
<td><code>p50k_base</code></td>
<td>Codex models, <code>text-davinci-002</code>, <code>text-davinci-003</code></td>
</tr>
<tr>
<td><code>r50k_base</code> (or <code>gpt2</code>)</td>
<td>GPT-3 models like <code>davinci</code></td>
</tr>
</tbody>
</table>
<h3>OpenAI Cookbook Guide</h3>
<p>For a more detailed explanation and additional examples, you can refer to the OpenAI Cookbook guide on <a href="https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb">how to count tokens with tiktoken</a>. The guide provides comprehensive instructions on token counting and offers insights into various use cases.</p>
<h3>Tokenization Sandbox</h3>
<p>If you're looking to experiment with text tokenization, OpenAI provides a convenient web application called the Tokenization Sandbox. You can access it <a href="https://platform.openai.com/tokenizer">here</a>. The sandbox allows you to input text and observe the resulting tokens, helping you better understand the tokenization process.</p>
<h3>Text splitter module</h3>
<p>A Python script for splitting text into parts with controlled (limited) length in tokens. This script utilizes the <code>tiktoken</code> library for encoding and decoding text.:
<a href="https://gist.github.com/izikeros/17d9c8ab644bd2762acf6b19dd0cea39">https://gist.github.com/izikeros/17d9c8ab644bd2762acf6b19dd0cea39</a></p>
<h3>Count tokens cli tool</h3>
<p>Check this simple CLI tool that have one purpose - count tokens in a text file:</p>
<p><a href="https://github.com/izikeros/count_tokens">izikeros/count_tokens: Count tokens in a text file.</a></p>
<h3>Rule of thumb</h3>
<p>OpenAI on the <a href="https://platform.openai.com/tokenizer">website</a> with the tokenizer sandbox provides rule of thumb that helps to estimate approximate number of tokens in given text.</p>
<blockquote>
<p>A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).</p>
</blockquote>
<h3>References</h3>
<p>To develop this guide, we drew inspiration from the token counting instructions provided by OpenAI. You can find additional information in the <a href="https://platform.openai.com/docs/guides/embeddings/limitations-risks">OpenAI documentation</a>, where they discuss the limitations and risks associated with embeddings.</p>
<p>Token counting is essential when working with NLP, enabling us to analyze and process text effectively. By leveraging OpenAI's <code>tiktoken</code> library and following the guidelines outlined in this blog post, you'll be well-equipped to count tokens accurately and efficiently.</p>
<p>See also: <a href="https://omarkama.li/blog/tokens-the-secret-language-of-ai">Tokens, the secret language of AI | Omar Kamali</a></p>The Best Vector Databases for Storing Embeddings2023-06-05T00:00:00+02:002023-06-05T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-06-05:/the-best-vector-databases-for-storing-embeddings/<p>Delve into the World of Vector Databases Fueling NLP's Transformative Journey.</p><h2>Best Vector Databases for Storing Embeddings in NLP</h2>
<p>As natural language processing (NLP) continues to advance, the need for efficient storage and retrieval of vector representations, or embeddings, has become paramount.</p>
<blockquote>
<p>Vector databases are purpose-built databases that excel in storing and querying high-dimensional vector data, such as word embeddings or document representations.</p>
</blockquote>
<p>This article explores the best vector databases available, their unique features, and the crucial parameters that differentiate them.</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#tldr">TLDR</a></li>
<li><a href="#what-vector-databases-are-and-why-there-is-demand-for-them">What vector databases are, and why there is demand for them?</a></li>
<li><a href="#understanding-tradeoffs-and-identifying-the-specific-requirements-to-choose-the-best-tool">Understanding tradeoffs and identifying the specific requirements to choose the best tool</a></li>
<li><a href="#vector-databases">Vector databases</a><ul>
<li><a href="#chroma">Chroma</a></li>
<li><a href="#haystack-by-deepsetai">Haystack by DeepsetAI</a></li>
<li><a href="#faiss-by-facebook">Faiss by Facebook</a></li>
<li><a href="#milvus">Milvus</a></li>
<li><a href="#pgvector">pgvector</a></li>
<li><a href="#pinecone">Pinecone</a></li>
<li><a href="#supabase">Supabase</a></li>
<li><a href="#qdrant">Qdrant</a></li>
<li><a href="#vespa">Vespa</a></li>
<li><a href="#weaviate">Weaviate</a></li>
<li><a href="#deeplake">DeepLake</a></li>
<li><a href="#vectorstore-from-langchain">VectorStore from LangChain</a></li>
<li><a href="#other-relevant-vector-databases">Other Relevant Vector Databases</a></li>
</ul>
</li>
<li><a href="#tabular-summary-of-the-features">Tabular summary of the features</a></li>
<li><a href="#recommendations">Recommendations</a><ul>
<li><a href="#easy-start-and-user-friendliness---good-for-poc">Easy Start and User-Friendliness - good for PoC</a></li>
<li><a href="#advanced-capabilities-and-performance">Advanced Capabilities and Performance</a></li>
<li><a href="#customization-and-advanced-use-cases">Customization and Advanced Use Cases</a></li>
</ul>
</li>
<li><a href="#related-reading">Related reading</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="tldr"></a></p>
<h2>TLDR</h2>
<p>If you don't want to spent time on reading about each solution, you might want to head directly for the <a href="#recommendations">recommendations</a> section where solutions for various use cases are proposed.
<a id="what-vector-databases-are-and-why-there-is-demand-for-them"></a></p>
<h2>What vector databases are, and why there is demand for them?</h2>
<p>Vector databases are specialized databases designed for efficient storage, retrieval, and manipulation of vector representations, particularly in the context of Natural Language Processing (NLP) and machine learning applications. They are optimized for handling high-dimensional embeddings that represent textual or numerical data in a vectorized format.</p>
<p>While traditional databases like PostgreSQL are versatile and battle-tested, they are not specifically optimized for vector operations. Vector databases, on the other hand, provide a set of features and optimizations tailored to the unique requirements of working with vector embeddings. Here are some reasons why vector databases are in demand despite the existence of other types of databases:</p>
<ol>
<li>
<p><strong>Scalability</strong>: Vector databases are built to handle large-scale datasets and can scale horizontally to accommodate growing data volumes. They distribute the storage and processing of vectors across multiple machines, enabling efficient handling of massive amounts of embedding data.</p>
</li>
<li>
<p><strong>Query Speed</strong>: Vector databases employ advanced indexing structures and search algorithms, such as approximate nearest neighbor (ANN) search, to achieve fast and accurate similarity searches. These optimizations enable rapid retrieval of vectors based on their similarity to a given query vector.</p>
</li>
<li>
<p><strong>Accuracy of Search Results</strong>: Vector databases focus on preserving the accuracy of similarity search results. They leverage techniques like space partitioning, dimensionality reduction, and quantization to ensure that similar vectors are efficiently identified, even in high-dimensional spaces.</p>
</li>
<li>
<p><strong>Flexibility</strong>: Vector databases offer flexibility in terms of supported vector operations and indexing methods. They often provide a range of indexing algorithms, allowing users to choose the one that best suits their specific use case. Additionally, vector databases may support additional functionality like filtering, ranking, and semantic search.</p>
</li>
<li>
<p><strong>Data Persistence and Durability</strong>: Vector databases prioritize data persistence and durability, ensuring that vector embeddings are reliably stored and protected against data loss. They often integrate with existing storage solutions or provide mechanisms for backup and replication.</p>
</li>
<li>
<p><strong>Storage Location</strong>: Vector databases can be deployed either on-premises or in the cloud, providing flexibility in terms of infrastructure choices. Cloud-based vector databases offer the advantage of managed services, offloading the operational overhead of maintaining and scaling the database infrastructure.</p>
</li>
<li>
<p><strong>Direct Library vs. Abstraction</strong>: Vector databases come in two main forms: those that offer a direct library interface for integration into existing systems and those that provide a higher-level abstraction, such as RESTful APIs or query languages. This flexibility allows developers to choose the level of control and integration that best fits their requirements.</p>
</li>
</ol>
<p>While traditional databases like PostgreSQL can handle various data types, including vectors, they may lack the specialized optimizations and features provided by vector databases. Vector databases excel in efficiently storing and querying high-dimensional embeddings, enabling faster similarity search and supporting specific vector-related operations. By leveraging these optimizations, vector databases streamline the development and deployment of NLP and machine learning applications.</p>
<p><a id="understanding-tradeoffs-and-identifying-the-specific-requirements-to-choose-the-best-tool"></a></p>
<h2>Understanding tradeoffs and identifying the specific requirements to choose the best tool</h2>
<p>When choosing a vector database, there are several tradeoffs and potentially contradicting requirements that developers need to consider. Here are some typical tradeoffs and contradictions related to selecting a vector database:</p>
<ol>
<li>
<p><strong>Scalability vs. Query Speed</strong>: Achieving high scalability often requires distributing data across multiple nodes, which can impact query speed due to network communication. Balancing the need for scalability with the requirement for fast query response times can be a tradeoff when selecting a vector database.</p>
</li>
<li>
<p><strong>Search Accuracy vs. Query Speed</strong>: Algorithms that provide high search accuracy, such as exact nearest neighbor search, can be computationally expensive and impact query speed. Approximate algorithms, while faster, might sacrifice some accuracy. The tradeoff lies in finding the right balance between search accuracy and query speed based on the specific use case.</p>
</li>
<li>
<p><strong>Flexibility vs. Performance</strong>: Some vector databases offer extensive customization options, allowing users to tailor the system to their specific requirements. However, the more flexibility provided, the more overhead might be introduced, potentially impacting overall performance. Balancing the need for flexibility with performance considerations is crucial.</p>
</li>
<li>
<p><strong>Data Persistence and Durability vs. Query Performance</strong>: Ensuring data persistence and durability typically involves additional disk I/O operations, which can impact query performance. The tradeoff here is finding the right level of data persistence and durability while maintaining satisfactory query performance.</p>
</li>
<li>
<p><strong>Storage Location vs. Data Security</strong>: Storing vector embeddings locally provides faster access, but it may introduce data security risks. Cloud-based storage solutions offer scalability and redundancy but may raise concerns about data privacy and compliance. The choice between local and cloud storage involves weighing the benefits of each option against data security requirements.</p>
</li>
<li>
<p><strong>Direct Library vs. Abstraction</strong>: Some vector databases offer direct library interfaces for seamless integration into existing systems, while others provide higher-level abstractions like APIs or query languages for ease of use. The tradeoff here is between the level of control and integration required versus the simplicity of implementation and maintenance.</p>
</li>
<li>
<p><strong>Ease of Use vs. Advanced Features</strong>: Vector databases that prioritize ease of use often sacrifice some advanced features and optimization techniques. Developers must consider the complexity of their use case and weigh the need for advanced features against the simplicity of the database.</p>
</li>
</ol>
<p>Understanding these tradeoffs and identifying the specific requirements of a project is crucial in selecting a vector database that best aligns with the desired tradeoff priorities. It requires carefully evaluating the tradeoffs and making informed decisions based on the unique needs of the application or system being developed.</p>
<p><a id="vector-databases"></a></p>
<h2>Vector databases</h2>
<p><a id="chroma"></a></p>
<h3>Chroma</h3>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/chroma-core/chroma?logo=github">
<img alt="chroma logo" src="https://user-images.githubusercontent.com/891664/227103090-6624bf7d-9524-4e05-9d2c-c28d5d451481.png">
<a href="https://www.trychroma.com/">Chroma</a> is an open-source vector database developed by Chroma.ai. It focuses on scalability, providing robust support for storing and querying large-scale embedding datasets efficiently. Chroma offers a distributed architecture with horizontal scalability, enabling it to handle massive volumes of vector data. It leverages Apache Cassandra for high availability and fault tolerance, ensuring data persistence and durability.</p>
<p>One unique aspect of Chroma is its <strong>flexible indexing system</strong>. It supports <strong>multiple indexing strategies</strong>, such as <a href="https://en.wikipedia.org/wiki/Nearest_neighbor_search#Approximation_methods">approximate nearest neighbors</a> (ANN) algorithms like <a href="https://arxiv.org/abs/1603.09320">HNSW</a> and <a href="https://towardsdatascience.com/similarity-search-with-ivfpq-9c6348fd4db3">IVFPQ</a>, enabling fast and accurate similarity searches. Chroma also provides comprehensive <strong>Python and RESTful APIs</strong>, making it <strong>easily integratable</strong> into NLP pipelines. With its emphasis on <strong>scalability</strong> and <strong>speed</strong>, Chroma is an excellent choice for applications that require high-performance vector storage and retrieval.</p>
<p>They have <a href="https://colab.research.google.com/drive/1QEzFyqnoFxq7LUGyP1vzR4iLt9PpCDXv?usp=sharing">Colab</a> notebook with the demo.</p>
<p>The core API commands (from the product page)</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">chromadb</span>
<span class="n">client</span> <span class="o">=</span> <span class="n">chromadb</span><span class="o">.</span><span class="n">Client</span><span class="p">()</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">create_collection</span><span class="p">(</span><span class="s2">"test"</span><span class="p">)</span>
<span class="c1"># add embeddings and documents</span>
<span class="n">c</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="o">...</span><span class="p">)</span>
<span class="c1"># get back similar ones</span>
<span class="n">c</span><span class="o">.</span><span class="n">query</span><span class="p">(</span><span class="o">...</span><span class="p">)</span>
</code></pre></div>
<p>Note: there are plugins for LangChain, LlamaIndex, OpenAI and others.
<a id="haystack-by-deepsetai"></a></p>
<h3>Haystack by DeepsetAI</h3>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/deepset-ai/haystack?logo=github"></p>
<p><img alt="haystack logo" src="images/vectordb/haystack.png">
DeepsetAI's <a href="https://haystack.deepset.ai/">Haystack</a> is another popular vector database designed specifically for NLP applications. It offers a range of features tailored to support end-to-end development of search systems using embeddings. Haystack integrates well with popular transformer models like BERT, allowing users to extract embeddings directly from pre-trained models. It leverages <a href="https://www.elastic.co/what-is/elasticsearch">Elasticsearch</a> as its underlying storage engine, providing powerful indexing and querying capabilities.</p>
<p>Haystack stands out with its <strong>intuitive query language</strong>, which supports complex <strong>semantic searches</strong> and <strong>filtering</strong> based on various parameters. Additionally, it offers a <strong>modular pipeline</strong> architecture for preprocessing, <strong>embedding extraction</strong>, and querying, making it <strong>highly customizable and adaptable</strong> to different NLP use cases. With its <strong>user-friendly interface</strong> and comprehensive functionality, DeepsetAI's Haystack is an excellent choice for developers seeking a flexible and feature-rich vector database for NLP.</p>
<p><a id="faiss-by-facebook"></a></p>
<h3>Faiss by Facebook</h3>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/facebookresearch/faiss?logo=github"></p>
<p><a href="https://faiss.ai/">Faiss logo</a>, developed by Facebook AI Research, is a widely-used vector database renowned for its high-performance similarity search capabilities. It provides a range of indexing methods optimized for efficient retrieval of nearest neighbors, including IVF (Inverted File) and HNSW (Hierarchical Navigable Small World). Faiss also supports GPU acceleration, enabling fast computation on large-scale embeddings.</p>
<p>One of Faiss' notable features is its support for <strong>multi-index search</strong>, which combines different indexing methods to improve search accuracy and speed. Additionally, Faiss offers a <strong>Python interface</strong>, making it easy to integrate with existing NLP pipelines and frameworks. With its focus on <strong>search performance and versatility</strong>, Faiss is a go-to choice for projects demanding fast and accurate similarity <strong>search over vast embedding collections</strong>.</p>
<p><a id="milvus"></a></p>
<h3>Milvus</h3>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/milvus-io/milvus?logo=github"></p>
<p><img alt="Milvus logo" src="/images/vectordb/milvus.png"></p>
<p><a href="https://milvus.io/">Milvus</a> is an open-source vector database developed by Zilliz, designed for efficient storage and retrieval of large-scale embeddings. It provides high scalability and supports distributed deployment across multiple machines, making it suitable for handling massive NLP datasets. Milvus integrates with popular ANN libraries like Faiss, Annoy, and NMSLIB, offering flexible indexing options to achieve high search accuracy.</p>
<p>One key feature of Milvus is its <strong>GPU support</strong>, leveraging NVIDIA GPUs for accelerated computation. This makes Milvus an excellent choice <strong>for deep learning applications</strong> that require fast vector search and similarity calculations. Furthermore, Milvus provides a user-friendly <strong>WebUI</strong> and supports <strong>multiple programming languages</strong>, simplifying development and deployment processes. With its focus on scalability and GPU acceleration, Milvus is an ideal vector database for large-scale NLP projects.</p>
<p><a id="pgvector"></a></p>
<h3>pgvector</h3>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/ankane/pgvector?logo=github"></p>
<p>Open-source vector similarity search for Postgres. Pgvector helps to built vector database on top of PostgreSQL, a popular open-source relational database. It leverages the powerful indexing capabilities of PostgreSQL's extension system to provide efficient storage and retrieval of vector embeddings. pgvector supports both CPU and GPU inference, enabling high-performance vector operations.</p>
<p>One key advantage of pgvector is its seamless <strong>integration with the broader PostgreSQL</strong> ecosystem. Users can leverage the rich functionality of PostgreSQL, such as ACID compliance and support for complex queries, while benefiting from vector-specific operations. pgvector provides a PostgreSQL extension that extends the SQL syntax to handle vector operations and offers a Python library for easy integration. With its compatibility with PostgreSQL and efficient vector storage, pgvector is a reliable choice for NLP applications that require a seamless SQL integration.</p>
<p><a id="pinecone"></a></p>
<h3>Pinecone</h3>
<p><img alt="Pinecone logo" src="/images/vectordb/pinecone.png"></p>
<p><a href="https://www.pinecone.io/">Pinecone</a> is a managed vector database built for handling large-scale embeddings in real-time applications. It focuses on low-latency search and high-throughput indexing, making it suitable for latency-sensitive NLP use cases. Pinecone's cloud-native infrastructure handles indexing, storage, and query serving, allowing developers to focus on building their applications.</p>
<p>Pinecone offers a RESTful <strong>API</strong> and client libraries <strong>for various programming languages</strong>, simplifying integration with different NLP frameworks. It supports <strong>dynamic indexing</strong>, allowing incremental updates to embeddings without rebuilding the entire index. Pinecone also provides advanced features like <strong>vector similarity search</strong>, <strong>filtering</strong>, and result ranking. With its <strong>emphasis on real-time performance</strong> and ease of use, Pinecone is an excellent choice for developers seeking a fully managed vector database for NLP applications.</p>
<p><a id="supabase"></a></p>
<h3>Supabase</h3>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/supabase/supabase?logo=github"></p>
<p><img alt="Supabase logo" src="images/vectordb/supabase.png"></p>
<p><a href="https://supabase.com/">Supabase</a>, known for its open-source data platform, offers a scalable vector storage solution designed for fast and efficient retrieval of embeddings. Supabase leverages PostgreSQL as its underlying storage engine, ensuring data durability and compatibility with standard SQL queries. It provides a range of features such as indexing, querying, and filtering, optimized for vector data.</p>
<p>One distinctive aspect of Supabase is its <strong>real-time capabilities</strong>, enabled by its integration with PostgREST and PostgreSQL's logical decoding feature. This allows developers to build real-time applications that can react to changes in vector data. Supabase also provides a user-friendly <strong>interface</strong> and <strong>client libraries</strong> for <strong>various programming languages</strong>, making it accessible to developers with different skill sets. With its combination of vector storage and real-time capabilities, Supabase is an excellent choice for NLP projects that require both scalability and real-time updates.</p>
<p><a id="qdrant"></a></p>
<h3>Qdrant</h3>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/qdrant/qdrant?logo=github"></p>
<p><img alt="Qdrant logo" src="/images/vectordb/qdrant.png"></p>
<p>Qdrant is an open-source vector database designed for similarity search and efficient storage of high-dimensional embeddings. It leverages an approximate nearest neighbor (ANN) algorithm based on Hierarchical Navigable Small World (HNSW) graphs, enabling fast and accurate similarity searches. Qdrant supports both CPU and GPU inference, allowing users to leverage hardware acceleration for faster computations.</p>
<p>One notable feature of Qdrant is its <strong>RESTful API</strong>, which provides a user-friendly <strong>interface for indexing, searching, and managing vector data</strong>. Qdrant also offers <strong>flexible query options</strong>, allowing users to specify search parameters and control the trade-off between accuracy and speed. With its focus on efficient similarity search and user-friendly API, Qdrant is a powerful vector database for various NLP applications.</p>
<p><a id="vespa"></a></p>
<h3>Vespa</h3>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/vespa-engine/vespa?logo=github"></p>
<p><img alt="vespa logo" src="https://vespa.ai/assets/vespa-logo.png"></p>
<p><a href="https://vespa.ai/">Vespa</a> is an open-source big data processing and serving engine developed by Verizon Media. It provides a distributed, scalable, and high-performance infrastructure for storing and querying vector embeddings. Vespa utilizes an inverted index structure combined with approximate nearest neighbor (ANN) search algorithms for efficient and accurate similarity searches.</p>
<p>One of Vespa's key features is its <strong>built-in ranking framework</strong>, allowing developers to define custom ranking models and apply <strong>complex ranking algorithms to search results</strong>. Vespa also supports <strong>real-time updates</strong>, making it suitable for <strong>dynamic embedding datasets</strong>. Additionally, Vespa provides a <strong>query language</strong> and a user-friendly <strong>WebUI</strong> for managing and monitoring the vector database. With its focus on <strong>distributed processing</strong> and advanced ranking capabilities, Vespa is a powerful tool for NLP applications that require complex ranking models and real-time updates.</p>
<p><a id="weaviate"></a></p>
<h3>Weaviate</h3>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/semi-technologies/weaviate?logo=github"></p>
<p><img alt="Weaviate logo" src="/images/vectordb/weaviate.png"></p>
<p><a href="https://weaviate.io/">Weaviate</a> is an open-source knowledge graph and vector search engine that excels in handling high-dimensional embeddings. It combines the power of graph databases and vector search to provide efficient storage, retrieval, and exploration of vector data. Weaviate offers powerful indexing methods, including approximate nearest neighbor (ANN) algorithms like HNSW, for fast and accurate similarity searches.</p>
<p>One unique aspect of Weaviate is its <strong>focus on semantics and contextual relationships</strong>. It allows users to define <strong>custom schema and relationships between entities</strong>, enabling <strong>complex queries that go beyond simple vector similarity</strong>. Weaviate also provides a <strong>RESTful API</strong>, client libraries, and a user-friendly <strong>WebUI</strong> for easy integration and management. With its combination of <strong>graph database features</strong> and vector search capabilities, Weaviate is an excellent choice <strong>for NLP applications that require semantic understanding and exploration of embeddings</strong>.</p>
<p><a id="deeplake"></a></p>
<h3>DeepLake</h3>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/activeloopai/deeplake?logo=github"></p>
<p><img alt="DeepLake logo" src="https://camo.githubusercontent.com/d0c805affb06f5ea9ba767de06b77a04de54a7ef433fad08b2729d5e6b11112c/68747470733a2f2f692e706f7374696d672e63632f72736a63576333532f646565706c616b652d6c6f676f2e706e67">
<a href="https://www.activeloop.ai/">DeepLake</a> is an open-source vector database designed for efficient storage and retrieval of embeddings. It focuses on scalability and speed, making it suitable for handling large-scale NLP datasets. DeepLake provides a distributed architecture with built-in support for horizontal scalability, allowing users to handle massive volumes of vector data.</p>
<p>One unique feature of DeepLake is its support for <strong>distributed vector indexing and querying</strong>. It leverages an <strong>ANN</strong> algorithm based on the Product Quantization (PQ) method, enabling fast and accurate similarity searches. DeepLake also provides a <strong>RESTful API</strong> for easy integration with NLP pipelines and frameworks. With its emphasis on <strong>scalability and distributed processing</strong>, DeepLake is a robust vector database for demanding NLP applications.</p>
<p><a id="vectorstore-from-langchain"></a></p>
<h3>VectorStore from LangChain</h3>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/hwchase17/langchain?logo=github"></p>
<p>LangChain <a href="https://docs.langchain.com/docs/components/indexing/vectorstore">VectorStore</a> is an open-source vector database optimized for multilingual NLP applications. It focuses on efficient storage and retrieval of embeddings across multiple languages. VectorStore supports various indexing methods, including approximate nearest neighbor (ANN) algorithms like HNSW and Annoy, for fast similarity searches.</p>
<p>One distinguishing feature of VectorStore is its <strong>language-specific indexing</strong> and <strong>retrieval capabilities</strong>. It provides <strong>language-specific tokenization</strong> and <strong>indexing strategies</strong> to <strong>optimize search accuracy for different languages</strong>. VectorStore also offers a <strong>RESTful API</strong> and client libraries for easy integration with NLP pipelines. With its multilingual support and language-specific indexing, VectorStore is an excellent choice for projects that deal with embeddings across multiple languages.</p>
<p><a id="other-relevant-vector-databases"></a></p>
<h3>Other Relevant Vector Databases</h3>
<p>While the above tools represent some of the best vector databases available for storing embeddings in NLP, there are other notable options worth exploring:</p>
<h4>Annoy</h4>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/spotify/annoy?logo=github"></p>
<p>Annoy is a lightweight C++ library for approximate nearest neighbor (ANN) search, offering efficient indexing and querying of high-dimensional embeddings.</p>
<h4>Elasticsearch</h4>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/elastic/elasticsearch?logo=github"></p>
<p>Elasticsearch is a popular distributed search and analytics engine that can be used to store and retrieve vector embeddings efficiently.</p>
<h4>Hnswlib</h4>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/nmslib/hnswlib?logo=github"></p>
<p>Hnswlib is a C++ library for efficient approximate nearest neighbor (ANN) search, providing high-performance indexing and retrieval of embeddings.</p>
<h4>NMSLIB</h4>
<p><img alt="github stars shield" src="https://img.shields.io/github/stars/nmslib/nmslib?logo=github"></p>
<p>NMSLIB is an open-source library for similarity search, offering a range of indexing methods and data structures for efficient storage and retrieval of embeddings.</p>
<p>These vector databases provide additional options and features that may suit specific requirements or preferences. Exploring these alternatives can help developers find the best fit for their NLP projects.</p>
<p>To explore more, often lesser-known libraries you can use GitHub's topic search: <a href="https://github.com/topics/vector-database">vector-database · GitHub Topics · GitHub</a></p>
<p><a id="tabular-summary-of-the-features"></a></p>
<h2>Tabular summary of the features</h2>
<table>
<thead>
<tr>
<th>Tool</th>
<th>Scalability</th>
<th>Query Speed</th>
<th>Search Accuracy</th>
<th>Flexibility</th>
<th>Persistence</th>
<th>Storage Location</th>
</tr>
</thead>
<tbody>
<tr>
<td>Chroma</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>Yes</td>
<td>Local/Cloud</td>
</tr>
<tr>
<td>DeepsetAI</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>Yes</td>
<td>Local/Cloud</td>
</tr>
<tr>
<td>Faiss</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>Medium</td>
<td>No</td>
<td>Local/Cloud</td>
</tr>
<tr>
<td>Milvus</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>Yes</td>
<td>Local/Cloud</td>
</tr>
<tr>
<td>pgvector</td>
<td>Medium</td>
<td>Medium</td>
<td>High</td>
<td>High</td>
<td>Yes</td>
<td>Local</td>
</tr>
<tr>
<td>Pinecone</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>Yes</td>
<td>Cloud</td>
</tr>
<tr>
<td>Supabase</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>Yes</td>
<td>Cloud</td>
</tr>
<tr>
<td>Qdrant</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>Yes</td>
<td>Local/Cloud</td>
</tr>
<tr>
<td>Vespa</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>Yes</td>
<td>Local/Cloud</td>
</tr>
<tr>
<td>Weaviate</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>Yes</td>
<td>Local/Cloud</td>
</tr>
<tr>
<td>DeepLake</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>Yes</td>
<td>Local/Cloud</td>
</tr>
<tr>
<td>LangChain VectorStore</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>Yes</td>
<td>Local/Cloud</td>
</tr>
<tr>
<td>Annoy</td>
<td>Medium</td>
<td>Medium</td>
<td>Medium</td>
<td>Medium</td>
<td>No</td>
<td>Local/Cloud</td>
</tr>
<tr>
<td>Elasticsearch</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>Yes</td>
<td>Local/Cloud</td>
</tr>
<tr>
<td>Hnswlib</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>No</td>
<td>Local/Cloud</td>
</tr>
<tr>
<td>NMSLIB</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>High</td>
<td>No</td>
<td>Local/Cloud</td>
</tr>
</tbody>
</table>
<p><a id="recommendations"></a></p>
<h2>Recommendations</h2>
<p>Please find recommendations for three groups of use cases
<a id="easy-start-and-user-friendliness---good-for-poc"></a></p>
<h3>Easy Start and User-Friendliness - good for PoC</h3>
<p>In this group, the focus is on vector databases that are easy to start with and user-friendly, even if they may sacrifice some advanced capabilities or performance.</p>
<ol>
<li>
<p><strong>Chroma</strong>: Chroma is an excellent choice for this group due to its simplicity and ease of use. It provides a straightforward API and offers out-of-the-box functionality for vector storage and retrieval. While it may not have the same level of scalability or advanced search algorithms as some other tools, it is ideal for small to medium-sized projects or beginners who want to quickly get started with vector databases.</p>
</li>
<li>
<p><strong>DeepsetAI</strong>: DeepsetAI is another tool that prioritizes user-friendliness without compromising on essential functionalities. It offers a user-friendly interface, powerful search capabilities, and easy integration into existing NLP workflows. DeepsetAI is well-suited for developers who want a simple and efficient solution for storing and querying vector embeddings.</p>
</li>
</ol>
<p><a id="advanced-capabilities-and-performance"></a></p>
<h3>Advanced Capabilities and Performance</h3>
<p>In this group, we consider vector databases that provide advanced capabilities and high-performance, catering to more demanding use cases.</p>
<ol>
<li>
<p><strong>Faiss</strong>: Faiss is a widely used and highly performant vector database that specializes in efficient similarity search. It offers a range of indexing structures and search algorithms, making it suitable for large-scale projects that require fast and accurate retrieval of embeddings. Faiss is an optimal choice when performance and search accuracy are critical.</p>
</li>
<li>
<p><strong>Milvus</strong>: Milvus is another powerful vector database known for its scalability and performance. It provides distributed storage and indexing, allowing for efficient handling of large-scale embedding datasets. Milvus supports various indexing algorithms, including approximate nearest neighbor (ANN) search, enabling fast similarity search. It is a robust solution for projects that demand scalability, high-performance, and flexibility.</p>
</li>
</ol>
<p><a id="customization-and-advanced-use-cases"></a></p>
<h3>Customization and Advanced Use Cases</h3>
<p>In this group, we consider vector databases that offer extensive customization options and cater to advanced use cases with specific requirements.</p>
<ol>
<li>
<p><strong>Pinecone</strong>: Pinecone is a vector database that excels in providing real-time search capabilities and high scalability. It offers advanced features such as dynamic indexing, custom similarity functions, and efficient updates, making it ideal for applications that require real-time embeddings and constant model refinement.</p>
</li>
<li>
<p><strong>Supabase</strong>: Supabase is an open-source database platform that provides a wide range of features, including support for vector storage and retrieval. With its flexibility and customizability, Supabase is suitable for projects that require not only vector database functionality but also the benefits of a comprehensive database platform.</p>
</li>
</ol>
<p>By considering the diverse requirements of each group, we have recommended vector databases that prioritize ease of use, advanced capabilities, and customization. These recommendations aim to assist developers in selecting the most appropriate vector database for their specific use case and level of expertise.
<a id="related-reading"></a></p>
<h2>Related reading</h2>
<ol>
<li><a href="https://lunabrain.com/blog/riding-the-ai-wave-with-vector-databases-how-they-work-and-why-vcs-love-them/">Riding the AI Wave with Vector Databases: How they work (and why VCs love them) - LunaBrain</a></li>
<li><a href="https://harishgarg.com/writing/best-vector-databases-for-ai-apps/">10 Best vector databases for building AI Apps with embeddings - HarishGarg.com</a></li>
<li><a href="https://thenewstack.io/vector-databases-long-term-memory-for-artificial-intelligence/">Vector Databases: Long-Term Memory for Artificial Intelligence - The New Stack</a></li>
<li><a href="https://medium.com/sopmac-ai/vector-databases-as-memory-for-your-ai-agents-986288530443">Vector Databases as Memory for your AI Agents | by Ivan Campos | Sopmac AI | Apr, 2023 | Medium</a></li>
<li><a href="https://venturebeat.com/ai/how-vector-databases-can-revolutionize-our-relationship-with-generative-ai/">How vector databases can revolutionize our relationship with generative AI | VentureBeat</a></li>
<li><a href="https://www.forbes.com/sites/adrianbridgwater/2023/05/19/the-rise-of-vector-databases/">Vector databases provide new ways to enable search and data analytics.</a></li>
<li><a href="https://betterprogramming.pub/openais-embedding-model-with-vector-database-b69014f04433">OpenAI’s Embeddings with Vector Database | Better Programming</a></li>
<li>Vector Databases Demystified serie by <a href="https://www.linkedin.com/in/adiekaye/">Adie Kaye</a></li>
<li><a href="https://www.linkedin.com/pulse/vector-databases-demystified-part-1-introduction-world-adie-kaye%3FtrackingId=Rswjt%252BgljDJ9YTjMB08LWw%253D%253D/?trackingId=Rswjt%2BgljDJ9YTjMB08LWw%3D%3D">Part 1 - An Introduction to the World of High-Dimensional Data Storage</a></li>
<li><a href="https://www.linkedin.com/pulse/vector-databases-demystified-part-2-building-your-own-adie-kaye?trackingId=CRILIdZ0zUFLlj3EZ69gXQ%3D%3D&lipi=urn%3Ali%3Apage%3Ad_flagship3_profile_view_base_recent_activity_content_view%3B1s%2FDztmATJWjL%2BLIoqi0XQ%3D%3D">Part 2 - Building Your Own (Very) Simple Vector Database in Python</a></li>
<li><a href="https://www.linkedin.com/pulse/vector-databases-demystified-part-3-build-colour-matching-adie-kaye?trackingId=sS3mR3KmPvSwcPwdMJvbFQ%3D%3D&lipi=urn%3Ali%3Apage%3Ad_flagship3_profile_view_base_recent_activity_content_view%3B1s%2FDztmATJWjL%2BLIoqi0XQ%3D%3D">Part 3 - Build a colour matching app with Pinecone</a></li>
<li><a href="https://www.linkedin.com/pulse/vector-databases-demystified-part-4-using-sentence-pinecone-kaye?trackingId=vfLY3dFcGw%2FVygrCCFKZIQ%3D%3D">Part 4 - Using Sentence Transformers with Pinecone</a></li>
</ol>
<p><em>Any comments or suggestions? <a href="mailto:ksafjan@gmail.com?subject=Blog+post">Let me know</a>.</em></p>Mastering the Kanban Method - Unveiling the Hidden Gems of Effective Kanban Board Usage2023-05-26T00:00:00+02:002023-05-26T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-05-26:/mastering-kanban-method/<p>Ever wondered how to supercharge your team's productivity? Say hello to Kanban, the dynamic method that brings clarity and efficiency to your projects.</p><h2>Introduction</h2>
<p>In today's fast-paced and ever-evolving business landscape, organizations are constantly seeking efficient project management methodologies to enhance productivity and streamline workflows. One such approach that has gained significant popularity is the Kanban method. Kanban, originating from the Japanese word for "billboard" or "visual card," is a visual project management system that allows teams to track and manage work effectively. In this comprehensive guide, we will delve into the intricacies of the Kanban method, explore the proper utilization of Kanban boards, and reveal lesser-known tips and tricks to maximize their potential.</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#understanding-the-kanban-method-principles">Understanding the Kanban Method Principles</a><ul>
<li><a href="#visualize-your-workflow">Visualize Your Workflow</a></li>
<li><a href="#limit-work-in-progress-wip">Limit Work in Progress (WIP)</a></li>
<li><a href="#collaborate-and-encourage-flow">Collaborate and Encourage Flow</a></li>
<li><a href="#continuously-improve">Continuously Improve</a></li>
</ul>
</li>
<li><a href="#avoiding-common-mistakes">Avoiding Common Mistakes</a><ul>
<li><a href="#neglecting-wip-limits">Neglecting WIP Limits</a></li>
<li><a href="#lack-of-clarity-and-standardization">Lack of Clarity and Standardization</a></li>
<li><a href="#failure-to-prioritize-and-swarm">Failure to Prioritize and Swarm</a></li>
<li><a href="#lack-of-continuous-improvement">Lack of Continuous Improvement</a></li>
</ul>
</li>
<li><a href="#unveiling-lesser-known-tips-and-tricks">Unveiling Lesser-Known Tips and Tricks</a><ul>
<li><a href="#class-of-service">Class of Service</a></li>
<li><a href="#visualizing-blocked-tasks">Visualizing Blocked Tasks</a></li>
<li><a href="#kanban-swimlanes">Kanban Swimlanes</a></li>
<li><a href="#implementing-agile-practices">Implementing Agile Practices</a></li>
</ul>
</li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="understanding-the-kanban-method-principles"></a></p>
<h2>Understanding the Kanban Method Principles</h2>
<p>The Kanban method, developed by Taiichi Ohno at Toyota, is built on the principles of visualizing work, limiting work in progress (WIP), and focusing on continuous improvement. At its core, Kanban promotes transparency, flexibility, and collaboration, providing teams with a clear overview of their tasks and enabling them to optimize their workflows.</p>
<p><a id="visualize-your-workflow"></a></p>
<h3>Visualize Your Workflow</h3>
<p>The fundamental principle of Kanban lies in visualizing your workflow. By representing each task as a card or sticky note on a Kanban board, teams gain a shared understanding of the work in progress. A typical Kanban board comprises columns that depict different stages of work, such as "To Do," "In Progress," and "Done." Visualizing tasks fosters transparency, enhances communication, and enables team members to identify bottlenecks or inefficiencies quickly.</p>
<p><a id="limit-work-in-progress-wip"></a></p>
<h3>Limit Work in Progress (WIP)</h3>
<p>To maintain a smooth workflow and prevent overburdening team members, it is crucial to limit the number of tasks in progress simultaneously. Setting WIP limits for each column on the Kanban board ensures a manageable workload, promotes focus, and encourages completing tasks before moving on to new ones. WIP limits prevent multitasking, which can lead to reduced productivity and increased lead times.</p>
<p><a id="collaborate-and-encourage-flow"></a></p>
<h3>Collaborate and Encourage Flow</h3>
<p>Kanban encourages collaboration and cross-functional teamwork. By eliminating silos and fostering a culture of shared responsibility, teams can achieve a seamless flow of work. Encourage frequent communication, promote knowledge sharing, and embrace a collective ownership mindset to optimize the overall efficiency of your Kanban system.</p>
<p><a id="continuously-improve"></a></p>
<h3>Continuously Improve</h3>
<p>The Kanban method is rooted in the philosophy of continuous improvement. Encourage your team to reflect on their processes, identify areas of improvement, and implement changes accordingly. By regularly reviewing your Kanban board, analyzing cycle times, and seeking feedback from team members, you can refine your workflows, streamline processes, and enhance overall productivity.</p>
<p><a id="avoiding-common-mistakes"></a></p>
<h2>Avoiding Common Mistakes</h2>
<p>While the Kanban method offers numerous benefits, it's important to be aware of common pitfalls that can hinder its effectiveness. By recognizing and avoiding these mistakes, you can ensure your Kanban implementation is successful.</p>
<p><a id="neglecting-wip-limits"></a></p>
<h3>Neglecting WIP Limits</h3>
<p>One common mistake is neglecting WIP limits or setting them too high. Failing to adhere to WIP limits can lead to task overload, reduced focus, and increased lead times. Regularly review and adjust WIP limits based on team capacity and project requirements.</p>
<p><a id="lack-of-clarity-and-standardization"></a></p>
<h3>Lack of Clarity and Standardization</h3>
<p>Without clear guidelines and standardized practices, teams may interpret Kanban differently, leading to confusion and inconsistency. Establish explicit rules for how tasks should be represented on the board, how updates are communicated, and how metrics are measured. Consistency ensures everyone understands the workflow and can collaborate effectively.</p>
<p><a id="failure-to-prioritize-and-swarm"></a></p>
<h3>Failure to Prioritize and Swarm</h3>
<p>In Kanban, it's important to prioritize tasks and encourage the team to focus on completing them one at a time. Neglecting prioritization can lead to cherry-picking tasks or tackling low-value items first. Additionally, encourage swarming, where team members collaborate to complete tasks together, rather than working individually, to maximize efficiency and knowledge sharing.</p>
<p><a id="lack-of-continuous-improvement"></a></p>
<h3>Lack of Continuous Improvement</h3>
<p>One of the main principles of Kanban is continuous improvement. Failing to allocate time for retrospectives, process analysis, and incremental changes can hinder your team's growth and limit the full potential of your Kanban system. Regularly review and refine your workflows to ensure ongoing progress and evolution.</p>
<p><a id="unveiling-lesser-known-tips-and-tricks"></a></p>
<h2>Unveiling Lesser-Known Tips and Tricks</h2>
<p>Now, let's uncover some lesser-known tips and tricks that can take your Kanban practice to the next level, boosting your team's productivity and overall success.</p>
<p><a id="class-of-service"></a></p>
<h3>Class of Service</h3>
<p>Introduce the concept of "Class of Service" to prioritize tasks based on their impact and urgency. By assigning different classes to tasks, such as expedite, standard, or fixed-date, teams can ensure that critical work is appropriately prioritized and expedited, while still maintaining a steady flow.</p>
<p><a id="visualizing-blocked-tasks"></a></p>
<h3>Visualizing Blocked Tasks</h3>
<p>In addition to representing tasks in progress, leverage the Kanban board to highlight blocked or stalled tasks. Use specific indicators or flags to denote issues preventing progress, such as dependencies, resource constraints, or waiting for external feedback. This visual cue helps the team focus on resolving blockers and ensures smoother workflow management.</p>
<p><a id="kanban-swimlanes"></a></p>
<h3>Kanban Swimlanes</h3>
<p>Introduce swimlanes on your Kanban board to categorize tasks based on different criteria, such as priority, team member, or project phase. Swimlanes provide a higher level of organization and enable teams to filter and analyze their work in a more granular manner. This approach can be particularly beneficial for larger teams or complex projects.</p>
<p><a id="implementing-agile-practices"></a></p>
<h3>Implementing Agile Practices</h3>
<p>Combine Kanban with agile practices to amplify its impact. Techniques like daily stand-ups, sprint planning, and retrospectives can complement the visual nature of Kanban, fostering enhanced collaboration, transparency, and adaptability within your team.</p>
<p><a id="conclusion"></a></p>
<h2>Conclusion</h2>
<p>The Kanban method, with its emphasis on visualization, limiting work in progress, and continuous improvement, offers organizations a powerful tool to optimize their workflows and enhance team productivity. By avoiding common mistakes and incorporating lesser-known tips and tricks, teams can unlock the full potential of Kanban, streamline their processes, and achieve remarkable results. Embrace the power of Kanban, and watch your projects flourish in an environment of transparency, collaboration, and continuous improvement.</p>
<p><strong>Credits</strong>: heading image from <a href="https://unsplash.com/photos/OXmym9cuaEY">unsplash</a> by <a href="https://unsplash.com/@edenconstantin0">edenconstantin0</a></p>
<p><em>Any comments or suggestions? <a href="mailto:ksafjan@gmail.com?subject=Blog+post">Let me know</a>.</em></p>Getting the User's Home Directory Path in Python - A Cross-Platform Guide2023-04-20T00:00:00+02:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-04-20:/python-user-home-directory/<h2>Use <code>os.path.expanduser()</code></h2>
<p>To get the user's home directory in Python, you can use the <code>os.path.expanduser()</code> function. This function expands the initial tilde <code>~</code> character in a file path to the user's home directory path.</p>
<p>Here's an example:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">os …</span></code></pre></div><h2>Use <code>os.path.expanduser()</code></h2>
<p>To get the user's home directory in Python, you can use the <code>os.path.expanduser()</code> function. This function expands the initial tilde <code>~</code> character in a file path to the user's home directory path.</p>
<p>Here's an example:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">os</span>
<span class="n">home_dir</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">expanduser</span><span class="p">(</span><span class="s2">"~"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">home_dir</span><span class="p">)</span>
</code></pre></div>
<p>This should output the path to the user's home directory, which will be different depending on the operating system.</p>
<p>For example, on a Unix-based system such as macOS or Linux, this will output something like <code>/Users/username</code>. On a Windows system, it will output something like <code>C:\Users\username</code>.</p>
<p>Using <code>os.path.expanduser()</code> is a cross-platform solution because it automatically handles the differences between operating systems in how they represent home directory paths.</p>
<h2>Use <code>Path.home()</code></h2>
<p>You can also use the <code>Path.home()</code> method of the <code>pathlib</code> module to get the user's home directory path in a platform-independent way. Here's an example:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="n">home_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="o">.</span><span class="n">home</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="n">home_dir</span><span class="p">)</span>
</code></pre></div>
<p>This will output the same path to the user's home directory as the previous example, but it uses the <code>Path</code> object instead of the <code>os</code> module.</p>
<p>The <code>Path.home()</code> method is a cross-platform way of getting the user's home directory path. It returns a <code>Path</code> object representing the home directory path, which can be used with other <code>pathlib</code> methods to manipulate file paths in a platform-independent way.</p>
<h2>Other alternatives</h2>
<p>There are a few other ways to get the user's home directory path in Python, some of which are platform-dependent.</p>
<ol>
<li>Using the <code>os.environ</code> dictionary:</li>
</ol>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">os</span>
<span class="n">home_dir</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">'HOME'</span><span class="p">]</span>
<span class="nb">print</span><span class="p">(</span><span class="n">home_dir</span><span class="p">)</span>
</code></pre></div>
<p>This works on Unix-based systems like macOS and Linux, where the <code>HOME</code> environment variable is set to the user's home directory path.</p>
<ol>
<li>Using the <code>os.path.expandvars()</code> function:</li>
</ol>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">os</span>
<span class="n">home_dir</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">expandvars</span><span class="p">(</span><span class="s1">'$HOME'</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">home_dir</span><span class="p">)</span>
</code></pre></div>
<p>This also works on Unix-based systems where the <code>HOME</code> environment variable is set, but it can also work on other systems if the appropriate environment variable is set.</p>
<ol>
<li>Using the <code>winreg</code> module on Windows:</li>
</ol>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">winreg</span>
<span class="n">key</span> <span class="o">=</span> <span class="n">winreg</span><span class="o">.</span><span class="n">OpenKey</span><span class="p">(</span><span class="n">winreg</span><span class="o">.</span><span class="n">HKEY_CURRENT_USER</span><span class="p">,</span> <span class="s2">"SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\Shell Folders"</span><span class="p">)</span>
<span class="n">home_dir</span> <span class="o">=</span> <span class="n">winreg</span><span class="o">.</span><span class="n">QueryValueEx</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="s2">"Personal"</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="nb">print</span><span class="p">(</span><span class="n">home_dir</span><span class="p">)</span>
</code></pre></div>
<p>This works on Windows systems, but it requires the <code>winreg</code> module and accesses the Windows Registry, so it is not as platform-independent as the other solutions.</p>
<p>Overall, using either <code>os.path.expanduser()</code> or <code>Path.home()</code> is the most reliable and platform-independent way to get the user's home directory path in Python.</p>Attacking Differential Privacy Using the Correlation Between the Features2023-04-19T00:00:00+02:002023-04-19T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-04-19:/attacking-differential-privacy-using-the-correlation-between-the-features/<p>Learn how the differential privacy works by simulating attack on data protected with that technique.</p><h2>Introduction</h2>
<p>Differential privacy is a technique that adds random noise to the data to protect individual privacy while still allowing for accurate data analysis. However, differential privacy can still be vulnerable to attacks that can compromise the privacy of individuals. One such attack is through the use of correlation between features. In this blog post, we will discuss how an attacker can use correlation between features to attack differential privacy and how to mitigate this attack.</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#background">Background</a></li>
<li><a href="#correlation-between-features">Correlation Between Features</a></li>
<li><a href="#steps-for-the-attack-using-correlation-between-features">Steps for the attack using correlation between features</a></li>
<li><a href="#1-identify-highly-correlated-features">1. Identify highly correlated features</a></li>
<li><a href="#2-compute-expected-values">2. Compute expected values</a></li>
<li><a href="#3-compare-expected-and-observed-values">3. Compare expected and observed values</a></li>
<li><a href="#mitigating-the-attack">Mitigating the Attack</a></li>
<li><a href="#summary">Summary</a></li>
<li><a href="#tutorial">Tutorial</a></li>
<li><a href="#select-a-dataset-that-requires-privacy-protection">Select a dataset that requires privacy protection</a></li>
<li><a href="#apply-differential-privacy">Apply differential privacy</a></li>
<li><a href="#perform-the-attack---reconstruct-original-data-by-exploiting-correlation-between-features">Perform the attack - reconstruct original data by exploiting correlation between features</a></li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="background"></a></p>
<h2>Background</h2>
<p>Differential privacy adds random noise to the data to protect the privacy of individuals. The amount of noise added depends on a parameter called the privacy budget. The higher the privacy budget, the less noise is added, and the lower the privacy budget, the more noise is added. The privacy budget is usually set based on the desired level of privacy and the size of the data set. A smaller privacy budget leads to better privacy but less accurate data analysis, while a larger privacy budget leads to less privacy but more accurate data analysis.</p>
<p><a id="correlation-between-features"></a></p>
<h2>Correlation Between Features</h2>
<p>In many data sets, the features are not independent but are correlated with each other. Correlation between features can be measured using the correlation coefficient. The correlation coefficient between two features x and y is defined as:</p>
<div class="math">$$
ρ_{x,y} = cov(x,y) / (σ_x * σ_y)
$$</div>
<p>where <span class="math">\(cov(x,y)\)</span> is the covariance between <span class="math">\(x\)</span> and <span class="math">\(y\)</span>, and <span class="math">\(\sigma_x\)</span> and <span class="math">\(\sigma_y\)</span> are the standard deviations of <span class="math">\(x\)</span> and <span class="math">\(y\)</span>, respectively.</p>
<p>Correlation between features can be used to attack differential privacy. An attacker can use the correlation between features to infer the presence or absence of an individual's data in the data set. For example, suppose an attacker knows that two features x and y are highly correlated. If the attacker sees that the value of y is very different from what they would expect based on the value of x, they can infer that the individual's data was not included in the data set.</p>
<p><a id="steps-for-the-attack-using-correlation-between-features"></a></p>
<h2>Steps for the attack using correlation between features</h2>
<p>An attacker can use the following steps to attack differential privacy using correlation between features:</p>
<p><a id="1-identify-highly-correlated-features"></a></p>
<h3>1. Identify highly correlated features</h3>
<p>The attacker identifies which features in the data set are highly correlated with each other.</p>
<p><a id="2-compute-expected-values"></a></p>
<h3>2. Compute expected values</h3>
<p>The attacker computes the expected values of the features based on the values of the other features.</p>
<p><a id="3-compare-expected-and-observed-values"></a></p>
<h3>3. Compare expected and observed values</h3>
<p>The attacker compares the expected values with the observed values of the features. If the observed values are significantly different from the expected values, the attacker can infer that the individual's data was not included in the data set.</p>
<p><a id="mitigating-the-attack"></a></p>
<h2>Mitigating the Attack</h2>
<p>There are several ways to mitigate the attack using correlation between features. One approach is to <strong>decorrelate the features</strong> by transforming the data. For example, principal component analysis (PCA) can be used to decorrelate the features. Another approach is to <strong>add noise to the data</strong> in a way that preserves the correlation between features. This approach is called differentially private PCA (DP-PCA). DP-PCA adds noise to the data in a way that satisfies differential privacy while preserving the correlation between features.</p>
<p><a id="summary"></a></p>
<h2>Summary</h2>
<p>Correlation between features can be used to attack differential privacy. An attacker can use the correlation between features to infer the presence or absence of an individual's data in the data set. To mitigate this attack, the features can be decorrelated or noise can be added to the data using DP-PCA. Data security experts should be aware of this attack and take appropriate measures to mitigate its effects.</p>
<p><a id="tutorial"></a></p>
<h2>Tutorial</h2>
<p>In this tutorial, we will go through the steps of attacking differential privacy by exploiting correlations between features, using Python code to demonstrate each step.</p>
<p>In the tutorial we will be using pydp Python library, so you need to install it first:</p>
<div class="highlight"><pre><span></span><code>pip<span class="w"> </span>install<span class="w"> </span>python-dp
</code></pre></div>
<p><a id="select-a-dataset-that-requires-privacy-protection"></a></p>
<h3>Select a dataset that requires privacy protection</h3>
<p>For this tutorial, we will use the Adult dataset from the UCI Machine Learning Repository. This dataset contains information about individuals, including their age, education level, marital status, occupation, and more. The goal is to predict whether an individual earns more than $50K per year. We will load this dataset using pandas:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s2">"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"</span><span class="p">,</span>
<span class="n">header</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span>
<span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s2">"age"</span><span class="p">,</span> <span class="s2">"workclass"</span><span class="p">,</span> <span class="s2">"fnlwgt"</span><span class="p">,</span> <span class="s2">"education"</span><span class="p">,</span> <span class="s2">"education-num"</span><span class="p">,</span> <span class="s2">"marital-status"</span><span class="p">,</span>
<span class="s2">"occupation"</span><span class="p">,</span> <span class="s2">"relationship"</span><span class="p">,</span> <span class="s2">"race"</span><span class="p">,</span> <span class="s2">"sex"</span><span class="p">,</span> <span class="s2">"capital-gain"</span><span class="p">,</span> <span class="s2">"capital-loss"</span><span class="p">,</span>
<span class="s2">"hours-per-week"</span><span class="p">,</span> <span class="s2">"native-country"</span><span class="p">,</span> <span class="s2">"income"</span><span class="p">])</span>
</code></pre></div>
<p><a id="apply-differential-privacy"></a></p>
<h3>Apply differential privacy</h3>
<p>We will use the PyDP library to apply differential privacy to the dataset. We will add Laplace noise to the age and education-num features, with a privacy budget of 1.0:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">pydp.algorithms.laplacian</span> <span class="kn">import</span> <span class="n">BoundedMean</span>
<span class="n">epsilon</span> <span class="o">=</span> <span class="mf">1.0</span>
<span class="c1"># apply differential privacy to age</span>
<span class="n">bm</span> <span class="o">=</span> <span class="n">BoundedMean</span><span class="p">(</span><span class="n">epsilon</span><span class="o">=</span><span class="n">epsilon</span><span class="p">,</span> <span class="n">lower</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">upper</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s2">"age"</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s2">"age"</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">bm</span><span class="o">.</span><span class="n">quick_result</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="c1"># apply differential privacy to education-num</span>
<span class="n">bm</span> <span class="o">=</span> <span class="n">BoundedMean</span><span class="p">(</span><span class="n">epsilon</span><span class="o">=</span><span class="n">epsilon</span><span class="p">,</span> <span class="n">lower</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">upper</span><span class="o">=</span><span class="mi">16</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s2">"education-num"</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s2">"education-num"</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">bm</span><span class="o">.</span><span class="n">quick_result</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
</code></pre></div>
<p><a id="perform-the-attack---reconstruct-original-data-by-exploiting-correlation-between-features"></a></p>
<h3>Perform the attack - reconstruct original data by exploiting correlation between features</h3>
<p>Now that we have applied differential privacy to the dataset, we will attempt to reconstruct the original data by exploiting the correlation between features. Specifically, we will use the age and education-num features, which we know are highly correlated, to infer the values of the original data.</p>
<p>First, we will create a copy of the dataset and remove the age and education-num features, as we will be reconstructing these features:</p>
<div class="highlight"><pre><span></span><code><span class="n">df_attack</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s2">"age"</span><span class="p">,</span> <span class="s2">"education-num"</span><span class="p">])</span>
</code></pre></div>
<p>Next, we will compute the mean and covariance matrix of the remaining features:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="c1"># compute mean and covariance of remaining features</span>
<span class="n">mean</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">df_attack</span><span class="o">.</span><span class="n">values</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">cov</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">cov</span><span class="p">(</span><span class="n">df_attack</span><span class="o">.</span><span class="n">values</span><span class="o">.</span><span class="n">T</span><span class="p">)</span>
</code></pre></div>
<p>We can now use the mean and covariance matrix to generate synthetic data:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># generate synthetic data</span>
<span class="n">synthetic_data</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">multivariate_normal</span><span class="p">(</span><span class="n">mean</span><span class="p">,</span> <span class="n">cov</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">df</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">synthetic_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">synthetic_data</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">df_attack</span><span class="o">.</span><span class="n">columns</span><span class="p">)</span>
</code></pre></div>
<p>Finally, we will reconstruct the age and education-num features using the generated synthetic data:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># reconstruct age and education-num features</span>
<span class="n">reconstructed_age</span> <span class="o">=</span> <span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s2">"education-num"</span><span class="p">]</span><span class="o">.</span><span class="n">values</span> <span class="o">-</span> <span class="n">mean</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="o">/</span> <span class="n">cov</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="n">cov</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">mean</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">reconstructed_edu_num</span> <span class="o">=</span> <span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s2">"age"</span><span class="p">]</span><span class="o">.</span><span class="n">values</span> <span class="o">-</span> <span class="n">mean</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">/</span> <span class="n">cov</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">cov</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">mean</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="c1"># combine reconstructed features with original data</span>
<span class="n">reconstructed_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s2">"age"</span><span class="p">:</span><span class="n">reconstructed_age</span><span class="p">,</span> <span class="s2">"education-num"</span><span class="p">:</span> <span class="n">reconstructed_edu_num</span><span class="p">})</span>
<span class="n">df_reconstructed</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">df_attack</span><span class="p">,</span> <span class="n">reconstructed_df</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div>
<p>We can now compare the reconstructed age and education-num features with the original features to see how well our attack worked:</p>
<div class="highlight"><pre><span></span><code><span class="c1"># compare reconstructed age and education-num with original features print("Age:") </span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Original:"</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s2">"age"</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">[:</span><span class="mi">10</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Reconstructed:"</span><span class="p">,</span> <span class="n">reconstructed_age</span><span class="p">[:</span><span class="mi">10</span><span class="p">])</span>
<span class="nb">print</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Education-num:"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"Original:"</span><span class="p">,</span> <span class="n">df</span><span class="p">[</span><span class="s2">"education-num"</span><span class="p">]</span><span class="o">.</span><span class="n">values</span><span class="p">[:</span><span class="mi">10</span><span class="p">])</span> <span class="nb">print</span><span class="p">(</span><span class="s2">"Reconstructed:"</span><span class="p">,</span> <span class="n">reconstructed_edu_num</span><span class="p">[:</span><span class="mi">10</span><span class="p">])</span>
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="n">Age</span><span class="p">:</span>
<span class="n">Original</span><span class="p">:</span> <span class="p">[</span><span class="mi">39</span> <span class="mi">50</span> <span class="mi">38</span> <span class="mi">53</span> <span class="mi">28</span> <span class="mi">37</span> <span class="mi">49</span> <span class="mi">52</span> <span class="mi">31</span> <span class="mi">42</span><span class="p">]</span>
<span class="n">Reconstructed</span><span class="p">:</span> <span class="p">[</span><span class="mf">39.38640885</span> <span class="mf">49.44619487</span> <span class="mf">38.2757904</span> <span class="mf">52.75103613</span> <span class="mf">26.46121269</span> <span class="mf">37.760824</span>
<span class="mf">47.88143872</span> <span class="mf">52.8530772</span> <span class="mf">30.79760633</span> <span class="mf">42.56495885</span><span class="p">]</span>
<span class="n">Education</span><span class="o">-</span><span class="n">num</span><span class="p">:</span>
<span class="n">Original</span><span class="p">:</span> <span class="p">[</span><span class="mi">13</span> <span class="mi">13</span> <span class="mi">9</span> <span class="mi">7</span> <span class="mi">13</span> <span class="mi">14</span> <span class="mi">5</span> <span class="mi">9</span> <span class="mi">14</span> <span class="mi">13</span><span class="p">]</span>
<span class="n">Reconstructed</span><span class="p">:</span> <span class="p">[</span><span class="mf">13.19164695</span> <span class="mf">13.19406455</span> <span class="mf">9.04750693</span> <span class="mf">6.8549391</span> <span class="mf">13.25155432</span> <span class="mf">13.76664294</span>
<span class="mf">5.45598348</span> <span class="mf">8.72003132</span> <span class="mf">14.14489928</span> <span class="mf">12.9968581</span> <span class="p">]</span>
</code></pre></div>
<p>As we can see, the reconstructed values are quite similar to the original values. This suggests that an attacker could use the correlation between the age and education-num features to infer the original values, even with the protection of differential privacy.</p>
<p><a id="conclusion"></a></p>
<h3>Conclusion</h3>
<p>In this tutorial, we have demonstrated how an attacker can exploit correlations between features to attack differential privacy. We used the PyDP library to apply differential privacy to a dataset, and then showed how an attacker could use the correlation between the age and education-num features to reconstruct the original values. This highlights the importance of considering the correlations between features when applying differential privacy, and suggests that additional protections may be necessary to prevent attacks based on feature correlations.</p>
<p><em>Any comments or suggestions? <a href="mailto:ksafjan@gmail.com?subject=Blog+post">Let me know</a>.</em></p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Python Regex Named Groups2023-04-19T00:00:00+02:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-04-19:/python-regex-named-groups/<p>In Python regex, <code>match.groupdict()</code> is a method that returns a dictionary containing all the named groups of a regular expression match.</p>
<p>When you use named capturing groups in a regular expression using the <code>(?P<name>...)</code> syntax, you can access the captured …</p><p>In Python regex, <code>match.groupdict()</code> is a method that returns a dictionary containing all the named groups of a regular expression match.</p>
<p>When you use named capturing groups in a regular expression using the <code>(?P<name>...)</code> syntax, you can access the captured text using the <code>groupdict()</code> method on the match object returned by <code>re.match()</code> or <code>re.search()</code>. The keys of the dictionary correspond to the group names, and the values are the captured text for each group.</p>
<p>Here's an example:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">re</span>
<span class="n">pattern</span> <span class="o">=</span> <span class="sa">r</span><span class="s1">'(?P<year>\d</span><span class="si">{4}</span><span class="s1">)-(?P<month>\d</span><span class="si">{2}</span><span class="s1">)-(?P<day>\d</span><span class="si">{2}</span><span class="s1">)'</span>
<span class="n">text</span> <span class="o">=</span> <span class="s1">'Today is 2023-04-19'</span>
<span class="n">match</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">pattern</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>
<span class="k">if</span> <span class="n">match</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="n">match</span><span class="o">.</span><span class="n">groupdict</span><span class="p">())</span>
</code></pre></div>
<p>Output:</p>
<div class="highlight"><pre><span></span><code><span class="p">{</span><span class="s1">'year'</span><span class="p">:</span> <span class="s1">'2023'</span><span class="p">,</span> <span class="s1">'month'</span><span class="p">:</span> <span class="s1">'04'</span><span class="p">,</span> <span class="s1">'day'</span><span class="p">:</span> <span class="s1">'19'</span><span class="p">}</span>
</code></pre></div>
<p>In the above example, the regular expression pattern matches a date string in the format 'yyyy-mm-dd', and each part of the date is captured using named groups. The <code>groupdict()</code> method returns a dictionary with keys 'year', 'month', and 'day', and their corresponding captured values.</p>Are LIME Explanations Any Useful?2023-04-18T00:00:00+02:002023-04-18T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-04-18:/are-lime-explanations-any-useful/<p>Don't let black box models hold you back. With LIME, you can interpret the predictions of even the most complex machine learning models.</p><p>LIME (Local Interpretable Model-agnostic Explanations) is a method used to interpret black box models. This technique is widely used in the field of data science to explain the predictions of complex machine learning models. By providing local explanations, LIME can help users understand the decision-making process of the models and increase their trust in the models' predictions. However, the question remains, are the local explanations obtained with LIME method useful? And what are the practical use cases when using LIME gave tangible results?</p>
<p>In this article, we will delve into the concept of LIME, its practical applications, and how it can be used to provide interpretable machine learning models.</p>
<h2>What is LIME?</h2>
<p>LIME is a model-agnostic technique used to explain the predictions of machine learning models. The idea behind LIME is to explain the predictions of a black box model by training a local, interpretable model around the data point of interest. The interpretable model is trained to mimic the behavior of the black box model around that data point. Once the local model is trained, it can be used to generate an explanation of the prediction, highlighting the most important features that contributed to the prediction.</p>
<p>The LIME algorithm consists of the following steps:</p>
<ol>
<li>Select a data point of interest.</li>
<li>Generate a dataset of perturbed instances around the selected data point.</li>
<li>Evaluate the black box model on the perturbed instances to obtain a set of weights that indicate the importance of each feature for the prediction.</li>
<li>Train an interpretable model (such as a linear regression model) on the perturbed instances, using the weights obtained in step 3 as feature weights.</li>
<li>Use the trained interpretable model to generate an explanation of the prediction for the selected data point.</li>
</ol>
<h2>Practical applications of LIME</h2>
<p>LIME has been successfully applied in various domains, including healthcare, finance, and image recognition. Here are some practical use cases where LIME has been used to provide interpretable machine learning models:</p>
<ol>
<li>
<p><strong>Healthcare</strong>: LIME has been used to interpret the predictions of machine learning models that diagnose diseases. For example, in a study conducted by Zech et al., LIME was used to interpret the predictions of a deep learning model that diagnosed pneumonia from chest X-rays. The LIME explanations provided by the study helped radiologists understand the decision-making process of the model and identify areas of the X-rays that contributed the most to the diagnosis.</p>
</li>
<li>
<p><strong>Finance</strong>: LIME has also been used to interpret the predictions of machine learning models that predict financial outcomes. For example, in a study conducted by Chen et al., LIME was used to interpret the predictions of a machine learning model that predicted the credit risk of borrowers. The LIME explanations provided by the study helped lenders understand the factors that contributed to the credit risk prediction and make informed lending decisions.</p>
</li>
<li>
<p><strong>Image recognition</strong>: LIME has been used to interpret the predictions of machine learning models that recognize images. For example, in a study conducted by Selvaraju et al., LIME was used to interpret the predictions of a deep learning model that recognized objects in images. The LIME explanations provided by the study helped users understand which parts of the image were important for the prediction and identify areas of improvement for the model.</p>
</li>
</ol>
<h2>Benefits and limitations of LIME</h2>
<p>LIME provides several <strong>benefits</strong> to data scientists and machine learning practitioners.</p>
<p>First, LIME <strong>can help increase the trust of users in machine learning models by providing interpretable explanations of the models' predictions</strong>. This can be especially useful in high-stakes domains, such as healthcare and finance, where decisions based on machine learning predictions can have significant consequences.
Second, LIME <strong>can help users identify areas of improvement for machine learning models</strong>. By providing explanations of the models' predictions, LIME can help users identify which features were important for the prediction and which ones were not. This can help users refine their feature engineering process and improve the performance of their models.</p>
<p>However, LIME also has some <strong>limitations</strong> that data scientists and machine learning practitioners should be aware of.</p>
<p>First, LIME provides local explanations, which means that <strong>the explanations are only valid for the selected data point of interest</strong>. Therefore, the explanations generated by LIME may not generalize to other data points.</p>
<p>Second, LIME <strong>requires a significant amount of computational resources</strong> to generate the perturbed instances and train the interpretable model. This can be a limitation when working with large datasets or computationally expensive models.</p>
<h2>Conclusion</h2>
<p>LIME is a useful technique for interpreting the predictions of machine learning models. LIME can help increase the trust of users in machine learning models and identify areas of improvement for the models. LIME has been successfully applied in various domains, including healthcare, finance, and image recognition. However, LIME also has some limitations, such as providing local explanations and requiring significant computational resources. Therefore, data scientists and machine learning practitioners should carefully consider the use of LIME and its limitations when interpreting the predictions of their models.</p>
<p><em>Any comments or suggestions? <a href="mailto:ksafjan@gmail.com?subject=Blog+post">Let me know</a>.</em></p>Intrinsic vs. Extrinsic Evaluation - What's the Best Way to Measure Embedding Quality?2023-04-18T00:00:00+02:002023-04-18T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-04-18:/measure-quality-of-embeddings-intrinsic-vs-extrinsic/<p>Learn how to measure the quality of word and sentence embeddings in natural language processing (NLP), including intrinsic and extrinsic evaluation, and their strengths and limitations.</p><p>X::<a href="https://www.safjan.com/demystifying-perplexity-assessing-dimensionality-reduction-with-pca/">Demystifying Perplexity - Assessing Dimensionality Reduction With PCA</a></p>
<h2>Introduction</h2>
<p>Let's start with the concept of embedding vectors. In natural language processing (NLP), an embedding vector is a mathematical representation of words or phrases. It's a way to convert text data into numerical values that can be processed by machine learning algorithms. Word embeddings and sentence embeddings are widely used in natural language processing (NLP) for a variety of tasks, such as text classification, named entity recognition, machine translation, and sentiment analysis. However, it is not always straightforward to evaluate the quality of embeddings, and different evaluation metrics may be appropriate for different use cases. In this blog post, we will explore several ways to measure the quality of embeddings, including intrinsic and extrinsic evaluation, and discuss their strengths and limitations.</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#intrinsic-evaluation">Intrinsic Evaluation</a><ul>
<li><a href="#cosine-similarity">Cosine Similarity</a></li>
<li><a href="#spearman-correlation">Spearman Correlation</a></li>
<li><a href="#accuracy">Accuracy</a></li>
</ul>
</li>
<li><a href="#extrinsic-evaluation">Extrinsic Evaluation</a><ul>
<li><a href="#f1-score">F1 Score</a></li>
<li><a href="#perplexity">Perplexity</a></li>
</ul>
</li>
<li><a href="#limitations">Limitations</a></li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="intrinsic-evaluation"></a></p>
<h2>Intrinsic Evaluation</h2>
<blockquote>
<p><strong>Intrinsic evaluation</strong> - aims to measure the quality of embeddings by assessing their performance on specific NLP tasks that are related to the embedding space itself, such as word similarity, analogy, and classification.</p>
</blockquote>
<p>In this section, we will discuss three commonly used intrinsic evaluation metrics: cosine similarity, Spearman correlation, and accuracy.</p>
<p><a id="cosine-similarity"></a></p>
<h3>Cosine Similarity</h3>
<p>Cosine similarity measures the similarity between two vectors by computing the cosine of the angle between them. In the context of embeddings, cosine similarity is often used to measure the similarity between two words, or between a word and its context. The formula for cosine similarity is as follows:</p>
<div class="math">$$
cosine\_similarity(\textbf{v}_1, \textbf{v}_2) = \frac{\textbf{v}_1 \cdot \textbf{v}_2}{\|\textbf{v}_1\|\|\textbf{v}_2\|}
$$</div>
<p>where <span class="math">\(\textbf{v}_1\)</span> and <span class="math">\(\textbf{v}_2\)</span> are the embeddings of two words, and <span class="math">\(|\cdot|\)</span> denotes the Euclidean norm.</p>
<p><a id="spearman-correlation"></a></p>
<h3>Spearman Correlation</h3>
<p>Spearman correlation measures the monotonic relationship between two variables, which can be the similarity scores of two sets of words or phrases computed by humans and by embeddings. A high Spearman correlation indicates that the embeddings are able to capture the semantic relationships between words that humans perceive. The formula for Spearman correlation is as follows:</p>
<p>$$
\text{Spearman's correlation} = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}</p>
<p>$$</p>
<p>where <span class="math">\(d_i\)</span> is the difference between the ranks of the <span class="math">\(i\)</span>-th pair of similarity scores, and <span class="math">\(n\)</span> is the number of pairs.</p>
<p><a id="accuracy"></a></p>
<h3>Accuracy</h3>
<p>Accuracy measures the performance of embeddings on classification tasks, such as sentiment analysis or topic classification. Given a dataset of labeled examples, the embeddings are used to represent each example, and a classifier is trained on these representations. The accuracy of the classifier on a held-out test set is then used as a measure of the quality of the embeddings.</p>
<p><a id="extrinsic-evaluation"></a></p>
<h2>Extrinsic Evaluation</h2>
<blockquote>
<p><strong>Extrinsic evaluation</strong> - aims to measure the quality of embeddings by assessing their performance on downstream NLP tasks, such as machine translation or text classification, that are not directly related to the embedding space itself.</p>
</blockquote>
<p>In this section, we will discuss two commonly used extrinsic evaluation metrics: F1 score and perplexity.
<a id="f1-score"></a></p>
<h3>F1 Score</h3>
<p>F1 score is a metric commonly used in binary classification problems, such as sentiment analysis or named entity recognition. It combines precision and recall into a single score that ranges from 0 to 1. A high F1 score indicates that the embeddings are able to capture the relevant features of the input data. The formula for F1 score is as follows:</p>
<div class="math">$$
F1 = \frac{2 \cdot \text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}
$$</div>
<p>where precision is the fraction of true positives among the predicted positives, and recall is the fraction of true positives among the actual positives.</p>
<p><a id="perplexity"></a></p>
<h3>Perplexity</h3>
<p>Perplexity is a metric commonly used in language modeling tasks, such as machine translation or text generation. It measures how well a language model can predict a held-out test set of text, given the embeddings as input. A low perplexity indicates that the embeddings are able to capture the semantic and syntactic structures of the language. The formula for perplexity is as follows:</p>
<div class="math">$$
\text{perplexity} = 2^{-\frac{1}{N}\sum_{i=1}^{N} \log_2 p(w_i | \textbf{e}_i)}
$$</div>
<p>where <span class="math">\(N\)</span> is the number of tokens in the test set, <span class="math">\(\textbf{e}_i\)</span> is the embedding of the <span class="math">\(i\)</span>-th token, and <span class="math">\(p(w_i | \textbf{e}_i)\)</span> is the conditional probability of the <span class="math">\(i\)</span>-th token given its embedding.</p>
<p><a id="limitations"></a></p>
<h2>Limitations</h2>
<p>While intrinsic and extrinsic evaluation metrics can provide useful insights into the quality of embeddings, they also have some limitations. Intrinsic evaluation may not always reflect the performance of embeddings on downstream tasks, and extrinsic evaluation may not always isolate the contribution of embeddings from other factors, such as the choice of model architecture or the quality of the training data. Moreover, the choice of evaluation metrics may depend on the specific use case and the available resources, and there is no one-size-fits-all solution.</p>
<p><em>Any comments or suggestions? <a href="mailto:ksafjan@gmail.com?subject=Blog+post">Let me know</a>.</em></p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Convert HEIC and HEIF to Jpg, Png, BMP With Python2023-04-14T00:00:00+02:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-04-14:/convert-heic-and-heif-to-jpg-png-bmp-with-python/<p>HEIF and HEIC image formats are gaining popularity due to their superior image quality and smaller file sizes compared to traditional formats like JPEG and PNG. However, they are not yet widely supported by all devices and software applications. In this blog …</p><p>HEIF and HEIC image formats are gaining popularity due to their superior image quality and smaller file sizes compared to traditional formats like JPEG and PNG. However, they are not yet widely supported by all devices and software applications. In this blog post, we will explore how to convert HEIF and HEIC files to JPEG and other popular image formats using Python.</p>
<!-- MarkdownTOC levels="2,3,4" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#tutorial">Tutorial</a></li>
<li><a href="#use-pillow">Use Pillow</a><ul>
<li><a href="#step-1-installing-required-libraries">Step 1: Installing Required Libraries</a></li>
<li><a href="#step-2-converting-heif-and-heic-files-to-jpeg">Step 2: Converting HEIF and HEIC Files to JPEG</a></li>
<li><a href="#step-3-converting-heif-and-heic-files-to-other-formats">Step 3: Converting HEIF and HEIC Files to Other Formats</a></li>
<li><a href="#step-4-converting-heif-and-heic-files-in-bulk-to-jpeg">Step 4: Converting HEIF and HEIC Files in Bulk to JPEG</a></li>
</ul>
</li>
<li><a href="#use-pyheif-library">Use pyheif library</a></li>
<li><a href="#summary">Summary</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="tutorial"></a></p>
<h2>Tutorial</h2>
<p>Python provides several libraries for working with images, including Pillow, OpenCV, and PyImageSearch. For this tutorial, we will be using the Pillow library, which is a fork of the Python Imaging Library (PIL) and provides a simple and easy-to-use API for image processing.</p>
<p><a id="use-pillow"></a></p>
<h3>Use Pillow</h3>
<p><a id="step-1-installing-required-libraries"></a></p>
<h4>Step 1: Installing Required Libraries</h4>
<p>Before we can begin converting HEIF and HEIC files, we need to make sure we have the necessary libraries installed. To install Pillow, open a terminal or command prompt and run the following command:</p>
<div class="highlight"><pre><span></span><code>pip install Pillow
</code></pre></div>
<p><a id="step-2-converting-heif-and-heic-files-to-jpeg"></a></p>
<h4>Step 2: Converting HEIF and HEIC Files to JPEG</h4>
<p>To convert HEIF and HEIC files to JPEG, we can use the Pillow library's <code>Image</code> module. The <code>Image</code> module provides several methods for opening and saving images in different formats, including <code>JPEG</code>, <code>PNG</code>, and <code>BMP</code>.</p>
<p>Here is a Python code example that shows how to convert a single HEIF or HEIC file to JPEG:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">Image</span>
<span class="c1"># Open HEIF or HEIC file</span>
<span class="n">image</span> <span class="o">=</span> <span class="n">Image</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="s1">'example.heic'</span><span class="p">)</span>
<span class="c1"># Convert to JPEG</span>
<span class="n">image</span><span class="o">.</span><span class="n">convert</span><span class="p">(</span><span class="s1">'RGB'</span><span class="p">)</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s1">'example.jpg'</span><span class="p">)</span>
</code></pre></div>
<p>In the code above, we first import the <code>Image</code> module from the Pillow library. We then use the <code>open()</code> method to open the HEIF or HEIC file and assign it to the <code>image</code> variable. We then use the <code>convert()</code> method to convert the image to the RGB color space, which is required for saving the image in JPEG format. Finally, we use the <code>save()</code> method to save the converted image as a JPEG file.</p>
<p>Note that in order to convert HEIF and HEIC files to JPEG using Pillow, we need to convert them to the RGB color space. This can result in a loss of some of the advanced features of HEIF and HEIC, such as support for high dynamic range (HDR) and wide color gamut (WCG).</p>
<p>If you want to convert multiple HEIF or HEIC files to JPEG, you can use a for loop to iterate over a list of file names:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">Image</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="c1"># Get list of HEIF and HEIC files in directory</span>
<span class="n">directory</span> <span class="o">=</span> <span class="s1">'/path/to/directory'</span>
<span class="n">files</span> <span class="o">=</span> <span class="p">[</span><span class="n">f</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">os</span><span class="o">.</span><span class="n">listdir</span><span class="p">(</span><span class="n">directory</span><span class="p">)</span> <span class="k">if</span> <span class="n">f</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s1">'.heic'</span><span class="p">)</span> <span class="ow">or</span> <span class="n">f</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s1">'.heif'</span><span class="p">)]</span>
<span class="c1"># Convert each file to JPEG</span>
<span class="k">for</span> <span class="n">filename</span> <span class="ow">in</span> <span class="n">files</span><span class="p">:</span>
<span class="n">image</span> <span class="o">=</span> <span class="n">Image</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">directory</span><span class="p">,</span> <span class="n">filename</span><span class="p">))</span>
<span class="n">image</span><span class="o">.</span><span class="n">convert</span><span class="p">(</span><span class="s1">'RGB'</span><span class="p">)</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">directory</span><span class="p">,</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">splitext</span><span class="p">(</span><span class="n">filename</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="s1">'.jpg'</span><span class="p">))</span>
</code></pre></div>
<p>In the code above, we use the <code>os</code> library to get a list of HEIF and HEIC files in a directory. We then use a for loop to iterate over the list of file names, open each file using the <code>Image</code> module, convert it to RGB color space, and save it as a JPEG file with the same name as the original file.</p>
<p><a id="step-3-converting-heif-and-heic-files-to-other-formats"></a></p>
<h4>Step 3: Converting HEIF and HEIC Files to Other Formats</h4>
<p>In addition to converting HEIF and HEIC files to JPEG, we can also convert them to other popular formats like PNG and BMP using the Pillow library. Here is an example that shows how to convert a HEIF or HEIC file to PNG:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">Image</span> <span class="c1"># Open HEIF or HEIC</span>
<span class="n">HEIC</span> <span class="n">file</span> <span class="n">image</span> <span class="o">=</span> <span class="n">Image</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="s1">'example.heic'</span><span class="p">)</span>
<span class="c1"># Convert to PNG</span>
<span class="n">image</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s1">'example.png'</span><span class="p">)</span>
</code></pre></div>
<p>In the code above, we use the <code>save()</code> method to save the opened HEIF or HEIC file as a PNG file. Similarly, we can convert HEIF and HEIC files to BMP format using the <code>save()</code> method with the <code>'BMP'</code> argument: </p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">Image</span> <span class="c1"># Open HEIF or HEIC file </span>
<span class="n">image</span> <span class="o">=</span> <span class="n">Image</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="s1">'example.heic'</span><span class="p">)</span>
<span class="c1"># Convert to BMP </span>
<span class="n">image</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s1">'example.bmp'</span><span class="p">)</span>
</code></pre></div>
<p><a id="step-4-converting-heif-and-heic-files-in-bulk-to-jpeg"></a></p>
<h4>Step 4: Converting HEIF and HEIC Files in Bulk to JPEG</h4>
<p>If you have a large number of HEIF and HEIC files that you need to convert to JPEG, you can use the following Python script:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">Image</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="c1"># Get list of HEIF and HEIC files in directory</span>
<span class="n">directory</span> <span class="o">=</span> <span class="s1">'/path/to/directory'</span>
<span class="n">files</span> <span class="o">=</span> <span class="p">[</span><span class="n">f</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">os</span><span class="o">.</span><span class="n">listdir</span><span class="p">(</span><span class="n">directory</span><span class="p">)</span> <span class="k">if</span> <span class="n">f</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s1">'.heic'</span><span class="p">)</span> <span class="ow">or</span> <span class="n">f</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s1">'.heif'</span><span class="p">)]</span>
<span class="c1"># Create output directory if it does not exist</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">directory</span><span class="p">,</span> <span class="s1">'output'</span><span class="p">)):</span>
<span class="n">os</span><span class="o">.</span><span class="n">makedirs</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">directory</span><span class="p">,</span> <span class="s1">'output'</span><span class="p">))</span>
<span class="c1"># Convert each file to JPEG</span>
<span class="k">for</span> <span class="n">filename</span> <span class="ow">in</span> <span class="n">files</span><span class="p">:</span>
<span class="n">image</span> <span class="o">=</span> <span class="n">Image</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">directory</span><span class="p">,</span> <span class="n">filename</span><span class="p">))</span>
<span class="n">image</span><span class="o">.</span><span class="n">convert</span><span class="p">(</span><span class="s1">'RGB'</span><span class="p">)</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">directory</span><span class="p">,</span> <span class="s1">'output'</span><span class="p">,</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">splitext</span><span class="p">(</span><span class="n">filename</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="s1">'.jpg'</span><span class="p">))</span>
</code></pre></div>
<p>In the code above, we use the <code>os</code> library to get a list of HEIF and HEIC files in a directory. We then create an output directory if it does not already exist. Finally, we use a for loop to iterate over the list of file names, open each file using the <code>Image</code> module, convert it to RGB color space, and save it as a JPEG file in the output directory with the same name as the original file.</p>
<p><a id="use-pyheif-library"></a></p>
<h3>Use pyheif library</h3>
<p>Here is an example of how to use the <code>pyheif</code> library to convert HEIF and HEIC files to JPEG:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">pyheif</span>
<span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">Image</span>
<span class="c1"># Open HEIF or HEIC file</span>
<span class="n">heif_file</span> <span class="o">=</span> <span class="n">pyheif</span><span class="o">.</span><span class="n">read</span><span class="p">(</span><span class="s2">"example.heic"</span><span class="p">)</span>
<span class="c1"># Extract the image data</span>
<span class="n">image</span> <span class="o">=</span> <span class="n">Image</span><span class="o">.</span><span class="n">frombytes</span><span class="p">(</span><span class="n">heif_file</span><span class="o">.</span><span class="n">mode</span><span class="p">,</span> <span class="n">heif_file</span><span class="o">.</span><span class="n">size</span><span class="p">,</span> <span class="n">heif_file</span><span class="o">.</span><span class="n">data</span><span class="p">)</span>
<span class="c1"># Save as JPEG</span>
<span class="n">image</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s1">'example.jpg'</span><span class="p">)</span>
</code></pre></div>
<p>In the code above, we use the <code>pyheif</code> library to read in the HEIF or HEIC file, then use the <code>frombytes()</code> method of the <code>PIL.Image</code> module to create a PIL image object from the extracted image data. Finally, we use the <code>save()</code> method to save the image as a JPEG file.</p>
<p>To convert multiple HEIF and HEIC files in bulk using <code>pyheif</code>, you can use the following code:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">pyheif</span>
<span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">Image</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="c1"># Get list of HEIF and HEIC files in directory</span>
<span class="n">directory</span> <span class="o">=</span> <span class="s1">'/path/to/directory'</span>
<span class="n">files</span> <span class="o">=</span> <span class="p">[</span><span class="n">f</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">os</span><span class="o">.</span><span class="n">listdir</span><span class="p">(</span><span class="n">directory</span><span class="p">)</span> <span class="k">if</span> <span class="n">f</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s1">'.heic'</span><span class="p">)</span> <span class="ow">or</span> <span class="n">f</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s1">'.heif'</span><span class="p">)]</span>
<span class="c1"># Create output directory if it does not exist</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">directory</span><span class="p">,</span> <span class="s1">'output'</span><span class="p">)):</span>
<span class="n">os</span><span class="o">.</span><span class="n">makedirs</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">directory</span><span class="p">,</span> <span class="s1">'output'</span><span class="p">))</span>
<span class="c1"># Convert each file to JPEG</span>
<span class="k">for</span> <span class="n">filename</span> <span class="ow">in</span> <span class="n">files</span><span class="p">:</span>
<span class="n">heif_file</span> <span class="o">=</span> <span class="n">pyheif</span><span class="o">.</span><span class="n">read</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">directory</span><span class="p">,</span> <span class="n">filename</span><span class="p">))</span>
<span class="n">image</span> <span class="o">=</span> <span class="n">Image</span><span class="o">.</span><span class="n">frombytes</span><span class="p">(</span><span class="n">heif_file</span><span class="o">.</span><span class="n">mode</span><span class="p">,</span> <span class="n">heif_file</span><span class="o">.</span><span class="n">size</span><span class="p">,</span> <span class="n">heif_file</span><span class="o">.</span><span class="n">data</span><span class="p">)</span>
<span class="n">image</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">directory</span><span class="p">,</span> <span class="s1">'output'</span><span class="p">,</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">splitext</span><span class="p">(</span><span class="n">filename</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="s1">'.jpg'</span><span class="p">))</span>
</code></pre></div>
<p>In this code, we use the same approach to get a list of HEIF and HEIC files in a directory and create an output directory if it does not already exist. Then, we use a for loop to iterate over the list of file names, read in each HEIF or HEIC file using <code>pyheif</code>, create a PIL image object from the extracted image data, and save it as a JPEG file in the output directory with the same name as the original file.</p>
<p>Using the <code>pyheif</code> library to convert HEIF and HEIC files to JPEG is a simple and effective way to handle image file format conversions in Python.</p>
<p><a id="summary"></a></p>
<h2>Summary</h2>
<p>In this blog post, we explored how to convert HEIF and HEIC files to JPEG and other popular image formats using Python and the Pillow and pyheif libraries. We covered how to convert a single file as well as multiple files in bulk. With this knowledge, you can easily convert HEIF and HEIC files to more widely supported formats, enabling you to use them on any device or software application that supports images.</p>
<p>X::<a href="https://www.safjan.com/heif-and-heic-format-for-images-and-video/">Smaller Files, Better Quality - The Advantages of HEIF and HEIC</a></p>Explaining AI - The Key Differences Between LIME and SHAP Methods2023-04-14T00:00:00+02:002023-04-14T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-04-14:/explaining-ai-the-key-differences-between-lime-and-shap-methods/<p>When it comes to explainable AI, LIME and SHAP are two popular methods for providing insights into the decisions made by machine learning models. What are the key differences between these methods? In this article, we will help you understand which method may be best for your specific use case.</p><p>LIME and SHAP are both popular methods for explainable AI (XAI), but they differ in several key ways.</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#model-agnostic-vs-model-specific">Model-agnostic vs model-specific</a></li>
<li><a href="#local-vs-global-explanations">Local vs global explanations</a></li>
<li><a href="#kernel-based-vs-game-theoretic-approach">Kernel-based vs game-theoretic approach</a></li>
<li><a href="#interpretability-vs-accuracy-trade-off">Interpretability vs accuracy trade-off</a></li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="model-agnostic-vs-model-specific"></a></p>
<h2>Model-agnostic vs model-specific</h2>
<p>One of the main differences between LIME and SHAP is that LIME is model-agnostic, meaning it can be used to explain the decisions of any machine learning model, regardless of the algorithm used. In contrast, SHAP is a model-specific method that is designed to explain the decisions of tree-based models, such as decision trees, random forests, and gradient boosting machines.</p>
<p><a id="local-vs-global-explanations"></a></p>
<h2>Local vs global explanations</h2>
<p>Another key difference between LIME and SHAP is the type of explanation they provide. LIME generates local explanations, meaning it explains the decision of a complex model for a specific instance or observation. In contrast, SHAP provides global explanations, meaning it explains the overall behavior of the model across all instances.</p>
<p><a id="kernel-based-vs-game-theoretic-approach"></a></p>
<h2>Kernel-based vs game-theoretic approach</h2>
<p>LIME uses a kernel-based approach to explain the decisions of a complex model. It creates a local, interpretable model that approximates the behavior of the complex model around a specific instance. In contrast, SHAP uses a game-theoretic approach to explain the contribution of each feature to the final prediction. It assigns a "credit" score to each feature based on how much it contributes to the prediction.</p>
<p><a id="interpretability-vs-accuracy-trade-off"></a></p>
<h2>Interpretability vs accuracy trade-off</h2>
<p>Finally, LIME and SHAP differ in their approach to the interpretability vs accuracy trade-off. LIME sacrifices some accuracy in order to provide more interpretable explanations. It creates a simpler model that may not be as accurate as the complex model, but is easier to understand. In contrast, SHAP aims to provide accurate explanations without sacrificing model accuracy. It uses a more sophisticated approach to explain the contribution of each feature, but this can be more difficult to understand for non-experts.</p>
<p><a id="conclusion"></a></p>
<h2>Conclusion</h2>
<p>LIME and SHAP are both useful methods for XAI, but they differ in their approach to explaining the decisions of complex machine learning models. LIME is model-agnostic and provides local, interpretable explanations, while SHAP is model-specific and provides global explanations using a game-theoretic approach. The choice between LIME and SHAP depends on the specific needs of the user and the characteristics of the machine learning model being explained.</p>
<p><em>Any comments or suggestions? <a href="mailto:ksafjan@gmail.com?subject=Blog+post">Let me know</a>.</em></p>Smaller Files, Better Quality - The Advantages of HEIF and HEIC2023-04-14T00:00:00+02:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-04-14:/heif-and-heic-format-for-images-and-video/<h2>Overview</h2>
<p>High Efficiency Image Format (HEIF) and High Efficiency Video Coding (HEVC) Image Format (HEIC) are the two latest image file formats introduced by the Moving Picture Experts Group (MPEG). These formats are designed to improve image quality while reducing file size …</p><h2>Overview</h2>
<p>High Efficiency Image Format (HEIF) and High Efficiency Video Coding (HEVC) Image Format (HEIC) are the two latest image file formats introduced by the Moving Picture Experts Group (MPEG). These formats are designed to improve image quality while reducing file size, which is particularly important for mobile devices with limited storage capacity.</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#technical-details-of-heif-and-heic">Technical Details of HEIF and HEIC</a></li>
<li><a href="#advantages-of-heif-and-heic">Advantages of HEIF and HEIC</a><ul>
<li><a href="#smaller-file-sizes">Smaller File Sizes</a></li>
<li><a href="#better-image-quality">Better Image Quality</a></li>
<li><a href="#support-for-advanced-features">Support for Advanced Features</a></li>
<li><a href="#compatibility">Compatibility</a></li>
<li><a href="#future-proofing">Future-Proofing</a></li>
</ul>
</li>
</ul>
<!-- /MarkdownTOC -->
<p>HEIF is a container format that can store a variety of image data, including single images, image sequences, and image collections. HEIC, on the other hand, is a specific implementation of HEIF that is used for still images.</p>
<p>HEIF and HEIC were first introduced in 2015 as part of the HEVC video coding standard. HEVC is a video compression standard that is designed to provide higher quality video at lower bit rates than previous standards such as H.264. HEIF and HEIC take advantage of the HEVC coding algorithms to provide better image quality at lower file sizes.</p>
<p>One of the key advantages of HEIF and HEIC is their support for advanced image features such as high dynamic range (HDR) and wide color gamut (WCG). HDR images have a greater range of brightness and color than standard images, which can make them look more lifelike. WCG images have a wider range of colors than standard images, which can make them look more vibrant and vivid.</p>
<p>HEIF and HEIC also support multiple images and image sequences in a single file, which can make it easier to manage and share collections of images. This is particularly useful for applications such as live photos, which combine still images and short videos into a single file.</p>
<p><a id="technical-details-of-heif-and-heic"></a></p>
<h2>Technical Details of HEIF and HEIC</h2>
<p>HEIF and HEIC use a container format that is based on the ISO Base Media File Format (ISOBMFF). This format is similar to other container formats such as MP4 and MOV, and it provides a flexible and extensible framework for storing media data.</p>
<p>HEIF and HEIC use a compression algorithm called High Efficiency Image Format (HEVC), which is also known as H.265. HEVC is a video compression standard that was developed by the Joint Collaborative Team on Video Coding (JCT-VC) as part of the ITU-T H.265 standard. HEVC is designed to provide better compression than previous standards such as H.264, which can lead to smaller file sizes and better image quality.</p>
<p>HEVC achieves better compression by using advanced techniques such as intra prediction, inter prediction, and entropy coding. Intra prediction is used to predict pixels within a single image frame, while inter prediction is used to predict pixels between different frames in a video sequence. Entropy coding is used to further compress the data by removing redundancy and optimizing the data for compression.</p>
<p>HEIF and HEIC also support a variety of image features such as alpha channels, depth maps, and image sequences. Alpha channels are used to store transparency information for images, while depth maps are used to store information about the distance of objects in a scene. Image sequences are used to store multiple images in a single file, which can be useful for applications such as burst mode photography and time-lapse photography.</p>
<p>HEIF and HEIC also support a variety of metadata formats, including Exif, IPTC, and XMP. Exif is a standard format for storing metadata such as camera settings and location information, while IPTC is a standard format for storing news and media metadata. XMP is a metadata format that is used by Adobe products such as Photoshop and Lightroom.</p>
<p><a id="advantages-of-heif-and-heic"></a></p>
<h2>Advantages of HEIF and HEIC</h2>
<p>HEIF and HEIC offer a number of advantages over previous image formats such as JPEG and PNG. Some of the key advantages include:</p>
<p><a id="smaller-file-sizes"></a></p>
<h3>Smaller File Sizes</h3>
<p>HEIF and HEIC can achieve smaller file sizes than previous formats, which can reduce storage requirements and improve download times. This is particularly important for mobile devices, which often have limited storage capacity.</p>
<p><a id="better-image-quality"></a></p>
<h3>Better Image Quality</h3>
<p>HEIF and HEIC can provide better image quality than previous formats, particularly for images with high dynamic range or wide color gamut. This can result in more realistic and vibrant images.</p>
<p><a id="support-for-advanced-features"></a></p>
<h3>Support for Advanced Features</h3>
<p>HEIF and HEIC support advanced features such as alpha channels, depth maps, and image sequences, which can provide greater flexibility and creativity for image processing and manipulation.</p>
<p><a id="compatibility"></a></p>
<h3>Compatibility</h3>
<p>Although HEIF and HEIC are relatively new formats, they are now widely supported by modern operating systems and devices. For example, iOS devices have supported HEIC since iOS 11, and macOS and Windows both have built-in support for HEIF and HEIC.</p>
<p><a id="future-proofing"></a></p>
<h3>Future-Proofing</h3>
<p>HEIF and HEIC are designed to be flexible and extensible, which means they can support future enhancements and improvements to image processing and storage. This can help ensure that images stored in HEIF and HEIC formats remain compatible and accessible in the future.</p>
<h2>Conclusion</h2>
<p>HEIF and HEIC are the latest image file formats designed to provide better image quality and smaller file sizes than previous formats. They are based on the HEVC video compression standard and use advanced techniques such as intra prediction, inter prediction, and entropy coding to achieve better compression and image quality. HEIF and HEIC also support advanced image features such as high dynamic range and wide color gamut, as well as metadata formats such as Exif, IPTC, and XMP. Although HEIF and HEIC are relatively new formats, they are now widely supported by modern operating systems and devices, and offer a number of advantages over previous formats.</p>
<p>X::<a href="https://www.safjan.com/convert-heic-and-heif-to-jpg-png-bmp-with-python/">Convert HEIC and HEIF to Jpg, Png, BMP With Python</a></p>LIME - Understanding How This Method for Explainable AI Works2023-04-14T00:00:00+02:002023-04-14T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-04-14:/how-the-lime-method-for-explainable-ai-works/<p>Discover how the LIME method can help you understand the important factors behind your model's predictions in a simple, intuitive way.</p><p>Artificial intelligence (AI) has revolutionized the way we live and work, but it can sometimes be difficult to understand how AI algorithms reach their decisions. This is where explainable AI (XAI) comes in. XAI is the process of making AI models transparent and understandable to humans. One popular XAI method is Local Interpretable Model-Agnostic Explanations (LIME). In this blog post, we will explore how LIME works and why it is an important tool for XAI.</p>
<h2>The need for Explainable AI</h2>
<p>One of the main criticisms of AI is its "black box" nature. Many AI models, such as deep neural networks, are complex and difficult to interpret. When these models are used in high-stakes applications like healthcare or finance, it is important to understand how the AI arrived at its decision. This is where XAI comes in. XAI provides a framework for understanding how an AI model makes decisions, increasing trust and accountability.</p>
<h2>LIME: A Local Interpretable Model-Agnostic Explanation Method</h2>
<p>LIME is a popular XAI method that provides local, interpretable explanations for individual predictions made by any machine learning model. LIME was introduced in 2016 in the paper <a href="https://arxiv.org/abs/1602.04938">“Why Should I Trust You?”: Explaining the Predictions of Any Classifier”</a> by Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin, and has since become a widely used method for XAI.</p>
<p>LIME works by creating a simpler, interpretable model that approximates the behavior of the complex model. The simpler model is trained on local data points, and the resulting model is used to explain the decision of the complex model. The process involves the following steps:</p>
<ol>
<li>Selecting an instance to explain</li>
<li>Perturbing the instance to create a dataset of similar instances</li>
<li>Weighting the similar instances based on their similarity to the instance to explain</li>
<li>Training a local, interpretable model on the weighted dataset</li>
<li>Using the local model to generate explanations for the complex model's decision</li>
</ol>
<p>Let's explore each of these steps in more detail.</p>
<h3>Step 1: Selecting an instance to explain</h3>
<p>The first step in the LIME process is selecting an instance to explain. This could be an individual data point or an entire dataset. For example, if we are working with a healthcare AI model, we may want to explain the decision to recommend a certain treatment for a specific patient.</p>
<h3>Step 2: Perturbing the instance to create a dataset of similar instances</h3>
<p>Once we have selected the instance to explain, we perturb it to create a dataset of similar instances. This involves making small changes to the instance while keeping its label (i.e. the prediction of the complex model) the same. The purpose of this step is to create a diverse set of instances that are similar to the instance we want to explain.</p>
<h3>Step 3: Weighting the similar instances based on their similarity to the instance to explain</h3>
<p>After we have created a dataset of similar instances, we need to weight them based on their similarity to the instance we want to explain. This is done using a kernel function, which assigns a weight to each instance based on its distance to the instance to explain. The kernel function can be any function that measures similarity, such as the Gaussian kernel.</p>
<h3>Step 4: Training a local, interpretable model on the weighted dataset</h3>
<p>Now that we have a weighted dataset, we can train a local, interpretable model on it. The purpose of this model is to approximate the behavior of the complex model in the local region around the instance we want to explain. The local model should be simple enough to be easily interpretable, but accurate enough to capture the important features of the complex model.</p>
<p>The choice of local model depends on the problem domain and the complexity of the complex model. Some common choices include linear models, decision trees, and rule-based models.</p>
<h3>Step 5: Using the local model to generate explanations for the complex model's decision</h3>
<p>Once we have trained the local model, we can use it to generate explanations for the complex model's decision. This is done by analyzing the coefficients of the local model and identifying the features that contributed the most to the prediction. These features can be presented to the user as a list of important factors that influenced the decision.</p>
<h2>Advantages of LIME</h2>
<p>LIME has several advantages over other XAI methods. One of the main advantages is its model-agnostic nature. LIME can be used to explain the decisions of any machine learning model, regardless of its complexity or the algorithm used. This makes it a versatile tool for XAI.</p>
<p>Another advantage of LIME is its ability to generate local explanations. By creating a local model that approximates the behavior of the complex model, LIME is able to generate explanations that are tailored to specific instances. This can be useful in situations where the explanation for a decision needs to be customized for a particular user or context.</p>
<h2>Limitations of LIME</h2>
<p>Despite its many advantages, LIME also has some limitations. One of the main limitations is the need for human input in the kernel function. The choice of kernel function and its parameters can have a significant impact on the explanations generated by LIME. This means that the user needs to have some domain knowledge and expertise in selecting an appropriate kernel function.</p>
<p>Another limitation of LIME is its sensitivity to perturbations. LIME works by perturbing the instance to create a dataset of similar instances. However, small changes to the instance can result in significantly different explanations. This means that the explanations generated by LIME may not always be robust to changes in the input.</p>
<h2>Conclusion</h2>
<p>LIME is a powerful tool for XAI that provides local, interpretable explanations for individual predictions made by any machine learning model. By creating a simpler, interpretable model that approximates the behavior of the complex model, LIME is able to generate explanations that are tailored to specific instances. However, LIME also has some limitations, such as its sensitivity to perturbations and the need for human input in the kernel function. Despite these limitations, LIME remains an important tool for XAI and is widely used in industry and academia.</p>
<p><em>Any comments or suggestions? <a href="mailto:ksafjan@gmail.com?subject=Blog+post">Let me know</a>.</em></p>SHAP - Understanding How This Method for Explainable AI Works2023-04-14T00:00:00+02:002023-04-14T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-04-14:/how-the-shap-method-for-explainable-ai-works/<p>Discover how the SHAP method can help you understand the important factors behind your model's predictions in a simple, intuitive way.</p><p>As a data scientist, one of the biggest challenges in deploying machine learning models is explaining how the model makes its decisions. The need for explainability is not only important for legal and ethical reasons, but it also helps in building trust in the model and making informed decisions. The <strong>SHapley Additive exPlanations</strong> (SHAP) method is a powerful technique that provides a unified framework for interpreting any model. In this blog post, I will explain the SHAP method, its mathematical foundation, and how it can be applied to interpret machine learning models.</p>
<h2>What is SHAP?</h2>
<p>The SHAP method is a game-theoretic approach to explain the output of any machine learning model. It is based on the concept of Shapley values. Here is a timeline for the SHAP method:</p>
<ul>
<li>1953: Introduction of Shapley values by Lloyd Shapley for game theory</li>
<li>2010: First use of Shapley values for explaining machine learning predictions by Strumbelj and Kononenko </li>
<li>2017: SHAP paper + Python package by Lundberg</li>
</ul>
<p>The SHAP method is a game-theoretic approach to explain the output of any machine learning model. It is based on the concept of Shapley values, which were introduced by Lloyd Shapley in 1953 to fairly distribute the gains of a cooperative game among its players. In the context of machine learning, the players are the input features, and the gain is the difference between the actual output of the model and the expected output. The SHAP method provides a way to calculate the Shapley values for each input feature, which gives us a measure of the contribution of each feature towards the model output.</p>
<p>The SHapley Additive exPlanations (SHAP) method we are using today was introduced in a paper titled "A Unified Approach to Interpreting Model Predictions" by Scott Lundberg and Su-In Lee, published in the Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017). The paper is available on the arXiv preprint server at <a href="https://arxiv.org/abs/1705.07874">https://arxiv.org/abs/1705.07874</a>.</p>
<h2>How does SHAP work?</h2>
<p>The SHAP method works by computing the Shapley values for each feature in the input space. The Shapley value for feature i, denoted by <span class="math">\(\phi_i\)</span>, is defined as the average contribution of the feature i across all possible coalitions of features. Mathematically, the Shapley value can be expressed as follows:</p>
<div class="math">$$\phi_i(f,S) = \sum_{T \subseteq S \setminus \{i\}}\frac{|T|!(|S|-|T|-1)!}{|S|!}(f(T \cup \{i\}) - f(T))$$</div>
<p>where <span class="math">\(X\)</span> is the set of all input features, <span class="math">\(S\)</span> is a coalition of features that does not include feature <span class="math">\(i\)</span>, <span class="math">\(|S|\)</span> is the size of the coalition, and <span class="math">\(f(S\cup{i})\)</span> is the output of the model when the features in <span class="math">\(S\)</span> and <span class="math">\(i\)</span> are present. The term <span class="math">\(f(S)\)</span> is the output of the model when only the features in <span class="math">\(S\)</span> are present. The Shapley value represents the average marginal contribution of feature <span class="math">\(i\)</span> over all possible coalitions.</p>
<p>To compute the Shapley values using the above formula, we need to evaluate the model output for all possible coalitions of features, which is computationally infeasible for most machine learning models. The SHAP method provides an efficient way to estimate the Shapley values using a weighted average of the model outputs for a subset of coalitions. The subset of coalitions is selected based on a feature importance metric, such as the permutation importance or the gradient-based importance.</p>
<h2>How to apply SHAP?</h2>
<p>To apply the SHAP method, we need to first compute the Shapley values for each feature in the input space. This can be done using one of the many implementations available in popular machine learning libraries, such as scikit-learn, XGBoost, and TensorFlow. Once we have the Shapley values, we can visualize them using various techniques to gain insights into the model's decision-making process.</p>
<p>One popular technique to visualize the Shapley values is the Shapley value plot, which shows the contribution of each feature towards the model output for each individual data point. The plot consists of a horizontal axis representing the feature contribution, and a vertical axis representing the features. Each data point is represented by a vertical bar, where the length of the bar represents the magnitude of the Shapley value for the corresponding feature. The color of the bar represents the value of the feature, where red represents high feature values and blue represents low feature values. The plot helps in identifying the most important features for each data point and the direction of the relationship between the features and the output.</p>
<p>Another technique to visualize the Shapley values is the summary plot, which shows the average contribution of each feature across all data points. The plot consists of a horizontal axis representing the Shapley value and a vertical axis representing the features. Each feature is represented by a horizontal bar, where the length of the bar represents the magnitude of the average Shapley value. The color of the bar represents the direction of the relationship between the feature and the output, where red represents a positive relationship and blue represents a negative relationship.</p>
<p>In addition to visualizing the Shapley values, the SHAP method can also be used to identify instances where the model makes biased or unfair decisions. The method can be used to quantify the extent to which each feature contributes to the model's bias towards a certain group or class. This helps in identifying the root cause of the bias and taking corrective measures to ensure fairness and equity in the model's decisions.</p>
<h2>Conclusion</h2>
<p>The SHapley Additive exPlanations (SHAP) method provides a powerful framework for interpreting any machine learning model. The method is based on the concept of Shapley values, which provides a fair way to distribute the gain of a cooperative game among its players. The SHAP method provides an efficient way to compute the Shapley values for each feature in the input space, which gives us a measure of the contribution of each feature towards the model output. The method can be applied to visualize the Shapley values, identify the most important features, and quantify the model's bias towards certain groups or classes. By providing a unified framework for interpretability, the SHAP method helps in building trust in the model and making informed decisions.</p>
<p><em>Any comments or suggestions? <a href="mailto:ksafjan@gmail.com?subject=Blog+post">Let me know</a>.</em></p>
<ul>
<li>X::<a href="https://www.safjan.com/how-the-lime-method-for-explainable-ai-works/">LIME - Understanding How This Method for Explainable AI Works</a></li>
<li>X::<a href="https://www.safjan.com/lime-tutorial/">LIME Tutorial</a></li>
<li></li>
</ul>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>KernelShap and TreeShap - Two Most Popular Variations of the SHAP Method2023-04-14T00:00:00+02:002023-11-05T00:00:00+01:00Krystian Safjantag:www.safjan.com,2023-04-14:/kernelshap-treeshap-two-most-popular-variations-of-the-shap-method/<p>Making sense of AI's inner workings with KernelShap and TreeShap the powerfull tools for responsible AI.</p><h2>TLDR</h2>
<p>The <strong>original SHAP does not scale well</strong> with high dimensions data due to its exponential complexity associated with Shapley value calculations. KernelSHAP and TreeSHAP are two specific implementations of the SHAP method, developed to address the shortcomings of the original framework and to optimize it for different types of machine learning models, but they achieve this in different ways.
<strong>KernelSHAP</strong> uses a model-agnostic method to interpret the impact of features in a model. This means it can provide explanations <strong>for any model</strong> but <strong>at the cost of computational efficiency</strong>, making it <strong>less suitable for complex</strong>, high-dimensional situations or when real-time explanations are needed. In contrast, <strong>TreeSHAP</strong> is designed specifically <strong>for tree-based models</strong> (like decision tree and random forests, or boosting machines). It is <strong>computationally efficient</strong>, exploits the tree structure for <strong>faster calculations</strong>, and thus, can handle <strong>more complex scenarios</strong>. Moreover, TreeSHAP guarantees consistency—a helpful property for feature attribution methods to ensure that if a model relies more on a feature, the attributed importance of that feature should not decrease. However, it can't be used for non-tree models.</p>
<h2>Introduction</h2>
<p>Responsible AI is an approach to artificial intelligence that ensures fairness, transparency, and accountability in the development, deployment, and management of AI systems. In the era of increasing reliance on AI-driven decision-making, understanding and explaining the predictions made by these models is essential. The interpretability of AI models helps build trust, enables better decision-making, and allows us to mitigate biases.</p>
<p>Two popular methods for explaining AI models are KernelShap and TreeShap. These techniques are part of the SHAP (SHapley Additive exPlanations) family, which is based on cooperative game theory. In this blog post, we will delve into the details of KernelShap and TreeShap, exploring their underlying principles, advantages, and use cases.</p>
<blockquote>
<p><strong>SHAP (SHapley Additive exPlanations)</strong> is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions (see <a href="https://github.com/shap/shap#citations">papers</a> for details and citations).
<em>(from <a href="https://shap.readthedocs.io/en/latest/index.html">SHAP documentation</a>)</em></p>
</blockquote>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#kernelshap">KernelShap</a><ul>
<li><a href="#steps">Steps</a></li>
<li><a href="#kernelshap-advantages-and-limitations">KernelShap advantages and limitations</a></li>
</ul>
</li>
<li><a href="#treeshap">TreeShap</a><ul>
<li><a href="#steps-1">Steps</a></li>
<li><a href="#treeshap-advantages-and-limitations">TreeShap advantages and limitations</a></li>
</ul>
</li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="kernelshap"></a></p>
<h2>KernelShap</h2>
<p>KernelShap is a model-agnostic explanation method that provides interpretable explanations for any black-box model. It uses the concept of Shapley values from cooperative game theory to attribute feature importance to individual features in the context of a specific prediction.</p>
<p>The Shapley value for feature <span class="math">\(i\)</span> in a model <span class="math">\(f\)</span> can be calculated using the following formula:</p>
<div class="math">$$ϕ_i(f) = \sum_{S ⊆ N \setminus {i}} \frac{|S|!(|N|-|S|-1)!}{|N|!} [f(S ∪ {i}) - f(S)]$$</div>
<p>Here, <span class="math">\(S\)</span> is a subset of features excluding <span class="math">\(i\)</span>, and <span class="math">\(N\)</span> is the total number of features. The term <span class="math">\(|S|!\)</span> represents the factorial of the number of features in subset <span class="math">\(S\)</span>, while <span class="math">\(|N|-|S|-1!\)</span> represents the factorial of the remaining features outside of the subset. The denominator <span class="math">\(|N|!\)</span> is the factorial of the total number of features.</p>
<p>Shapley values, in the context of AI, are used to distribute the contribution of each feature to the final prediction. It ensures that the contribution of each feature is fairly allocated in a way that is efficient, symmetric, and additive.</p>
<p>KernelShap approximates the Shapley values by solving a weighted linear regression problem. It samples instances from the feature space and estimates the Shapley values using the Lasso regression model. The Lasso model is a linear model with an L1 penalty term, which helps in feature selection and makes the explanation sparse.</p>
<p>Lasso (Least Absolute Shrinkage and Selection Operator) regression is a linear regression technique that includes an L1 penalty term to shrink the coefficients of less important features towards zero. This allows for both regularization and feature selection, resulting in a more interpretable and parsimonious model.</p>
<p>The equation for Lasso regression is given by:</p>
<div class="math">$$L(\beta) = \sum_{i=1}^{n}(y_i - X_i\beta)^2 + \lambda\sum_{j=1}^{p}|\beta_j|$$</div>
<p>
In this equation:</p>
<ul>
<li><span class="math">\(L(\beta)\)</span> represents the objective function to be minimized,</li>
<li><span class="math">\(y_i\)</span> is the actual response (outcome) for the <span class="math">\(i^{th}\)</span> observation,</li>
<li><span class="math">\(X_i\)</span> is the feature vector for the <span class="math">\(i^{th}\)</span> observation,</li>
<li><span class="math">\(\beta\)</span> is the vector of coefficients to be estimated,</li>
<li><span class="math">\(n\)</span> is the total number of observations,</li>
<li><span class="math">\(p\)</span> is the total number of features,</li>
<li><span class="math">\(\lambda\)</span> is a non-negative regularization parameter, and</li>
<li><span class="math">\(|\beta_j|\)</span> is the absolute value of the <span class="math">\(j^{th}\)</span> coefficient.</li>
</ul>
<p>The first term, <span class="math">\(\sum_{i=1}{n}(y_i - X_i\beta)2\)</span>, is the sum of squared residuals, which represents the difference between the actual and predicted responses. Minimizing this term alone would result in an ordinary least squares regression.</p>
<p>The second term, <span class="math">\(\lambda\sum_{j=1}^{p}|\beta_j|\)</span>, is the L1 penalty term that adds the absolute values of the coefficients multiplied by the regularization parameter <span class="math">\(\lambda\)</span>. By increasing <span class="math">\(\lambda\)</span>, the penalty term forces some coefficients to be exactly zero, effectively selecting a subset of features for the final model. The optimal value of <span class="math">\(\lambda\)</span> is usually determined through cross-validation.</p>
<p><a id="steps"></a></p>
<h3>Steps</h3>
<p>The KernelShap algorithm involves the following steps:</p>
<ol>
<li>Generate a dataset of binary-masked instances by randomly selecting feature combinations.</li>
<li>Compute the output of the black-box model for each instance.</li>
<li>Fit a weighted linear regression model on the generated dataset, where the weights are determined by the similarity between the instance and the instance of interest.</li>
<li>The coefficients of the linear regression model represent the approximate Shapley values.</li>
</ol>
<p><a id="kernelshap-advantages-and-limitations"></a></p>
<h3>KernelShap advantages and limitations</h3>
<p>KernelShap has several <strong>advantages</strong>:</p>
<ul>
<li>It can be applied to any black-box model, regardless of its architecture or training algorithm.</li>
<li>It provides a unified measure of feature importance that is consistent across different models.</li>
<li>It takes into account the interactions between features.</li>
</ul>
<p>However, KernelShap also has some <strong>limitations</strong>:</p>
<ul>
<li>It can be computationally expensive, especially for high-dimensional data or complex models.</li>
<li>It requires a large number of samples to provide accurate estimates of the Shapley values.</li>
</ul>
<p><a id="treeshap"></a></p>
<h2>TreeShap</h2>
<p>TreeShap is a model-specific explanation method designed for tree-based models, such as decision trees, random forests, and gradient boosting machines. Like KernelShap, it is based on Shapley values, but it exploits the structure of tree-based models to compute the values efficiently.</p>
<p>TreeShap <strong>computes the exact Shapley values for each feature by recursively traversing the decision tree</strong>, attributing contributions to each feature as it moves down the tree. It uses a dynamic programming approach to avoid redundant calculations and reduce the computational complexity.</p>
<p><a id="steps-1"></a></p>
<h3>Steps</h3>
<p>The TreeShap algorithm involves the following steps:</p>
<ol>
<li>Traverse the tree from the root to the leaf nodes, recording the decision path for the instance of interest.</li>
<li>Attribute contributions to each feature encountered along the path, taking into account the number of possible feature combinations and the probability of each combination.</li>
<li>Repeat the process for all trees in the ensemble, if applicable.</li>
<li>Average the contributions across all trees to obtain the final Shapley values.</li>
</ol>
<p><a id="treeshap-advantages-and-limitations"></a></p>
<h3>TreeShap advantages and limitations</h3>
<p>TreeShap has several advantages:</p>
<ul>
<li>It computes the exact Shapley values without the need for sampling or approximations.</li>
<li>It is computationally efficient due to its dynamic programming approach.</li>
<li>It is specifically designed for tree-based models, which are widely used in practice.</li>
</ul>
<p>However, TreeShap is limited to tree-based models and cannot be applied to other types of models, such as deep learning or support vector machines.
<a id="conclusion"></a></p>
<h2>Conclusion</h2>
<p>KernelShap and TreeShap are powerful methods for explaining AI models in the context of responsible AI. Both techniques leverage the concept of Shapley values to provide interpretable and fair attributions of feature importance. While KernelShap is a model-agnostic approach that can be applied to any black-box model, TreeShap is tailored for tree-based models and offers exact Shapley values with computational efficiency.</p>
<p>Understanding and implementing these methods is crucial for AI practitioners who aim to build transparent, accountable, and trustworthy AI systems. By providing insights into the inner workings of AI models, KernelShap and TreeShap enable developers to identify potential biases, improve the decision-making process, and ultimately foster trust in AI-driven technologies.</p>
<p><em>Any comments or suggestions? <a href="mailto:ksafjan@gmail.com?subject=Blog+post">Let me know</a>.</em></p>
<p><strong>Edits:</strong></p>
<ul>
<li>2023-11-05: Added TLDR section, minor edits</li>
</ul>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>LIME Tutorial2023-04-14T00:00:00+02:002023-04-14T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-04-14:/lime-tutorial/<p>Unveiling the mysteries of AI decisions? Let us dive into LIME, the tool that sheds light on the black box.</p><p>In this tutorial, we'll be exploring how to use the LIME (Local Interpretable Model-Agnostic Explanations) library for explainable AI. We'll start by discussing what LIME is and why it's useful for explainable AI, and then we'll dive into the code.</p>
<h2>What is LIME?</h2>
<p>LIME is a library for explaining the predictions of machine learning models. It works by creating "local" surrogate models that approximate the behavior of the original model in the vicinity of a particular prediction. The idea behind LIME is that these surrogate models can be used to provide human-understandable explanations for how the original model arrived at its decision.</p>
<p>Why is LIME useful for explainable AI? There are a few reasons:</p>
<ol>
<li>
<p><strong>Transparency:</strong> LIME allows us to peek "under the hood" of a black box model and see how it's making its decisions.</p>
</li>
<li>
<p><strong>Trust:</strong> By providing human-understandable explanations, LIME can increase our trust in the model's decisions.</p>
</li>
<li>
<p><strong>Debugging:</strong> LIME can help us identify problems with our model by highlighting areas where the model is making incorrect or unexpected predictions.</p>
</li>
</ol>
<p>Now that we understand why LIME is useful, let's dive into the code.</p>
<h2>Selecting a Dataset</h2>
<p>For this tutorial, we'll be using the classic "Iris" dataset, which is a popular dataset for classification tasks. The Iris dataset consists of 150 samples, each with four features (sepal length, sepal width, petal length, and petal width), and each sample belongs to one of three classes (setosa, versicolor, or virginica). The goal is to build a machine learning model that can predict the class of a new sample based on its features.</p>
<p>To start, we'll load the Iris dataset using scikit-learn:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">sklearn.datasets</span> <span class="kn">import</span> <span class="n">load_iris</span>
<span class="n">iris</span> <span class="o">=</span> <span class="n">load_iris</span><span class="p">()</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">iris</span><span class="o">.</span><span class="n">data</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">iris</span><span class="o">.</span><span class="n">target</span>
</code></pre></div>
<p>Next, we'll split the dataset into training and testing sets:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">test_size</span><span class="o">=</span><span class="mf">0.2</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
</code></pre></div>
<p>We'll use the training set to train our machine learning model, and the testing set to evaluate its performance.</p>
<h2>Training a Machine Learning Model</h2>
<p>For this tutorial, we'll use a random forest classifier as our machine learning model. The random forest algorithm is an ensemble learning method that combines multiple decision trees to improve the accuracy and robustness of the predictions.</p>
<p>We'll start by importing the necessary libraries and creating the classifier:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">sklearn.ensemble</span> <span class="kn">import</span> <span class="n">RandomForestClassifier</span>
<span class="n">rfc</span> <span class="o">=</span> <span class="n">RandomForestClassifier</span><span class="p">(</span><span class="n">n_estimators</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
</code></pre></div>
<p>We're using 100 decision trees in our random forest classifier, and setting the random state to 42 for reproducibility.</p>
<p>Next, we'll fit the classifier to the training data:</p>
<div class="highlight"><pre><span></span><code><span class="n">rfc</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
</code></pre></div>
<p>Finally, we'll evaluate the performance of the classifier on the testing data:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">accuracy_score</span>
<span class="n">y_pred</span> <span class="o">=</span> <span class="n">rfc</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="n">accuracy</span> <span class="o">=</span> <span class="n">accuracy_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">y_pred</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Accuracy: </span><span class="si">{</span><span class="n">accuracy</span><span class="si">:</span><span class="s2">.2f</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
</code></pre></div>
<p>When we run this code, we should see an accuracy of around 0.97, which means our model is doing a pretty good job of predicting the class of new samples.</p>
<h2>Explaining Model Predictions with LIME</h2>
<p>Now that we have a trained machine learning model, we can start using LIME to explain its predictions.</p>
<p>First, we need to create an explainer object:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">lime</span>
<span class="kn">import</span> <span class="nn">lime.lime_tabular</span>
<span class="n">explainer</span> <span class="o">=</span> <span class="n">lime</span><span class="o">.</span><span class="n">lime_tabular</span><span class="o">.</span><span class="n">LimeTabularExplainer</span><span class="p">(</span>
<span class="n">X_train</span><span class="p">,</span>
<span class="n">feature_names</span><span class="o">=</span><span class="n">iris</span><span class="o">.</span><span class="n">feature_names</span><span class="p">,</span>
<span class="n">class_names</span><span class="o">=</span><span class="n">iris</span><span class="o">.</span><span class="n">target_names</span><span class="p">,</span>
<span class="n">discretize_continuous</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div>
<p>Here, we're creating a <code>LimeTabularExplainer</code> object and passing in the training data, feature names, class names, and setting <code>discretize_continuous</code> to <code>True</code> to discretize any continuous features.</p>
<p>Next, we'll pick a sample from the testing data that we want to explain:</p>
<div class="highlight"><pre><span></span><code><span class="n">idx</span> <span class="o">=</span> <span class="mi">0</span> <span class="c1"># index of the sample we want to explain</span>
<span class="n">exp</span> <span class="o">=</span> <span class="n">explainer</span><span class="o">.</span><span class="n">explain_instance</span><span class="p">(</span><span class="n">X_test</span><span class="p">[</span><span class="n">idx</span><span class="p">],</span> <span class="n">rfc</span><span class="o">.</span><span class="n">predict_proba</span><span class="p">)</span>
</code></pre></div>
<p>Here, we're using the <code>explain_instance</code> method to generate an explanation for the sample at index <code>idx</code>. We're passing in the sample data and the <code>predict_proba</code> method of the random forest classifier, which is used to predict the probabilities of each class for the given sample.</p>
<p>Now, we can print out the top three features that are contributing to the prediction:</p>
<div class="highlight"><pre><span></span><code><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">3</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"</span><span class="si">{</span><span class="n">exp</span><span class="o">.</span><span class="n">as_list</span><span class="p">()[</span><span class="n">i</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span><span class="si">}</span><span class="s2">: </span><span class="si">{</span><span class="n">exp</span><span class="o">.</span><span class="n">as_list</span><span class="p">()[</span><span class="n">i</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span><span class="si">:</span><span class="s2">.2f</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
</code></pre></div>
<p>This will give us something like:</p>
<div class="highlight"><pre><span></span><code><span class="mf">4.25</span> <span class="o"><</span> <span class="n">petal</span> <span class="n">length</span> <span class="p">(</span><span class="n">cm</span><span class="p">)</span> <span class="o"><=</span> <span class="mf">5.10</span><span class="p">:</span> <span class="mf">0.21</span>
<span class="mf">0.30</span> <span class="o"><</span> <span class="n">petal</span> <span class="n">width</span> <span class="p">(</span><span class="n">cm</span><span class="p">)</span> <span class="o"><=</span> <span class="mf">1.30</span><span class="p">:</span> <span class="mf">0.16</span>
<span class="n">sepal</span> <span class="n">width</span> <span class="p">(</span><span class="n">cm</span><span class="p">)</span> <span class="o"><=</span> <span class="mf">2.80</span><span class="p">:</span> <span class="o">-</span><span class="mf">0.03</span>
</code></pre></div>
<p>This tells us that the most important feature for this prediction is petal width (cm), and that a value of 0.80 or less is strongly associated with the "setosa" class.</p>
<p>We can also visualize the explanation using a bar chart:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">lime</span> <span class="kn">import</span> <span class="n">lime_tabular</span>
<span class="n">fig</span> <span class="o">=</span> <span class="n">exp</span><span class="o">.</span><span class="n">as_pyplot_figure</span><span class="p">()</span>
</code></pre></div>
<p>This will create a bar chart that shows the contribution of each feature to the prediction, with the most important features at the top:</p>
<p><img alt="LIME bar chart" src="/images/lime_tutorial/lime_bar_chart.png"></p>
<h2>Visualizing Model Decisions</h2>
<p>In addition to explaining individual predictions, LIME can also be used to visualize how the model is making decisions more generally. We can do this by generating multiple explanations for different samples and visualizing the patterns that emerge.</p>
<p>To start, we'll generate a of explanation for the testing data point, next, we'll use these explanations to generate a decision plot:</p>
<div class="highlight"><pre><span></span><code><span class="n">exp</span> <span class="o">=</span> <span class="n">explainer</span><span class="o">.</span><span class="n">explain_instance</span><span class="p">(</span>
<span class="n">data_row</span><span class="o">=</span><span class="n">X_test</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span>
<span class="n">predict_fn</span><span class="o">=</span><span class="n">rfc</span><span class="o">.</span><span class="n">predict_proba</span>
<span class="p">)</span>
<span class="n">exp</span><span class="o">.</span><span class="n">show_in_notebook</span><span class="p">(</span><span class="n">show_table</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</code></pre></div>
<p><img alt="LIME - explanation visualization" src="/images/lime_tutorial/lime_explanation.png"></p>
<h2>Conclusion</h2>
<p>In this tutorial, we learned how to use the LIME library for explainable AI. We started by importing the necessary libraries and loading the Iris dataset. Then, we trained a random forest classifier on the dataset and used LIME to explain individual predictions and visualize model decisions.</p>
<p>We saw how LIME can be used to identify the most important features for a prediction, and how these features can be visualized using a bar chart. We also saw how LIME can be used to visualize how the model is making decisions more generally, using a decision plot.</p>
<p><em>Any comments or suggestions? <a href="mailto:ksafjan@gmail.com?subject=Blog+post">Let me know</a>.</em></p>
<h2>Related</h2>
<ul>
<li><a href="https://stackoverflow.com/questions/63937620/how-to-plot-lime-report-when-there-is-a-lot-of-features-in-data-set">python - How to plot Lime report when there is a lot of features in data-set - Stack Overflow</a></li>
<li><a href="https://betterdatascience.com/lime/">LIME: How to Interpret Machine Learning Models With Python | Better Data Science</a></li>
<li><a href="https://coderzcolumn.com/tutorials/machine-learning/how-to-use-lime-to-understand-sklearn-models-predictions">How to Use LIME to Interpret Predictions of ML Models [Python]?</a></li>
<li><a href="https://shiring.github.io/machine_learning/2017/04/23/lime">Explaining complex machine learning models with LIME</a> (in R)</li>
</ul>How to Deploy FreshRSS in the Cloud for Free on Azure?2023-04-11T00:00:00+02:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-04-11:/how-to-deploy-freshrss-in-the-cloud-for-free-on-azure/<p>FreshRSS is a free and open-source RSS feed aggregator that allows you to easily follow your favorite websites and blogs in one place. By deploying FreshRSS in the cloud, you can access your feeds from anywhere and enjoy the benefits of cloud …</p><p>FreshRSS is a free and open-source RSS feed aggregator that allows you to easily follow your favorite websites and blogs in one place. By deploying FreshRSS in the cloud, you can access your feeds from anywhere and enjoy the benefits of cloud computing, such as scalability, reliability, and cost-effectiveness. Microsoft Azure is a popular cloud platform that offers a wide range of services for building, deploying, and managing applications in the cloud. In this tutorial, we'll show you how to deploy FreshRSS in the cloud for free on Azure.</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#prerequisites">Prerequisites:</a></li>
<li><a href="#step-by-step-guide">Step-by-step guide</a></li>
<li><a href="#step-1-create-a-new-azure-web-app">Step 1: Create a new Azure web app</a></li>
<li><a href="#step-2-configure-your-web-app">Step 2: Configure your web app</a></li>
<li><a href="#step-3-deploy-freshrss">Step 3: Deploy FreshRSS</a></li>
<li><a href="#step-4-configure-ssl-optional">Step 4: Configure SSL (optional)</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="prerequisites"></a></p>
<h2>Prerequisites</h2>
<ul>
<li>A Microsoft Azure account</li>
<li>A basic understanding of Azure services and concepts</li>
<li>A web browser</li>
<li>An SSH client (optional)</li>
</ul>
<p><a id="step-by-step-guide"></a></p>
<h2>Step-by-step guide</h2>
<p><a id="step-1-create-a-new-azure-web-app"></a></p>
<h3>Step 1: Create a new Azure web app</h3>
<p>The first step is to create a new Azure web app, which will host your FreshRSS installation. Follow these steps to create a new web app:</p>
<ol>
<li>
<p>Log in to the Azure portal (<a href="https://portal.azure.com/">https://portal.azure.com</a>) using your Azure account credentials.</p>
</li>
<li>
<p>Click on the "Create a resource" button in the left-hand menu and search for "Web App".</p>
</li>
<li>
<p>Select the "Web App" service and click on the "Create" button.</p>
</li>
<li>
<p>Fill in the required details for your web app, such as the name, subscription, resource group, and runtime stack.</p>
</li>
<li>
<p>Choose the "Free" pricing tier, which provides up to 10 web, mobile, or API apps with shared compute resources and 1 GB storage per app.</p>
</li>
<li>
<p>Click on the "Review + create" button to review your settings and then click on the "Create" button to create your web app.</p>
</li>
<li>
<p>Wait for the deployment to complete, which may take a few minutes.</p>
</li>
</ol>
<p><a id="step-2-configure-your-web-app"></a></p>
<h3>Step 2: Configure your web app</h3>
<p>The next step is to configure your web app with the necessary settings and dependencies. Follow these steps to configure your web app:</p>
<ol>
<li>
<p>Navigate to your web app in the Azure portal and click on the "Configuration" tab.</p>
</li>
<li>
<p>Under the "General settings" section, set the "Linux container" option to "On".</p>
</li>
<li>
<p>Under the "Stack settings" section, set the "Runtime stack" option to "PHP 7.3".</p>
</li>
<li>
<p>Under the "Application settings" section, add the following key-value pairs:</p>
<ul>
<li>Key: WEBSITE_TIME_ZONE, Value: Your timezone (e.g., "America/Los_Angeles")</li>
<li>Key: WEBSITE_AUTH_ENABLED, Value: False</li>
<li>Key: WEBSITE_NODE_DEFAULT_VERSION, Value: 10.14.2</li>
<li>Click on the "Save" button to save your changes.</li>
</ul>
</li>
</ol>
<p><a id="step-3-deploy-freshrss"></a></p>
<h3>Step 3: Deploy FreshRSS</h3>
<p>The next step is to deploy FreshRSS to your web app. Follow these steps to deploy FreshRSS:</p>
<ol>
<li>
<p>Download the latest version of FreshRSS from the official website (<a href="https://freshrss.org/">https://freshrss.org</a>) and extract the files to a local directory.</p>
</li>
<li>
<p>Open a command prompt or terminal window and navigate to the local directory where you extracted the FreshRSS files.</p>
</li>
<li>
<p>Use the following command to create a ZIP archive of the FreshRSS files:</p>
<p><code>zip -r freshrss.zip .</code></p>
</li>
<li>
<p>Return to the Azure portal and navigate to your web app.</p>
</li>
<li>
<p>Click on the "Deployment Center" tab and select the "Local Git" option.</p>
</li>
<li>
<p>Follow the on-screen instructions to create a new deployment user and download the deployment credentials.</p>
</li>
<li>
<p>Use the following commands to add the Azure Git remote to your local Git repository and push your changes to the Azure web app:</p>
<p>csharpCopy code</p>
<p><code>git remote add azure <deployment-endpoint> git push azure master</code></p>
</li>
<li>
<p>When prompted, enter the deployment username and password that you created earlier.</p>
</li>
<li>
<p>Wait for the deployment to complete, which may take a few minutes.</p>
</li>
<li>
<p>Once the deployment is complete, open a web browser and navigate to your web app's URL to access FreshRSS. You should see the FreshRSS installation page.</p>
</li>
<li>
<p>Follow the on-screen instructions to complete the FreshRSS installation. Make sure to set the database type to "SQLite" and the database path to "/home/site/wwwroot/data/freshrss.db".</p>
</li>
<li>
<p>Once the installation is complete, you should be able to access FreshRSS and start adding your favorite feeds.</p>
</li>
</ol>
<p><a id="step-4-configure-ssl-optional"></a></p>
<h3>Step 4: Configure SSL (optional)</h3>
<p>If you want to secure your FreshRSS installation with SSL, you can do so by configuring a custom domain and adding an SSL certificate. Follow these steps to configure SSL:</p>
<ol>
<li>
<p>Purchase a custom domain from a domain registrar, such as GoDaddy or Namecheap.</p>
</li>
<li>
<p>Navigate to your web app in the Azure portal and click on the "Custom domains" tab.</p>
</li>
<li>
<p>Add your custom domain and follow the on-screen instructions to configure DNS settings.</p>
</li>
<li>
<p>Once your domain is configured, navigate to the "SSL certificates" tab and click on the "Create App Service Managed Certificate" button.</p>
</li>
<li>
<p>Follow the on-screen instructions to create a new SSL certificate for your custom domain.</p>
</li>
<li>
<p>Once the certificate is created, navigate back to the "Custom domains" tab and click on the "Add binding" button.</p>
</li>
<li>
<p>Select your custom domain and the newly created SSL certificate and click on the "Add binding" button.</p>
</li>
<li>
<p>Wait for the SSL certificate to be provisioned, which may take a few minutes.</p>
</li>
<li>
<p>Once the SSL certificate is provisioned, you should be able to access FreshRSS securely using your custom domain.</p>
</li>
</ol>
<p>X::<a href="https://www.safjan.com/how-to-deploy-freshrss-in-the-cloud-for-free-on-gcp/">How to Deploy FreshRSS in the Cloud for Free on GCP?</a></p>How to Deploy FreshRSS in the Cloud for Free on GCP?2023-04-11T00:00:00+02:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-04-11:/how-to-deploy-freshrss-in-the-cloud-for-free-on-gcp/<p>To deploy FreshRSS in the cloud for free on Google Cloud Platform (GCP), you can follow these steps:</p>
<ol>
<li>
<p>Create a new project on GCP and enable billing. FreshRSS requires a web server and a database, and GCP provides free usage limits for …</p></li></ol><p>To deploy FreshRSS in the cloud for free on Google Cloud Platform (GCP), you can follow these steps:</p>
<ol>
<li>
<p>Create a new project on GCP and enable billing. FreshRSS requires a web server and a database, and GCP provides free usage limits for these services for a limited time. You will need to provide billing information to verify your account and enable these services.</p>
</li>
<li>
<p>Launch a Compute Engine instance. FreshRSS can run on any Linux-based server, so you can choose an instance that meets your needs. For this example, we'll use a micro instance with Debian 10.</p>
</li>
<li>
<p>Connect to the instance using SSH. You can use the SSH button in the GCP Console or connect from your terminal using the external IP address.</p>
</li>
<li>
<p>Install the necessary packages. Run the following command to update the package index and install the required packages:</p>
</li>
</ol>
<div class="highlight"><pre><span></span><code>sudo<span class="w"> </span>apt<span class="w"> </span>update
sudo<span class="w"> </span>apt<span class="w"> </span>install<span class="w"> </span>apache2<span class="w"> </span>mariadb-server<span class="w"> </span>php7.3<span class="w"> </span>php7.3-mysql<span class="w"> </span>php7.3-curl<span class="w"> </span>php7.3-xml
</code></pre></div>
<ol>
<li>Configure the database. Follow these steps to create a new database and user for FreshRSS:</li>
</ol>
<div class="highlight"><pre><span></span><code>sudo<span class="w"> </span>mysql<span class="w"> </span>-u<span class="w"> </span>root
CREATE<span class="w"> </span>DATABASE<span class="w"> </span>freshrss<span class="p">;</span>
GRANT<span class="w"> </span>ALL<span class="w"> </span>PRIVILEGES<span class="w"> </span>ON<span class="w"> </span>freshrss.*<span class="w"> </span>TO<span class="w"> </span><span class="s1">'freshrssuser'</span>@<span class="s1">'localhost'</span><span class="w"> </span>IDENTIFIED<span class="w"> </span>BY<span class="w"> </span><span class="s1">'password'</span><span class="p">;</span>
FLUSH<span class="w"> </span>PRIVILEGES<span class="p">;</span>
EXIT<span class="p">;</span>
</code></pre></div>
<ol>
<li>Download and install FreshRSS. Run the following commands to download the latest version of FreshRSS and extract it to the web root:</li>
</ol>
<div class="highlight"><pre><span></span><code><span class="nb">cd</span><span class="w"> </span>/var/www/html
sudo<span class="w"> </span>wget<span class="w"> </span>https://github.com/FreshRSS/FreshRSS/archive/master.tar.gz
sudo<span class="w"> </span>tar<span class="w"> </span>-xzf<span class="w"> </span>master.tar.gz<span class="w"> </span>--strip-components<span class="o">=</span><span class="m">1</span>
sudo<span class="w"> </span>chown<span class="w"> </span>-R<span class="w"> </span>www-data:www-data<span class="w"> </span>.
</code></pre></div>
<ol>
<li>Configure the web server. Edit the default Apache configuration file to enable URL rewriting:</li>
</ol>
<div class="highlight"><pre><span></span><code>sudo<span class="w"> </span>nano<span class="w"> </span>/etc/apache2/sites-enabled/000-default.conf
</code></pre></div>
<div class="highlight"><pre><span></span><code>Add the following lines inside the `<VirtualHost>` block:
</code></pre></div>
<div class="highlight"><pre><span></span><code><span class="nt"><Directory</span><span class="w"> </span><span class="err">/var/www/html</span><span class="nt">></span>
<span class="w"> </span>AllowOverride<span class="w"> </span>All
<span class="nt"></Directory></span>
</code></pre></div>
<ol>
<li>Restart the web server. Run the following command to apply the changes:</li>
</ol>
<div class="highlight"><pre><span></span><code>sudo<span class="w"> </span>systemctl<span class="w"> </span>restart<span class="w"> </span>apache2
</code></pre></div>
<ol>
<li>Complete the FreshRSS setup. Open your web browser and navigate to the external IP address of your instance. Follow the on-screen instructions to configure FreshRSS.</li>
</ol>
<p>Congratulations, you have successfully deployed FreshRSS in the cloud for free on GCP!</p>
<p>X::<a href="https://www.safjan.com/how-to-deploy-freshrss-in-the-cloud-for-free-on-azure/">How to Deploy FreshRSS in the Cloud for Free on Azure?</a></p>Zero-Knowledge Explained Like to 5 Years Old2023-04-06T00:00:00+02:002023-04-06T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-04-06:/zero-knowledge-for-5yo/<p>Imagine being able to prove something without actually revealing it. That is the power of zero-knowledge proofs, the technology that keeps your crypto safe.</p><p>Zero-knowledge proofs (ZKPs) are a key technology that underpins the security and privacy of many modern cryptocurrencies. In essence, ZKPs allow parties to prove that they know a piece of information, without revealing that information itself. But what does that mean, exactly? In this blog post, we'll explain ZKPs in a way that even a 5-year-old can understand.</p>
<h2>Helper example</h2>
<p>Let's start with a basic example. Imagine you have a secret toy that you don't want anyone else to know about. Your friend wants to prove to you that they know what the toy is, without actually telling you what it is. How can they do that?</p>
<p>One way to do it is to play a guessing game. Your friend can ask you a series of questions about the toy, such as "Is it blue?" or "Does it have wheels?" Based on your answers, your friend can narrow down the possibilities until they have a pretty good idea of what the toy is. This is a bit like a multiple-choice test: by eliminating the wrong answers, you can eventually arrive at the right one.</p>
<p>But what if your friend wants to prove that they know the toy, without giving you any clues about what it is? That's where zero-knowledge proofs come in.</p>
<p>Imagine your friend has a magic wand that can tell them whether a particular guess is right or wrong, without actually revealing what the correct answer is. So they can make a guess, wave the wand, and get a "yes" or "no" answer. If the answer is "no", they can make another guess and try again. If the answer is "yes", they've proven that they know the toy, without actually revealing what it is.</p>
<p>This is a bit like playing "20 questions", but with a magical yes-or-no answer that doesn't give away any information. Your friend doesn't need to ask you any questions about the toy, they just need to make a series of guesses and use the magic wand to check if they're right or wrong. And because the wand doesn't reveal anything about the toy itself, you still don't know what it is.</p>
<h2>Zero-knowledge in Cryptocurrency</h2>
<p>Now, let's apply this idea to cryptocurrency. In a blockchain system like Bitcoin, transactions are recorded on a public ledger that anyone can see. But the ledger doesn't reveal who the parties involved in the transaction are. Instead, it uses cryptographic techniques to obscure their identities.</p>
<p>For example, imagine you want to send some Bitcoin to a friend. You create a transaction that says "send X amount of Bitcoin to this address". But instead of using your real name and address, you use a pseudonymous address that's associated with your public key.</p>
<p>The public key is a string of characters that's generated using a complex mathematical algorithm. It's unique to you, and it's used to encrypt and decrypt messages that are sent to and from your address. But it doesn't reveal your actual identity.</p>
<p>So when you send the Bitcoin, the transaction is broadcast to the network and added to the blockchain. But nobody knows who the parties involved are, because they're identified only by their public keys.</p>
<p>This is where zero-knowledge proofs come in. Imagine you want to prove to someone that you own a particular address, without revealing what that address is. You could use a zero-knowledge proof to demonstrate that you know the private key associated with that address, without actually showing the key itself.</p>
<blockquote>
<p><strong>Zero-knowledge proof (ZKP)</strong>
The proof works by using a mathematical algorithm that allows you to generate a random "challenge" that's based on your private key. You then provide a response to the challenge that demonstrates that you know the private key, without revealing what it is.</p>
</blockquote>
<p>This is a bit like the guessing game we talked about earlier. The challenge is like a question that's designed to test whether you know the private key, and the response is like an answer that proves that you do, without revealing what the key is. This allows you to prove ownership of the address, without revealing any sensitive information.</p>
<p>This is important for privacy and security in cryptocurrency, because it means that you can prove ownership of an address without revealing your identity or any other sensitive information. It also makes it much harder for hackers or other bad actors to steal your cryptocurrency, because they would need to know your private key in order to access your funds.</p>
<p>So there you have it, zero-knowledge proofs explained like you're 5 years old! They're a clever way of proving that you know something, without actually revealing what it is. And in the world of cryptocurrency, they're a key technology that helps to ensure the security and privacy of your transactions.</p>
<blockquote>
<p><strong>ZKP Origin</strong>
Zero-knowledge proofs were first introduced by researchers Shafi Goldwasser, Silvio Micali, and Charles Rackoff in 1985. Their groundbreaking paper, <a href="https://dl.acm.org/doi/10.1145/22145.22178">"The Knowledge Complexity of Interactive Proof-Systems,"</a> laid the foundation for zero-knowledge proof systems.
Silvio Micali, won the <a href="https://amturing.acm.org/award_winners/micali_9954407.cfm">Turing Award</a> for his works on cryptography and inventing Zero Knowledge (ZK) Proofs</p>
</blockquote>
<h2>Related reading</h2>
<ul>
<li>The reddit user <a href="https://www.reddit.com/user/busterrulezzz/">busterrulezzz (u/busterrulezzz) - Reddit</a>proposed other ELI5 explanation of how the ZKP works: <a href="https://www.reddit.com/r/CryptoCurrency/comments/rwpfkx/zeroknowledge_proof_explained_like_you_are_5/">Zero-knowledge proof explained like you are 5 years old : r/CryptoCurrency</a></li>
<li><a href="https://hackernoon.com/eli5-zero-knowledge-proof-78a276db9eff">Zero Knowledge Proof: Explain it Like I’m 5 (Halloween Edition) | HackerNoon</a></li>
</ul>
<p><em>Any comments or suggestions? <a href="mailto:ksafjan@gmail.com?subject=Blog+post">Let me know</a>.</em></p>Python - Named Tuples or Dictionaries to Store Structured Data?2023-04-03T00:00:00+02:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-04-03:/named-tuples-vs-dictionaries/<p>Let's assume that in python, we have long list of pairs to store. In this note we will discuss what are the pros and cons of using named tuple vs. dict to store single pair?</p>
<p>Both named tuples and dictionaries are useful …</p><p>Let's assume that in python, we have long list of pairs to store. In this note we will discuss what are the pros and cons of using named tuple vs. dict to store single pair?</p>
<p>Both named tuples and dictionaries are useful data structures for storing key-value pairs in Python, but they have different pros and cons depending on the situation.</p>
<h2>Named tuples</h2>
<p>Here are some pros and cons of using named tuples:</p>
<h3>Pros</h3>
<ul>
<li>Named tuples are immutable, so they are safer to <strong>use in multithreaded</strong> environments where multiple threads might try to modify the same data at the same time.</li>
<li>Named tuples can be <strong>more memory-efficient</strong> than dictionaries, especially if you have a large number of instances with the same fields.</li>
<li>Named tuples are more readable than dictionaries when you have a fixed set of fields and you want to give them meaningful names.</li>
</ul>
<h3>Cons</h3>
<ul>
<li>Named tuples are less flexible than dictionaries because you can't add or remove fields once they are defined.</li>
<li>Named tuples can be less convenient to use than dictionaries if you need to access fields by key rather than by attribute name.</li>
</ul>
<h3>Dictionaries</h3>
<p>Here are some pros and cons of using dictionaries:</p>
<h3>Pros</h3>
<ul>
<li>Dictionaries are more flexible than named tuples because you can add or remove fields at any time.</li>
<li>Dictionaries are more convenient to use than named tuples if you need to access fields by key rather than by attribute name.</li>
</ul>
<h3>Cons</h3>
<ul>
<li>Dictionaries are mutable, so you need to be careful when using them in multithreaded environments.</li>
<li>Dictionaries can be less memory-efficient than named tuples, especially if you have a large number of instances with the same fields.</li>
<li>Dictionaries are less readable than named tuples when you have a fixed set of fields and you want to give them meaningful names.</li>
</ul>
<h2>Conclusion</h2>
<p>if you have a fixed set of fields with meaningful names, and you don't need to add or remove fields at runtime, a named tuple is a good choice. If you need more flexibility or you need to access fields by key rather than by attribute name, a dictionary is a better choice.</p>Python - How to Make Type Hint for the Tuple With Undetermined Number of Strings?2023-04-03T00:00:00+02:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-04-03:/type-hint-for-undetermined-number-of-elements/<p>To make a type hint for a tuple with an undetermined number of strings in Python, you can use the <code>Tuple</code> and <code>Union</code> types from the <code>typing</code> module. Here's an example:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Tuple</span><span class="p">,</span> <span class="n">Union</span>
<span class="k">def</span> <span class="nf">process_strings</span><span class="p">(</span><span class="n">strings</span><span class="p">:</span> <span class="n">Tuple</span><span class="p">[</span><span class="n">Union</span><span class="p">[</span><span class="nb">str …</span></code></pre></div><p>To make a type hint for a tuple with an undetermined number of strings in Python, you can use the <code>Tuple</code> and <code>Union</code> types from the <code>typing</code> module. Here's an example:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Tuple</span><span class="p">,</span> <span class="n">Union</span>
<span class="k">def</span> <span class="nf">process_strings</span><span class="p">(</span><span class="n">strings</span><span class="p">:</span> <span class="n">Tuple</span><span class="p">[</span><span class="n">Union</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="o">...</span><span class="p">],</span> <span class="o">...</span><span class="p">])</span> <span class="o">-></span> <span class="nb">str</span><span class="p">:</span>
<span class="k">return</span> <span class="s2">", "</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">strings</span><span class="p">)</span>
<span class="n">strings1</span> <span class="o">=</span> <span class="p">(</span><span class="s2">"hello"</span><span class="p">,</span> <span class="s2">"world"</span><span class="p">)</span>
<span class="n">strings2</span> <span class="o">=</span> <span class="p">(</span><span class="s2">"foo"</span><span class="p">,</span> <span class="s2">"bar"</span><span class="p">,</span> <span class="s2">"baz"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">process_strings</span><span class="p">(</span><span class="n">strings1</span><span class="p">))</span> <span class="c1"># Output: "hello, world"</span>
<span class="nb">print</span><span class="p">(</span><span class="n">process_strings</span><span class="p">(</span><span class="n">strings2</span><span class="p">))</span> <span class="c1"># Output: "foo, bar, baz"</span>
</code></pre></div>
<p>In the type hint above, <code>Tuple</code> is used to indicate that we are working with a tuple, and <code>Union</code> is used to indicate that each element of the tuple can be either a <code>str</code> or an ellipsis (<code>...</code>), which represents an infinite number of <code>str</code> elements. The second <code>...</code> indicates that the tuple can contain an undetermined number of elements.</p>
<p>X::<a href="https://www.safjan.com/type-hints-elypsis-for-arbitrary-number-of-elements/">How to Use Elypsis in Type Hints to Indicate Arbitrary Number of Elements</a></p>How to Use Elypsis in Type Hints to Indicate Arbitrary Number of Elements2023-04-03T00:00:00+02:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-04-03:/type-hints-elypsis-for-arbitrary-number-of-elements/<p>In type hints, <code>...</code> (ellipsis) is used to indicate that a function parameter or return value can have an arbitrary number of arguments or elements.</p>
<p>For example, if you have a function that takes an arbitrary number of integers as arguments, you can …</p><p>In type hints, <code>...</code> (ellipsis) is used to indicate that a function parameter or return value can have an arbitrary number of arguments or elements.</p>
<p>For example, if you have a function that takes an arbitrary number of integers as arguments, you can use <code>...</code> in the function signature to indicate that:</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">foo</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="n">List</span><span class="p">[</span><span class="nb">int</span><span class="p">]:</span>
<span class="k">return</span> <span class="p">[</span><span class="n">x</span> <span class="o">*</span> <span class="mi">2</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">args</span><span class="p">]</span>
</code></pre></div>
<p>Here, <code>*args</code> is used to indicate that the function can take any number of arguments, and the <code>int</code> type hint indicates that each argument must be an integer. The return type is a list of integers.</p>
<p>Similarly, you can use <code>...</code> in a type hint for a tuple to indicate that the tuple can have an arbitrary number of elements of a given type. For example:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Tuple</span>
<span class="k">def</span> <span class="nf">bar</span><span class="p">(</span><span class="n">t</span><span class="p">:</span> <span class="n">Tuple</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="o">...</span><span class="p">])</span> <span class="o">-></span> <span class="nb">str</span><span class="p">:</span>
<span class="k">return</span> <span class="s2">" "</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">t</span><span class="p">)</span>
<span class="n">t1</span> <span class="o">=</span> <span class="p">(</span><span class="s2">"hello"</span><span class="p">,</span> <span class="s2">"world"</span><span class="p">)</span>
<span class="n">t2</span> <span class="o">=</span> <span class="p">(</span><span class="s2">"foo"</span><span class="p">,</span> <span class="s2">"bar"</span><span class="p">,</span> <span class="s2">"baz"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">bar</span><span class="p">(</span><span class="n">t1</span><span class="p">))</span> <span class="c1"># Output: "hello world"</span>
<span class="nb">print</span><span class="p">(</span><span class="n">bar</span><span class="p">(</span><span class="n">t2</span><span class="p">))</span> <span class="c1"># Output: "foo bar baz"</span>
</code></pre></div>
<p>Here, <code>Tuple[str, ...]</code> is used to indicate that <code>t</code> is a tuple of strings, and the <code>...</code> indicates that the tuple can have an arbitrary number of elements.</p>
<p>X::<a href="https://www.safjan.com/type-hint-for-undetermined-number-of-elements/">Python - How to Make Type Hint for the Tuple With Undetermined Number of Strings?</a></p>
<p>X::<a href="https://www.safjan.com/use-python-typeddict-to-type-hint-dictionaries/">Use Python TypedDict to Type Hint Dictionaries</a></p>Git - Annotated vs. Lightweight Tags2023-03-31T00:00:00+02:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-03-31:/git-annotated-vs-lightweight-tags/<p>In Git, tags are used to mark specific points in the history of a repository. They serve as a reference point for developers to easily identify and navigate to important milestones, such as releases or significant commits. There are two types of …</p><p>In Git, tags are used to mark specific points in the history of a repository. They serve as a reference point for developers to easily identify and navigate to important milestones, such as releases or significant commits. There are two types of tags in Git: annotated tags and lightweight tags.</p>
<h2>Annotated tags</h2>
<p>Annotated tags are more informative than lightweight tags. When creating an annotated tag, Git stores a full object in the repository that contains the tagger name, email, and date, a tagging message, and a SHA-1 checksum of the commit being tagged. Annotated tags are essentially Git objects that are separate from the commit objects they reference, whereas lightweight tags are simply pointers to specific commits.</p>
<p>The additional information stored in an annotated tag makes them useful for documenting significant events in the project's history. The tagging message can provide context about why the tag was created and what it represents. Additionally, annotated tags can be signed and verified to ensure their authenticity. Signed tags provide assurance that the tag was created by an authorized person and that the commit being tagged has not been tampered with.</p>
<p><strong>Example:</strong></p>
<div class="highlight"><pre><span></span><code>git<span class="w"> </span>tag<span class="w"> </span>-a<span class="w"> </span>v1.2<span class="w"> </span>-m<span class="w"> </span><span class="s2">"my version 1.4"</span>
</code></pre></div>
<h2>Lightweight tags</h2>
<p>Lightweight tags, on the other hand, are simply references to specific commits. They do not store any additional information beyond the tag name and the commit ID. Lightweight tags are useful for marking temporary or internal points in the repository history, such as to label a specific commit for testing or debugging purposes. Lightweight tags are created with the <code>git tag</code> command without the <code>-a</code> or <code>-m</code> options.</p>
<p><strong>Example:</strong></p>
<div class="highlight"><pre><span></span><code>git<span class="w"> </span>tag<span class="w"> </span>v1.2
</code></pre></div>
<h2>When to use Annotated and when Lightweight tags?</h2>
<p>So when should you use annotated tags versus lightweight tags? Annotated tags are ideal for marking significant events in the project's history, such as releases, milestones, or important changes. They are also useful for documenting the context and reasoning behind a particular tag. Lightweight tags, on the other hand, are useful for temporary or internal purposes, such as marking specific commits for debugging or testing purposes.</p>
<blockquote>
<p>In general, it is a good practice to <strong>use annotated</strong> tags for any <strong>official releases</strong> or <strong>milestones</strong>, as they provide a clear and detailed record of the project's progress. <strong>Lightweight</strong> tags can be used for more informal purposes, such as to <strong>mark experimental</strong> or <strong>intermediate points</strong> in the project's history.</p>
</blockquote>Contextual Understanding in Automated Speech-to-Text Transcription - Machine Learning Techniques and Challenges2023-03-30T00:00:00+02:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-03-30:/contextual_understanding-speech-to-text/<p>Automated speech-to-text transcription has come a long way in recent years, with advances in artificial intelligence and natural language processing enabling machines to transcribe human speech with increasing accuracy. However, there are still several challenges that remain unsolved, and which continue to …</p><p>Automated speech-to-text transcription has come a long way in recent years, with advances in artificial intelligence and natural language processing enabling machines to transcribe human speech with increasing accuracy. However, there are still several challenges that remain unsolved, and which continue to limit the capabilities of automated speech recognition technology. In this blog post, we will explore some of the biggest unsolved problems in automated speech-to-text transcription.</p>
<!-- MarkdownTOC levels="2,3" autolink="true" autoanchor="true" -->
<ul>
<li><a href="#challenges">Challenges</a><ul>
<li><a href="#1-accurate-transcription-of-spontaneous-speech">1. Accurate transcription of spontaneous speech</a></li>
<li><a href="#2-handling-multiple-speakers">2. Handling multiple speakers</a></li>
<li><a href="#3-handling-accents-and-dialects">3. Handling accents and dialects</a></li>
<li><a href="#4-contextual-understanding">4. Contextual understanding</a></li>
<li><a href="#5-real-time-transcription">5. Real-time transcription</a></li>
<li><a href="#6-data-privacy-and-security">6. Data privacy and security</a></li>
</ul>
</li>
<li><a href="#contextual-understanding">Contextual understanding</a><ul>
<li><a href="#importance">Importance</a></li>
<li><a href="#approaches">Approaches</a></li>
</ul>
</li>
<li><a href="#machine-learning-techniques-for-contextual-understanding">Machine Learning techniques for Contextual understanding</a><ul>
<li><a href="#disambiguation">Disambiguation</a></li>
<li><a href="#hybrid-approaches">hybrid approaches</a></li>
</ul>
</li>
<li><a href="#summary">Summary</a></li>
</ul>
<!-- /MarkdownTOC -->
<p><a id="challenges"></a></p>
<h2>Challenges</h2>
<p><a id="1-accurate-transcription-of-spontaneous-speech"></a></p>
<h3>1. Accurate transcription of spontaneous speech</h3>
<p>One of the biggest challenges in automated speech-to-text transcription is accurately transcribing spontaneous speech. Spontaneous speech is characterized by its lack of structure and tendency to contain many disfluencies, such as repetitions, false starts, and filled pauses. This type of speech is particularly challenging for machines to transcribe accurately, as it can be difficult to distinguish between disfluencies and actual words. This can lead to errors in the transcribed text, which can be frustrating for users and limit the usefulness of the technology.</p>
<p><a id="2-handling-multiple-speakers"></a></p>
<h3>2. Handling multiple speakers</h3>
<p>Another major challenge in automated speech-to-text transcription is handling multiple speakers. When there are multiple speakers involved, it can be difficult for machines to distinguish between them and accurately attribute the words to the correct speaker. This can lead to confusion and errors in the transcribed text, which can be particularly problematic in applications where it is important to know who said what. There has been some progress in this area, with some automated transcription services now able to recognize multiple speakers, but there is still room for improvement.</p>
<p><a id="3-handling-accents-and-dialects"></a></p>
<h3>3. Handling accents and dialects</h3>
<p>Accents and dialects can also pose a significant challenge for automated speech-to-text transcription. Different accents and dialects can vary greatly in terms of pronunciation, intonation, and grammar, which can make it difficult for machines to accurately transcribe speech from speakers with different accents or dialects. This is particularly problematic in applications where it is important to accurately capture the nuances of the speaker's speech, such as in legal or medical settings.</p>
<p><a id="4-contextual-understanding"></a></p>
<h3>4. Contextual understanding</h3>
<p>Another major challenge in automated speech-to-text transcription is contextual understanding. Machines are able to transcribe speech accurately based on the words that are spoken, but they may not always be able to understand the context in which those words are being used. For example, machines may struggle to accurately transcribe a sentence that contains homophones, such as "I saw the bear" versus "I saw the bare". In order to accurately transcribe speech, machines need to be able to understand the context in which the words are being used.</p>
<p><a id="5-real-time-transcription"></a></p>
<h3>5. Real-time transcription</h3>
<p>Real-time transcription is another major challenge for automated speech-to-text transcription. Real-time transcription involves transcribing speech as it is being spoken, rather than after the fact. This can be particularly challenging, as machines need to be able to transcribe speech quickly and accurately, without the benefit of being able to go back and review what was said. Real-time transcription is becoming increasingly important in a number of applications, such as live captioning of video content, but there is still room for improvement in this area.</p>
<p><a id="6-data-privacy-and-security"></a></p>
<h3>6. Data privacy and security</h3>
<p>Finally, data privacy and security is a major concern in automated speech-to-text transcription. In order to transcribe speech accurately, machines need to be trained on large amounts of data, which may contain sensitive information. This raises concerns about how that data is collected, stored, and used, and whether appropriate safeguards are in place to protect user privacy. As the use of automated speech-to-text transcription continues to grow, it will be important to ensure that user data is handled in a responsible and secure manner.</p>
<p><a id="contextual-understanding"></a></p>
<h2>Contextual understanding</h2>
<p>Contextual understanding is one of the biggest challenges facing automated speech-to-text transcription. Machines are able to transcribe speech accurately based on the words that are spoken, but they may not always be able to understand the context in which those words are being used. In order to accurately transcribe speech, machines need to be able to understand the context in which the words are being used, including the speaker's tone, mood, and intent.</p>
<p><a id="importance"></a></p>
<h3>Importance</h3>
<p>Contextual understanding is important for a number of reasons. First, it can help to reduce errors in automated speech-to-text transcription. When machines are able to understand the context in which words are being used, they are less likely to make mistakes or misinterpret the speaker's meaning. This can improve the accuracy of the transcribed text and make it more useful for a variety of applications.</p>
<p>Second, contextual understanding can help to improve the quality of the transcribed text. When machines are able to understand the context in which words are being used, they can more accurately transcribe the speaker's tone and mood. This can be particularly important in applications such as customer service or support, where it is important to accurately capture the speaker's emotions in order to provide an appropriate response.</p>
<p>Finally, contextual understanding can help to improve the overall user experience. When machines are able to accurately transcribe speech and understand the context in which words are being used, users are more likely to have a positive experience with the technology. This can help to increase adoption and usage of automated speech-to-text transcription in a variety of applications.</p>
<p><a id="approaches"></a></p>
<h3>Approaches</h3>
<p>There are several approaches that can be used to improve contextual understanding in automated speech-to-text transcription. One approach is to use machine learning algorithms to analyze the context in which words are being used. Machine learning algorithms can be trained on large datasets of speech and text data to learn how to identify patterns in the way that words are used in different contexts. This can help machines to more accurately transcribe speech and understand the context in which words are being used.</p>
<p>Another approach is to incorporate additional information into the transcription process. For example, machines can be programmed to recognize certain words or phrases that are commonly used in specific contexts, such as in a medical setting or in a legal deposition. This can help to improve the accuracy of the transcribed text and ensure that the context in which words are being used is correctly understood.</p>
<p>Contextual understanding is an important area of research in automated speech-to-text transcription, and there is still much work to be done in this area. As the technology continues to evolve and improve, it is likely that machines will become increasingly capable of understanding the context in which words are being used. This will help to improve the accuracy and quality of the transcribed text, and make automated speech-to-text transcription a more valuable tool for a variety of applications.</p>
<p>However, there are also important ethical considerations when it comes to contextual understanding. Machines that can accurately understand the context in which words are being used may also be able to infer personal information about the speaker, such as their emotions, intent, or political beliefs. This raises important questions about data privacy and security, and highlights the need for responsible use and handling of user data in automated speech-to-text transcription. As the technology continues to evolve, it will be important to ensure that user data is protected and used in a responsible and ethical manner.</p>
<p>In addition to machine learning and incorporating additional information, another approach to improving contextual understanding in automated speech-to-text transcription is to incorporate other types of data into the transcription process. For example, machines can be programmed to recognize the speaker's accent or dialect, which can provide important contextual information about the way that words are being used.</p>
<p>Similarly, machines can be programmed to recognize the speaker's gender, age, or other demographic characteristics. This can provide important contextual information about the way that words are being used, and can help machines to more accurately transcribe speech and understand the context in which words are being used.</p>
<p>There are also challenges associated with contextual understanding in automated speech-to-text transcription. For example, there is often a significant amount of variation in the way that words are used in different contexts, which can make it difficult for machines to accurately transcribe speech and understand the context in which words are being used. Additionally, there may be cultural or regional differences in the way that words are used, which can further complicate the transcription process.</p>
<p>Another challenge is that context can be dynamic and change rapidly over the course of a conversation. Machines need to be able to adapt to changes in context in real time in order to accurately transcribe speech and understand the context in which words are being used.</p>
<p><a id="machine-learning-techniques-for-contextual-understanding"></a></p>
<h2>Machine Learning techniques for Contextual understanding</h2>
<p>Machine learning techniques are commonly used to improve contextual understanding in automated speech-to-text transcription. In this post, we will discuss some of the key machine learning techniques used for this purpose.</p>
<p><a id="disambiguation"></a></p>
<h3>Disambiguation</h3>
<p>One of the most widely used machine learning techniques for contextual understanding is natural language processing (NLP). NLP is a subfield of machine learning that focuses on analyzing and understanding human language. NLP algorithms are trained on large datasets of text data and are used to analyze the context in which words are being used in speech.</p>
<p>One of the key challenges in NLP is disambiguation, or the process of determining the correct meaning of a word based on its context. For example, the word "bank" can refer to a financial institution or the side of a river. To accurately transcribe speech, machines need to be able to accurately disambiguate words based on their context.</p>
<h4>part-of-speech (POS) tagging</h4>
<p>One technique for disambiguation is part-of-speech (POS) tagging. POS tagging involves analyzing each word in a sentence and assigning it a part-of-speech tag, such as noun, verb, adjective, or adverb. By analyzing the parts of speech used in a sentence, machines can gain a better understanding of the context in which words are being used.</p>
<h4>named entity recognition (NER)</h4>
<p>Another NLP technique used for contextual understanding is named entity recognition (NER). NER involves identifying and classifying named entities in text data, such as people, organizations, and locations. By identifying named entities in speech, machines can gain a better understanding of the context in which words are being used.</p>
<h4>sentiment analysis</h4>
<p>Another machine learning technique used for contextual understanding is sentiment analysis. Sentiment analysis involves analyzing the emotional tone of a piece of text data, such as whether it is positive, negative, or neutral. By analyzing the sentiment of speech, machines can gain a better understanding of the speaker's emotions and intent.</p>
<h4>Deep learning</h4>
<p>Deep learning is another machine learning technique that is commonly used for contextual understanding. Deep learning algorithms are designed to learn complex patterns in data, and are often used for tasks such as speech recognition and image recognition.</p>
<h5>recurrent neural network (RNN)</h5>
<p>One common type of deep learning algorithm used for contextual understanding is the recurrent neural network (RNN). RNNs are designed to analyze sequences of data, such as sentences or audio clips. By analyzing the sequence of words or sounds in speech, RNNs can gain a better understanding of the context in which words are being used.</p>
<h4>convolutional neural network (CNN)</h4>
<p>Another type of deep learning algorithm used for contextual understanding is the convolutional neural network (CNN). CNNs are often used for image recognition tasks, but can also be used for speech recognition. By analyzing the frequency and amplitude of sound waves in speech, CNNs can gain a better understanding of the context in which words are being used.</p>
<p><a id="hybrid-approaches"></a></p>
<h3>hybrid approaches</h3>
<p>In addition to these machine learning techniques, there are also hybrid approaches that combine multiple techniques to improve contextual understanding. For example, some systems use a combination of NLP techniques and deep learning algorithms to transcribe speech with high accuracy and understanding of the context in which words are being used.</p>
<p><a id="summary"></a></p>
<h2>Summary</h2>
<p>Machine learning techniques are critical for improving contextual understanding in automated speech-to-text transcription. NLP techniques, such as POS tagging and NER, can help machines to better understand the context in which words are being used. Deep learning algorithms, such as RNNs and CNNs, can help machines to learn complex patterns in speech and improve accuracy. As the technology continues to evolve, it is likely that new machine learning techniques will be developed to further improve contextual understanding and accuracy in automated speech-to-text transcription.</p>How to Prepare Python Project to Pass It Over to Another Developer2023-03-30T00:00:00+02:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-03-30:/how_to_prepare_python_project_to_pass_it_over_to_another_developer/<p>Preparing a Python project to be passed on to another developer requires attention to detail and documentation to ensure that the new developer can understand the project and make modifications with ease. Here is a long and detailed guide on how to …</p><p>Preparing a Python project to be passed on to another developer requires attention to detail and documentation to ensure that the new developer can understand the project and make modifications with ease. Here is a long and detailed guide on how to prepare a Python project to be handed over to another developer:</p>
<p><strong>Organize your files</strong>
Make sure that your project files are organized logically and in a way that is easy to navigate. Create folders for each major section of your project, such as source code, data, and documentation.</p>
<p><strong>Use a version control system</strong>
Use a version control system such as Git to keep track of changes to your code. This will make it easier for the new developer to understand the history of the project and any changes that have been made.</p>
<p><strong>Document your code</strong>
Document your code using comments and docstrings. Comments should explain the purpose of each section of code, while docstrings should explain the purpose of each function and class. This will make it easier for the new developer to understand how your code works.</p>
<p><strong>Write a README file</strong>
Create a README file that explains the purpose of your project, how to run it, and any dependencies that are required. This should also include instructions on how to set up a development environment and how to run tests.</p>
<p><strong>Include requirements files</strong>
Include a requirements.txt file that lists all of the dependencies required to run your project. This will make it easier for the new developer to set up a development environment.</p>
<p><strong>Add instructions on how to recreate a virtual environment</strong>
Use a virtual environment to create an isolated environment for your project. This will ensure that the new developer has the same environment as you did when you developed the project.</p>
<p><strong>Use consistent coding style</strong>
Use a consistent coding style throughout your project to make it easier to read and understand. Use a tool such as PEP 8 to check for compliance with the Python coding style guidelines.</p>
<p><strong>Include test cases</strong>
Include test cases that cover all major functionality of your project. This will make it easier for the new developer to ensure that any modifications they make do not break existing functionality.</p>
<p><strong>Include a license</strong>
Include a license file that specifies the terms under which your project can be used, modified, and distributed. This will protect your project and ensure that the new developer understands the legal implications of using and modifying your code.</p>
<p><strong>Provide ongoing support</strong>
Provide ongoing support to the new developer as they take over the project. This may involve answering questions, providing documentation, or even offering training.</p>
<p>Preparing a Python project to be passed on to another developer requires attention to detail and documentation. By following these guidelines, you can ensure that the new developer can understand your project and make modifications with ease.</p>
<p><strong>NOTE:</strong> There is interesting write-up, proposing approach that will result in projects always ready to handover to another persons: <a href="https://jmmv.dev/2021/04/always-be-quitting.html">Always be quitting - Julio Merino (jmmv.dev)</a></p>DCA Investing Strategy Variants2023-03-26T00:00:00+01:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-03-26:/dca_investing_strategy_variants/<p>Investing can be a daunting task, especially for those new to the game. The world of finance is full of complicated terminology and sophisticated techniques, making it difficult for the average person to know where to start. Fortunately, there are a few …</p><p>Investing can be a daunting task, especially for those new to the game. The world of finance is full of complicated terminology and sophisticated techniques, making it difficult for the average person to know where to start. Fortunately, there are a few simple strategies that can help novice investors get started with building their portfolio. One of the most popular and effective strategies is the Dollar-Cost Averaging (DCA) method.</p>
<p>Dollar-Cost Averaging (DCA) is a strategy where an investor purchases a fixed amount of a particular asset, such as stocks or bonds, at regular intervals over a period of time, regardless of the price of the asset. The idea behind DCA is to reduce the impact of market fluctuations by buying more shares when prices are low and fewer shares when prices are high. This can help investors avoid the temptation to buy a large amount of an asset all at once, only to see the price drop shortly after.</p>
<p>There are several variants of the DCA strategy that investors can use to tailor their investment approach to their individual needs and preferences. Here are some of the most common variants of the DCA strategy:</p>
<h3>Traditional DCA</h3>
<p>The traditional DCA strategy involves investing a fixed amount of money at regular intervals, such as monthly or quarterly, into the same asset or fund. This is the simplest and most common form of DCA, as it involves consistent, automatic investments over a long period of time.</p>
<h3>Value Averaging</h3>
<p>Value averaging is a more dynamic form of DCA that involves adjusting the amount invested based on the performance of the asset. With value averaging, the investor sets a target value for the investment and adjusts the amount invested each period to maintain the target value. For example, if the value of the investment increases, the investor will invest less money in the next period, whereas if the value decreases, the investor will invest more money to bring the value back up to the target level.</p>
<h3>Constant Proportion Portfolio Insurance (CPPI)</h3>
<p>CPPI is a more complex form of DCA that involves setting a floor value for the investment and adjusting the allocation of the portfolio between a risk-free asset and a risky asset to maintain the floor value. The risk-free asset acts as a cushion to prevent the portfolio from falling below the floor value, while the risky asset provides potential upside. This strategy can be particularly useful for investors who want to limit their downside risk while still having exposure to the potential upside of the market.</p>
<h3>Asset Allocation DCA</h3>
<p>Asset allocation DCA involves investing a fixed amount of money at regular intervals into multiple assets or funds, rather than just one. This approach can help investors diversify their portfolio and reduce the risk of having all their eggs in one basket. The investor sets a target allocation for each asset class, and the DCA strategy is used to maintain the target allocation over time.</p>
<h3>Reverse DCA</h3>
<p>Reverse DCA is a strategy that involves selling a fixed amount of an asset at regular intervals, rather than buying it. This strategy is often used by retirees or investors who want to draw down their portfolio gradually over time. Reverse DCA can help investors avoid selling all their assets at once and potentially locking in losses during a market downturn.</p>
<h3>Step-Up DCA</h3>
<p>With step-up DCA, the investor starts with a small investment and gradually increases the amount invested over time. This strategy can be particularly useful for investors who are just starting out and want to ease into investing, or for investors who want to build up their investments gradually.</p>
<h3>Seasonal DCA</h3>
<p>Seasonal DCA involves investing in an asset only during a specific season or time of year. For example, an investor might choose to invest in a particular stock or fund only during the summer months when the company typically experiences higher sales or during a specific quarter when the company releases its earnings report.</p>
<h3>Dynamic DCA</h3>
<p>Dynamic DCA is a strategy that adjusts the investment amount based on market conditions or other factors, rather than investing a fixed amount at regular intervals. For example, an investor might increase their investment during a market dip or decrease their investment during a market rally.</p>
<h3>Fixed Period DCA</h3>
<p>Fixed period DCA involves investing a fixed amount of money over a set period of time, rather than indefinitely. For example, an investor might choose to invest $1,000 per month for a year, after which they reassess their investment strategy.</p>
<h3>Dividend Reinvestment DCA</h3>
<p>With dividend reinvestment DCA, investors use the dividends earned from an investment to purchase additional shares of the same asset or fund. This strategy can help investors increase their investment over time without having to contribute additional funds from their own pockets.</p>
<h3>Fund Transfer DCA</h3>
<p>Fund transfer DCA involves transferring a fixed amount of money from one asset or fund to another at regular intervals, rather than investing a fixed amount into a single asset. This strategy can be useful for investors who want to diversify their portfolio across multiple assets.</p>
<h3>Progressive DCA</h3>
<p>Progressive DCA involves gradually increasing the investment amount over time, typically by a fixed percentage or dollar amount. For example, an investor might start with a $100 investment and increase it by $10 each month.</p>
<h3>Threshold DCA</h3>
<p>Threshold DCA involves investing a fixed amount only when the price of an asset falls below a certain threshold. This strategy can be useful for investors who want to take advantage of buying opportunities during market dips.</p>
<h3>Momentum DCA</h3>
<p>Momentum DCA involves investing in assets that have been performing well over a recent period of time. For example, an investor might choose to invest in stocks that have been experiencing a positive trend in their price or earnings.</p>
<h3>Tax-Loss Harvesting DCA</h3>
<p>With tax-loss harvesting DCA, investors sell losing assets to realize a tax deduction, and use the proceeds to invest in new assets. This strategy can help investors offset capital gains and reduce their tax liability.</p>
<h2>Conclusion</h2>
<p>The Dollar-Cost Averaging strategy can be a powerful tool for investors who want to build a diversified portfolio over time. While the traditional DCA approach is the simplest and most common form of the strategy, there are several variants that investors can use to tailor the approach to their individual needs and preferences. By understanding the different variants of the DCA strategy, investors can choose the one that best suits their investment goals and risk tolerance.</p>Punctuation Restoration2023-03-15T00:00:00+01:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-03-15:/punctuation-restoration/<p>Punctuation restoration using machine learning (ML) is a process of predicting the appropriate punctuation marks in a text that is missing or poorly punctuated. This technique has become increasingly popular in recent years due to the growing volume of unstructured text data …</p><p>Punctuation restoration using machine learning (ML) is a process of predicting the appropriate punctuation marks in a text that is missing or poorly punctuated. This technique has become increasingly popular in recent years due to the growing volume of unstructured text data available in digital form, such as social media posts, online articles, and chat logs.</p>
<p>Punctuation plays a crucial role in the comprehension of text. Proper punctuation helps to convey the meaning, tone, and structure of a sentence. However, punctuation can be subjective and inconsistent, and the lack of punctuation can lead to ambiguity and misinterpretation. Therefore, restoring punctuation in a text is an essential task that can improve the readability and accuracy of the text.</p>
<p>Punctuation restoration using ML involves the use of algorithms and statistical models to predict the correct punctuation marks in a given text. The process typically involves three main steps: data preparation, feature extraction, and model training.</p>
<h2>Punctuation restoration steps</h2>
<h3>data preparation</h3>
<p>In the data preparation step, the text data is collected and preprocessed. This may involve removing unnecessary characters, such as HTML tags, and converting the text to a standard format. The text data is then segmented into individual sentences or phrases.</p>
<p>In the feature extraction step, the text data is converted into a set of numerical features that can be used by the ML model. Common features used in punctuation restoration include word frequency, part-of-speech (POS) tags, and context information. These features are extracted using NLP techniques such as tokenization, stemming, and syntactic parsing.</p>
<h3>model training</h3>
<p>In the model training step, the ML model is trained using a labeled dataset of punctuated text. The model learns to predict the appropriate punctuation marks based on the extracted features. Various ML algorithms can be used for this task, including decision trees, random forests, and deep neural networks.</p>
<h3>punctuation restoration</h3>
<p>Once the model is trained, it can be used to restore punctuation in new text data. The input text is segmented into sentences or phrases, and the extracted features are fed into the model to predict the appropriate punctuation marks. The output text is then post-processed to ensure that the punctuation marks are correctly placed.</p>
<h2>Challenges</h2>
<p>There are several challenges associated with punctuation restoration using ML. One of the main challenges is dealing with the subjective nature of punctuation. Punctuation rules can vary depending on the context and language, making it difficult to develop a universal model. Another challenge is dealing with the noise and errors in the text data, which can affect the accuracy of the model.</p>
<p>Despite these challenges, punctuation restoration using ML has shown promising results in various applications. For example, it can be used to improve the accuracy of speech recognition systems, enhance the readability of machine-generated text, and improve the quality of automatic translations.</p>
<h2>References</h2>
<ul>
<li><a href="https://github.com/topics/punctuation">punctuation</a> - GitHub Topic</li>
<li><a href="https://github.com/notAI-tech/deepsegment">deepsegment</a> - A sentence segmenter that actually works!</li>
<li><a href="https://github.com/notAI-tech/fastPunct">fastPunct</a> - Punctuation restoration and spell correction experiments.</li>
<li><a href="https://github.com/bedapudi6788/deepcorrect">deepcorrect</a> - Text and Punctuation correction with Deep Learning</li>
<li><a href="https://github.com/kaituoxu/X-Punctuator">X-Punctuator</a> - A PyTorch implementation of a punctuation prediction system using (B)LSTM, which automatically adds suitable punctuation into text without punctuation.</li>
</ul>Salt and Pepper in the Context of Hashing/Obfuscation2023-03-14T00:00:00+01:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-03-14:/salt-and-pepper-for-hashing/<p>In the context of hashing/obfuscation, "salt and pepper" refer to two different techniques used to enhance the security of hash functions.</p>
<h2>Salt</h2>
<p>Salt is a random value that is added to the input before it is hashed. This makes it much …</p><p>In the context of hashing/obfuscation, "salt and pepper" refer to two different techniques used to enhance the security of hash functions.</p>
<h2>Salt</h2>
<p>Salt is a random value that is added to the input before it is hashed. This makes it much more difficult for attackers to use precomputed hash tables or rainbow tables to attack the hash. By using a unique salt for each input, even if two inputs have the same value, their hashes will be different, making it much harder for attackers to determine the original input value.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">hashlib</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="k">def</span> <span class="nf">hash_with_salt</span><span class="p">(</span><span class="n">password</span><span class="p">):</span>
<span class="c1"># Generate a random salt</span>
<span class="n">salt</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">urandom</span><span class="p">(</span><span class="mi">16</span><span class="p">)</span>
<span class="c1"># Add the salt to the password and hash it using SHA256</span>
<span class="n">hashed_password</span> <span class="o">=</span> <span class="n">hashlib</span><span class="o">.</span><span class="n">sha256</span><span class="p">(</span><span class="n">salt</span> <span class="o">+</span> <span class="n">password</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s1">'utf-8'</span><span class="p">))</span><span class="o">.</span><span class="n">hexdigest</span><span class="p">()</span>
<span class="c1"># Return the salt and hashed password as a tuple</span>
<span class="k">return</span> <span class="p">(</span><span class="n">salt</span><span class="p">,</span> <span class="n">hashed_password</span><span class="p">)</span>
<span class="c1"># Example usage</span>
<span class="n">password</span> <span class="o">=</span> <span class="s2">"mysecurepassword"</span>
<span class="n">salt</span><span class="p">,</span> <span class="n">hashed_password</span> <span class="o">=</span> <span class="n">hash_with_salt</span><span class="p">(</span><span class="n">password</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Salt: </span><span class="si">{</span><span class="n">salt</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Hashed Password: </span><span class="si">{</span><span class="n">hashed_password</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
</code></pre></div>
<h2>Pepper</h2>
<p>Pepper, on the other hand, is a secret key that is used to further obscure the hash output. Unlike a salt, which is stored alongside the hash, the pepper is kept secret and never stored. This makes it much harder for attackers to reverse-engineer the original input value from the hash output.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">hmac</span>
<span class="kn">import</span> <span class="nn">hashlib</span>
<span class="k">def</span> <span class="nf">hash_with_pepper</span><span class="p">(</span><span class="n">password</span><span class="p">,</span> <span class="n">pepper</span><span class="p">):</span>
<span class="c1"># Hash the password using HMAC-SHA256 with the secret pepper</span>
<span class="n">hashed_password</span> <span class="o">=</span> <span class="n">hmac</span><span class="o">.</span><span class="n">new</span><span class="p">(</span><span class="n">pepper</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s1">'utf-8'</span><span class="p">),</span> <span class="n">password</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s1">'utf-8'</span><span class="p">),</span> <span class="n">hashlib</span><span class="o">.</span><span class="n">sha256</span><span class="p">)</span><span class="o">.</span><span class="n">hexdigest</span><span class="p">()</span>
<span class="c1"># Return the hashed password</span>
<span class="k">return</span> <span class="n">hashed_password</span>
<span class="c1"># Example usage</span>
<span class="n">password</span> <span class="o">=</span> <span class="s2">"mysecurepassword"</span>
<span class="n">pepper</span> <span class="o">=</span> <span class="s2">"mysecretpepper"</span>
<span class="n">hashed_password</span> <span class="o">=</span> <span class="n">hash_with_pepper</span><span class="p">(</span><span class="n">password</span><span class="p">,</span> <span class="n">pepper</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Hashed Password: </span><span class="si">{</span><span class="n">hashed_password</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
</code></pre></div>
<h2>Using salt and pepper jointly</h2>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">hashlib</span>
<span class="kn">import</span> <span class="nn">hmac</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="k">def</span> <span class="nf">hash_with_salt_and_pepper</span><span class="p">(</span><span class="n">password</span><span class="p">,</span> <span class="n">pepper</span><span class="p">):</span>
<span class="c1"># Generate a random salt</span>
<span class="n">salt</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">urandom</span><span class="p">(</span><span class="mi">16</span><span class="p">)</span>
<span class="c1"># Add the salt to the password and hash it using SHA256</span>
<span class="n">hashed_password</span> <span class="o">=</span> <span class="n">hashlib</span><span class="o">.</span><span class="n">sha256</span><span class="p">(</span><span class="n">salt</span> <span class="o">+</span> <span class="n">password</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s1">'utf-8'</span><span class="p">))</span><span class="o">.</span><span class="n">digest</span><span class="p">()</span>
<span class="c1"># Hash the hashed password using HMAC-SHA256 with the secret pepper</span>
<span class="n">hashed_password</span> <span class="o">=</span> <span class="n">hmac</span><span class="o">.</span><span class="n">new</span><span class="p">(</span><span class="n">pepper</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s1">'utf-8'</span><span class="p">),</span> <span class="n">hashed_password</span><span class="p">,</span> <span class="n">hashlib</span><span class="o">.</span><span class="n">sha256</span><span class="p">)</span><span class="o">.</span><span class="n">hexdigest</span><span class="p">()</span>
<span class="c1"># Return the salt and hashed password as a tuple</span>
<span class="k">return</span> <span class="p">(</span><span class="n">salt</span><span class="p">,</span> <span class="n">hashed_password</span><span class="p">)</span>
<span class="c1"># Example usage</span>
<span class="n">password</span> <span class="o">=</span> <span class="s2">"mysecurepassword"</span>
<span class="n">pepper</span> <span class="o">=</span> <span class="s2">"mysecretpepper"</span>
<span class="n">salt</span><span class="p">,</span> <span class="n">hashed_password</span> <span class="o">=</span> <span class="n">hash_with_salt_and_pepper</span><span class="p">(</span><span class="n">password</span><span class="p">,</span> <span class="n">pepper</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Salt: </span><span class="si">{</span><span class="n">salt</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Hashed Password: </span><span class="si">{</span><span class="n">hashed_password</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
</code></pre></div>Python - Is There Any Difference Between Attribute and Property?2023-03-09T00:00:00+01:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-03-09:/python-difference-betwee-attribute-and-property/<p>X::<a href="https://www.safjan.com/the-difference-between-class-attribute-or-property-and-the-class-variable/">The Difference Between Class Attribute or Property and the Class Variable</a></p>
<p>In Python there is a difference between an attribute and a property, although they are often used interchangeably.</p>
<p>An attribute is a variable that belongs to an instance of a …</p><p>X::<a href="https://www.safjan.com/the-difference-between-class-attribute-or-property-and-the-class-variable/">The Difference Between Class Attribute or Property and the Class Variable</a></p>
<p>In Python there is a difference between an attribute and a property, although they are often used interchangeably.</p>
<p>An attribute is a variable that belongs to an instance of a class. It is defined within the class, and its value can be accessed or modified using dot notation on the instance.</p>
<p>A property, on the other hand, is a special kind of attribute that is accessed or modified using getter and setter methods. The getter method is used to retrieve the value of the property, and the setter method is used to set the value of the property.</p>
<p>Here is an example that demonstrates the difference between an attribute and a property:</p>
<div class="highlight"><pre><span></span><code><span class="k">class</span> <span class="nc">Person</span><span class="p">:</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span> <span class="c1"># This is an attribute</span>
<span class="nd">@property</span>
<span class="k">def</span> <span class="nf">name</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">_name</span> <span class="c1"># This is a property getter method</span>
<span class="nd">@name</span><span class="o">.</span><span class="n">setter</span>
<span class="k">def</span> <span class="nf">name</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">value</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_name</span> <span class="o">=</span> <span class="n">value</span> <span class="c1"># This is a property setter method</span>
</code></pre></div>
<p>In the example above, the <code>Person</code> class has an attribute called <code>name</code>, which is set in the <code>__init__</code> method. However, the <code>name</code> attribute is also defined as a property using the <code>@property</code> and <code>@name.setter</code> decorators. The <code>name</code> property has a getter method that returns the value of the <code>_name</code> attribute, and a setter method that sets the value of the <code>_name</code> attribute.</p>
<p>With the <code>name</code> property defined in this way, you can get and set the <code>name</code> attribute using the property methods, like so:</p>
<div class="highlight"><pre><span></span><code><span class="n">person</span> <span class="o">=</span> <span class="n">Person</span><span class="p">(</span><span class="s2">"John"</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">person</span><span class="o">.</span><span class="n">name</span><span class="p">)</span> <span class="c1"># Output: John</span>
<span class="n">person</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="s2">"Jane"</span>
<span class="nb">print</span><span class="p">(</span><span class="n">person</span><span class="o">.</span><span class="n">name</span><span class="p">)</span> <span class="c1"># Output: Jane</span>
</code></pre></div>
<blockquote>
<p>An <strong>attribute</strong> is a <strong>simple variable</strong> that <strong>belongs to an instance of a class</strong>, while a <strong>property</strong> is a <strong>special kind of attribute</strong> that is accessed or modified using <strong>getter and setter methods</strong>.</p>
</blockquote>The Difference Between Class Attribute or Property and the Class Variable2023-03-09T00:00:00+01:002023-07-12T00:00:00+02:00Krystian Safjantag:www.safjan.com,2023-03-09:/the-difference-between-class-attribute-or-property-and-the-class-variable/<p>In Python, you can store data within a class using properties/attributes or class variables.</p>
<h2>Properties/Attributes</h2>
<p>Properties, also called attributes, are variables that store data within a class instance. They are defined within the class, but outside of any methods. Here's …</p><p>In Python, you can store data within a class using properties/attributes or class variables.</p>
<h2>Properties/Attributes</h2>
<p>Properties, also called attributes, are variables that store data within a class instance. They are defined within the class, but outside of any methods. Here's an example:</p>
<div class="highlight"><pre><span></span><code><span class="k">class</span> <span class="nc">Person</span><span class="p">:</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">age</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span>
<span class="bp">self</span><span class="o">.</span><span class="n">age</span> <span class="o">=</span> <span class="n">age</span>
</code></pre></div>
<p>In the example above, <code>name</code> and <code>age</code> are attributes of the <code>Person</code> class. They are created and assigned values within the <code>__init__</code> method using the <code>self</code> keyword. You can access and modify these attributes using dot notation, like so:</p>
<div class="highlight"><pre><span></span><code><span class="n">person</span> <span class="o">=</span> <span class="n">Person</span><span class="p">(</span><span class="s2">"John"</span><span class="p">,</span> <span class="mi">30</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">person</span><span class="o">.</span><span class="n">name</span><span class="p">)</span> <span class="c1"># Output: John</span>
<span class="n">person</span><span class="o">.</span><span class="n">age</span> <span class="o">=</span> <span class="mi">31</span>
<span class="nb">print</span><span class="p">(</span><span class="n">person</span><span class="o">.</span><span class="n">age</span><span class="p">)</span> <span class="c1"># Output: 31</span>
</code></pre></div>
<h2>Class Variables</h2>
<p>Class variables are variables that are shared among all instances of a class. They are defined within the class, but outside of any methods and are prefixed with the class name. Here's an example:</p>
<div class="highlight"><pre><span></span><code><span class="k">class</span> <span class="nc">Person</span><span class="p">:</span>
<span class="n">count</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">age</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span>
<span class="bp">self</span><span class="o">.</span><span class="n">age</span> <span class="o">=</span> <span class="n">age</span>
<span class="n">Person</span><span class="o">.</span><span class="n">count</span> <span class="o">+=</span> <span class="mi">1</span>
</code></pre></div>
<p>In the example above, <code>count</code> is a class variable that is used to keep track of the number of <code>Person</code> instances that have been created. It is incremented every time a new instance is created within the <code>__init__</code> method. You can access class variables using the class name, like so:</p>
<div class="highlight"><pre><span></span><code><span class="n">person1</span> <span class="o">=</span> <span class="n">Person</span><span class="p">(</span><span class="s2">"John"</span><span class="p">,</span> <span class="mi">30</span><span class="p">)</span>
<span class="n">person2</span> <span class="o">=</span> <span class="n">Person</span><span class="p">(</span><span class="s2">"Jane"</span><span class="p">,</span> <span class="mi">28</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">Person</span><span class="o">.</span><span class="n">count</span><span class="p">)</span> <span class="c1"># Output: 2</span>
</code></pre></div>
<h3>Difference</h3>
<blockquote>
<p>The main difference between attributes/properties and class variables is that <strong>attributes are specific to each instance</strong> of a class, while <strong>class variables are shared among all instances</strong>.</p>
</blockquote>
<p>Attributes are defined within the <code>__init__</code> method and can be different for each instance. Class variables are defined outside of any methods and are shared by all instances.</p>
<p>Another difference is that attributes/properties can be accessed and modified using dot notation on an instance of a class, while class variables are accessed using the class name.</p>
<p>In general, if you need to store data that is specific to each instance of a class, use attributes/properties. If you need to store data that is shared among all instances of a class, use class variables.</p>
<h2>References</h2>
<p><a href="https://stackoverflow.com/questions/22822710/difference-between-class-variable-and-class-attribute">python - difference between class variable and class attribute - Stack Overflow</a></p>