-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
## How to parameterized DBX Python Notebook #841
Comments
Edit: I did not realise you specified a notebook task, updated with original comment left underneath To pass a value from a local environment variable to a workflow definition in a notebook you should instead define the environment variable in the cluster configuration and read them into the notebook e.g., basic-cluster: &basic-cluster
new_cluster:
spark_version: "10.4.x-cpu-ml-scala2.12"
spark_conf:
<<: *basic-spark-conf
spark.databricks.passthrough.enabled: false
spark_env_vars:
DATABASE_NAME: "{{ env['DATABASE_NAME'] }}" See original comment below for how to use jinja with the deployment file. Original comment It is probably better practice to deploy separate workflows for separate environments, but to answer your question you can use the jinja support functionality (Jinja Support) combined with environment variables. Also see Passing Parameters Your deployment file should look something like this: build:
python: "pip"
environments:
default:
workflows:
- name: "my-workflow"
tasks:
- task_key: "task1"
python_wheel_task:
package_name: "some-pkg"
entry_point: "some-ep"
parameters: ["database_name", "{{ env['DATABASE_NAME'] }}"] Deploy via CLI export DATABASE_NAME=dev
dbx deploy --environment default --deployment-file conf/deployment.yml.j2 "my-workflow" Launch via CLI dbx launch --environment default --parameters='{"python_params":["database_name","${DATABASE_NAME}"]}' "my-workflow" Note that you will need to append the |
I tried to follow your steps :- spark_python_task: Now I am trying to access this database name into my name_of_python_notebook_converted_to_job.py by calling :- I am calling the dbx cli like:-dbx deploy --deployment-file conf/deployment.yaml.j2 "name_of_my_work_flow" look like my job can't read from sys.argv . I am getting error :-JSONDecoderError: Expecting value: line 1 column 1 (char 0) ----> db_name =json.loads(sys.argv[1]).get('python_params',[])[1] |
if I use export DATABASE_NAME=dev |
JSONDecoderErrorNotebooks use widgets to pass parameters, so you cannot pass parameters to a notebook task like you would for an entrypoint in a python wheel. You either need to use widgets, or define environment variables on the cluster using Environment Not Found ErrorFor the error environments:
default: You can use the |
Thanks for your reply , Well, I converted the notebook to a pure python
file , no #magic and no #widget and no dbutils can be and should be used as
we need to run unittest to test locally. Hence, I was expecting this plain
python file will be able to take argument value from this cli . Look like
it can't parse " dbx launch --job "my_job_name" --parameter='{"db_name":
"my_db_name"}' . My question is : why the parameter's first
field(key) "db_name" is not parsing into my sys.arg?
db_name =json.loads(sys.argv[1]).get('python_params',[])[1]
…On Thu, Sep 7, 2023 at 4:41 AM Doug Cresswell ***@***.***> wrote:
JSONDecoderError
Notebooks use *widgets* to pass parameters, so you cannot pass parameters
to a notebook task like you would for an entrypoint in a python wheel. You
either need to use widgets, or define environment variables *on the
cluster* using spark_env_vars. This way the environment variables will be
available to the notebook through os.environ.
Environment Not Found Error
For the error environment dev not found in the project file
.dbx/project.json the environments defined in your deployment yaml must
match those in your project.json file.
environments:
default:
You can use the dbx configure command to set up new environments in your
project if you should need multiple. If not simply remove the -e /
--environment from your cli commands and it will use the "default"
instead.
dbx configure docs
<https://urldefense.proofpoint.com/v2/url?u=https-3A__dbx.readthedocs.io_en_latest_reference_cli_-23dbx-2Dconfigure&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=XbyuY5JATO0jqJ0BtjDm_g&m=sxAtfWaj6ob_udX_IztzjOCcxXrYh8q6MAXhZUYse3UhQdnfEYaZWzfml3Oe3IIi&s=Vui1bGZB4c7a6EnIWRw4XnHPWVrdbsJxIN5gGruAN3E&e=>
project.json docs
<https://urldefense.proofpoint.com/v2/url?u=https-3A__dbx.readthedocs.io_en_latest_reference_project_-3Fh-3Denvironment&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=XbyuY5JATO0jqJ0BtjDm_g&m=sxAtfWaj6ob_udX_IztzjOCcxXrYh8q6MAXhZUYse3UhQdnfEYaZWzfml3Oe3IIi&s=v3B49HsjLzUv10ZEVhz8s9C7kRkAC1dIySo4sHhFOys&e=>
*FYI for next time, this kind of question is probably more appropriate for
Stack Overflow than a GitHub issue.*
—
Reply to this email directly, view it on GitHub
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databrickslabs_dbx_issues_841-23issuecomment-2D1709720157&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=XbyuY5JATO0jqJ0BtjDm_g&m=sxAtfWaj6ob_udX_IztzjOCcxXrYh8q6MAXhZUYse3UhQdnfEYaZWzfml3Oe3IIi&s=IWAwkG5ZclFgr5FQ-9cry-90jOACEP2iD7WJE7j2mDs&e=>,
or unsubscribe
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AUHPD3UO76E7V67WV7KT72TXZGCDDANCNFSM6AAAAAA4GPDMQM&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=XbyuY5JATO0jqJ0BtjDm_g&m=sxAtfWaj6ob_udX_IztzjOCcxXrYh8q6MAXhZUYse3UhQdnfEYaZWzfml3Oe3IIi&s=X4bcH1iVy4ZO52H9EdQFc010iZsU2qAm61lhCkW8Iyw&e=>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
The overall goal is to make database name (prod/dev/test) dynamic for each notebook in dbx job and passing that database name directly from jenkins without modifying notebook file or deployment.yaml file for each environment .
If I am creating a dbx job where I have few databricks notebook and I want to pass the database name dynamically into each python notebook without using databricks widget (assuming I am using sys.args that will read the input of dbx clie parameter and I want to run my job something like :-
dbx launch --job "my_job_name" --parameter='{"db_name": "my_db_name"}' and it will send that info to my job and all associated notebook which will read these info from conf/deployment.yaml and in deployment.yaml file I will have something like :--
notebook_task:
notebook_path:"/Reposs/My_github_repo/blala/notebookname"
base_parameters:
db_name"{{env.db_name_from_env}}"
Expected Behavior
Current Behavior
Steps to Reproduce (for bugs)
Context
Your Environment
The text was updated successfully, but these errors were encountered: