XUS_IN_Data Engineer

Sorry, this job was removed at 04:42 p.m. (IST) on Tuesday, Sep 24, 2024

Be an Early Applicant

In-Office

Similar Jobs

Caterpillar

Architect

An Hour Ago

Hybrid

Bangalore, Bengaluru Urban, Karnataka, IND

Expert/Leader

Artificial Intelligence • Cloud • Internet of Things • Software • Cybersecurity • Industrial

The Principal Digital Architect leads the development of architecture solutions for digital projects and manages enterprise data models, ensuring business alignment and technology integration.

Top Skills: AWSPythonSnowflakeSQL

Cloudflare

Red Team Engineer

An Hour Ago

Hybrid

Bengaluru, Karnataka, IND

Expert/Leader

Cloud • Information Technology • Security • Software • Cybersecurity

As a Red Team Engineer, you will develop adversarial emulation programs, track industry trends, and drive innovative red team strategies to enhance Cloudflare's security posture.

Top Skills: APIsCloud Systems(Ggp/Azure/Ad)Cloudflare ProductsMacos/LinuxServer/Networking HardwareWeb ApplicationsWindows

HERE Technologies

Senior Data Scientist

An Hour Ago

In-Office

Mumbai, Maharashtra, IND

Mid level

Artificial Intelligence • Automotive • Computer Vision • Information Technology • Internet of Things • Logistics • Software

Design and build processes for Place attribute quality using machine learning, maintain knowledge on research, and collaborate on software development.

Top Skills: KerasPythonPyTorchTensorFlow

About Xebia

Xebia is a trusted advisor in the modern era of digital transformation, serving hundreds of leading brands worldwide with end-to-end IT solutions. The company has experts specializing in technology consulting, software engineering, AI, digital products and platforms, data, cloud, intelligent automation, agile transformation, and industry digitization. In addition to providing high-quality digital consulting and state-of-the-art software development, Xebia has a host of standardized solutions that substantially reduce the time-to-market for businesses.

Xebia also offers a diverse portfolio of training courses to help support forward-thinking organizations as they look to upskill and educate their workforce to capitalize on the latest digital capabilities. The company has a strong presence across 16 countries with development centres across the US, Latin America, Western Europe, Poland, the Nordics, the Middle East, and Asia Pacific.

Responsibilities

Establish scalable, efficient, automated processes for data analysis, data model development, validation, and implementation.
Work closely with analysts/data scientists to understand impact to the downstream data models.
Write efficient and well-organized software to ship products in an iterative, continual release environment.
Contribute to and promote good software engineering practices across the team
Communicate clearly and effectively to technical and non-technical audiences.

Minimum Qualifications:

University or advanced degree in engineering, computer science, mathematics, or a related field
Strong hands-on experience in Databricks using PySpark and Spark SQL (Unity Catalog, workflows, Optimization techniques)
Experience with at least one cloud provider solution (GCP preferred)
Strong experience working with relational SQL databases.
Strong experience with object-oriented/object function scripting language: Python.
Working knowledge in any transformation tools, DBT preferred.
Ability to work with Linux platform.
Strong knowledge of data pipeline and workflow management tools (Airflow)
Working knowledge of Git hub /Git Toolkit
Expertise in standard software engineering methodology, e.g. unit testing, code reviews, design documentation
Experience creating Data pipelines that prepare data for ingestion & consumption appropriately.
Experience in maintaining and optimizing databases/filesystems for production usage in reporting, analytics.
Working in a collaborative environment and interacting effectively with technical and non-technical team members equally well. Good verbal and written communication skills.

Questionnaire

Scenario 1:

Data Pipeline Design on GCP

You are tasked with designing a data pipeline to process and analyze log data generated by a web application. The log data is stored in Google Cloud Storage (GCS) and needs to be ingested, transformed, and loaded into BigQuery for reporting and analysis.

Requirements:

Ingestion: The log data should be ingested from GCS to a staging area in BigQuery.

Transformation: Apply necessary transformations such as parsing JSON logs, filtering out irrelevant data, and aggregating metrics.

Loading: Load the transformed data into a final table in BigQuery for analysis.

Orchestration: The entire pipeline should be orchestrated to run daily.

Monitoring and Alerting: Set up monitoring and alerting to ensure the pipeline runs successfully and errors are detected promptly.

Questions:

1) Ingestion:

What GCP services would you use to ingest the log data from GCS to BigQuery, and why?

Provide an example of how you would configure this ingestion process.

2) Transformation:

Describe how you would implement the transformation step. What tools or services would you use?

Provide an example transformation you might perform on the log data.

3) Loading:

How would you design the schema for the final BigQuery table to ensure efficient querying?

What considerations would you take into account when loading data into BigQuery?

4) Orchestration:

Which GCP service would you use to orchestrate the data pipeline, and why?

Outline a high-level workflow for the daily orchestration of the pipeline.

5) Monitoring and Alerting:

What strategies would you use to monitor the pipeline's performance?

How would you set up alerts to notify you of any issues?

Scenario 2: Optimizing BigQuery Queries

You are responsible for optimizing BigQuery queries to improve performance and reduce costs. You notice that a frequently run query is taking longer than expected and is costly.

Questions:

1) Performance Analysis:

How would you analyze the performance of a BigQuery query?

What specific metrics or logs would you look at to identify inefficiencies?

2) Optimization Techniques:

List at least three techniques you would use to optimize a BigQuery query.

Explain how each technique improves performance or reduces costs.

3) Partitioning and Clustering:

Describe how you would use partitioning and clustering in BigQuery to optimize query performance.

Provide an example scenario where each technique would be beneficial.

Scenario 3: Data Migration to GCP

Your organization is migrating its on-premises data warehouse to Google Cloud Platform. You need to design and implement a migration strategy.

Questions:

1) Planning and Assessment:

What factors would you consider when planning the migration of an on-premises data warehouse to GCP?

How would you assess the readiness of your existing data warehouse for migration?

2) Migration Strategy:

Describe the steps you would take to migrate data from an on-premises data warehouse to BigQuery.

What tools or services would you use to facilitate the migration?

3) Post-Migration Optimization:

After migrating the data, how would you optimize the new BigQuery data warehouse for performance and cost-efficiency?

What best practices would you follow to ensure the migrated data is accurate and queryable?

Scenario 4: Real-time Data Processing on GCP

Your company requires real-time data processing to analyze streaming data from IoT devices. The data needs to be ingested, processed, and stored for further analysis.

Questions:

1) Ingestion:

What GCP service(s) would you use to ingest real-time streaming data from IoT devices?

Explain the benefits of using these services for real-time data ingestion.

2) Processing:

Describe how you would implement real-time data processing on GCP.

Which GCP services would you use, and why?

3) Storage:

How would you store the processed real-time data for efficient querying and analysis?

What considerations would you take into account when choosing a storage solution?

One liner for GCP:

How do you secure data in Google Cloud Storage?

What is the difference between Google BigQuery and Google Cloud SQL?

How do you implement data pipeline automation in Google Cloud?

Can you explain the role of Google Cloud Pub/Sub in data processing?

What strategies do you use for cost optimization in Google Cloud?

How do you handle schema changes in Google BigQuery?

What is the purpose of Google Dataflow, and when would you use it?

How do you monitor and troubleshoot performance issues in Google Cloud Dataproc?

Explain the difference between managed and unmanaged instance groups in GCP.

How would you design a data warehouse architecture on GCP?

Some useful links:

Xebia | Creating Digital Leaders.

https://www.linkedin.com/company/xebia/mycompany/

http://twitter.com/xebiaindia

https://www.instagram.com/life_at_xebia/

http://www.youtube.com/XebiaIndia

What you need to know about the Kolkata Tech Scene

When considering the industries shaping India's tech scene, gaming might not immediately come to mind. However, in the last decade, increased internet usage and greater access to mobile devices have catapulted the industry to new heights, with Kolkata-based companies like Virtualinfocom, Red Apple Technologies and Digitoonz, at the forefront, driving the design and animation of new gaming titles for players.